Version:

Importing SEGY Data into SalvusProject

This notebook goes through the process of taking an external dataset and importing it into SalvusProject.

Preface

The data formats used for storing field data in geophysics have a wide range of variability. This can often make importing external datasets a daunting task since the importing process can often vary from project-to-project. Luckily, there are a number of useful tools in Salvus that can help streamline this process.
Let's consider a set of (synthetically generated) near-surface measurements which are in the segy files included with this notebook. Each segy file contains a set of receiver measurements for a single source position. Before we can start with performing any sort of data analysis, we want to bring this raw data into SalvusProject and visualize what it looks like.
We first start off by importing a few useful modules.
Copy
import os

SALVUS_FLOW_SITE_NAME = os.environ.get("SITE_NAME", "local")
PROJECT_DIR = "project"
import zipfile

import numpy as np
import obspy
from obspy.io.segy.segy import _read_segy
import salvus.namespace as sn

import utils
As usual, let's create a fresh project where we'll store our subsequent data in.
d = sn.domain.dim2.BoxDomain(
    x0=-41.0,
    x1=41.0,
    y0=-30.0,
    y1=0.0,
)

p = sn.Project.from_domain(path=PROJECT_DIR, domain=d, load_if_exists=True)
p.domain.plot()
Note: This section goes through an example of opening and extracting the data from the segy file in accordance with the standards outlined by the Society of Exploration Geophysicists (SEG) here.
As mentioned in the preface, importing this data into Python itself is generally something which will be dependent on the data format used by the recording instrument and, thus, is outside of the scope of this tutorial. However, here are a few options which may be useful for bringing this data into Python:
  • obspy.read(...) is excellent at dealing with a variety of different common seismic data formats so this may be a good point to start from.
  • Plain text files can often be read in with something similar to:
    with open(filename) as f:
        data = f.readlines()
For this example, we will use the function _read_segy(...) within obspy.io.segy.segy. to import the segy data.
To understand how the importing process works, let's first import a single shotgather to start. Once we understand how this works, it should be trivial to wrap this into a loop to apply it to all of the remaining datasets. Before we do this, however, we'll just need to unzip the downloaded data.
with zipfile.ZipFile("./data/data.zip") as fh:
    fh.extractall("./data")
Now, let's start off with shot_00.segy:
segy_data = _read_segy("data/shot_00.segy")
The header of this segy file contains the metadata for the shotgather. Let's get the header and print it to see what it contains.
header = segy_data.traces[0].header
print(header)
trace_sequence_number_within_line: 0
trace_sequence_number_within_segy_file: 1
original_field_record_number: 0
trace_number_within_the_original_field_record: 0
energy_source_point_number: 0
ensemble_number: 0
trace_number_within_the_ensemble: 0
trace_identification_code: 12
number_of_vertically_summed_traces_yielding_this_trace: 0
number_of_horizontally_stacked_traces_yielding_this_trace: 0
data_use: 0
distance_from_center_of_the_source_point_to_the_center_of_the_receiver_group: 0
receiver_group_elevation: 0
surface_elevation_at_source: 0
source_depth_below_surface: 0
datum_elevation_at_receiver_group: 0
datum_elevation_at_source: 0
water_depth_at_source: 0
water_depth_at_group: 0
scalar_to_be_applied_to_all_elevations_and_depths: 0
scalar_to_be_applied_to_all_coordinates: -100
source_coordinate_x: -3000
source_coordinate_y: 0
group_coordinate_x: -2400
group_coordinate_y: 0
coordinate_units: 0
weathering_velocity: 0
subweathering_velocity: 0
uphole_time_at_source_in_ms: 0
uphole_time_at_group_in_ms: 0
source_static_correction_in_ms: 0
group_static_correction_in_ms: 0
total_static_applied_in_ms: 0
lag_time_A: 0
lag_time_B: 0
delay_recording_time: 0
mute_time_start_time_in_ms: 0
mute_time_end_time_in_ms: 0
number_of_samples_in_this_trace: 9790
sample_interval_in_ms_for_this_trace: 55
gain_type_of_field_instruments: 0
instrument_gain_constant: 0
instrument_early_or_initial_gain: 0
correlated: 0
sweep_frequency_at_start: 0
sweep_frequency_at_end: 0
sweep_length_in_ms: 0
sweep_type: 0
sweep_trace_taper_length_at_start_in_ms: 0
sweep_trace_taper_length_at_end_in_ms: 0
taper_type: 0
alias_filter_frequency: 0
alias_filter_slope: 0
notch_filter_frequency: 0
notch_filter_slope: 0
low_cut_frequency: 0
high_cut_frequency: 0
low_cut_slope: 0
high_cut_slope: 0
year_data_recorded: 1969
day_of_year: 365
hour_of_day: 23
minute_of_hour: 59
second_of_minute: 59
time_basis_code: 0
trace_weighting_factor: 0
geophone_group_number_of_roll_switch_position_one: 0
geophone_group_number_of_trace_number_one: 0
geophone_group_number_of_last_trace: 0
gap_size: 0
over_travel_associated_with_taper: 0
x_coordinate_of_ensemble_position_of_this_trace: 0
y_coordinate_of_ensemble_position_of_this_trace: 0
for_3d_poststack_data_this_field_is_for_in_line_number: 0
for_3d_poststack_data_this_field_is_for_cross_line_number: 0
shotpoint_number: 0
scalar_to_be_applied_to_the_shotpoint_number: 0
trace_value_measurement_unit: 0
transduction_constant_mantissa: 0
transduction_constant_exponent: 0
transduction_units: 0
device_trace_identifier: 0
scalar_to_be_applied_to_times: 0
source_type_orientation: 0
source_energy_direction_mantissa: 0
source_energy_direction_exponent: 0
source_measurement_mantissa: 0
source_measurement_exponent: 0
source_measurement_unit: 0

As we can see, there's a lot of information contained within the segy file. However, there is only a handful of information we're actually interested in:
  • Recorded data:
    • We'll want to be able to access each set of receiver measurements so that each trace within the shotgather is treated as a single array of size [n_samples].
  • Receiver metadata:
    • Receiver index within the sequence of receivers (ie. receiver number within the gather)
    • Cartesian coordinates of the receiver
    • Sampling rate (in Hz)
    • Start time of recording
    • End time of recording
  • Source metadata:
    • Source index within the survey (ie. the shot number)
    • Cartesian coordinates of the source
Since this segy file only contains integer values, the source/receiver coordinates (for example header.source_coordinate_x and header.source_coordinate_y) must be scaled by the factor given in header.scalar_to_be_applied_to_all_coordinates to achieve the correct decimal values (see the aforementioned SEG standards here for more details).
# Apply no scaling if it is not provided
if header.scalar_to_be_applied_to_all_coordinates == 0:
    coord_scale_fac = 1
# Have the scaling factor act as a divisor if it is negative
elif header.scalar_to_be_applied_to_all_coordinates < 0:
    coord_scale_fac = -1 / header.scalar_to_be_applied_to_all_coordinates
# Have the scaling factor act as a multiplier if it is positive
else:
    coord_scale_fac = header.scalar_to_be_applied_to_all_coordinates
To help keep things organized, let's create a obspy.Stream object which contains the information on the seismic traces along with the relevant metadata. To do this, we'll first create a list of obspy.Trace objects:
# To keep the numbering of the sources neat, we'll add some zero padding to the
# source names
src_zero_padding = 2

# Define the name of the source position
src_name = str(header.shotpoint_number).zfill(src_zero_padding)

# Initialize an empty list where we can store the `obspy.Trace` objects
traces = []

# Iterate over the traces
for trace in segy_data.traces:
    # Get some of the receiver metadata
    rec_nr = trace.header.trace_sequence_number_within_line

    # Define the names of the receivers; add zero padding to the receiver names
    # based on how many receivers are in the gather
    rec_zero_padding = int(np.log(len(segy_data.traces)))
    rec_name = str(rec_nr).zfill(rec_zero_padding)

    # Convert the trace to an obspy trace to make it a little easier to work with
    stats = trace.to_obspy_trace().stats

    # Package everything for the given trace into an `obspy.Trace`
    tr = obspy.Trace(
        data=trace.data,
        header={
            "station": rec_name,
            "starttime": stats.starttime,
            "endtime": stats.endtime,
            "sampling_rate": stats.sampling_rate,
            "receiver_location": np.array(
                [
                    trace.header.group_coordinate_x,
                    trace.header.group_coordinate_y,
                ]
            )
            * coord_scale_fac,
            "source_location": np.array(
                [header.source_coordinate_x, header.source_coordinate_y]
            )
            * coord_scale_fac,
            "source_id": src_name,
        },
    )
    traces.append(tr)

# Convert the traces to an obspy.Stream object
shotgather = obspy.Stream(traces)
Now that we have the data organized within an obspy.Stream object, we can use some of the handy utility functions within the included utils module to bring this into Salvus. Note that this utils module is open source in case you wish to take a look "under-the-hood" to further customize how your data is imported into Salvus.
Let's create an ExternalData object and plot out where the source/receiver positions for the given dataset are.
external_data = utils.ExternalData(shotgather)
external_data.plot_shots()
external_data.plot_receivers(src_name)
To make sure that our data looks as expected, let's plot a trace for one of the receivers:
external_data.shots[0].plot_trace(10)
Next, let's import all of this information into Salvus. To summarize what's done behind-the-scenes here, the following steps must be performed to bring the data into Salvus Project:
  1. Add a new event with the relevant source and receiver positions for the given shot position.
  2. Create a temporary HDF5 file within which the following is stored:
    • Receiver measurements (traces)
    • Receiver metadata
  3. Add the temporary HDF5 file to the project.
In broad terms, this would have a structure similar to the following pseudo-code:
# Add the event to the project
p.events.add(...)

# Note that here we assume that there is only a single component to the data
receiver_data = np.zeros([N_RECEIVERS, 1, traces.N_POINTS])
for i, trace in enumerate(traces):
    receiver_data[i, 0, :] = trace.data

with tempfile.TemporaryDirectory() as tmpdir:
    with h5py.File(..., "w") as f:
        f.create_group("point")
        f["point"].attrs["sampling_rate_in_hertz"] = ...
        f["point"].attrs["start_time_in_seconds"] = ...
        f["point"]["velocity"] = receiver_data
        f["point"]["velocity"].attrs["PHYSICS"] = "elastic".encode()

        f["coordinates_ELASTIC_point"] = ...
        f["names_ELASTIC_point"] = ...

    p.waveforms.add_external(
        data_name="survey_data",
        event=...,
        data_filename=...,
    )
For a comprehensive example of how this is performed internally, please see the add_to_project method within the accompanying utils.
external_data.add_to_project(p)
Let's plot the shotgather for the data now that it's been added to the project:
p.waveforms.get(
    "EXTERNAL_DATA:survey_data",
    events=src_name,
)[0].plot(
    receiver_field="velocity",
    component="X",
)
Note that since we only have a single component to the observed data, this information has been automatically assigned as the "X" component despite being the vertical component of the velocity. A subsequent mapping of the data will be applied during the processing to map the "X" component to the vertical component.
To add the remaining shots to the project, let's use the same process as above, but with everything wrapped in a single loop.
# We will keep a list of obspy.Stream objects
shotgathers = []

# Start at index 1 since we've already added shot 0 in the previous section
for i in range(1, 27):
    event_name = f"shot_{str(i).zfill(2)}"
    filename = f"{event_name}.segy"
    segy_data = _read_segy(f"data/{filename}")
    header = segy_data.traces[0].header

    # Apply no scaling if it is not provided
    if header.scalar_to_be_applied_to_all_coordinates == 0:
        coord_scale_fac = 1
    # Have the scaling factor act as a divisor if it is negative
    elif header.scalar_to_be_applied_to_all_coordinates < 0:
        coord_scale_fac = -1 / header.scalar_to_be_applied_to_all_coordinates
    # Have the scaling factor act as a multiplier if it is positive
    else:
        coord_scale_fac = header.scalar_to_be_applied_to_all_coordinates

    # Define the name of the source position
    src_name = str(header.shotpoint_number).zfill(src_zero_padding)

    # Initialize an empty list where we can store the `obspy.Trace` objects
    traces = []

    # Iterate over the traces
    for trace in segy_data.traces:
        # Get some of the receiver metadata
        rec_nr = trace.header.trace_sequence_number_within_line

        # Define the names of the receivers; add zero padding to the receiver
        # names based on how many receivers are in the gather
        rec_zero_padding = int(np.log(len(segy_data.traces)))
        rec_name = str(rec_nr).zfill(rec_zero_padding)

        # Convert the trace to an obspy trace to make it a little easier to work with
        stats = trace.to_obspy_trace().stats

        # Package everything for the given trace into an `obspy.Trace`
        tr = obspy.Trace(
            data=trace.data,
            header={
                "station": rec_name,
                "starttime": stats.starttime,
                "endtime": stats.endtime,
                "sampling_rate": stats.sampling_rate,
                "receiver_location": np.array(
                    [
                        trace.header.group_coordinate_x,
                        trace.header.group_coordinate_y,
                    ]
                )
                * coord_scale_fac,
                "source_location": np.array(
                    [header.source_coordinate_x, header.source_coordinate_y]
                )
                * coord_scale_fac,
                "source_id": src_name,
            },
        )
        traces.append(tr)

    shotgathers.append(obspy.Stream(traces))
Similar to before, let's add all of these shotgathers to the project.
external_data = utils.ExternalData(shotgathers)
external_data.add_to_project(p)
Finally, to check if they were indeed added correctly, let's plot the locations of all of our sources and receivers.
p.viz.nb.domain()
An important quality control step is to validate that the data has been imported correctly into SalvusProject by plotting the data. There are a number of handy plotting functions within SalvusProject known as processing fragments which we can use to apply some basic processing to the data being displayed. There are three primary processing fragments we have at our disposal:
  • bandpass(min_frequency, max_frequency)
  • normalize
  • time_shift(time_in_seconds)
Processing fragments are called in the data_name argument by separating the name of the data and the processing fragment with the | operator. Here's an example:
p.waveforms.get(
    "EXTERNAL_DATA:survey_data | bandpass(20, 80) | normalize | timeshift(0.1)",
    ...
)
Note that these processing fragments only affect how the data is displayed for this particular plot; none of the processing is applied and saved to the data within SalvusProject. Furthermore, we can apply one or more processing fragments to a single plot.
Let's take one of the shots near the center of the gather as an example:
src_name = str(9).zfill(src_zero_padding)
p.waveforms.get(
    "EXTERNAL_DATA:survey_data",
    events=src_name,
)[0].plot(
    receiver_field="velocity",
    component="X",
)
Since there's such a large variation in the amplitudes between the traces, the signals in the traces at further offsets are completely uninterpretable as a result. To rectify this, we can use the normalize processing fragment:
p.waveforms.get(
    "EXTERNAL_DATA:survey_data | normalize",
    events=src_name,
)[0].plot(
    receiver_field="velocity",
    component="X",
)
Finally, let's plot a few of the other shotgathers to verify that everything has been imported successfully.
for i in [0, 13, 26]:
    src_name = str(i).zfill(src_zero_padding)
    p.waveforms.get(
        "EXTERNAL_DATA:survey_data | normalize",
        events=src_name,
    )[0].plot(
        receiver_field="velocity",
        component="X",
    )
Awesome! We can now start setting up the simulations.
PAGE CONTENTS