MemmapLoader

class segfast.memmap_loader.MemmapLoader(path, endian='big', strict=False, ignore_geometry=True)[source]

Custom reader/writer for SEG-Y files. Relies on a memory mapping mechanism for actual reads of headers and traces.

SEG-Y description

The SEG-Y is a binary file divided into several blocks:

file-wide information block which in most cases takes the first 3600 bytes:

textual header: the first 3200 bytes are reserved for textual info about the file. Most of the software uses this header to keep acquisition meta, date of creation, author, etc.

binary header: 3200–3600 bytes contain file-wide headers, which describe the number of traces, a format used for storing numbers, the number of samples for each trace, acquisition parameters, etc.

(optional) 3600+ bytes can be used to store the extended textual information. If there is such a header, then this is indicated by the value in one of the 3200–3600 bytes.

a sequence of traces, where each trace is a combination of its header and signal data:

trace header takes the first 240 bytes and describes the meta info about its trace: shot/receiver coordinates, the method of acquisition, current trace length, etc. Analogously to binary file header, each trace also can have extended headers.

trace data is usually an array of amplitude values, which can be stored in various numerical types. As the original SEG-Y is quite old (1975), one of those numerical formats is IBM float, which is very different from standard IEEE floats; therefore, a special caution is required to correctly decode values from such files.

For the most part, SEG-Y files are written with a constant size of each trace, although the standard itself allows for variable-sized traces. We do not work with such files.

Implementation details

We rely on segyio to infer file-wide parameters. For headers and traces, we use custom methods of reading binary data. Main differences to segyio C++ implementation:

we read all of the requested headers in one file-wide sweep, speeding up by an order of magnitude compared to the segyio sequential read of every requested header. Also, we do that in multiple processes across chunks.

a memory map over trace data is used for loading values. Avoiding redundant copies and leveraging numpy superiority allows to speed up reading, especially in case of trace slicing along the samples axis. This is extra relevant in the case of loading horizontal (depth) slices.

load_headers(headers, indices=None, reconstruct_tsf=True, sort_columns=True, return_specs=False, chunk_size=25000, max_workers=4, pbar=False, **kwargs)[source]

Load requested trace headers from a SEG-Y file for each trace into a dataframe. If needed, we reconstruct the 'TRACE_SEQUENCE_FILE' manually be re-indexing traces.

Under the hood, we create a memory mapping over the SEG-Y file, and view it with special dtype. That dtype skips all of the trace data bytes and all of the unrequested headers, leaving only passed headers as non-void dtype.

The file is read in chunks in multiple processes.

Parameters:

headers (sequence) –
An array-like where each element can be:
- str – header name,
- int – header starting byte,
- TraceHeaderSpec – used as is,
- tuple – args to init TraceHeaderSpec,
- dict – kwargs to init TraceHeaderSpec.
indices (sequence or None) – Indices of traces to load trace headers for. If not given, trace headers are loaded for all traces.
reconstruct_tsf (bool) – Whether to reconstruct TRACE_SEQUENCE_FILE manually.
sort_columns (bool) – Whether to sort columns in the resulting dataframe by their starting bytes.
return_specs (bool) – Whether to return header specs used to load trace headers.
chunk_size (int) – Maximum amount of traces in each chunk.
max_workers (int, optional) – Maximum number of parallel processes to spawn. If None, then the number of CPU cores is used.
pbar (bool or str) – If bool, then whether to display progress bar over the file sweep. If str, then type of progress bar to display: 't' for textual, 'n' for widget.

Return type:

pd.DataFrame

Examples

Standard 'CDP_X' and 'CDP_Y' headers:

segfast_file.load_headers(['CDP_X', 'CDP_Y'])

Standard headers from 181 and 185 bytes with standard dtypes:

segfast_file.load_headers([181, 185])

Load 'CDP_X' and 'CDP_Y' from non-standard bytes positions corresponding to some standard headers (i.e. load 'CDP_X' from bytes for 'INLINE_3D' with '<i4' dtype and 'CDP_Y' from bytes for 'CROSSLINE_3D'):

segfast_file.load_headers([{'name': 'CDP_X', 'start_byte': 189, 'dtype': '<i4'}, ('CDP_Y', 193)])

Load 'CDP_X' and 'CDP_Y' from arbitrary positions:

segfast_file.load_headers([('CDP_X', 45, '>f4'), ('CDP_Y', 10, '>f4')])

Load ‘FieldRecord’ header for the first 5 traces:

segfast_file.load_headers(['FieldRecord'], indices=np.arange(5))

load_traces(indices, limits=None, buffer=None)[source]

Load traces by their indices. Under the hood, we use pre-made memory mapping over the file, where trace data is viewed with a special dtype. Regardless of the numerical dtype of SEG-Y file, we output IEEE float32: for IBM floats, that requires an additional conversion.

Parameters:

indices (sequence) – Indices ('TRACE_SEQUENCE_FILE') of the traces to read.
limits (sequence of ints, slice, optional) – Slice of the data along the depth axis.
buffer (numpy.ndarray, optional) – Buffer to read the data into. If possible, avoids copies.

Return type:

numpy.ndarray

load_depth_slices(indices, buffer=None)[source]

Load horizontal (depth) slices of the data. Requires an almost full sweep through SEG-Y, therefore is slow.

Parameters:

indices (sequence) – Indices (ordinals) of the depth slices to read.
buffer (numpy.ndarray, optional) – Buffer to read the data into. If possible, avoids copies.

Return type:

numpy.ndarray

convert(path=None, format=8, transform=None, chunk_size=25000, max_workers=4, pbar='t', overwrite=True)[source]

Convert SEG-Y file to a different format: dtype of data values. Keeps the same binary header (except for the 3225 byte, which stores the format). Keeps the same header values for each trace: essentially, only the values of each trace are transformed.

The most common scenario of this function usage is to convert float32 SEG-Y into int8 one: the latter is a lot faster and takes ~4 times less disk space at the cost of some data loss.

Parameters:

path (str, optional) – Path to the save file to. If not provided, we use the path of the current cube with an added postfix.
format (int) – Target SEG-Y format. Refer to SEGY_FORMAT_TO_TRACE_DATA_DTYPE for list of available formats and their data value dtype.
transform (callable, optional) – Callable to transform data from the current file to the ones, saved in path. Must return the same dtype, as specified by format.
chunk_size (int) – Maximum amount of traces in each chunk.
max_workers (int or None) – Maximum number of parallel processes to spawn. If None, then the number of CPU cores is used.
pbar (bool, str) – If bool, then whether to display a progress bar. If str, then the type of progress bar to display: 't' for textual, 'n' for widget.
overwrite (bool) – Whether to overwrite the existing path or raise an exception.

Returns:

path

Return type:

str