Batch

class Batch(index, dataset=None, pipeline=None, preloaded=None, copy=False, *args, **kwargs)[source]

The core Batch class

Note, that if any method is wrapped with @apply_parallel decorator than for inner calls (i.e. from other methods) should be used version of desired method with underscores. (For example, if there is a decorated method than you need to call _method_ from inside of other_method). Same is applicable for all child classes of batch.Batch.

add_components(components, init=None)[source]

Add new components

Parameters:

components (str or list) – new component names
init (array-like) – initial component data

Raises:

ValueError – If a component or an attribute with the given name already exists

apply_defaults = {'dst': None, 'post': '_assemble', 'src': None, 'target': 'threads'}

apply_parallel(func, init=None, post=None, src=None, dst=None, *args, p=None, target='for', requires_rng=False, rng_seeds=None, **kwargs)[source]

Apply a function to each item in the container, returned by init, and assemble results by post. Depending on the target parameter, different parallelization engines may be used: for, threads, MPC, async.

Roughly, under the hood we perform the following:

compute parameters, individual for each worker. Currently, these are:
- p to indicate whether the function should be applied
- worker id and a seed for random generator, if required
call init function, which outputs a container of items, passed directly to the func.

The simplest example is the init funciton that returns batch indices, and the function works off of each. - wrap the func call into parallelization engine of choice. - compute results of func calls for each item, returned by init - assemble results by post function, e.g. stack the obtained numpy arrays.

In the simplest possible case of init=None, src=images, dst=images_transformed, post=None, this function is almost equivalent to:

container = [func(item, *args, **kwargs) for item in self.images] self.images_transformed = container

If src is a list and dst is a list, then this function is applied recursively to each pair of src, dst. If src is a tuple, then this tuple is used as a whole. This allows to make functions that work on multiple components.

Parameters:

func (callable) – A function to apply to each item from the source. Should accept src and dst parameters, or be written in a way that accepts variable args.
target (str) –
Parallelization engine:
- ’f’, ‘for’ for executing each worker sequentially, like in a for-loop.
- ’t’, ‘threads’ for using threads.
- ’m’, ‘mpc’ for using processes. Note the bigger overhead for process initialization.
- ’a’, ‘async’ for asynchronous execution.
init (str, callable or container) –
Function to init data for individual workers: must return a container of items.

If ‘data’, then use src components as the init. If other str, then must be a name of the attribute of the batch to use as the init. If callable or any previous returned a callable, then result of this callable is used as the init. Note that in the last case callable should accept src and dst parameters, and kwargs are also passed. If not any of the above, then the object is used directly, for example, np.ndarray.
post (str or callable) – Function to apply to the results of function evaluation on each item. Must accept src and dst parameters, as well as kwargs.
src (str, sequence, list of str) – The source to get data from: - None - str - a component name, e.g. ‘images’ or ‘masks’ - tuple or list of str - several component names - sequence - data as a numpy-array, data frame, etc
dst (str or array) – the destination to put the result in, can be: - None - in this case dst is set to be same as src - str - a component name, e.g. ‘images’ or ‘masks’ - tuple or list of str, e.g. [‘images’, ‘masks’] If not provided, uses src.
p (float or None) – Probability of applying func to an element in the batch.
requires_rng (bool) – Whether the func requires RNG. Should be used for correctly initialized seeds for reproducibility. If True, then a pre-initialized RNG will be passed to the function call as rng keyword parameter.
args – Other parameters passed to func.
kwargs – Other parameters passed to func.

Notes

apply_parallel does the following (but in parallel):

for item in range(len(batch)):
    self.dst[item] = func(self.src[item], *args, **kwargs)

apply_parallel(func, src=[‘images’, ‘masks’]) is equal to apply_parallel(func, src=[‘images’, ‘masks’], dst=[‘images’, ‘masks’]), which in turn equals to two subsequent calls:

images = func(images)
masks = func(masks)

However, named expressions will be evaluated only once before the first call.

Whereas apply_parallel(func, src=(‘images’, ‘masks’)) (i.e. when src takes a tuple of component names, not the list as in the previous example) passes both components data into func simultaneously:

images, masks = func((images, masks))

Examples

apply_parallel(make_masks_fn, src='images', dst='masks')
apply_parallel(apply_mask, src=('images', 'masks'), dst='images_with_masks')
apply_parallel(rotate, src=['images', 'masks'], dst=['images', 'masks'], p=.2)
apply_parallel(MyBatch.some_static_method, p=.5)
apply_parallel(B.some_method, src='features', p=.5)

TODO: move logic of applying post function from inbatch_parallel here, as well as remove use_self arg.

property array_of_nones

NumPy array with None values.

Type:: 1-D ndarray

as_dataset(dataset=None, copy=False)[source]

Makes a new dataset from batch data

Parameters:

dataset – an instance or a subclass of Dataset
copy (bool) – whether to copy batch data to allow for further inplace transformations

Return type:

an instance of a class specified by dataset arg, preloaded with this batch data

components = None

create_attrs(**kwargs)[source]: Create attributes from kwargs

property data: tuple or named components - batch data

property data_setter: tuple or named components - batch data

property dataset: Dataset - a dataset the batch has been taken from

deepcopy()[source]: Return a deep copy of the batch.

do_nothing(*args, **kwargs)[source]: An empty action (might be convenient in complicated pipelines)

dump(*args, dst=None, fmt=None, components=None, **kwargs)[source]

Save data to another array or a file.

Parameters:

dst – a destination (e.g. an array or a file name)
fmt (str) – a destination format, one of None, ‘blosc’, ‘csv’, ‘hdf5’, ‘feather’
components (None or str or tuple of str) – components to load
*args – other parameters are passed to format-specific writers
*kwargs – other parameters are passed to format-specific writers

classmethod from_data(index=None, data=None)[source]: Create a batch from data given

get(item=None, component=None)[source]: Return an item from the batch or the component

get_attrs()[source]: Return additional attrs as kwargs

get_errors(all_res)[source]: Return a list of errors from a parallel action

property indices: numpy array - an array with the indices

property items: list - batch items

load(*args, src=None, fmt=None, dst=None, **kwargs)[source]

Load data from another array or a file.

Parameters:

src – a source (e.g. an array or a file name)
fmt (str) – a source format, one of None, ‘blosc’, ‘csv’, ‘hdf5’, ‘feather’
dst (None or str or tuple of str) – components to load src to
**kwargs – other parameters to pass to format-specific loaders

Notes

Loading creates new components if necessary.

Examples

Load data from a pandas dataframe’s columns into all batch components:

batch.load(src=dataframe)

Load data from dataframe’s columns features and labels into components features and labels:

batch.load(src=dataframe, dst=('features', 'labels'))

Load a dataframe into a component features:

batch.load(src=dataframe, dst='features')

Load data from a dict into components images and masks:

batch.load(src=dict(images=images_array, masks=masks_array), dst=('images', 'masks'))

Load data from a tuple into components images and masks:

batch.load(src=(images_array, masks_array), dst=('images', 'masks'))

Load data from an array into a component images:

batch.load(src=images_array, dst='images')

Load data from a CSV file columns into components features and labels:

batch.load(fmt='csv', src='/path/to/file.csv', dst=('features', 'labels`), index_col=0)

classmethod merge(batches, batch_size=None, components=None, batch_class=None)[source]

Merge several batches to form a new batch of a given size

Parameters:

batches (tuple of batches)
batch_size (int or None) – if None, just merge all batches into one batch (the rest will be None), if int, then make one batch of batch_size and a batch with the rest of data.
components (str, tuple or None) – if None, all components from initial batches will be created, if str or tuple, then create these components in new batches.
batch_class (Batch or None) – if None, created batches will be of the same class as initial batch, if Batch, created batches will be of that class.

Returns:

batch, rest

Return type:

tuple of two batches

Raises:

ValueError – If component is None in some batches and not None in others.

classmethod merge_component(component=None, data=None)[source]: Merge the same component data from several batches

property pipeline: Pipeline - a pipeline the batch is being used in

property random

A random number generator numpy.random.Generator. Use it instead of np.random for reproducibility.

Examples

x = self.random.normal(0, 1)

property random_seed: SeedSequence for random number generation

run_once(*args, **kwargs)[source]: Init function for no parallelism Useful for async action-methods (will wait till the method finishes)

save(*args, **kwargs)[source]: Save batch data to a file (an alias for dump method)

property size: int - number of items in the batch

to_array(comp, dtype=<class 'numpy.float32'>, channels='last', **kwargs)[source]

Converts batch components to np.ndarray format

Parameters:

src (str) – Component to get images from. Default is ‘images’.
dst (str) – Component to write images to. Default is ‘images’.
dtype (str or np.dtype) – Data type
channels (None, 'first' or 'last') – the dimension for channels axis