Batch¶
- class Batch(index, dataset=None, pipeline=None, preloaded=None, copy=False, *args, **kwargs)[source]¶
The core Batch class
Note, that if any method is wrapped with @apply_parallel decorator than for inner calls (i.e. from other methods) should be used version of desired method with underscores. (For example, if there is a decorated method than you need to call _method_ from inside of other_method). Same is applicable for all child classes of
batch.Batch
.- add_components(components, init=None)[source]¶
Add new components
- Parameters
- Raises
ValueError – If a component or an attribute with the given name already exists
- apply_defaults = {'dst': None, 'post': '_assemble', 'src': None, 'target': 'threads'}¶
- apply_parallel(func, init=None, post=None, src=None, dst=None, *args, p=None, target='for', requires_rng=False, rng_seeds=None, **kwargs)[source]¶
Apply a function to each item in the container, returned by init, and assemble results by post. Depending on the target parameter, different parallelization engines may be used: for, threads, MPC, async.
- Roughly, under the hood we perform the following:
- compute parameters, individual for each worker. Currently, these are:
p to indicate whether the function should be applied
worker id and a seed for random generator, if required
call init function, which outputs a container of items, passed directly to the func.
The simplest example is the init funciton that returns batch indices, and the function works off of each. - wrap the func call into parallelization engine of choice. - compute results of func calls for each item, returned by init - assemble results by post function, e.g. stack the obtained numpy arrays.
In the simplest possible case of init=None, src=images, dst=images_transformed, post=None, this function is almost equivalent to:
If src is a list and dst is a list, then this function is applied recursively to each pair of src, dst. If src is a tuple, then this tuple is used as a whole. This allows to make functions that work on multiple components.
- Parameters
func (callable) – A function to apply to each item from the source. Should accept src and dst parameters, or be written in a way that accepts variable args.
target (str) –
- Parallelization engine:
’f’, ‘for’ for executing each worker sequentially, like in a for-loop.
’t’, ‘threads’ for using threads.
’m’, ‘mpc’ for using processes. Note the bigger overhead for process initialization.
’a’, ‘async’ for asynchronous execution.
init (str, callable or container) –
Function to init data for individual workers: must return a container of items.
If ‘data’, then use src components as the init. If other str, then must be a name of the attribute of the batch to use as the init. If callable or any previous returned a callable, then result of this callable is used as the init. Note that in the last case callable should accept src and dst parameters, and kwargs are also passed. If not any of the above, then the object is used directly, for example, np.ndarray.
post (str or callable) – Function to apply to the results of function evaluation on each item. Must accept src and dst parameters, as well as kwargs.
src (str, sequence, list of str) – The source to get data from: - None - str - a component name, e.g. ‘images’ or ‘masks’ - tuple or list of str - several component names - sequence - data as a numpy-array, data frame, etc
dst (str or array) – the destination to put the result in, can be: - None - in this case dst is set to be same as src - str - a component name, e.g. ‘images’ or ‘masks’ - tuple or list of str, e.g. [‘images’, ‘masks’] If not provided, uses src.
p (float or None) – Probability of applying func to an element in the batch.
requires_rng (bool) – Whether the func requires RNG. Should be used for correctly initialized seeds for reproducibility. If True, then a pre-initialized RNG will be passed to the function call as rng keyword parameter.
args – Other parameters passed to
func
.kwargs – Other parameters passed to
func
.
Notes
apply_parallel does the following (but in parallel):
for item in range(len(batch)): self.dst[item] = func(self.src[item], *args, **kwargs)
apply_parallel(func, src=[‘images’, ‘masks’]) is equal to apply_parallel(func, src=[‘images’, ‘masks’], dst=[‘images’, ‘masks’]), which in turn equals to two subsequent calls:
images = func(images) masks = func(masks)
However, named expressions will be evaluated only once before the first call.
Whereas apply_parallel(func, src=(‘images’, ‘masks’)) (i.e. when src takes a tuple of component names, not the list as in the previous example) passes both components data into func simultaneously:
images, masks = func((images, masks))
Examples
apply_parallel(make_masks_fn, src='images', dst='masks') apply_parallel(apply_mask, src=('images', 'masks'), dst='images_with_masks') apply_parallel(rotate, src=['images', 'masks'], dst=['images', 'masks'], p=.2) apply_parallel(MyBatch.some_static_method, p=.5) apply_parallel(B.some_method, src='features', p=.5)
TODO: move logic of applying post function from inbatch_parallel here, as well as remove use_self arg.
- property array_of_nones¶
NumPy
array withNone
values.- Type
1-D ndarray
- as_dataset(dataset=None, copy=False)[source]¶
Makes a new dataset from batch data
- Parameters
dataset – an instance or a subclass of Dataset
copy (bool) – whether to copy batch data to allow for further inplace transformations
- Returns
- Return type
an instance of a class specified by dataset arg, preloaded with this batch data
- components = None¶
- property data¶
tuple or named components - batch data
- property data_setter¶
tuple or named components - batch data
- property dataset¶
Dataset - a dataset the batch has been taken from
- dump(*args, dst=None, fmt=None, components=None, **kwargs)[source]¶
Save data to another array or a file.
- Parameters
dst – a destination (e.g. an array or a file name)
fmt (str) – a destination format, one of None, ‘blosc’, ‘csv’, ‘hdf5’, ‘feather’
components (None or str or tuple of str) – components to load
*args – other parameters are passed to format-specific writers
*kwargs – other parameters are passed to format-specific writers
- property indices¶
numpy array - an array with the indices
- property items¶
list - batch items
- load(*args, src=None, fmt=None, dst=None, **kwargs)[source]¶
Load data from another array or a file.
- Parameters
Notes
Loading creates new components if necessary.
Examples
Load data from a pandas dataframe’s columns into all batch components:
batch.load(src=dataframe)
Load data from dataframe’s columns features and labels into components features and labels:
batch.load(src=dataframe, dst=('features', 'labels'))
Load a dataframe into a component features:
batch.load(src=dataframe, dst='features')
Load data from a dict into components images and masks:
batch.load(src=dict(images=images_array, masks=masks_array), dst=('images', 'masks'))
Load data from a tuple into components images and masks:
batch.load(src=(images_array, masks_array), dst=('images', 'masks'))
Load data from an array into a component images:
batch.load(src=images_array, dst='images')
Load data from a CSV file columns into components features and labels:
batch.load(fmt='csv', src='/path/to/file.csv', dst=('features', 'labels`), index_col=0)
- classmethod merge(batches, batch_size=None, components=None, batch_class=None)[source]¶
Merge several batches to form a new batch of a given size
- Parameters
batches (tuple of batches) –
batch_size (int or None) – if None, just merge all batches into one batch (the rest will be None), if int, then make one batch of batch_size and a batch with the rest of data.
components (str, tuple or None) – if None, all components from initial batches will be created, if str or tuple, then create these components in new batches.
batch_class (Batch or None) – if None, created batches will be of the same class as initial batch, if Batch, created batches will be of that class.
- Returns
batch, rest
- Return type
tuple of two batches
- Raises
ValueError – If component is None in some batches and not None in others.
- classmethod merge_component(component=None, data=None)[source]¶
Merge the same component data from several batches
- property pipeline¶
Pipeline - a pipeline the batch is being used in
- property random¶
A random number generator
numpy.random.Generator
. Use it instead of np.random for reproducibility.Examples
x = self.random.normal(0, 1)
- property random_seed¶
SeedSequence for random number generation
- run_once(*args, **kwargs)[source]¶
Init function for no parallelism Useful for async action-methods (will wait till the method finishes)
- property size¶
int - number of items in the batch