Dataset

class Dataset(index, batch_class=<class 'batchflow.batch.Batch'>, *args, preloaded=None, cast_to_array=True, copy=False, **kwargs)[source]

The Dataset holds an index of all data items (e.g. customers, transactions, etc) and a specific action class to process a small subset of data (batch).

batch_class

Type:: Batch

index

Type:: DatasetIndex or FilesIndex

indices

an array with the indices

Type:: class:numpy.ndarray

p

Actions which will be applied to this dataset

Type:: Pipeline

preloaded

For small dataset it could be convenient to preload data at first

Type:: data-type

train

The train part of this dataset. It appears after splitting

Type:: Dataset

test

The test part of this dataset. It appears after splitting

Type:: Dataset

validation

The validation part of this dataset. It appears after splitting

Type:: Dataset

CV(expr)[source]: Return a dataset which corresponds to the fold defined as NamedExpression

static build_index(index, *args, **kwargs)[source]

Check if instance of the index is DatasetIndex if it is not - create DatasetIndex from inputs

Parameters:: index (DatasetIndex or any)
Return type:: DatasetIndex

calc_split(shares=0.8)

Calculate split into train, test and validation subsets

Parameters:

shares (float or a sequence of floats) – A share of train, test and validation subset respectively.

Return type:

a tuple which contains number of items in train, test and validation subsets

Raises:

ValueError –

if shares has more than 3 items * if sum of shares is greater than 1 * if this set does not have enough items to split

Examples

Split into train / test in 80/20 ratio

>>> some_set.calc_split()

Split into train / test / validation in 60/30/10 ratio

>>> some_set.calc_split([0.6, 0.3])

Split into train / test / validation in 50/30/20 ratio

>>> some_set.calc_split([0.5, 0.3, 0.2])

copy()[source]: Make a shallow copy of the dataset object

create_attrs(**kwargs)[source]: Create attributes from kwargs

create_batch(index, pos=False, *args, **kwargs)[source]

Create a batch from given indices.

Parameters:

index (DatasetIndex) – Indices of dataset objects that should be included into batch
pos (bool) – Whether index contains elements positions. Defaults to False

Return type:

Batch

Notes

If pos is False, then index should contain the indices that should be included in the batch, otherwise index should contain their positions in current index.

create_subset(index)[source]

Create a dataset based on the given subset of indices

Parameters:: index (DatasetIndex or np.array)
Return type:: Dataset
Raises:: IndexError – When a user wants to create a subset from source dataset it is necessary to be confident that the index of new subset lies in the range of source dataset’s index. If the index lies out of the source dataset index’s range, the IndexError is raised.

cv(n)[source]: Return a dataset which corresponds to n-th CV split

cv_split(method='kfold', n_splits=5, shuffle=False)[source]

Create datasets for cross-validation

Datasets are available as cv0, cv1 and so on. And they are already split into train and test parts.

Another way to access these splits is train.cv0, train.cv1, …, test.cv0, test.cv1, …

Note that each pair (e.g. cv0.train and train.cv0) refers to the very same instance of a dataset, i.e. if you change train.cv0, cv0.train will also change.

Parameters:

method ({'kfold'}) – a splitting method (only kfold is supported)
n_splits (int) – a number of folds
shuffle – specifies the order of items (see shuffle())

Examples

dataset = Dataset(10)
dataset.cv_split(n_splits=3)
print(dataset.cv0.test.indices) # [0, 1, 2, 3]
print(dataset.cv1.test.indices) # [4, 5, 6]
print(dataset.cv2.test.indices) # [7, 8, 9]
print(dataset.test.cv0.indices) # [0, 1, 2, 3]
print(dataset.test.cv1.indices) # [4, 5, 6]
print(dataset.test.cv2.indices) # [7, 8, 9]

property data: Return preloaded data

classmethod from_dataset(dataset, index, batch_class=None, copy=False, **kwargs)[source]

Create a Dataset object from another dataset with a new index (usually a subset of the source dataset index)

Parameters:

dataset (Dataset) – Source dataset
index (DatasetIndex) – Set of items from source dataset which should be in the new Dataset
batch_class (type) – a subclass of Batch class
copy (bool) – whether to create a copy of the dataset or use the same instance wherever possible

Return type:

Dataset

gen_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, notifier=False, *args, **kwargs): Generate batches

get_attrs()[source]: Return additional attrs as kwargs

classmethod get_default_iter_params(): Return iteration params with default values to start iteration from scratch

property index

the dataset’s index

Type:: dataset.DatasetIndex

property indices

an array with the indices

Type:: numpy.ndarray

property is_split

True if dataset has been split into train / test / validation subsets

Type:: bool

next_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, iter_params=None, *args, **kwargs): Return a batch

property p: A short alias for pipeline()

pipeline(*args, **kwargs)[source]

Start a new data processing workflow

Parameters:: config (Config or dict) – Config lets you initialize variables in the Pipeline object, e.g. for the augmentation task https://analysiscenter.github.io/batchflow/intro/pipeline.html#initializing-a-variable
Return type:: Pipeline

reset(what='iter'): Clear all iteration metadata in order to start iterating from scratch

reset_iter()

property size: int - number of items in the set

split(shares=0.8, shuffle=False)

Split the dataset into train, test and validation sub-datasets. Subsets are available as .train, .test and .validation respectively.

Parameters:

shares (float, tuple of 2 floats, or tuple of 3 floats) – train/test/validation shares. Default is 0.8.
shuffle (bool, numpy.random.RandomState, int or callable) –
whether to randomize items order before splitting into subsets. Default is False. Can be
- boolFalse - to make subsets in the order of indices in the index,
  True - to make random subsets.
- a numpy.random.RandomState object which has an inplace shuffle method.
- int - a random seed number which will be used internally to create
  a numpy.random.RandomState object.
- callable - a function which gets an order and returns a shuffled order.

Examples

Split into train / test in 80/20 ratio

>>> dataset.split()

Split into train / test / validation in 60/30/10 ratio

>>> dataset.split([0.6, 0.3])

Split into train / test / validation in 50/30/20 ratio

>>> dataset.split([0.5, 0.3, 0.2])