Dataset

class Dataset(index, batch_class=<class 'batchflow.batch.Batch'>, *args, preloaded=None, cast_to_array=True, copy=False, **kwargs)[source]

The Dataset holds an index of all data items (e.g. customers, transactions, etc) and a specific action class to process a small subset of data (batch).

batch_class
Type

Batch

index
Type

DatasetIndex or FilesIndex

indices

an array with the indices

Type

class:numpy.ndarray

p

Actions which will be applied to this dataset

Type

Pipeline

preloaded

For small dataset it could be convenient to preload data at first

Type

data-type

train

The train part of this dataset. It appears after splitting

Type

Dataset

test

The test part of this dataset. It appears after splitting

Type

Dataset

validation

The validation part of this dataset. It appears after splitting

Type

Dataset

CV(expr)[source]

Return a dataset which corresponds to the fold defined as NamedExpression

static build_index(index, *args, **kwargs)[source]

Check if instance of the index is DatasetIndex if it is not - create DatasetIndex from inputs

Parameters

index (DatasetIndex or any) –

Returns

Return type

DatasetIndex

calc_split(shares=0.8)

Calculate split into train, test and validation subsets

Parameters

shares (float or a sequence of floats) – A share of train, test and validation subset respectively.

Returns

Return type

a tuple which contains number of items in train, test and validation subsets

Raises

ValueError

  • if shares has more than 3 items * if sum of shares is greater than 1 * if this set does not have enough items to split

Examples

Split into train / test in 80/20 ratio

>>> some_set.calc_split()

Split into train / test / validation in 60/30/10 ratio

>>> some_set.calc_split([0.6, 0.3])

Split into train / test / validation in 50/30/20 ratio

>>> some_set.calc_split([0.5, 0.3, 0.2])
copy()[source]

Make a shallow copy of the dataset object

create_attrs(**kwargs)[source]

Create attributes from kwargs

create_batch(index, pos=False, *args, **kwargs)[source]

Create a batch from given indices.

Parameters
  • index (DatasetIndex) – Indices of dataset objects that should be included into batch

  • pos (bool) – Whether index contains elements positions. Defaults to False

Returns

Return type

Batch

Notes

If pos is False, then index should contain the indices that should be included in the batch, otherwise index should contain their positions in current index.

create_subset(index)[source]

Create a dataset based on the given subset of indices

Parameters

index (DatasetIndex or np.array) –

Returns

Return type

Dataset

Raises

IndexError – When a user wants to create a subset from source dataset it is necessary to be confident that the index of new subset lies in the range of source dataset’s index. If the index lies out of the source dataset index’s range, the IndexError is raised.

cv(n)[source]

Return a dataset which corresponds to n-th CV split

cv_split(method='kfold', n_splits=5, shuffle=False)[source]

Create datasets for cross-validation

Datasets are available as cv0, cv1 and so on. And they are already split into train and test parts.

Another way to access these splits is train.cv0, train.cv1, …, test.cv0, test.cv1, …

Note that each pair (e.g. cv0.train and train.cv0) refers to the very same instance of a dataset, i.e. if you change train.cv0, cv0.train will also change.

Parameters
  • method ({'kfold'}) – a splitting method (only kfold is supported)

  • n_splits (int) – a number of folds

  • shuffle – specifies the order of items (see shuffle())

Examples

dataset = Dataset(10)
dataset.cv_split(n_splits=3)
print(dataset.cv0.test.indices) # [0, 1, 2, 3]
print(dataset.cv1.test.indices) # [4, 5, 6]
print(dataset.cv2.test.indices) # [7, 8, 9]
print(dataset.test.cv0.indices) # [0, 1, 2, 3]
print(dataset.test.cv1.indices) # [4, 5, 6]
print(dataset.test.cv2.indices) # [7, 8, 9]
property data

Return preloaded data

classmethod from_dataset(dataset, index, batch_class=None, copy=False, **kwargs)[source]

Create a Dataset object from another dataset with a new index (usually a subset of the source dataset index)

Parameters
  • dataset (Dataset) – Source dataset

  • index (DatasetIndex) – Set of items from source dataset which should be in the new Dataset

  • batch_class (type) – a subclass of Batch class

  • copy (bool) – whether to create a copy of the dataset or use the same instance wherever possible

Returns

Return type

Dataset

gen_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, notifier=False, *args, **kwargs)

Generate batches

get_attrs()[source]

Return additional attrs as kwargs

classmethod get_default_iter_params()

Return iteration params with default values to start iteration from scratch

property index

the dataset’s index

Type

dataset.DatasetIndex

property indices

an array with the indices

Type

numpy.ndarray

property is_split

True if dataset has been split into train / test / validation subsets

Type

bool

next_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, iter_params=None, *args, **kwargs)

Return a batch

property p

A short alias for pipeline()

pipeline(*args, **kwargs)[source]

Start a new data processing workflow

Parameters

config (Config or dict) – Config lets you initialize variables in the Pipeline object, e.g. for the augmentation task https://analysiscenter.github.io/batchflow/intro/pipeline.html#initializing-a-variable

Returns

Return type

Pipeline

reset(what='iter')

Clear all iteration metadata in order to start iterating from scratch

reset_iter()
property size

int - number of items in the set

split(shares=0.8, shuffle=False)

Split the dataset into train, test and validation sub-datasets. Subsets are available as .train, .test and .validation respectively.

Parameters
  • shares (float, tuple of 2 floats, or tuple of 3 floats) – train/test/validation shares. Default is 0.8.

  • shuffle (bool, numpy.random.RandomState, int or callable) –

    whether to randomize items order before splitting into subsets. Default is False. Can be

    • boolFalse - to make subsets in the order of indices in the index,

      True - to make random subsets.

    • a numpy.random.RandomState object which has an inplace shuffle method.

    • int - a random seed number which will be used internally to create

      a numpy.random.RandomState object.

    • callable - a function which gets an order and returns a shuffled order.

Examples

Split into train / test in 80/20 ratio

>>> dataset.split()

Split into train / test / validation in 60/30/10 ratio

>>> dataset.split([0.6, 0.3])

Split into train / test / validation in 50/30/20 ratio

>>> dataset.split([0.5, 0.3, 0.2])