Dataset¶
- class Dataset(index, batch_class=<class 'batchflow.batch.Batch'>, *args, preloaded=None, cast_to_array=True, copy=False, **kwargs)[source]
The Dataset holds an index of all data items (e.g. customers, transactions, etc) and a specific action class to process a small subset of data (batch).
- batch_class
- Type
- index
- Type
- indices
an array with the indices
- Type
class:numpy.ndarray
- p
Actions which will be applied to this dataset
- Type
- preloaded
For small dataset it could be convenient to preload data at first
- Type
data-type
- train
The train part of this dataset. It appears after splitting
- Type
Dataset
- test
The test part of this dataset. It appears after splitting
- Type
Dataset
- validation
The validation part of this dataset. It appears after splitting
- Type
Dataset
- CV(expr)[source]
Return a dataset which corresponds to the fold defined as NamedExpression
- static build_index(index, *args, **kwargs)[source]
Check if instance of the index is DatasetIndex if it is not - create DatasetIndex from inputs
- Parameters
index (DatasetIndex or any) –
- Returns
- Return type
- calc_split(shares=0.8)
Calculate split into train, test and validation subsets
- Parameters
shares (float or a sequence of floats) – A share of train, test and validation subset respectively.
- Returns
- Return type
a tuple which contains number of items in train, test and validation subsets
- Raises
if shares has more than 3 items * if sum of shares is greater than 1 * if this set does not have enough items to split
Examples
Split into train / test in 80/20 ratio
>>> some_set.calc_split()
Split into train / test / validation in 60/30/10 ratio
>>> some_set.calc_split([0.6, 0.3])
Split into train / test / validation in 50/30/20 ratio
>>> some_set.calc_split([0.5, 0.3, 0.2])
- copy()[source]
Make a shallow copy of the dataset object
- create_attrs(**kwargs)[source]
Create attributes from kwargs
- create_batch(index, pos=False, *args, **kwargs)[source]
Create a batch from given indices.
- Parameters
index (DatasetIndex) – Indices of dataset objects that should be included into batch
pos (bool) – Whether index contains elements positions. Defaults to False
- Returns
- Return type
Notes
If pos is False, then index should contain the indices that should be included in the batch, otherwise index should contain their positions in current index.
- create_subset(index)[source]
Create a dataset based on the given subset of indices
- Parameters
index (DatasetIndex or np.array) –
- Returns
- Return type
Dataset
- Raises
IndexError – When a user wants to create a subset from source dataset it is necessary to be confident that the index of new subset lies in the range of source dataset’s index. If the index lies out of the source dataset index’s range, the IndexError is raised.
- cv(n)[source]
Return a dataset which corresponds to n-th CV split
- cv_split(method='kfold', n_splits=5, shuffle=False)[source]
Create datasets for cross-validation
Datasets are available as cv0, cv1 and so on. And they are already split into train and test parts.
Another way to access these splits is train.cv0, train.cv1, …, test.cv0, test.cv1, …
Note that each pair (e.g. cv0.train and train.cv0) refers to the very same instance of a dataset, i.e. if you change train.cv0, cv0.train will also change.
- Parameters
Examples
dataset = Dataset(10) dataset.cv_split(n_splits=3) print(dataset.cv0.test.indices) # [0, 1, 2, 3] print(dataset.cv1.test.indices) # [4, 5, 6] print(dataset.cv2.test.indices) # [7, 8, 9] print(dataset.test.cv0.indices) # [0, 1, 2, 3] print(dataset.test.cv1.indices) # [4, 5, 6] print(dataset.test.cv2.indices) # [7, 8, 9]
- property data
Return preloaded data
- classmethod from_dataset(dataset, index, batch_class=None, copy=False, **kwargs)[source]
Create a Dataset object from another dataset with a new index (usually a subset of the source dataset index)
- Parameters
dataset (Dataset) – Source dataset
index (DatasetIndex) – Set of items from source dataset which should be in the new Dataset
batch_class (type) – a subclass of Batch class
copy (bool) – whether to create a copy of the dataset or use the same instance wherever possible
- Returns
- Return type
Dataset
- gen_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, notifier=False, *args, **kwargs)
Generate batches
- get_attrs()[source]
Return additional attrs as kwargs
- classmethod get_default_iter_params()
Return iteration params with default values to start iteration from scratch
- property index
the dataset’s index
- Type
dataset.DatasetIndex
- property indices
an array with the indices
- Type
- property is_split
True if dataset has been split into train / test / validation subsets
- Type
- next_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, iter_params=None, *args, **kwargs)
Return a batch
- property p
A short alias for pipeline()
- pipeline(*args, **kwargs)[source]
Start a new data processing workflow
- Parameters
config (Config or dict) – Config lets you initialize variables in the Pipeline object, e.g. for the augmentation task https://analysiscenter.github.io/batchflow/intro/pipeline.html#initializing-a-variable
- Returns
- Return type
- reset(what='iter')
Clear all iteration metadata in order to start iterating from scratch
- reset_iter()
- property size
int - number of items in the set
- split(shares=0.8, shuffle=False)
Split the dataset into train, test and validation sub-datasets. Subsets are available as .train, .test and .validation respectively.
- Parameters
shares (float, tuple of 2 floats, or tuple of 3 floats) – train/test/validation shares. Default is 0.8.
shuffle (bool,
numpy.random.RandomState
, int or callable) –whether to randomize items order before splitting into subsets. Default is False. Can be
- boolFalse - to make subsets in the order of indices in the index,
True - to make random subsets.
a
numpy.random.RandomState
object which has an inplace shuffle method.- int - a random seed number which will be used internally to create
a
numpy.random.RandomState
object.
callable - a function which gets an order and returns a shuffled order.
Examples
Split into train / test in 80/20 ratio
>>> dataset.split()
Split into train / test / validation in 60/30/10 ratio
>>> dataset.split([0.6, 0.3])
Split into train / test / validation in 50/30/20 ratio
>>> dataset.split([0.5, 0.3, 0.2])