Dataset

Creating a dataset

The Dataset holds an index of all data items (e.g. customers, transactions, etc) and a specific action class to process a small subset of data (batch).

from batchflow import DatasetIndex, Dataset, Batch

client_index = DatasetIndex(client_ids_list)
client_dataset = Dataset(client_index)

By default a dataset will generate batches of Batch class. A custom batch class may also be provided:

client_dataset = Dataset(client_index, batch_class=MyBatch)

Preloading data

For smaller dataset it might be convenient to preload all data at once:

client_dataset = Dataset(client_index, batch_class=Batch, preloaded=data)

As a result, all created batches will contain a portion of data. For this to work, preloaded data container should have a certain structure.

If a batch class does not contain components, preloaded should be indexed with the dataset indices, i.e. preloaded[dataset.indices[0]] returns the first item from the dataset.

So preloaded could be a numpy array, pandas dataframe or anything else that supports advanced indexing. For convinience, an ordinary dict is also allowed (as advanced indexing for dicts is implemented internally).

If a batch class contains components, preloaded should be indexed with components names first and then with the dataset indices, i.e. preloaded[‘images’][0] returns the first item for the images component.

For instance, pandas.DataFrame fits the purpose very well. However, other data structures are also allowed. As in the previous case, preloaded[component] should support advanced indexing (and again dict may be used here as well).

Adding custom data

To store dataset-specific data which can be later accessed within pipelines or batches, just pass it as keyword-parameters when instantiating a dataset.

client_dataset = Dataset(client_index, locations=loc_data, products=product_list)

Custom data is available as dataset attributes, e.g. client_dataset.locations and client_dataset.products.

Splitting a dataset

A dataset can be easily split into train, test and validation subsets.

client_dataset.split([.8, .1, .1])

All parts are also datasets, which can be addressed as dataset.train, dataset.test and dataset.validation.

Parameters

shares - train/test/validation shares. Can be float, tuple of 2 floats, or tuple of 3 floats.

shuffle - whether to randomize items order before splitting into subsets. Default is False. Can be

bool : False - to make subsets in the order of indices in the index, or True - to make random subsets.
a numpy.random.RandomState object which has an inplace shuffle method.
int - a random seed number which will be used internally to create a numpy.random.RandomState object.
callable - a function which gets an order and returns a shuffled order.

Cross-validation

A dataset can also be partitioned for cross-validation.

dataset.cv_split(n_splits=3, shuffle=True)

Now partitions which are also datasets can be available as cv0, cv1 and so on. And each dataset is already split into train and test parts.

dataset = Dataset(10)
dataset.cv_split(n_splits=3, shuffle=False)
dataset.cv0.test.indices # [0, 1, 2, 3]
dataset.cv1.test.indices # [4, 5, 6]
dataset.cv2.test.indices # [7, 8, 9]

Iterating over a dataset

And now you can conveniently iterate over the dataset:

BATCH_SIZE = 200
for client_batch in client_dataset.gen_batch(BATCH_SIZE, shuffle=False, n_iters=3):
    # client_batch is an instance of Batch class which holds an index of the subset of the original dataset
    # so you can do anything you want with that batch
    # for instance, load some data, as the batch is empty when initialized
    batch_with_data = client_batch.load(client_data)

You might also create batches with next_batch function:

NUM_ITERS = 1000
for i in range(NUM_ITERS):
    client_batch = client_dataset.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
    batch_with_data = client_batch.load(client_data)
    # ...

The only difference is that gen_batch() is a generator, while next_batch() is just an ordinary method.

Parameters

batch_size - number of items in the batch.

shuffle - whether to randomize items order before splitting into batches. Default is False. Can be

bool : False - to make batches in the order of indices in the index, or True - to make random batches.
a numpy.random.RandomState object which has an inplace shuffle method.
int - a random seed number which will be used internally to create a numpy.random.RandomState object.
sample function - any callable which gets an order and returns a shuffled order.

n_iters - number of iterations (i.e. number of batches created).

n_epochs - number of complete passes through the whole dataset.

Only one of n_iters and n_epochs should be defined. If both are None, then you will get an infinite sequence of batches.

drop_last - whether to skip the last batch if it has fewer items (for instance, if a dataset contains 10 items and the batch size is 3, then there will 3 batches of 3 items and the 4th batch with just 1 item. The last batch will be skipped if drop_last=True). See API for more details.

bar - whether to show a tqdm bar.

Custom batch class

You can also define a new batch class with custom action methods to process your specific data.

class MyBatch(Batch):
    @action
    def my_custom_action(self):
        ...

    @action
    def another_custom_action(self):
        ...

And then create a dataset with a new batch class:

client_dataset = Dataset(client_index, batch_class=MyBatch)

API

See Dataset API.