Dataset¶
Creating a dataset¶
The Dataset holds an index of all data items (e.g. customers, transactions, etc) and a specific action class to process a small subset of data (batch).
from batchflow import DatasetIndex, Dataset, Batch
client_index = DatasetIndex(client_ids_list)
client_dataset = Dataset(client_index)
By default a dataset will generate batches of Batch
class.
A custom batch class may also be provided:
client_dataset = Dataset(client_index, batch_class=MyBatch)
Preloading data¶
For smaller dataset it might be convenient to preload all data at once:
client_dataset = Dataset(client_index, batch_class=Batch, preloaded=data)
As a result, all created batches will contain a portion of data. For this to work, preloaded data container should have a certain structure.
If a batch class does not contain components, preloaded should be indexed with the dataset indices, i.e. preloaded[dataset.indices[0]] returns the first item from the dataset.
So preloaded could be a numpy array, pandas dataframe or anything else that supports advanced indexing. For convinience, an ordinary dict is also allowed (as advanced indexing for dicts is implemented internally).
If a batch class contains components, preloaded should be indexed with components names first and then with the dataset indices, i.e. preloaded[‘images’][0] returns the first item for the images component.
For instance, pandas.DataFrame fits the purpose very well. However, other data structures are also allowed. As in the previous case, preloaded[component] should support advanced indexing (and again dict may be used here as well).
Adding custom data¶
To store dataset-specific data which can be later accessed within pipelines or batches, just pass it as keyword-parameters when instantiating a dataset.
client_dataset = Dataset(client_index, locations=loc_data, products=product_list)
Custom data is available as dataset attributes, e.g. client_dataset.locations and client_dataset.products.
Splitting a dataset¶
A dataset can be easily split into train, test and validation subsets.
client_dataset.split([.8, .1, .1])
All parts are also datasets, which can be addressed as dataset.train, dataset.test and dataset.validation.
Parameters¶
shares - train/test/validation shares. Can be float, tuple of 2 floats, or tuple of 3 floats.
shuffle - whether to randomize items order before splitting into subsets. Default is False. Can be
bool : False - to make subsets in the order of indices in the index, or True - to make random subsets.
a
numpy.random.RandomState
object which has an inplace shuffle method.int - a random seed number which will be used internally to create a
numpy.random.RandomState
object.callable - a function which gets an order and returns a shuffled order.
Cross-validation¶
A dataset can also be partitioned for cross-validation.
dataset.cv_split(n_splits=3, shuffle=True)
Now partitions which are also datasets can be available as cv0, cv1 and so on. And each dataset is already split into train and test parts.
dataset = Dataset(10)
dataset.cv_split(n_splits=3, shuffle=False)
dataset.cv0.test.indices # [0, 1, 2, 3]
dataset.cv1.test.indices # [4, 5, 6]
dataset.cv2.test.indices # [7, 8, 9]
Iterating over a dataset¶
And now you can conveniently iterate over the dataset:
BATCH_SIZE = 200
for client_batch in client_dataset.gen_batch(BATCH_SIZE, shuffle=False, n_iters=3):
# client_batch is an instance of Batch class which holds an index of the subset of the original dataset
# so you can do anything you want with that batch
# for instance, load some data, as the batch is empty when initialized
batch_with_data = client_batch.load(client_data)
You might also create batches with next_batch function:
NUM_ITERS = 1000
for i in range(NUM_ITERS):
client_batch = client_dataset.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
batch_with_data = client_batch.load(client_data)
# ...
The only difference is that gen_batch()
is a generator,
while next_batch()
is just an ordinary method.
Parameters¶
batch_size - number of items in the batch.
shuffle - whether to randomize items order before splitting into batches. Default is False. Can be
bool : False - to make batches in the order of indices in the index, or True - to make random batches.
a
numpy.random.RandomState
object which has an inplace shuffle method.int - a random seed number which will be used internally to create a
numpy.random.RandomState
object.sample function - any callable which gets an order and returns a shuffled order.
n_iters - number of iterations (i.e. number of batches created).
n_epochs - number of complete passes through the whole dataset.
Only one of n_iters and n_epochs should be defined. If both are None, then you will get an infinite sequence of batches.
drop_last - whether to skip the last batch if it has fewer items (for instance, if a dataset contains 10 items and the batch size is 3, then there will 3 batches of 3 items and the 4th batch with just 1 item. The last batch will be skipped if drop_last=True). See API for more details
.
bar - whether to show a tqdm bar.
Custom batch class¶
You can also define a new batch class with custom action methods to process your specific data.
class MyBatch(Batch):
@action
def my_custom_action(self):
...
@action
def another_custom_action(self):
...
And then create a dataset with a new batch class:
client_dataset = Dataset(client_index, batch_class=MyBatch)
API¶
See Dataset API.