======= Dataset ======= Creating a dataset ------------------ The `Dataset` holds an index of all data items (e.g. customers, transactions, etc) and a specific action class to process a small subset of data (batch). .. code-block:: python from batchflow import DatasetIndex, Dataset, Batch client_index = DatasetIndex(client_ids_list) client_dataset = Dataset(client_index) By default a dataset will generate batches of :class:`~batchflow.Batch` class. A custom :doc:`batch class ` may also be provided:: client_dataset = Dataset(client_index, batch_class=MyBatch) Preloading data --------------- For smaller dataset it might be convenient to preload all data at once: .. code-block:: python client_dataset = Dataset(client_index, batch_class=Batch, preloaded=data) As a result, all created batches will contain a portion of `data`. For this to work, `preloaded` data container should have a certain structure. If a batch class does not contain :ref:`components`, `preloaded` should be indexed with the dataset indices, i.e. `preloaded[dataset.indices[0]]` returns the first item from the dataset. So `preloaded` could be a numpy array, pandas dataframe or anything else that supports `advanced indexing `_. For convinience, an ordinary `dict` is also allowed (as advanced indexing for dicts is implemented internally). If a batch class contains :ref:`components`, `preloaded` should be indexed with components names first and then with the dataset indices, i.e. `preloaded['images'][0]` returns the first item for the `images` component. For instance, `pandas.DataFrame` fits the purpose very well. However, other data structures are also allowed. As in the previous case, `preloaded[component]` should support advanced indexing (and again `dict` may be used here as well). Adding custom data ------------------ To store dataset-specific data which can be later accessed within pipelines or batches, just pass it as keyword-parameters when instantiating a dataset. .. code-block:: python client_dataset = Dataset(client_index, locations=loc_data, products=product_list) Custom data is available as dataset attributes, e.g. `client_dataset.locations` and `client_dataset.products`. Splitting a dataset ------------------- A dataset can be easily split into train, test and validation subsets. .. code-block:: python client_dataset.split([.8, .1, .1]) All parts are also datasets, which can be addressed as `dataset.train`, `dataset.test` and `dataset.validation`. Parameters ^^^^^^^^^^ `shares` - train/test/validation shares. Can be float, tuple of 2 floats, or tuple of 3 floats. `shuffle` - whether to randomize items order before splitting into subsets. Default is `False`. Can be * `bool` : `False` - to make subsets in the order of indices in the index, or `True` - to make random subsets. * a :class:`numpy.random.RandomState` object which has an inplace shuffle method. * `int` - a random seed number which will be used internally to create a :class:`numpy.random.RandomState` object. * `callable` - a function which gets an order and returns a shuffled order. Cross-validation ---------------- A dataset can also be partitioned for cross-validation. .. code-block:: python dataset.cv_split(n_splits=3, shuffle=True) Now partitions which are also datasets can be available as `cv0`, `cv1` and so on. And each dataset is already split into train and test parts. .. code-block:: python dataset = Dataset(10) dataset.cv_split(n_splits=3, shuffle=False) dataset.cv0.test.indices # [0, 1, 2, 3] dataset.cv1.test.indices # [4, 5, 6] dataset.cv2.test.indices # [7, 8, 9] Iterating over a dataset ------------------------ And now you can conveniently iterate over the dataset:: BATCH_SIZE = 200 for client_batch in client_dataset.gen_batch(BATCH_SIZE, shuffle=False, n_iters=3): # client_batch is an instance of Batch class which holds an index of the subset of the original dataset # so you can do anything you want with that batch # for instance, load some data, as the batch is empty when initialized batch_with_data = client_batch.load(client_data) You might also create batches with `next_batch` function:: NUM_ITERS = 1000 for i in range(NUM_ITERS): client_batch = client_dataset.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None) batch_with_data = client_batch.load(client_data) # ... The only difference is that :func:`~batchflow.Dataset.gen_batch` is a generator, while :func:`~batchflow.Dataset.next_batch` is just an ordinary method. Parameters ^^^^^^^^^^ `batch_size` - number of items in the batch. `shuffle` - whether to randomize items order before splitting into batches. Default is `False`. Can be * `bool` : `False` - to make batches in the order of indices in the index, or `True` - to make random batches. * a :class:`numpy.random.RandomState` object which has an inplace shuffle method. * `int` - a random seed number which will be used internally to create a :class:`numpy.random.RandomState` object. * `sample function` - any callable which gets an order and returns a shuffled order. `n_iters` - number of iterations (i.e. number of batches created). `n_epochs` - number of complete passes through the whole dataset. Only one of `n_iters` and `n_epochs` should be defined. If both are None, then you will get an infinite sequence of batches. `drop_last` - whether to skip the last batch if it has fewer items (for instance, if a dataset contains 10 items and the batch size is 3, then there will 3 batches of 3 items and the 4th batch with just 1 item. The last batch will be skipped if `drop_last=True`). See :meth:`API for more details `. `bar` - whether to show a tqdm bar. Custom batch class ------------------ You can also define a new :doc:`batch class ` with custom action methods to process your specific data. .. code-block:: python class MyBatch(Batch): @action def my_custom_action(self): ... @action def another_custom_action(self): ... And then create a dataset with a new batch class:: client_dataset = Dataset(client_index, batch_class=MyBatch) API --- See :doc:`Dataset API <../api/batchflow.dataset>`.