==================== A short introduction ==================== Index ===== Index holds a sequence of data item ids. As a dataset is split into batches, you should have a mechanism to uniquely address each data item. In simple cases it can be just a `numpy.arange`: .. code-block:: python dataset_index = DatasetIndex(np.arange(my_array.shape[0])) `FilesIndex` is helpful when your data comes from multiple files. .. code-block:: python dataset_index = FilesIndex("/path/to/files/*.png") Most of the times creating an index in one line of code is all you need to do about a dataset index. For more details see :doc:`How to work with an Index `. Dataset ======= A dataset consists of an index (1-d sequence with unique keys per each data item) and a batch class which processes small subsets of data. .. code-block:: python client_ds = Dataset(dataset_index, batch_class=Batch) Now you can iterate over sequential or random batches: .. code-block:: python batch = client_ds.next_batch(BATCH_SIZE, shuffle=True, n_epochs=3) You will rarely need anything than creating a dataset in one line of code, but you may always dig deeper into :doc:`how to work with datasets `. Batch ===== Batch class holds the data and contains processing functions. Normally, you never create batch instances, as they are created in the `Dataset` or `Pipeline` batch generators. See more info about :doc:`useful batch methods and actions and how to create your own batch class `. Pipeline ======== After a batch class is created, you can define a processing workflow for the whole dataset: .. code-block:: python my_pipeline = my_dataset.pipeline() .load('/some/path') .some_processing() .another_processing() .save('/other/path') .run(BATCH_SIZE, shuffle=False) All the methods here (except `run`) are :doc:`actions from the batch class `. Now you are ready for a deeper immersion into :doc:`how to create, use and run pipelines `. Within-batch parallelism ======================== In order to accelerate data processing you can run batch methods in parallel: .. code-block:: python from batchflow import Batch, inbatch_parallel, action class MyBatch(Batch): ... @action @inbatch_parallel(init='_init_fn', post='_post_fn', target='threads') def some_action(self, item): # process just one item from the batch return some_value :doc:`How to make parallel methods `. Inter-batch parallelism ======================= To further increase pipeline performance and eliminate inter batch delays you may process several batches in parallel: .. code-block:: python some_pipeline.next_batch(BATCH_SIZE, prefetch=3) The parameter `prefetch` defines how many additional batches will be processed in the background. See more indo about :doc:`prefetching `. Models ====== Mostly, pipelines are needed to train machine learning models or predict using these models. See :doc:`Working with models ` to understand what a model is and how to use it within pipelines. There is a bunch of :doc:`predefined models ` which you can use out of the box. Research ======== To perform multiple experiments with different parameters you can use `Research` class: .. code-block:: python from batchflow.research import Research ... research = (Research() .add_pipeline(train_pipeline, variables='loss', name='train') .add_pipeline(test_pipeline, variables='accuracy', name='test', import_model_from='train') .add_grid_config('model_class': [VGG7, VGG16], 'layout': ['cna', 'can']) .run(n_reps=10, n_iters=1000)) See more indo about :doc:`Research `.