Index¶
Why the index is needed?¶
A dataset may be so large that it does not fit into memory and thus you cannot process it at once. That is why each data item in the Dataset should have an unique id. It does not have to be meaningful (like a card number or a transaction id), sometimes it may be just a hash or an ordered number. However, each index item should address exactly one data item (which in turn can have a complex structure, like a list, an array, a dataframe, or even a graph).
The value of ids in the index is important only in 2 situations:
in load action-method , when batch gets data from some external source like batch_items = external_source[batch.indices] and thus the external source should contain those indices, otherwise load will fail. Similarly, when data is loaded from files, indices usually point to those files and their full paths (see FilesIndex below).
in item selection - batch[some_item_id] - so the index should contain the id you’re referring to.
Evereywhere else the particular id value is pretty meaningless as all operations use an item position in the index, not its id.
DatasetIndex¶
DatasetIndex is a base index class which stores a sequence of unique ids for your data items. In the simplest case it might be just an ordered sequence of numbers (0, 1, 2, 3,…, e.g. numpy.arange(len(dataset))).
dataset_index = DatasetIndex(np.arange(my_array.shape[0]))
In other cases it can be a list of domain-specific identificators (e.g. client ids, product codes, serial numbers, timestamps, etc).
dataset_index = DatasetIndex(dataframe['client_id'])
You will rarely need to work with an index directly, but if you want to do something specific you may use its public API.
FilesIndex¶
When data comes from a file system, it might be convenient to use FilesIndex.
files_index = FilesIndex(path="/path/to/some/files/*.csv")
Thus files_index will contain the list of filenames that match a given mask.
The details of mask specification may be found in the glob()
documentation.
No file extensions¶
When filenames contain extensions which are not a part of the id, then they may be stripped with an option no_ext
dataset_index = FilesIndex(path="/path/to/some/files/*.csv", no_ext=True)
Sorting¶
Since order may be random, you may want to sort your index items:
dataset_index = FilesIndex(path="/path/to/some/files/*.csv", sort=True)
However, this rarely makes any sense.
Directories¶
Sometimes you need directories, not files. For instance, a CT images dataset includes one subdirectory per each patient, it is named with patient id and contains many images of that patient. So the index should be built from these subdirectories, and not separate images.
dirs_index = FilesIndex(path="/path/to/archive/2016-*/scans/*", dirs=True)
Here dirs_index will contain a list of all subdirectories names.
Numerous sources¶
If files you are interested in are located in different places you may still build one united index:
dataset_index = FilesIndex(["/current/year/data/*", "/path/to/archive/2016/*", "/previous/years/*"])
Creating your own index class¶
Constructor¶
We highly recommend to use the following pattern:
class MyIndex(DatasetIndex):
def __init__(self, index, my_arg, *args, **kwargs):
# initialize new properties
super().__init__(index, my_arg, *args, **kwargs)
# do whatever you need
So to summarize:
the parent class should be DatasetIndex or its child
include *args and **kwargs in the constructor definition
pass all the arguments to the parent constructor
build_index¶
You might want to redefine build_index method which actually creates the index.
It takes all the arguments from the constructor and returns a numpy array with index items.
This method is called automatically from the DatasetIndex
constructor.