Index¶
DatasetIndex¶
- class DatasetIndex(*args, **kwargs)[source]¶
Stores an index for a dataset. The index should be 1-d array-like, e.g. numpy array, pandas Series, etc.
- Parameters
index (int, 1-d array-like or callable) – defines structure of DatasetIndex
Examples
>>> index = DatasetIndex(all_item_ids)
>>> index.split([0.8, 0.2])
>>> item_pos = index.get_pos(item_id)
- static build_index(index)[source]¶
Check index type and structure.
- Parameters
index (int, 1-d array-like or callable) –
Defines content of DatasetIndex
- 1-d array-like
Content is numpy.array
- int
Content is numpy.arange() of given length.
- callable
Content is return of given function (should be 1-d array-like).
- Raises
TypeError – If ‘index’ is not 1-dimensional.
ValueError – If ‘index’ is empty.
- Returns
Index to be stored in class instance.
- Return type
numpy.array
- calc_split(shares=0.8)¶
Calculate split into train, test and validation subsets
- Parameters
shares (float or a sequence of floats) – A share of train, test and validation subset respectively.
- Returns
- Return type
a tuple which contains number of items in train, test and validation subsets
- Raises
if shares has more than 3 items * if sum of shares is greater than 1 * if this set does not have enough items to split
Examples
Split into train / test in 80/20 ratio
>>> some_set.calc_split()
Split into train / test / validation in 60/30/10 ratio
>>> some_set.calc_split([0.6, 0.3])
Split into train / test / validation in 50/30/20 ratio
>>> some_set.calc_split([0.5, 0.3, 0.2])
- classmethod concat(*index_list)[source]¶
Create index by concatenating other indices.
- Parameters
index_list (list) – Indices to be concatenated. Each item is expected to contain index property with 1-d sequence of indices.
- Returns
Contains one common index.
- Return type
- create_batch(index, pos=True, as_array=False, *args, **kwargs)[source]¶
Create a batch from given indices.
- Parameters
index (int, slice, list, numpy.array or DatasetIndex) –
If ‘pos’ is True, then ‘index’ should contain positions of items in the current index to be returned as separate batch.
If ‘pos’ is False, then ‘index’ should contain indices to be returned as separate batch (so expected batch is just the very same index).
pos (bool) – Whether to return indices or positions
as_array (bool) – Whether to return array or an instance of DatasetIndex
- Returns
Part of initial DatasetIndex, specified by ‘index’.
- Return type
DatasetIndex or numpy.array
Examples
Create DatasetIndex with first 100 natural numbers, then get batch with every second item
>>> DatasetIndex(100).create_batch(index=2*numpy.arange(50))
- gen_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, notifier=False, iter_params=None)[source]¶
Generate batches
- Parameters
batch_size (int) – Desired number of items in the batch (the actual batch could contain fewer items).
shuffle – specifies the order of items (see
shuffle()
)n_iters (int) – Number of iterations to make (only one of n_iters and n_epochs should be specified).
n_epochs (int) – Number of epochs required (only one of n_iters and n_epochs should be specified).
drop_last (bool) –
If True, drops the last batch (in each epoch) if it contains fewer than batch_size items. If False, than the last batch in each epoch could contain repeating indices (which might be a problem) and the very last batch could contain fewer than batch_size items.
For instance, gen_batch(3, shuffle=False, n_epochs=2, drop_last=False) for a dataset with 4 items returns indices [0,1,2], [3,0,1], [2,3]. While gen_batch(3, shuffle=False, n_epochs=2, drop_last=True) returns indices [0,1,2], [0,1,2].
Take into account that gen_batch(3, shuffle=True, n_epochs=2, drop_last=False) could return batches [3,0,1], [2,0,2], [1,3]. Here the second batch contains two items with the same index “2”. This might become a problem if some action uses batch.get_pos() or batch.index.get_pos() methods so that one of the identical items will be missed. However, there is nothing to worry about if you don’t iterate over batch items explicitly (i.e. for item in batch) or implicitly (through batch[ix]).
notifier (str, dict, or instance of .Notifier) – Configuration of displayed progress bar, if any. If str or dict, then parameters of .Notifier initialization. For more details about notifying capabilities, refer to .Notifier documentation.
- Yields
An instance of the same class with a subset of indices
- Raises
ValueError – When n_epochs and n_iters have been passed at the same time.
Examples
for index_batch in index.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True): # do whatever you want
- classmethod get_default_iter_params()¶
Return iteration params with default values to start iteration from scratch
- get_pos(index)[source]¶
Return position of an item in the index.
- Parameters
index (int, str, slice or Iterable) –
Items to return positions of.
- int, str
Return position of that item in the DatasetIndex.
- slice, Iterable
Return positions of multiple items, specified by argument.
- Returns
Positions of specified items in DatasetIndex.
- Return type
numpy.array
Examples
Create DatasetIndex that holds index of images and get position of one of them
>>> DatasetIndex(['image_0', 'image_1']).get_pos('image_1')
- property index¶
the dataset’s index
- Type
dataset.DatasetIndex
- property indices¶
an array with the indices
- Type
- next_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, iter_params=None)[source]¶
Return the next batch
- Parameters
batch_size (int) – Desired number of items in the batch (the actual batch could contain fewer items)
shuffle – Specifies the order of items (see
shuffle()
)n_iters (int) – Number of iterations to make (only one of n_iters and n_epochs should be specified).
n_epochs (int) – Number of epochs required (only one of n_iters and n_epochs should be specified).
drop_last (bool) –
If True, drops the last batch (in each epoch) if it contains fewer than batch_size items. If False, than the last batch in each epoch could contain repeating indices (which might be a problem) and the very last batch could contain fewer than batch_size items.
For instance, next_batch(3, shuffle=False, n_epochs=2, drop_last=False) for a dataset with 4 items returns indices [0,1,2], [3,0,1], [2,3]. While next_batch(3, shuffle=False, n_epochs=2, drop_last=True) returns indices [0,1,2], [0,1,2].
Take into account that next_batch(3, shuffle=True, n_epochs=2, drop_last=False) could return batches [3,0,1], [2,0,2], [1,3]. Here the second batch contains two items with the same index “2”. This might become a problem if some action uses batch.get_pos() or batch.index.get_pos() methods so that one of the identical items will be missed. However, there is nothing to worry about if you don’t iterate over batch items explicitly (i.e. for item in batch) or implicitly (through batch[ix]).
- Raises
StopIteration – When n_epochs has been reached and there is no batches left in the dataset.
ValueError – When n_epochs and n_iters have been passed at the same time. When batch size exceeds the dataset size.
Examples
for i in range(MAX_ITER): index_batch = index.next_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True): # do whatever you want
- reset(what='iter')¶
Clear all iteration metadata in order to start iterating from scratch
- reset_iter()¶
- shuffle(shuffle, iter_params=None)[source]¶
Permute indices
- Parameters
shuffle (bool or seed) –
specifies the order of items
if False, items go sequentially, one after another as they appear in the index.
if True, items are shuffled randomly before each epoch.
see
make_rng()
for seed specifications.
- Returns
a permuted order for indices
- Return type
ndarray
- property size¶
int - number of items in the set
- split(shares=0.8, shuffle=False)[source]¶
Split index into train, test and validation subsets.
Shuffles index if necessary.
Subsets are available as .train, .test and .validation respectively.
- Parameters
Notes
If tuple of 3 floats is passed, then validation subset is always present.
Examples
split into train / test in 80/20 ratio
>>> index.split()
split into train / test / validation in 60/30/10 ratio
>>> index.split([0.6, 0.3])
split into train / test / validation in 50/30/20 ratio
>>> index.split([0.5, 0.3, 0.2])
use 1 sample as validation and split the rest evenly to train / test
>>> index.split([0.5, 0.5, 0])
FilesIndex¶
- class FilesIndex(*args, **kwargs)[source]¶
Bases:
batchflow.dsindex.DatasetIndex
Index with the list of files or directories with the given path pattern
Examples
Create a sorted index of files in a directory:
>>> fi = FilesIndex(path='/path/to/data/files/*', sort=True)
Create an index of directories through all subdirectories:
>>> fi = FilesIndex(path='/path/to/data/archive*/patient*', dirs=True)
Create an index of files in several directories, and file extensions are ignored:
>>> fi = FilesIndex(path=['/path/to/archive/2016/*','/path/to/current/file/*'], no_ext=True)
To get a path to the file call get_fullpath(index_id):
>>> path = fi.get_fullpath(some_id)
Split into train / test / validation in 80/15/5 ratio
>>> fi.split([0.8, 0.15])
Get a position of a customer in the index
>>> item_pos = fi.get_pos(customer_id)
- build_from_index(index, paths, dirs=None)[source]¶
Build index from another index for indices given.
- build_from_path(path, dirs=False, no_ext=False, sort=False)[source]¶
Build index from a path/glob or a sequence of paths/globs.
- build_index(index=None, path=None, *args, **kwargs)[source]¶
Build index from a path string or an index given.
- classmethod concat(*index_list)[source]¶
Create index by concatenating other indices.
- Parameters
index_list (list) – Indices to be concatenated. Each item is expected to contain index property with 1-d sequence of indices.
- Returns
Contains one common index.
- Return type
- property paths¶