Index

DatasetIndex

class DatasetIndex(*args, **kwargs)[source]

Stores an index for a dataset. The index should be 1-d array-like, e.g. numpy array, pandas Series, etc.

Parameters:: index (int, 1-d array-like or callable) – defines structure of DatasetIndex

Examples

>>> index = DatasetIndex(all_item_ids)

>>> index.split([0.8, 0.2])

>>> item_pos = index.get_pos(item_id)

static build_index(index)[source]

Check index type and structure.

Parameters:

index (int, 1-d array-like or callable) –

Defines content of DatasetIndex

1-d array-like
Content is numpy.array
int
Content is numpy.arange() of given length.
callable
Content is return of given function (should be 1-d array-like).

Raises:

TypeError – If ‘index’ is not 1-dimensional.
ValueError – If ‘index’ is empty.

Returns:

Index to be stored in class instance.

Return type:

numpy.array

build_pos()[source]: Create a dictionary with positions in the index.

calc_split(shares=0.8)

Calculate split into train, test and validation subsets

Parameters:

shares (float or a sequence of floats) – A share of train, test and validation subset respectively.

Return type:

a tuple which contains number of items in train, test and validation subsets

Raises:

ValueError –

if shares has more than 3 items * if sum of shares is greater than 1 * if this set does not have enough items to split

Examples

Split into train / test in 80/20 ratio

>>> some_set.calc_split()

Split into train / test / validation in 60/30/10 ratio

>>> some_set.calc_split([0.6, 0.3])

Split into train / test / validation in 50/30/20 ratio

>>> some_set.calc_split([0.5, 0.3, 0.2])

classmethod concat(*index_list)[source]

Create index by concatenating other indices.

Parameters:: index_list (list) – Indices to be concatenated. Each item is expected to contain index property with 1-d sequence of indices.
Returns:: Contains one common index.
Return type:: DatasetIndex

create_batch(index, pos=True, as_array=False, *args, **kwargs)[source]

Create a batch from given indices.

Parameters:

index (int, slice, list, numpy.array or DatasetIndex) –
If ‘pos’ is True, then ‘index’ should contain positions of items in the current index to be returned as separate batch.

If ‘pos’ is False, then ‘index’ should contain indices to be returned as separate batch (so expected batch is just the very same index).
pos (bool) – Whether to return indices or positions
as_array (bool) – Whether to return array or an instance of DatasetIndex

Returns:

Part of initial DatasetIndex, specified by ‘index’.

Return type:

DatasetIndex or numpy.array

Examples

Create DatasetIndex with first 100 natural numbers, then get batch with every second item

>>> DatasetIndex(100).create_batch(index=2*numpy.arange(50))

create_subset(index)[source]: Return a new index object based on the subset of indices given.

classmethod from_index(*args, **kwargs)[source]: Create index from another index.

gen_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, notifier=False, iter_params=None)[source]

Generate batches

Parameters:

batch_size (int) – Desired number of items in the batch (the actual batch could contain fewer items).
shuffle – specifies the order of items (see shuffle())
n_iters (int) – Number of iterations to make (only one of n_iters and n_epochs should be specified).
n_epochs (int) – Number of epochs required (only one of n_iters and n_epochs should be specified).
drop_last (bool) –
If True, drops the last batch (in each epoch) if it contains fewer than batch_size items. If False, than the last batch in each epoch could contain repeating indices (which might be a problem) and the very last batch could contain fewer than batch_size items.

For instance, gen_batch(3, shuffle=False, n_epochs=2, drop_last=False) for a dataset with 4 items returns indices [0,1,2], [3,0,1], [2,3]. While gen_batch(3, shuffle=False, n_epochs=2, drop_last=True) returns indices [0,1,2], [0,1,2].

Take into account that gen_batch(3, shuffle=True, n_epochs=2, drop_last=False) could return batches [3,0,1], [2,0,2], [1,3]. Here the second batch contains two items with the same index “2”. This might become a problem if some action uses batch.get_pos() or batch.index.get_pos() methods so that one of the identical items will be missed. However, there is nothing to worry about if you don’t iterate over batch items explicitly (i.e. for item in batch) or implicitly (through batch[ix]).
notifier (str, dict, or instance of .Notifier) – Configuration of displayed progress bar, if any. If str or dict, then parameters of .Notifier initialization. For more details about notifying capabilities, refer to .Notifier documentation.

Yields:

An instance of the same class with a subset of indices

Raises:

ValueError – When n_epochs and n_iters have been passed at the same time.

Examples

for index_batch in index.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True):
    # do whatever you want

classmethod get_default_iter_params(): Return iteration params with default values to start iteration from scratch

get_pos(index)[source]

Return position of an item in the index.

Parameters:

index (int, str, slice or Iterable) –

Items to return positions of.

int, str
Return position of that item in the DatasetIndex.
slice, Iterable
Return positions of multiple items, specified by argument.

Returns:

Positions of specified items in DatasetIndex.

Return type:

numpy.array

Examples

Create DatasetIndex that holds index of images and get position of one of them

>>> DatasetIndex(['image_0', 'image_1']).get_pos('image_1')

property index

the dataset’s index

Type:: dataset.DatasetIndex

property indices

an array with the indices

Type:: numpy.ndarray

property is_split

True if dataset has been split into train / test / validation subsets

Type:: bool

next_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, iter_params=None)[source]

Return the next batch

Parameters:

batch_size (int) – Desired number of items in the batch (the actual batch could contain fewer items)
shuffle – Specifies the order of items (see shuffle())
n_iters (int) – Number of iterations to make (only one of n_iters and n_epochs should be specified).
n_epochs (int) – Number of epochs required (only one of n_iters and n_epochs should be specified).
drop_last (bool) –
If True, drops the last batch (in each epoch) if it contains fewer than batch_size items. If False, than the last batch in each epoch could contain repeating indices (which might be a problem) and the very last batch could contain fewer than batch_size items.

For instance, next_batch(3, shuffle=False, n_epochs=2, drop_last=False) for a dataset with 4 items returns indices [0,1,2], [3,0,1], [2,3]. While next_batch(3, shuffle=False, n_epochs=2, drop_last=True) returns indices [0,1,2], [0,1,2].

Take into account that next_batch(3, shuffle=True, n_epochs=2, drop_last=False) could return batches [3,0,1], [2,0,2], [1,3]. Here the second batch contains two items with the same index “2”. This might become a problem if some action uses batch.get_pos() or batch.index.get_pos() methods so that one of the identical items will be missed. However, there is nothing to worry about if you don’t iterate over batch items explicitly (i.e. for item in batch) or implicitly (through batch[ix]).

Raises:

StopIteration – When n_epochs has been reached and there is no batches left in the dataset.
ValueError – When n_epochs and n_iters have been passed at the same time. When batch size exceeds the dataset size.

Examples

for i in range(MAX_ITER):
    index_batch = index.next_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True):
    # do whatever you want

reset(what='iter'): Clear all iteration metadata in order to start iterating from scratch

reset_iter()

shuffle(shuffle, iter_params=None)[source]

Permute indices

Parameters:

shuffle (bool or seed) –

specifies the order of items

if False, items go sequentially, one after another as they appear in the index.
if True, items are shuffled randomly before each epoch.
see make_rng() for seed specifications.

Returns:

a permuted order for indices

Return type:

ndarray

property size: int - number of items in the set

split(shares=0.8, shuffle=False)[source]

Split index into train, test and validation subsets.

Shuffles index if necessary.

Subsets are available as .train, .test and .validation respectively.

Parameters:

shares (float or tuple of floats) – train, test and validation shares.
shuffle – specifies the order of items (see shuffle())

Notes

If tuple of 3 floats is passed, then validation subset is always present.

Examples

split into train / test in 80/20 ratio

>>> index.split()

split into train / test / validation in 60/30/10 ratio

>>> index.split([0.6, 0.3])

split into train / test / validation in 50/30/20 ratio

>>> index.split([0.5, 0.3, 0.2])

use 1 sample as validation and split the rest evenly to train / test

>>> index.split([0.5, 0.5, 0])

subset_by_pos(pos)[source]

Return subset of index by given positions in the index.

Parameters:: pos (int, slice, list or numpy.array) – Positions of items to include in subset.
Returns:: Subset of DatasetIndex.index.
Return type:: numpy.array

FilesIndex

class FilesIndex(*args, **kwargs)[source]

Bases: DatasetIndex

Index with the list of files or directories with the given path pattern

Examples

Create a sorted index of files in a directory:

>>> fi = FilesIndex(path='/path/to/data/files/*', sort=True)

Create an index of directories through all subdirectories:

>>> fi = FilesIndex(path='/path/to/data/archive*/patient*', dirs=True)

Create an index of files in several directories, and file extensions are ignored:

>>> fi = FilesIndex(path=['/path/to/archive/2016/*','/path/to/current/file/*'], no_ext=True)

To get a path to the file call get_fullpath(index_id):

>>> path = fi.get_fullpath(some_id)

Split into train / test / validation in 80/15/5 ratio

>>> fi.split([0.8, 0.15])

Get a position of a customer in the index

>>> item_pos = fi.get_pos(customer_id)

build_from_index(index, paths, dirs=None)[source]: Build index from another index for indices given.

build_from_one_path(path, dirs=False, no_ext=False)[source]: Build index from a path/glob.

build_from_path(path, dirs=False, no_ext=False, sort=False)[source]: Build index from a path/glob or a sequence of paths/globs.

build_index(index=None, path=None, *args, **kwargs)[source]: Build index from a path string or an index given.

static build_key(fullpathname, no_ext=False)[source]: Create index item from full path name.

classmethod concat(*index_list)[source]

Create index by concatenating other indices.

Parameters:: index_list (list) – Indices to be concatenated. Each item is expected to contain index property with 1-d sequence of indices.
Returns:: Contains one common index.
Return type:: DatasetIndex

create_subset(index)[source]: Return a new FilesIndex based on the subset of indices given.

get_fullpath(key)[source]: Return the full path name for an item in the index.

property paths