Index

DatasetIndex

class DatasetIndex(*args, **kwargs)[source]

Stores an index for a dataset. The index should be 1-d array-like, e.g. numpy array, pandas Series, etc.

Parameters

index (int, 1-d array-like or callable) – defines structure of DatasetIndex

Examples

>>> index = DatasetIndex(all_item_ids)
>>> index.split([0.8, 0.2])
>>> item_pos = index.get_pos(item_id)
static build_index(index)[source]

Check index type and structure.

Parameters

index (int, 1-d array-like or callable) –

Defines content of DatasetIndex

  • 1-d array-like

    Content is numpy.array

  • int

    Content is numpy.arange() of given length.

  • callable

    Content is return of given function (should be 1-d array-like).

Raises
Returns

Index to be stored in class instance.

Return type

numpy.array

build_pos()[source]

Create a dictionary with positions in the index.

calc_split(shares=0.8)

Calculate split into train, test and validation subsets

Parameters

shares (float or a sequence of floats) – A share of train, test and validation subset respectively.

Returns

Return type

a tuple which contains number of items in train, test and validation subsets

Raises

ValueError

  • if shares has more than 3 items * if sum of shares is greater than 1 * if this set does not have enough items to split

Examples

Split into train / test in 80/20 ratio

>>> some_set.calc_split()

Split into train / test / validation in 60/30/10 ratio

>>> some_set.calc_split([0.6, 0.3])

Split into train / test / validation in 50/30/20 ratio

>>> some_set.calc_split([0.5, 0.3, 0.2])
classmethod concat(*index_list)[source]

Create index by concatenating other indices.

Parameters

index_list (list) – Indices to be concatenated. Each item is expected to contain index property with 1-d sequence of indices.

Returns

Contains one common index.

Return type

DatasetIndex

create_batch(index, pos=True, as_array=False, *args, **kwargs)[source]

Create a batch from given indices.

Parameters
  • index (int, slice, list, numpy.array or DatasetIndex) –

    If ‘pos’ is True, then ‘index’ should contain positions of items in the current index to be returned as separate batch.

    If ‘pos’ is False, then ‘index’ should contain indices to be returned as separate batch (so expected batch is just the very same index).

  • pos (bool) – Whether to return indices or positions

  • as_array (bool) – Whether to return array or an instance of DatasetIndex

Returns

Part of initial DatasetIndex, specified by ‘index’.

Return type

DatasetIndex or numpy.array

Examples

Create DatasetIndex with first 100 natural numbers, then get batch with every second item

>>> DatasetIndex(100).create_batch(index=2*numpy.arange(50))
create_subset(index)[source]

Return a new index object based on the subset of indices given.

classmethod from_index(*args, **kwargs)[source]

Create index from another index.

gen_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, notifier=False, iter_params=None)[source]

Generate batches

Parameters
  • batch_size (int) – Desired number of items in the batch (the actual batch could contain fewer items).

  • shuffle – specifies the order of items (see shuffle())

  • n_iters (int) – Number of iterations to make (only one of n_iters and n_epochs should be specified).

  • n_epochs (int) – Number of epochs required (only one of n_iters and n_epochs should be specified).

  • drop_last (bool) –

    If True, drops the last batch (in each epoch) if it contains fewer than batch_size items. If False, than the last batch in each epoch could contain repeating indices (which might be a problem) and the very last batch could contain fewer than batch_size items.

    For instance, gen_batch(3, shuffle=False, n_epochs=2, drop_last=False) for a dataset with 4 items returns indices [0,1,2], [3,0,1], [2,3]. While gen_batch(3, shuffle=False, n_epochs=2, drop_last=True) returns indices [0,1,2], [0,1,2].

    Take into account that gen_batch(3, shuffle=True, n_epochs=2, drop_last=False) could return batches [3,0,1], [2,0,2], [1,3]. Here the second batch contains two items with the same index “2”. This might become a problem if some action uses batch.get_pos() or batch.index.get_pos() methods so that one of the identical items will be missed. However, there is nothing to worry about if you don’t iterate over batch items explicitly (i.e. for item in batch) or implicitly (through batch[ix]).

  • notifier (str, dict, or instance of .Notifier) – Configuration of displayed progress bar, if any. If str or dict, then parameters of .Notifier initialization. For more details about notifying capabilities, refer to .Notifier documentation.

Yields

An instance of the same class with a subset of indices

Raises

ValueError – When n_epochs and n_iters have been passed at the same time.

Examples

for index_batch in index.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True):
    # do whatever you want
classmethod get_default_iter_params()

Return iteration params with default values to start iteration from scratch

get_pos(index)[source]

Return position of an item in the index.

Parameters

index (int, str, slice or Iterable) –

Items to return positions of.

  • int, str

    Return position of that item in the DatasetIndex.

  • slice, Iterable

    Return positions of multiple items, specified by argument.

Returns

Positions of specified items in DatasetIndex.

Return type

numpy.array

Examples

Create DatasetIndex that holds index of images and get position of one of them

>>> DatasetIndex(['image_0', 'image_1']).get_pos('image_1')
property index

the dataset’s index

Type

dataset.DatasetIndex

property indices

an array with the indices

Type

numpy.ndarray

property is_split

True if dataset has been split into train / test / validation subsets

Type

bool

next_batch(batch_size, shuffle=False, n_iters=None, n_epochs=None, drop_last=False, iter_params=None)[source]

Return the next batch

Parameters
  • batch_size (int) – Desired number of items in the batch (the actual batch could contain fewer items)

  • shuffle – Specifies the order of items (see shuffle())

  • n_iters (int) – Number of iterations to make (only one of n_iters and n_epochs should be specified).

  • n_epochs (int) – Number of epochs required (only one of n_iters and n_epochs should be specified).

  • drop_last (bool) –

    If True, drops the last batch (in each epoch) if it contains fewer than batch_size items. If False, than the last batch in each epoch could contain repeating indices (which might be a problem) and the very last batch could contain fewer than batch_size items.

    For instance, next_batch(3, shuffle=False, n_epochs=2, drop_last=False) for a dataset with 4 items returns indices [0,1,2], [3,0,1], [2,3]. While next_batch(3, shuffle=False, n_epochs=2, drop_last=True) returns indices [0,1,2], [0,1,2].

    Take into account that next_batch(3, shuffle=True, n_epochs=2, drop_last=False) could return batches [3,0,1], [2,0,2], [1,3]. Here the second batch contains two items with the same index “2”. This might become a problem if some action uses batch.get_pos() or batch.index.get_pos() methods so that one of the identical items will be missed. However, there is nothing to worry about if you don’t iterate over batch items explicitly (i.e. for item in batch) or implicitly (through batch[ix]).

Raises
  • StopIteration – When n_epochs has been reached and there is no batches left in the dataset.

  • ValueError – When n_epochs and n_iters have been passed at the same time. When batch size exceeds the dataset size.

Examples

for i in range(MAX_ITER):
    index_batch = index.next_batch(BATCH_SIZE, shuffle=True, n_epochs=2, drop_last=True):
    # do whatever you want
reset(what='iter')

Clear all iteration metadata in order to start iterating from scratch

reset_iter()
shuffle(shuffle, iter_params=None)[source]

Permute indices

Parameters

shuffle (bool or seed) –

specifies the order of items

  • if False, items go sequentially, one after another as they appear in the index.

  • if True, items are shuffled randomly before each epoch.

  • see make_rng() for seed specifications.

Returns

a permuted order for indices

Return type

ndarray

property size

int - number of items in the set

split(shares=0.8, shuffle=False)[source]

Split index into train, test and validation subsets.

Shuffles index if necessary.

Subsets are available as .train, .test and .validation respectively.

Parameters
  • shares (float or tuple of floats) – train, test and validation shares.

  • shuffle – specifies the order of items (see shuffle())

Notes

If tuple of 3 floats is passed, then validation subset is always present.

Examples

split into train / test in 80/20 ratio

>>> index.split()

split into train / test / validation in 60/30/10 ratio

>>> index.split([0.6, 0.3])

split into train / test / validation in 50/30/20 ratio

>>> index.split([0.5, 0.3, 0.2])

use 1 sample as validation and split the rest evenly to train / test

>>> index.split([0.5, 0.5, 0])
subset_by_pos(pos)[source]

Return subset of index by given positions in the index.

Parameters

pos (int, slice, list or numpy.array) – Positions of items to include in subset.

Returns

Subset of DatasetIndex.index.

Return type

numpy.array

FilesIndex

class FilesIndex(*args, **kwargs)[source]

Bases: batchflow.dsindex.DatasetIndex

Index with the list of files or directories with the given path pattern

Examples

Create a sorted index of files in a directory:

>>> fi = FilesIndex(path='/path/to/data/files/*', sort=True)

Create an index of directories through all subdirectories:

>>> fi = FilesIndex(path='/path/to/data/archive*/patient*', dirs=True)

Create an index of files in several directories, and file extensions are ignored:

>>> fi = FilesIndex(path=['/path/to/archive/2016/*','/path/to/current/file/*'], no_ext=True)

To get a path to the file call get_fullpath(index_id):

>>> path = fi.get_fullpath(some_id)

Split into train / test / validation in 80/15/5 ratio

>>> fi.split([0.8, 0.15])

Get a position of a customer in the index

>>> item_pos = fi.get_pos(customer_id)
build_from_index(index, paths, dirs=None)[source]

Build index from another index for indices given.

build_from_one_path(path, dirs=False, no_ext=False)[source]

Build index from a path/glob.

build_from_path(path, dirs=False, no_ext=False, sort=False)[source]

Build index from a path/glob or a sequence of paths/globs.

build_index(index=None, path=None, *args, **kwargs)[source]

Build index from a path string or an index given.

static build_key(fullpathname, no_ext=False)[source]

Create index item from full path name.

classmethod concat(*index_list)[source]

Create index by concatenating other indices.

Parameters

index_list (list) – Indices to be concatenated. Each item is expected to contain index property with 1-d sequence of indices.

Returns

Contains one common index.

Return type

DatasetIndex

create_subset(index)[source]

Return a new FilesIndex based on the subset of indices given.

get_fullpath(key)[source]

Return the full path name for an item in the index.

property paths