Preprocessing

Module allows to perform a number of preprocessing actions on a dataset of scans.

RadIO works with batches of scans, wrapped in a class CTImagesBatch. The class stores scans’ data in components which represents main attributes of scans. E.g., scans itself are stacked in one tall 3d numpy-array (called skyscraper) and stored in images-component. The other components are spacing and origin, which store important meta.

So, what can be done with the data?

Load and dump

The preprocessing-module is primarily adapted to work with two large datasets, available in open access: Luna-dataset (.raw-format) and DsBowl2017-dataset (.dicom-format). Consider, you have one of these two datasets (or a part of it) downloaded in folder path/to/scans. The first step is to set up an index. FilesIndex from dataset-module reduces the task to defining a glob-mask for a needed set of scans:

from radio import CTImagesBatch as CTIMB
from radio.dataset import FilesIndex, Dataset, Pipeline

ctx = FilesIndex(path='path/to/scans/*', no_ext=True)
ctset = Datset(index=ctx, batch_class=CTIMB)
pipeline = Pipeline()

For loading scans you only need to call action load() specifying format of dataset:

pipeline = pipeline.load(fmt='raw') # use fmt = 'dicom' for load of dicom-scans

After performing some preprocessing operations you may need to save the results on disk. Action dump() is of help here:

pipeline = ... # some preprocessing actions
pipeline = pipeline.dump(dst='path/to/preprocessed/')

In the end, data of each scan from the batch will be packed with blosc and dumped into folder. Dumped scans can be loaded later using the same methodology. To do this, specify blosc-format when performing load():

pipeline = Pipeline().load(fmt='blosc')

Both dump() and load() from blosc can work component-wise:

pipeline_dump = (
    pipeline
    .dump(fmt='blosc', components=['spacing', 'origin']) # dump spacing, origin components
    .dump(dst='path/to/preprocessed/', fmt='blosc', components='images') # dumps scans itself
)

pipeline_load = Pipeline().load(fmt='blosc', components=['spacing', 'origin', 'images'])

Resize and unify spacing

Another step of preprocessing is resize of scans to a specific shape. preprocessing-module has resize()-action, specifying desired output shape in z, y, x order:

pipeline = pipeline.resize(shape=(128, 256, 256))

Currently module supports two different resize-engines: scipy.interpolate and PIL-simd. While the second engine is more robust and works faster on systems with small number of cores, the first allows greater degree of parallelization and can be more precise in some cases. One can choose engine in a following way:

pipeline = pipeline.resize(shape=(128, 256, 256), method='scipy')

Sometimes, it may be useful to convert scans to the same real-world scale, rather than simply reshape to same size. This can be achieved through unify_spacing()-action:

pipeline = pipeline.unify_spacing(spacing=(3.0, 2.0, 2.0), shape=(128, 256, 256))

To control real-world world scale of scans, you can specify spacing, that represents distances in millimeters between adjacent voxels along three axes. The action works in two steps. The first step stands for spacing unification by means of resize, while the second one crops/pads resized scan so that it fits in the supplied shape. You can specify resize parameters and padding mode:

pipeline = pipeline.unify_spacing(spacing=(3.0, 2.0, 2.0), shape=(128, 256, 256),
                            padding='reflect', engine='pil-simd')

So far it was all about images-components, that can be viewed as an X-input of a neural network. What about network’s target, Y-input?

Create masks with CTImagesMaskedBatch

Preparing target for network revolves around class CTImagesMaskedBatch. It naturally has one new component - masks. Masks have the same shape as images and store cancer-masks of different items in a binary format, where value of each voxel is either 0 (non-cancerous voxel) or 1 (cancerous voxel). masks can be made in two steps. First, load info about cancerous nodules in a batch with fetch_nodules_info():

pipeline = (
    pipeline
     .fetch_nodules_info(nodules=nodules) # nodules is a Pandas.DataFrame
                                          # containing info about nodules
)

Then you can fill the masks-component using the loaded info and action create_mask():

pipeline = (
    pipeline
    .create_mask()
)

Sample crops from scan

RadIO has sample_nodules() that allows to generate batches of small crops, balancing cancerous and non-cancerous examples. Let’s start preprocessing with resize of scans:

pipeline = (
    pipeline
    .resize(shape=(256, 512, 512))
)

Now all scans have the same shape (256, 512, 512), it is possible to put them into a neural network. However, it may fail for two main reasons:

  • only small number of scans (say, 3) of such size can be put into a memory of a GPU
  • typically, there are not so many scans available for training (888 for Luna-dataset). As a result, making only one training example out of a scan is rather wasteful.

A more efficient approach is to crop out interesting parts of scans using sample_nodules(). E.g., this piece of code

pipeline = (
    pipeline
    .resize(shape=(256, 512, 512))
    .sample_nodules(nodule_size=(32, 64, 64),
                    batch_size=20, share=0.5)
)

will generate batches of size 20, that will contain 10 cancerous and 10 noncancerous crops of shape (32, 64, 64). Or, alternatively this code

pipeline = (
    pipeline
    .resize(shape=(256, 512, 512))
    .sample_nodules(nodule_size=(32, 64, 64),
                    batch_size=20, share=0.6,
                    variance=(100, 200, 200),
                    histo=some_3d_histogram)
)

will generate batches of size 20 with 12 cancerous crops. Pay attention to parameters variance and histo of sample_nodules():

  • variance introduces variability in the location of cancerous nodule inside the crop. E.g., if set to (100, 200, 200), the location of cancerous nodule will be sampled from normal distribution with zero-mean and variances (100, 200, 200) along three axes.
  • histo allows you to control the positions of noncancerous crops. If histo set to None, noncancerous crops will be sampled uniformly from scan-boxes of shape (256, 512, 512). Sometimes, though, you may want to sample noncancerous crops from specific regions of lungs - say, the interior of the left lung. In this case you can generate a 3d-histogram (see numpy.histogram()) concentrated in this region and supply it into sample_nodules-action.

Augment data on-the-fly

Medical datasets are often small and require additional augmentation to avoid overfitting. For this purpose, it is possible to combine rotate() and central_crop():

pipeline = (
    pipeline
    .resize(shape=(256, 512, 512))
    .rotate(angle=90, axes=(1, 2), random=True)
    .central_crop(crop_size=(32, 64, 64))
)

This pipeline first resizes all images to same shape and then samples rotated crops of shape [32, 64, 64]; rotation angle is random, from 0 to 90 degrees. Rotation is performed about z-axis. Crops are padded by zeroes after rotation, if needed.

Accessing Batch components

You may want to access CTImagesBatch or CTImagesMaskedBatch-data directly. E.g., if you decide to write your own actions. Batch-classes has such functionality: 3d-scan for an item indexed by ix from a batch can be accessed in the following way:

image_3d_ix = batch.get(ix, 'images')

The same goes for other components of item ix:

spacing_ix = batch.get(ix, 'spacing')

Or, alternatively

image_3d_ix = getattr(batch[ix], 'images')
spacing_ix = batch[ix].spacing

It is sometimes useful to print indices of all items from a batch:

print(batch.indices) # batch.indices is a list of indices of all items

Writing your own actions

Now that you know how to work with components of CTImagesBatch, you can write your own action. E.g., you need an action, that subtracts mean-values of voxel densities from each scan. You can easily inherit one of batch classes of RadIO (we suggest to use CTImagesMaskedBatch) add make your action center a method of this class, just like that:

from RadIO.dataset import action
from RadIO import CTImagesMaskedBatch

class CTImagesCustomBatch(CTImagesMaskedBatch):
    """ Ct-scans batch class with your own action """

    @action  # action-decorator allows you to chain your method with other actions in pipelines
    def center(self):
        """ Center values of pixels in each scan from batch """
        for ix in self.indices:
            mean_ix = np.mean(self.get(ix, 'images'))
            images_ix = getattr(self[ix], 'images')
            images_ix[:] -= mean_ix

        return self  # action must always return a batch-object

You can then chain your action center with other actions of CTImagesMaskedBatch to form custom preprocessing pipelines:

pipeline = (Pipeline()
            .load(fmt='blosc')  # load data
            .center()  # mean-normalize scans
            .sample_nodules(batch_size=20))  # sample cancerous and noncancerous crops