Preprocessing¶
Module allows to perform a number of preprocessing actions on a dataset of scans.
RadIO works with batches of scans, wrapped in a class CTImagesBatch
.
The class stores scans’ data in components
which represents main attributes
of scans. E.g., scans itself are stacked in one
tall 3d numpy
-array (called skyscraper) and stored in images
-component. The other
components are spacing
and origin
, which store important meta.
So, what can be done with the data?
Load and dump¶
The preprocessing
-module is primarily adapted to work with two
large datasets, available in open access: Luna-dataset
(.raw-format) and DsBowl2017-dataset (.dicom-format).
Consider, you have one of these two datasets (or a part of it) downloaded in
folder path/to/scans
. The first step is to set up an index.
FilesIndex
from dataset
-module reduces the
task to defining a glob
-mask for a needed set of scans:
from radio import CTImagesBatch as CTIMB
from radio.dataset import FilesIndex, Dataset, Pipeline
ctx = FilesIndex(path='path/to/scans/*', no_ext=True)
ctset = Datset(index=ctx, batch_class=CTIMB)
pipeline = Pipeline()
For loading scans you only need to call action load()
specifying
format of dataset:
pipeline = pipeline.load(fmt='raw') # use fmt = 'dicom' for load of dicom-scans
After performing some preprocessing operations you may need to save the
results on disk. Action dump()
is of help here:
pipeline = ... # some preprocessing actions
pipeline = pipeline.dump(dst='path/to/preprocessed/')
In the end, data of each scan from the batch will be packed with
blosc
and dumped into folder.
Dumped scans can be loaded later using the same methodology.
To do this, specify blosc-format when performing load()
:
pipeline = Pipeline().load(fmt='blosc')
Both dump()
and load()
from blosc can work component-wise:
pipeline_dump = (
pipeline
.dump(fmt='blosc', components=['spacing', 'origin']) # dump spacing, origin components
.dump(dst='path/to/preprocessed/', fmt='blosc', components='images') # dumps scans itself
)
pipeline_load = Pipeline().load(fmt='blosc', components=['spacing', 'origin', 'images'])
Resize and unify spacing¶
Another step of preprocessing is resize of scans to a specific shape.
preprocessing
-module has resize()
-action, specifying desired
output shape in z, y, x order:
pipeline = pipeline.resize(shape=(128, 256, 256))
Currently module supports two different resize-engines:
scipy.interpolate
and PIL-simd
. While the second engine
is more robust and works faster on systems with small number
of cores, the first allows greater degree of parallelization
and can be more precise in some cases. One can choose engine
in a following way:
pipeline = pipeline.resize(shape=(128, 256, 256), method='scipy')
Sometimes, it may be useful to convert scans to the same real-world scale,
rather than simply reshape to same size.
This can be achieved through unify_spacing()
-action:
pipeline = pipeline.unify_spacing(spacing=(3.0, 2.0, 2.0), shape=(128, 256, 256))
To control real-world world scale of scans, you can specify spacing
,
that represents distances in millimeters between adjacent voxels along three axes.
The action works in two steps. The first step stands for spacing
unification by means of resize, while the second one crops/pads
resized scan so that it fits in the supplied shape. You can specify
resize parameters and padding mode:
pipeline = pipeline.unify_spacing(spacing=(3.0, 2.0, 2.0), shape=(128, 256, 256),
padding='reflect', engine='pil-simd')
So far it was all about images
-components, that can be viewed as
an X-input of a neural network. What about network’s target, Y-input?
Create masks with CTImagesMaskedBatch
¶
Preparing target for network revolves around class CTImagesMaskedBatch
.
It naturally has one new component - masks
. Masks
have the same
shape as images
and store cancer-masks of different items
in a binary format, where value of each voxel is either 0 (non-cancerous voxel) or
1 (cancerous voxel). masks
can be made in two steps.
First, load info about cancerous nodules in a batch with fetch_nodules_info()
:
pipeline = (
pipeline
.fetch_nodules_info(nodules=nodules) # nodules is a Pandas.DataFrame
# containing info about nodules
)
Then you can fill the masks
-component using the loaded info and action create_mask()
:
pipeline = (
pipeline
.create_mask()
)
Sample crops from scan¶
RadIO has sample_nodules()
that allows to generate batches of small crops, balancing cancerous
and non-cancerous examples.
Let’s start preprocessing with resize of scans:
pipeline = (
pipeline
.resize(shape=(256, 512, 512))
)
Now all scans have the same shape (256, 512, 512), it is possible to put them into a neural network. However, it may fail for two main reasons:
- only small number of scans (say, 3) of such size can be put into a memory of a GPU
- typically, there are not so many scans available for training (888 for Luna-dataset). As a result, making only one training example out of a scan is rather wasteful.
A more efficient approach is to crop out interesting parts of scans using sample_nodules()
.
E.g., this piece of code
pipeline = (
pipeline
.resize(shape=(256, 512, 512))
.sample_nodules(nodule_size=(32, 64, 64),
batch_size=20, share=0.5)
)
will generate batches of size 20, that will contain 10 cancerous and 10 noncancerous crops of shape (32, 64, 64). Or, alternatively this code
pipeline = (
pipeline
.resize(shape=(256, 512, 512))
.sample_nodules(nodule_size=(32, 64, 64),
batch_size=20, share=0.6,
variance=(100, 200, 200),
histo=some_3d_histogram)
)
will generate batches of size 20 with 12 cancerous crops. Pay attention to
parameters variance
and histo
of sample_nodules()
:
variance
introduces variability in the location of cancerous nodule inside the crop. E.g., if set to (100, 200, 200), the location of cancerous nodule will be sampled from normal distribution with zero-mean and variances (100, 200, 200) along three axes.histo
allows you to control the positions of noncancerous crops. Ifhisto
set toNone
, noncancerous crops will be sampled uniformly from scan-boxes of shape (256, 512, 512). Sometimes, though, you may want to sample noncancerous crops from specific regions of lungs - say, the interior of the left lung. In this case you can generate a 3d-histogram (seenumpy.histogram()
) concentrated in this region and supply it intosample_nodules
-action.
Augment data on-the-fly¶
Medical datasets are often small and require additional augmentation to avoid overfitting. For this purpose, it is possible to combine
rotate()
and central_crop()
:
pipeline = (
pipeline
.resize(shape=(256, 512, 512))
.rotate(angle=90, axes=(1, 2), random=True)
.central_crop(crop_size=(32, 64, 64))
)
This pipeline first resizes all images to same shape and then samples rotated crops of shape [32, 64, 64]; rotation angle is random, from 0 to 90 degrees. Rotation is performed about z-axis. Crops are padded by zeroes after rotation, if needed.
Accessing Batch components¶
You may want to access CTImagesBatch
or CTImagesMaskedBatch
-data directly. E.g., if you decide to write your own actions
.
Batch-classes has such functionality: 3d-scan for an item indexed by ix
from a batch
can be accessed in the following way:
image_3d_ix = batch.get(ix, 'images')
The same goes for other components of item ix
:
spacing_ix = batch.get(ix, 'spacing')
Or, alternatively
image_3d_ix = getattr(batch[ix], 'images')
spacing_ix = batch[ix].spacing
It is sometimes useful to print indices of all items from a batch
:
print(batch.indices) # batch.indices is a list of indices of all items
Writing your own actions¶
Now that you know how to work with components of CTImagesBatch
, you can write your own action. E.g., you need an
action, that subtracts mean-values of voxel densities from each scan. You can easily inherit one of
batch classes of RadIO (we suggest to use CTImagesMaskedBatch
) add make your action center
a method of this
class, just like that:
from RadIO.dataset import action
from RadIO import CTImagesMaskedBatch
class CTImagesCustomBatch(CTImagesMaskedBatch):
""" Ct-scans batch class with your own action """
@action # action-decorator allows you to chain your method with other actions in pipelines
def center(self):
""" Center values of pixels in each scan from batch """
for ix in self.indices:
mean_ix = np.mean(self.get(ix, 'images'))
images_ix = getattr(self[ix], 'images')
images_ix[:] -= mean_ix
return self # action must always return a batch-object
You can then chain your action center
with other actions of CTImagesMaskedBatch
to form custom preprocessing pipelines:
pipeline = (Pipeline()
.load(fmt='blosc') # load data
.center() # mean-normalize scans
.sample_nodules(batch_size=20)) # sample cancerous and noncancerous crops