Preprocessing ============= Module allows to perform a number of preprocessing actions on a dataset of scans. `RadIO` works with batches of scans, wrapped in a class ``CTImagesBatch``. The class stores scans' data in :class:`components ` which represents main attributes of scans. E.g., scans itself are stacked in one tall 3d ``numpy``-array (called `skyscraper`) and stored in ``images``-component. The other components are ``spacing`` and ``origin``, which store important meta. So, what can be done with the data? Load and dump ------------- The ``preprocessing``-module is primarily adapted to work with two large datasets, available in open access: `Luna-dataset `_ (**.raw**-format) and `DsBowl2017-dataset `_ (**.dicom**-format). Consider, you have one of these two datasets (or a part of it) downloaded in folder ``path/to/scans``. The first step is to set up an index. :class:`~dataset.FilesIndex` from :mod:`dataset`-module reduces the task to defining a ``glob``-mask for a needed set of scans: .. code-block:: python from radio import CTImagesBatch as CTIMB from radio.dataset import FilesIndex, Dataset, Pipeline ctx = FilesIndex(path='path/to/scans/*', no_ext=True) ctset = Datset(index=ctx, batch_class=CTIMB) pipeline = Pipeline() For loading scans you only need to call action :meth:`~.CTImagesBatch.load` specifying format of dataset: .. code-block:: python pipeline = pipeline.load(fmt='raw') # use fmt = 'dicom' for load of dicom-scans After performing some preprocessing operations you may need to save the results on disk. Action :meth:`~.CTImagesBatch.dump` is of help here: .. code-block:: python pipeline = ... # some preprocessing actions pipeline = pipeline.dump(dst='path/to/preprocessed/') In the end, data of each scan from the batch will be packed with :func:`blosc ` and dumped into folder. Dumped scans can be loaded later using the same methodology. To do this, specify `blosc`-format when performing :meth:`~.CTImagesBatch.load`: .. code-block:: python pipeline = Pipeline().load(fmt='blosc') Both :meth:`~.CTImagesBatch.dump` and :meth:`~.CTImagesBatch.load` from `blosc` can work component-wise: .. code-block:: python pipeline_dump = ( pipeline .dump(fmt='blosc', components=['spacing', 'origin']) # dump spacing, origin components .dump(dst='path/to/preprocessed/', fmt='blosc', components='images') # dumps scans itself ) pipeline_load = Pipeline().load(fmt='blosc', components=['spacing', 'origin', 'images']) .. _ResizeUspac: Resize and unify spacing ------------------------ Another step of preprocessing is **resize** of scans to a specific shape. ``preprocessing``-module has :meth:`~.CTImagesBatch.resize`-action, specifying desired output shape in z, y, x order: .. code-block:: python pipeline = pipeline.resize(shape=(128, 256, 256)) Currently module supports two different resize-engines: :mod:`scipy.interpolate` and ``PIL-simd``. While the second engine is more robust and works faster on systems with small number of cores, the first allows greater degree of parallelization and can be more precise in some cases. One can choose engine in a following way: .. code-block:: python pipeline = pipeline.resize(shape=(128, 256, 256), method='scipy') Sometimes, it may be useful to convert scans to the same real-world scale, rather than simply reshape to same size. This can be achieved through :meth:`~.CTImagesBatch.unify_spacing`-action: .. code-block:: python pipeline = pipeline.unify_spacing(spacing=(3.0, 2.0, 2.0), shape=(128, 256, 256)) To control real-world world scale of scans, you can specify ``spacing``, that represents distances in millimeters between adjacent voxels along three axes. The action works in two steps. The first step stands for spacing unification by means of resize, while the second one crops/pads resized scan so that it fits in the supplied shape. You can specify resize parameters and padding mode: .. code-block:: python pipeline = pipeline.unify_spacing(spacing=(3.0, 2.0, 2.0), shape=(128, 256, 256), padding='reflect', engine='pil-simd') So far it was all about ``images``-components, that can be viewed as an **X**-input of a neural network. What about network's target, **Y**-input? Create masks with ``CTImagesMaskedBatch`` ----------------------------------------- Preparing target for network revolves around class ``CTImagesMaskedBatch``. It naturally has one new component - ``masks``. ``Masks`` have the same shape as ``images`` and store cancer-masks of different items in a binary format, where value of each voxel is either **0** (non-cancerous voxel) or **1** (cancerous voxel). ``masks`` can be made in two steps. First, load info about cancerous nodules in a batch with :meth:`~.CTImagesMaskedBatch.fetch_nodules_info`: .. code-block:: python pipeline = ( pipeline .fetch_nodules_info(nodules=nodules) # nodules is a Pandas.DataFrame # containing info about nodules ) Then you can fill the ``masks``-component using the loaded info and action :meth:`~.CTImagesMaskedBatch.create_mask`: .. code-block:: python pipeline = ( pipeline .create_mask() ) .. _sample_crops_from_scan: Sample crops from scan ---------------------- RadIO has :meth:`~.CTImagesMaskedBatch.sample_nodules` that allows to generate batches of small crops, balancing cancerous and non-cancerous examples. Let's start preprocessing with :ref:`resize ` of scans: .. code-block:: python pipeline = ( pipeline .resize(shape=(256, 512, 512)) ) Now all scans have the same shape **(256, 512, 512)**, it is possible to put them into a neural network. However, it may fail for two main reasons: * only small number of scans (say, 3) of such size can be put into a memory of a GPU * typically, there are not so many scans available for training (888 for Luna-dataset). As a result, making only one training example out of a scan is rather wasteful. A more efficient approach is to crop out interesting parts of scans using :meth:`~.CTImagesMaskedBatch.sample_nodules`. E.g., this piece of code .. code-block:: python pipeline = ( pipeline .resize(shape=(256, 512, 512)) .sample_nodules(nodule_size=(32, 64, 64), batch_size=20, share=0.5) ) will generate batches of size **20**, that will contain **10** cancerous and **10** noncancerous crops of shape **(32, 64, 64)**. Or, alternatively this code .. code-block:: python pipeline = ( pipeline .resize(shape=(256, 512, 512)) .sample_nodules(nodule_size=(32, 64, 64), batch_size=20, share=0.6, variance=(100, 200, 200), histo=some_3d_histogram) ) will generate batches of size **20** with **12** cancerous crops. Pay attention to parameters ``variance`` and ``histo`` of :meth:`~.CTImagesMaskedBatch.sample_nodules`: * ``variance`` introduces variability in the location of cancerous nodule inside the crop. E.g., if set to **(100, 200, 200)**, the location of cancerous nodule will be sampled from normal distribution with zero-mean and variances **(100, 200, 200)** along three axes. * ``histo`` allows you to control the positions of noncancerous crops. If ``histo`` set to ``None``, noncancerous crops will be sampled uniformly from scan-boxes of shape **(256, 512, 512)**. Sometimes, though, you may want to sample noncancerous crops from specific regions of lungs - say, the interior of the left lung. In this case you can generate a 3d-histogram (see :func:`numpy.histogram`) concentrated in this region and supply it into ``sample_nodules``-action. Augment data on-the-fly ----------------------- Medical datasets are often small and require additional augmentation to avoid overfitting. For this purpose, it is possible to combine :meth:`~.CTImagesBatch.rotate` and :meth:`~.CTImagesMaskedBatch.central_crop`: .. code-block:: python pipeline = ( pipeline .resize(shape=(256, 512, 512)) .rotate(angle=90, axes=(1, 2), random=True) .central_crop(crop_size=(32, 64, 64)) ) This pipeline first resizes all images to same shape and then samples rotated crops of shape **[32, 64, 64]**; rotation angle is random, from 0 to 90 degrees. Rotation is performed about **z**-axis. Crops are padded by zeroes after rotation, if needed. Accessing Batch components -------------------------- You may want to access ``CTImagesBatch`` or ``CTImagesMaskedBatch``-data directly. E.g., if you decide to write your own :func:`actions `. Batch-classes has such functionality: 3d-scan for an item indexed by ``ix`` from a ``batch`` can be accessed in the following way: .. code-block:: python image_3d_ix = batch.get(ix, 'images') The same goes for other components of item ``ix``: .. code-block:: python spacing_ix = batch.get(ix, 'spacing') Or, alternatively .. code-block:: python image_3d_ix = getattr(batch[ix], 'images') spacing_ix = batch[ix].spacing It is sometimes useful to print indices of all items from a ``batch``: .. code-block:: python print(batch.indices) # batch.indices is a list of indices of all items Writing your own actions ------------------------ Now that you know how to work with components of :class:`~.CTImagesBatch`, you can write your own action. E.g., you need an action, that subtracts mean-values of voxel densities from each scan. You can easily inherit one of batch classes of **RadIO** (we suggest to use :class:`~.CTImagesMaskedBatch`) add make your action ``center`` a method of this class, just like that: .. code-block:: python from RadIO.dataset import action from RadIO import CTImagesMaskedBatch class CTImagesCustomBatch(CTImagesMaskedBatch): """ Ct-scans batch class with your own action """ @action # action-decorator allows you to chain your method with other actions in pipelines def center(self): """ Center values of pixels in each scan from batch """ for ix in self.indices: mean_ix = np.mean(self.get(ix, 'images')) images_ix = getattr(self[ix], 'images') images_ix[:] -= mean_ix return self # action must always return a batch-object You can then chain your action ``center`` with other actions of :class:`~.CTImagesMaskedBatch` to form custom preprocessing pipelines: .. code-block:: python pipeline = (Pipeline() .load(fmt='blosc') # load data .center() # mean-normalize scans .sample_nodules(batch_size=20)) # sample cancerous and noncancerous crops