Pipelines

Pipelines are workflows that greatly simplify deep learning research on CT-scans. Each workflow is represented in a form of preprocessing actions, chained in a pipeline.

Let us start with a workflow that allows to perform a full-scale preprocessing over a dataset of scans and start training the model of your choice.

Preprocessing workflow

Say, you need a workflow that loads scans from disk, resizes them to shape [128, 256, 256], and prepares batch of 20 cancerous and non-cancerous crops of shape [32, 64, 64]. The straightforward approach is to chain several actions:

import pandas as pd
from radio import dataset as ds

nodules = pd.read_csv('/path/to/annotation/nodules.csv')

get_crops = (
  ds.Pipeline()
    .load(fmt='raw')
    .fetch_nodules_info(nodules=nodules)
    .unify_spacing(shape=(128, 256, 256), method='pil-simd',
                   padding='reflect', spacing=(1.7, 1.0, 1.0))
    .create_mask()
    .normalize_hu()
    .sample_nodules(nodule_size=(32, 64, 64), batch_size=20, share=0.5)
    .run(batch_size=8, lazy=True, shuffle=True)
)

The simpler approach is to use get_crops()-function that manufactures frequently used preprocessing pipelines. With get_crops() you can get the pipeline written above in two lines of code:

from radio.pipelines import get_crops
pipeline = get_crops(fmt='raw', shape=(128, 256, 256), nodules=nodules, histo=some_3d_histogram,
                     batch_size=20, share=0.6, nodule_shape=(32, 64, 64))

Pay attention to parameters batch_size and share: they allow to control the number of items in a batch of crops and the number of cancerous crops. Parameter histo controls the distribution which is used for sampling locations of random (noncancerous) crops. Although histo accepts any 3d-histogram, we advise to use distribution of cancer location. You can chain pipeline with some additional actions for training, say, DenseNoduleNet:

pipeline = (
    pipeline
    .init_model('static',class=DenseNoduleNet, model_name='dnod_net')
    .train_model(model_name='dnod_net', feed_dict={
        'images': F(CT.unpack, component='images'),
        'labels': F(CT.unpack, component='classification_targets')
    })
)
(ctset >> pipeline).run()

Alternatively, you can choose to save dataset of crops on disk and get back to training a network on it later:

pipeline = pipeline.dump('/path/to/crops/')
(ctset >> pipeline).run()

Created pipeline will generate ~1500 training examples, in one run through Luna-dataset (one epoch). It may take a couple of hours to work through the pipeline, even for a high performing machine. The reason for this is that both unify_spacing() and load() are costly operations.

That being said, for implementing an efficient learning procedure we advise to use another workflow, that allows to generate more than 100000 training examples after running one time through the Luna-dataset.

Requirements for get_crops(): Dataset of scans in DICOM or MetaImage. pandas.DataFrame of nodules-annotations in Luna-format.

Faster workflow

Preparation of richer training dataset can be achieved in two steps using two pipeline-creators: split_dump() and combine_crops().

Step 1

During the first step you dump large sets of cancerous and non-cancerous crops in separate folders using split_dump():

from radio.pipelines import split_dump
pipeline = split_dump(cancer_path='/train/cancer', non_cancer_path='/train/non_cancer',
                      nodules=nodules)
(ctset >> pipeline).run()  # one run through Luna; may take a couple of hours

Requirements for split_dump(): Dataset of scans in DICOM or MetaImage. pandas.DataFrame of nodules-annotations in Luna-format.

Step 2

You can now combine cancerous and non-cancerous crops from two folders using combine_crops(). First, you associate a dataset with each folder:

# datasets of cancerous and non-cancerous crops
cancer_set = Dataset(index=FilesIndex('/train/cancer/*', dirs=True))
non_cancer_set = Dataset(index=FilesIndex('/train/non_cancer/*', dirs=True))

You can balance crops from two dataset in any proportion you want:

from radio.pipelines import combine_crops
pipeline = combine_crops(cancer_set, non_cancer_set, batch_sizes=(10, 10))

Pay attention to parameter batch_sizes in combine_crops()-function. It defines how many cancerous and non-cancerous crops will be included in batches. Just like with get_crops(), it is easy to add training of ResNet to pipeline:

pipeline = (
    pipeline
    .init_model('static',class=ResNodule3DNet50, model_name='resnet')
    .train_model(model_name='resnet', feed_dict={
    'images': F(CT.unpack, component='images'),
        'labels': F(CT.unpack, component='classification_targets')
    })
)
(ctset >> pipeline).run(BATCH_SIZE=12)

Requirements for combine_crops(): datasets of cancerous and noncancerous crops, prepared by split_dump() (see Step 1).

Calculation of cancer location distribution

Another useful pipeline-creator is update_histo(). With update_histo() you can get a histogram-estimate of distribution of cancer-location inside preprocessed scans:

from radio.pipelines import update_histo
SHAPE = (400, 512, 512)  # default shape of resize in preprocessing
ranges = list(zip([0]*3, SHAPE)) # boxes of preprocessed scans
histo = list(np.histogramdd(np.empty((0, 3)), range=ranges, bins=4))  # init empty 3d-histogram

pipeline = update_histo(nodules, histo)

It is time to run a dataset of scans through pipeline and accumulate information about cancer-location in histo:

(ctset >> pipeline).run() # may take a couple of hours

You can now use histo in pipeline get_crops`() to sample batches of cancerous and noncancerous crops:

pipeline = get_crops(nodules=nodules, histo=histo)

In that way, cancerous and noncancerous examples will be cropped from similar locations. This, of course, makes training datasets more balanced.

Requirements for update_histo(): Dataset of scans in DICOM or MetaImage and pandas.DataFrame of nodules-annotations in Luna-format.