Pipelines ========= Pipelines are workflows that greatly simplify deep learning research on CT-scans. Each workflow is represented in a form of preprocessing `actions `_, chained in a :class:`pipeline `. Let us start with a workflow that allows to perform a full-scale preprocessing over a dataset of scans and start training the model of your choice. Preprocessing workflow ---------------------- Say, you need a workflow that loads scans from disk, resizes them to shape **[128, 256, 256]**, and prepares batch of **20** cancerous and non-cancerous crops of shape **[32, 64, 64]**. The straightforward approach is to chain several actions: .. code-block:: python import pandas as pd from radio import dataset as ds nodules = pd.read_csv('/path/to/annotation/nodules.csv') get_crops = ( ds.Pipeline() .load(fmt='raw') .fetch_nodules_info(nodules=nodules) .unify_spacing(shape=(128, 256, 256), method='pil-simd', padding='reflect', spacing=(1.7, 1.0, 1.0)) .create_mask() .normalize_hu() .sample_nodules(nodule_size=(32, 64, 64), batch_size=20, share=0.5) .run(batch_size=8, lazy=True, shuffle=True) ) The simpler approach is to use :func:`~.pipelines.get_crops`-function that manufactures frequently used preprocessing pipelines. With :func:`~.pipelines.get_crops` you can get the pipeline written above in two lines of code: .. code-block:: python from radio.pipelines import get_crops pipeline = get_crops(fmt='raw', shape=(128, 256, 256), nodules=nodules, histo=some_3d_histogram, batch_size=20, share=0.6, nodule_shape=(32, 64, 64)) Pay attention to parameters ``batch_size`` and ``share``: they allow to control the number of items in a batch of crops and the number of cancerous crops. Parameter ``histo`` controls the distribution which is used for sampling locations of random (noncancerous) crops. Although ``histo`` accepts any 3d-histogram, we advise to use :ref:`distribution of cancer location `. You can chain ``pipeline`` with some additional actions for training, say, ``DenseNoduleNet``: .. code-block:: python pipeline = ( pipeline .init_model('static',class=DenseNoduleNet, model_name='dnod_net') .train_model(model_name='dnod_net', feed_dict={ 'images': F(CT.unpack, component='images'), 'labels': F(CT.unpack, component='classification_targets') }) ) (ctset >> pipeline).run() Alternatively, you can choose to save dataset of crops on disk and get back to training a network on it later: .. code-block:: python pipeline = pipeline.dump('/path/to/crops/') (ctset >> pipeline).run() Created pipeline will generate `~1500` training examples, in one run through Luna-dataset (one epoch). It may take a couple of hours to work through the pipeline, even for a high performing machine. The reason for this is that both :meth:`~.CTImagesBatch.unify_spacing` and :meth:`~.CTImagesBatch.load` are costly operations. That being said, for implementing an efficient learning procedure we advise to use another workflow, that allows to generate more than `100000` training examples after running one time through the Luna-dataset. **Requirements** for :func:`~.pipelines.get_crops`: Dataset of scans in **DICOM** or **MetaImage**. ``pandas.DataFrame`` of nodules-annotations in `Luna-format `_. Faster workflow --------------- Preparation of richer training dataset can be achieved in two steps using two pipeline-creators: :func:`~.pipelines.split_dump` and :func:`~.pipelines.combine_crops`. .. _StepFirst: Step 1 ^^^^^^ During the first step you dump large sets of cancerous and non-cancerous crops in separate folders using :func:`~.pipelines.split_dump`: .. code-block:: python from radio.pipelines import split_dump pipeline = split_dump(cancer_path='/train/cancer', non_cancer_path='/train/non_cancer', nodules=nodules) (ctset >> pipeline).run() # one run through Luna; may take a couple of hours **Requirements** for :func:`~.pipelines.split_dump`: Dataset of scans in **DICOM** or **MetaImage**. ``pandas.DataFrame`` of nodules-annotations in `Luna-format `_. Step 2 ^^^^^^ You can now combine cancerous and non-cancerous crops from two folders using :func:`~.pipelines.combine_crops`. First, you associate a :class:`dataset ` with each folder: .. code-block:: python # datasets of cancerous and non-cancerous crops cancer_set = Dataset(index=FilesIndex('/train/cancer/*', dirs=True)) non_cancer_set = Dataset(index=FilesIndex('/train/non_cancer/*', dirs=True)) You can balance crops from two dataset in any proportion you want: .. code-block:: python from radio.pipelines import combine_crops pipeline = combine_crops(cancer_set, non_cancer_set, batch_sizes=(10, 10)) Pay attention to parameter ``batch_sizes`` in :func:`~.pipelines.combine_crops`-function. It defines how many cancerous and non-cancerous crops will be included in batches. Just like with :func:`~.pipelines.get_crops`, it is easy to add training of *ResNet* to ``pipeline``: .. code-block:: python pipeline = ( pipeline .init_model('static',class=ResNodule3DNet50, model_name='resnet') .train_model(model_name='resnet', feed_dict={ 'images': F(CT.unpack, component='images'), 'labels': F(CT.unpack, component='classification_targets') }) ) (ctset >> pipeline).run(BATCH_SIZE=12) **Requirements** for :func:`~.pipelines.combine_crops`: datasets of cancerous and noncancerous crops, prepared by :func:`~.pipelines.split_dump` (see :ref:`StepFirst`). .. _Histocalc: Calculation of cancer location distribution ------------------------------------------- Another useful pipeline-creator is :func:`~radio.pipelines.pipelines.update_histo`. With :func:`~.pipelines.update_histo` you can get a histogram-estimate of distribution of cancer-location inside preprocessed scans: .. code-block:: python from radio.pipelines import update_histo SHAPE = (400, 512, 512) # default shape of resize in preprocessing ranges = list(zip([0]*3, SHAPE)) # boxes of preprocessed scans histo = list(np.histogramdd(np.empty((0, 3)), range=ranges, bins=4)) # init empty 3d-histogram pipeline = update_histo(nodules, histo) It is time to run a dataset of scans through ``pipeline`` and accumulate information about cancer-location in ``histo``: .. code-block:: python (ctset >> pipeline).run() # may take a couple of hours You can now use ``histo`` in pipeline :func:`~.pipelines.get_crops`` to sample batches of cancerous and noncancerous crops: .. code-block:: python pipeline = get_crops(nodules=nodules, histo=histo) In that way, cancerous and noncancerous examples will be cropped from similar locations. This, of course, makes training datasets more balanced. **Requirements** for :func:`~.pipelines.update_histo`: Dataset of scans in **DICOM** or **MetaImage** and ``pandas.DataFrame`` of nodules-annotations in `Luna-format `_.