mehta-lab / microdl Goto Github PK

View Code? Open in Web Editor NEW

27.0 6.0 7.0 175.25 MB

3D virtual staining with 2D and 2.5D U-Nets

License: BSD 3-Clause "New" or "Revised" License

Python 25.29% Jupyter Notebook 74.67% Shell 0.03%

deep-learning label-free phase polarization u-net-keras u-net-pytorch virtual-staining pytorch

microdl's People

Contributors

Stargazers

Watchers

Forkers

bryantchhun leanhphuong201 johannarahm smguo genevieve-ceo ziw-liu nur1225

microdl's Issues

Option to add masks as input

Some ground truth masks will be generated by some manual or semi-automatic means when a simple thresholding isn't enough.
microDL should be able to accept a mask directory as an input and add that to the training data seamlessly.

Allowing user specified masks

Parameter "mask_dir" seems to have a different meaning in inference_dataset.py from training and preprocessing. From the docstring:
":param str/None mask_dir: If inference targets are masks stored in a
different directory than the image dir. Assumes the directory contains
a frames_meta.csv containing mask channels (which will be target channels
in the inference config) z, t, p indices matching the ones in image_dir"

It seems to me here mask_dir is actually target_dir, but in training time mask_dir refers to the folder with binary or float maps used for weighted loss calculation.

I suggest we use "target_dir" instead of "mask_dir" in inference_dataset.py to avoid confusion. Also separate "data_dir" into "input_dir" and "target_dir" in the training and preprocessing pipeline to be consistent to this change, since currently the input and target images are assumed to be always in the same directory.

Make flat fielding step prior to normalization and before generating frames

flat fielding should be performed on raw data before normalization. so if possible we should see if we can make flat fielding available outside of pre-processing. on another end, it is something that should be corrected immediately after acquisition, so this might not be an issue to address immediately. but again documenting discussion with @smguo

Update inference_tiles to work with new data format

Update version of Matplotlib

Version 3.0.3 is buggy when adding color bars in subplots. Updating to 3.1.1 solved the issue

Inference image shape and pixel values

When applying inference to images of shape 2562x2562 px² they are cropped to 2048x2048 px². Only the inferred image is saved and this makes it impossible to further compare ground truth and input images to the inferred images as the exact cropping area is unknown.

Furthermore, the inferred image has not the same dynamic range and the ground truth image. In the inference figure both target and prediction have pixel values ranging up to 33K. However, the ground truth image only has pixel values up to 280. The inferred image is stored with values up to 33K.

Both scenarios make it hard to further compare ground truth and inferred image outside of microDL. Could we think of a strategy to solve this?

Pixel values of target image

Inference figure showing different pixel values for target image

Commit 151cc25 master branch

Upgrade to Tensorflow2 keras or Pytorch

Keras is now part of Tensorflow2's API. Switching to tf.keras will make it easier to maintain the code base and add new models.

Testing of pytorch_implementation branch

I've made this into an issue so we can have a coherent thread in which we can reference parts of the code and I can answer questions about the workflow.

I have just done some preliminary testing:

I've been able to complete an entire data cycle (preprocessing training and inference) with 2d and 3d data using a 2d and 2.5d network respectively completely on Bruno.
I am using an environment called microdl_torch in the comp_micro group environments
The config files for this testing can serve as examples, and can be found at:
/hpc/projects/CompMicro/projects/virtualstaining/torch_microDL/config_files/2022_09_27_A549_NuclStain/09_30_2022_15_09/ (for the 2d data and network)
/hpc/projects/CompMicro/projects/virtualstaining/torch_microDL/config_files/2019_02_15_KidneyTissue_DLMBL_subset/09_30_2022_12_06/ (for the 3d data and 2.5d network)

There are more thorough instructions and documentation of the new torch config file in the branch's micro_dl/torch_unet/readme.md. The general 'gist' of the PyTorch workflow is that right now we still need valid preprocessing, training, and inference configs for some set of data, but most of (see example training configs) the parameters pertaining to training and model initiation can be ignored, and are instead part of the torch_config.yml file. This is clunky, but it is only a temporary solution for testing, and once we've determined the integrity of the PyTorch models we can phase out the other config files.

Implement valid convolution in pytorch unet

The PyTorch 2.5D Unet currently uses same convolution, to simplify information-matching in the concatenation of skip connections. This introduces unreliable values into the edges of inference outputs from the network. Valid convolution is superior in this aspect, but faces a few challenges:

Because our inputs are high-dimensional in x and y but low-dimensional in z, valid convolution must only be implemented in the x and y dimensions. This can be done by specifying padding dimensions in pytorch
Because of the downsampling of the encoding path, valid convolution on the bottom layers imposes a minimum size constraint on the input, as you cannot pass the threshold at which the convolution shrinks the spatial dimensionality more than 2x upsampling can compensate for.
More importantly, valid convolution inflicts greater variation in the spatial sizes of the tensors passed through subsequent convolutional blocks than those passed through skip connections. In other words, tensors preserved for concatenation in the skip sections don't match with their counterparts on the upsampling path. Some broken code has already been written for this in the pytorch implementatino branch.

configuration files need more documentation

it's not clear what any of the fields in the configuration .yml files do.

For example, for config_preprocess, what is the relationship between tile: depths, tile: mask-depth, channel_ids and masks: channels?

Additionally, each step of the process (generate_meta, preprocess etc... ) produces files or folders, but it's not clear what those are and for what purpose

Split of microDL pipeline usage on HPC and power9 machines

I have been using microDL by splitting the workflow across HPC cluster and the docker environment on power i9 machine. I have been predominantly using microDL to perform virtual staining using phase images. The initial data preparation and preprocessing is performed on HPC, and training and inference steps on the docker on power i9.

The first form of data which is the primary input to the pipeline is the stack of brightfield images captured on the microscopes. The first step is the phase image reconstruction from the brightfield image stacks using waveorder library, after which they are aligned to get the focal plane centered at every imaged position. The reconstructed phase images are input to the microDL preprocessing step. The following workflow is well implemented by occasional visual checks of resulting images by the user, which is easier using napari and imageJ (best used on HPC).

The next steps of training and inference are performed in the docker environment on power i9 machine. The following steps cannot be implemented elsewhere currently due to unstable environment on HPC and lower disk space availability on machines like fry2 (very slow training speed). So, it is currently best implemented in the docker environment.

ImageValidator doesn't accept images with file name format "image_n0_z0.npy"

merge data command

Documenting discussion with @smguo Basically one might have training data across different folders from different datasets. microdl must have a way to copy and merge all the 3 frames_meta.csv files for the 3 datasets and form one folder with one dataset for training and one csv file.

train_scirpt.py doesn't create model dir if it doesn't exist

add this to def run_action(args): will solve the problem
if not os.path.exists(config['trainer']['model_dir']): os.makedirs(config['trainer']['model_dir'])

train_script.py doesn't output error when all GPUs are occupied

Issues with testing Zarr reader branch

In testing Jenny's zarr reading PR I've come across multiple bugs, I believe more thorough testing is also needed to make sure that the issue's weren't with my testing scripts. Here they are in order of urgency:

The reader determines whether to use zarr or tiff image reading by the presence of a frames_meta.csv metadata file in the image directory. However, after preprocessing it generates this file and leaves it in the directory. The next time the zarr folder is preprocessed (perhaps with different parameters) the preprocessing breaks, because it sees the existing metadata of the last preprocessing cycle and tries to read tiff files (when the data is in fact zarr format). There should be some other way to determine whether we are reading from zarr or not.
It seems that I am unable to run preprocessing (of any two directories of either file format) twice consecutively in the same script. Some very strange behavior happens where assertion errors about the config's flat_field and 'mask_dir' parameters fails when they definitely should not (they detect some parameter that doesn't exist, etc). This should be looked into more.
I cannot seem to generate masks from certain channels in .zarr files. I've traced the problem down to a ValueError: cannot convert float NaN to integer when attempting to read the channel_idx from the auto-generated frames_meta.csv metadata. Perhaps the typing of that parameter is incorrect when autogenerated
If a user generates the frames_meta.csv' metadata while mounted differently (for example if they are accessing a data folder through /gpfs/projects/CompMicro.. or /home/their.user/CompMicro/..), then running preprocessing on that data from any other mounting will fail, as the metadata captures that mounting's absolute path as the path to the reconstructed images in each folder. I had to write a script to go in and convert these paths to /hpc/projects/CompMicro/.. before using the data to test preprocessing.

There were a few other issues I ran into, but I believe they might have been symptoms of these main ones. Hopefully fixing these will allow for smooth usage of the .zarr reading.

The scripts I have used for this testing can be found in /hpc/projects/comp_micro/projects/virtualstaining/testing_zarr_scripts.

Add docstrings to config yaml files

The example config files currently don't have any docstrings to explain the function of each parameter. I suggest to add docstring in the yaml files as in this example:

revise documentation for microDL (dependencies: tensorflow 1.13, keras 2.1.6)

The current microDL codebase needs a refreshed documentation to ease the on-boarding of new users to the current pipeline and to guide the transition of the code to tensorflow 2.0 or pytorch(#141) .

The structure of documentation we are aiming for is as follows: Simplify readme to describe installation and how to use CLI for preprocessing, training, inference. The readme also provides an index of other documentation. Most of the documentation will be centralized in the docs folder, containing

the notebook used for image translation exercise at the deep learning course. The exercise should be updated in following ways:
- a graphical abstract near the top.
- note checkpoints that clarify when and how the user should examine the data and model before proceeding.
- (nice to have) illustrate how to write a custom training loop.
annotated list of configs used for specific projects:
- virtual staining as reported in our paper with 2D, 2.5D, and 3D U-Nets.
- refinements made for robust virtual staining to train models with data acquired during March/April 2022.
- models that turn out to be valuable.
summary of algorithms and approaches implemented in the code base. A good example is description of learning rate scheduler.

Since the docs folder is versioned with the repo, this approach will allow us to version the documentation along with the code.

Docker image

Add tmux
Add requirements.

document inference script and module

config_3D.yml

Can you share the config_3D.yml for preprocess,train,and inference script.py ?

enhancements bucket list

Is there an easier way to utilize all available GPU's for training
Freeze and restart training (save everything related to training: weights, gradient, state etc and reload)
create config from template
consolidate model performance metrics from various runs

track metadata of computational experiments

We will use this google sheet to track the paths and config files of computational experiments, along with the imaging experiments.

Key decisions:

the table can be succinct - lot of metadata exists in the config files.
log the commit id of codebase with which model was trained. It can be a github tag if one of the releases was used.
The metadata should be machine readable, e.g., readable as a pandas dataframe.
Organize the input data and preprocessed tiles within projects/virtualstaining/tf_microDL and torch_microDL folders.
Organize the configs in projects/virtualstaining/tf_microDL/config_files and projects/virtualstaining/torch_microDL/config_files. Many configs are currently saved in CompMicro/software/configfiles/microDL.
Experiment ID should never change once assigned.

Add feedback in preprocess.py

Feature request: Add some sort of feedback (like a progressbar) while preprocess.py is running. This would be more userfriendly, as the script takes a while.

if metrics not provided, not handled in trainer. Pass random_seed to dataset and training_table

callbacks = self._get_callbacks()
if 'metrics' in self.config:
208 metrics_list = self.config['metrics']
209 metrics = get_metrics(metrics_list)
210 self._compile_model(loss, optimizer, metrics)
211 else:
212 self._compile_model(loss, optimizer, metrics=None)

`` gpu_id is an int whereas mem_frac is a list
run_action(args, gpu_id, gpu_mem_frac[0])

if loss_is_masked:
177 if 'metrics' in self.config:
178 masked_metrics = [metric(self.num_target_channels)
179 for metric in metrics]
180 self.model.compile(loss=masked_loss(loss,
181 self.num_target_channels),
182 optimizer=optimizer,
183 metrics=masked_metrics)
184 else:
185 self.model.compile(loss=masked_loss(loss,
186 self.num_target_channels),
187 optimizer=optimizer,
188 metrics=None)

Unet_stack_to_2D
@Property
def _get_input_shape(self):
"""Return shape of input"""

    if self.config['data_format'] == 'channels_first':
        shape = (1,
                 self.config['depth'],
                 self.config['height'],
                 self.config['width'])
    else:
        shape = (self.config['depth'],
                 self.config['height'],
                 self.config['width'], 1)
    return shape

num of channels is hardcoded at 1! fix it

memory mapping the tiles used for training

we have redundant copies of training data after tiling. @smguo suggested we should make use of numpy memmap - https://github.com/czbiohub/bam2fasta/blob/master/bam2fasta/np_utils.py#L5 to write these tiles.

Add preprocessing_script tests

This should act as an integration test, and test all kinds of processing; resize, mask, flatfield, tile, for uniform and non-uniform data.

This will need to be broken down into several PRs, maybe by uniform/nonuniform.

Add SSIM as an option for loss + metrics

https://www.tensorflow.org/api_docs/python/tf/image/ssim
https://stackoverflow.com/questions/48744945/keras-ms-ssim-as-loss-function

Branch "Infer_on_large_image": shape mismatch errors when input image dimension is not power of 2

proposed fix: add a cropping block in read_one
`def _read_one(tp_dir, channel_ids, fname, flat_field_dir=None):
"""Read one image set

    :param str tp_dir: timepoint dir
    :param list channel_ids: list of channels to read from
    :param str fname: fname of the image. Expects the fname to be the same
     in all channels
    :param str flat_field_dir: dir where flat field images are stored
    :returns: np.array of shape nb_channels, im_size (with or without
     flat field correction)
    """

    cur_images = []
    for ch in channel_ids:
        cur_fname = os.path.join(tp_dir,
                                 'channel_{}'.format(ch),
                                 fname)
        cur_image = np.load(cur_fname).astype(np.float32)
        if flat_field_dir is not None:
            ff_fname = os.path.join(flat_field_dir,
                                    'flat-field_channel-{}.npy'.format(ch))
            ff_image = np.load(ff_fname)
            cur_image = image_utils.apply_flat_field_correction(
                cur_image, flat_field_image=ff_image)
        cur_image = zscore(cur_image)
      ## minimally crop the image from center to have cropped shape always power of 2
        image_shape = np.asarray(cur_image.shape)
        crop_shape = 2**(np.floor(np.log2(image_shape))-1)
        crop_shape = crop_shape.astype(np.int)            
        ctr = np.floor(image_shape/2).astype(np.int)
        cur_image =  cur_image[ctr[0]-crop_shape[0]:ctr[0]+crop_shape[0],
                               ctr[1]-crop_shape[1]:ctr[1]+crop_shape[1]]
        
        cur_images.append(cur_image)
    cur_images = np.stack(cur_images)
    return cur_images`

Add test dataset

How about adding or referencing a test dataset with which the pipeline can be tried out. This would allow external users to clearly see which data structure is required and make the adaption to their dataset & setup easier.

Predicted images generated by Inference_script.py don't have the full dimension as the target images

Predicted images don't have the full dimension as the targets

Integration tests are currently not running on travis - port to github

@smguo @mattersoflight Tagging here in this issue as you have mentioned Integration testing for the repo. I would think GitHub's actions workflows might be better.

z-alignment of different modalities

Hi together,

I wanted to prepare training data that cover different modalities, so we can compare the impact of phase vs brightfield and deconvolved vs not deconvolved channels. However, I get an unexpected behavior of the z-alignment script. The script simply copies all single page tif file images into the new aligned folder, without aligning them (from the 97 z-ids per position all z-ids are saved in the aligned folder).

I plan to preparing following training data combinations:

Phase & non deconvolved fluorescence channels
Brightfield & deconvolved fluorescence channels
Brightfield & non deconvolved fluorescence channels
Phase & deconvolved fluorescence channels --> these are the modalities the current model is trained on, z-alignment works

The scripts are located here:
/gpfs/CompMicro/projects/HEK/2022_03_15_orgs_nuc_mem_63x_04NA/all_pos_single_page/

align_z_focus_2022_03_15_only_phase_refmem_min25_max60.py
align_z_focus_2022_03_15_only_deconv_refmem_min25_max60.py
align_z_focus_2022_03_15_raw_data_refmem_min25_max60.py
align_z_focus_2022_03_15_refmem_min25_max60.py. -> this script works! Aligned images are saved at /gpfs/CompMicro/projects/HEK/2022_03_15_orgs_nuc_mem_63x_04NA/all_pos_single_page/all_pos_Phase1e-3_Denconv_Nuc8e-4_Mem8e-4_pad15_bg50_registered_refmem_min25_max60

Furthermore, the alignment script takes quite a while to run through. If anyone finds time it would be great to speed the script up with e.g. multiprocessing.

I ran the scripts on hulk in a microDL docker container with the command python <script name>

I would highly appreciate if anyone finds time to debug this issue! Please let me know if there are open questions / if I can help with anything.

Best,
Johanna

Add config parameter checks

Make sure settings are compatible with model selection, or (even better) remove settings from config file and automatically set them based on model selection.

E.g. dataset -> squeeze should be set to true for 2D models and false otherwise.

find optimal # of z-slices for image translation (2.5D label-free -> 2D fluorescence)

What is the optimal receptor field along the z dimension for virtual staining? It should be similar to the receptor field along XY, because biological structures are not aligned in cartesian coordinates. However, optical resolution along Z is 2x lower than XY resolution.

The accuracy of prediction may improve when information outside optical resolution limit is used.
Imaging is typically done with 5x5 pixels within the XY resolution limit of microscope and 5 slices within the Z resolution limit.

inference 2.5D model bug

Hi, running inference with a 2.5D model results in an error related to the dimensions of the pred_stack (and target_stack) variable at infernce->image_inference.py->predict_2d(). A 2D model yields stacks in the format (ZCYX), which can be transposed to (CYXZ) in the predict_2d function. A 2.5D model yields stacks in the format (ZC?YX), which leads to the dimensionality error. I also attached an image of the preprocessing 2.5D model single channel, to show that 2D tiles were correctly configured.

Paths to used configs

2.5D model single channel: /gpfs/CompMicro/software/configfiles/Johanna/micro_DL_copy/tests/config_inference_2021_04_20_HEK_OC43_widefield_pool_25UNet_DAPI_5.yml
2.5D model multi channel: gpfs/CompMicro/software/configfiles/Johanna/micro_DL_copy/tests/config_inference_2021_04_20_HEK_OC43_widefield_pool_25UNet_DAPI_MTub_5.yml
2D model single channel: /gpfs/CompMicro/software/configfiles/Johanna/micro_DL_copy/tests/2D_model/config_inference_2021_04_20_HEK_OC43_widefield_M1_DAPI.yml

Error message

Preprocessing config of 2.5D model single channel

Preprocessing is slow when overwriting existing dir

If preprocessing is called using an existing directory it is much slower (doesn't use multiprocessing).
And it doesn't check if all channels are already there, it runs again even if all preprocessed data is already there.

Trainer.py doesn't use multiple workers for training when workers >1

use_multiprocessing option needs to be True for model.fit_generator to initiate multiple generator instances

Nice post demonstrating behaviors of model.fit_generator with and without multiprocessing:
https://keunwoochoi.wordpress.com/2017/08/24/tip-fit_generator-in-keras-how-to-parallelise-correctly/

Will fix this in the next PR

aux_utils.get_row_idx

update methods that use aux_utils.get_row_idx to take pos_idx as a param or
replace with aux_utils.get_meta_idx

model_inference.py, gen_mask_seg.py (ignore), estimate_flat_field.py

remove 'tune_hyperparam' related options in train

Switch from keras to tf.keras

Convert networks/conv_blocks/pad_channels() and _crop_layer() to custom layers in keras to avoid serialization errors when saving model weights

Branch "Infer_on_large_image":"predict_on_larger_image" in model_inference.py only supports single channel input

`
if num_dims == 2:
if config['network']['data_format'] == 'channels_first':
im_shape = [1, 1, im_size[0], im_size[1]]
else:
im_shape = [1, im_size[0], im_size[1], 1]

elif num_dims == 3:
    if config['network']['data_format'] == 'channels_first':
        im_shape = [1, 1, im_size[0], im_size[1], im_size[2]]
 else:
        im_shape = [1, im_size[0], im_size[1], im_size[2], 1]

[Regression] Predicted images need to be "un-z-scored" for correct SSIM calculation

SSIM of the prediction and the target is currently estimated by z-score the target, but this makes the luminance term always zero and underestimate SSIM. Instead, the prediction should be scaled back to the same range as target before computing metrics.

Note that this doesn't affect Pearson correlation metrics as it's scale invariant. This doesn't apply to segmentation output which is always in [0, 1]

Add 3D data augmentations

ModelCheckpoint monitor is train_loss in the config file template

It won't save the trained model by default as train_loss always decreases. Better to change it to val_loss

Dynamic importing of classes and logging

How to dynamically import classes across all sub-modules

add them to init.py for each module
for the input module the classes are well separated by functionality, there we use import from specific file
any other options?

Logging

should we log to a global file or individual loggers for modules?

Add support for Zarr as input format

microDL currently only accepts single-paged tiff as input format but microscopy community is switching to zarr format. Add support for zarr format will reduce data conversion and duplication.

Modifying inference to apply trained model on data with different channel order

How can I modify the inference configuration to use a trained model on a dataset which has different order of channel than the dataset which was used for training?
For instance, consider two datasets A and B. The datasets have captured in the following order:
Dataset A : nucleus =channel 0, membrane =channel 1, phase =channel 2.
Dataset B: phase =channel 0, mitochondria =channel 1.
If I trained a microDL model on dataset A to predict the nucleus using phase images, how can I modify the inference configuration to work on dataset B to virtually stain the nucleus from the phase channels?

[master branch] "tile_image" generate tiles with different sizes when the input image dimension is not the multiple of input tile size

proposed fix:
change
for row in range(0, n_rows, step_size[0]): if row + tile_size[0] > n_rows: row = check_in_range(row, n_rows, tile_size[0]) for col in range(0, n_cols, step_size[1]): if col + tile_size[1] > n_cols: col = check_in_range(col, n_cols, tile_size[1])
to
for row in range(0, n_rows - tile_size[0] + step_size[0]+ 1, step_size[0]): if row + tile_size[0] > n_rows: row = check_in_range(row, n_rows, tile_size[0]) for col in range(0, n_cols - tile_size[1] + step_size[1]+ 1, step_size[1]): if col + tile_size[1] > n_cols: col = check_in_range(col, n_cols, tile_size[1])

Remove run_image_preprocessing

Replace it with something that generates a frames_meta.csv if your file names adhere to naming convention.

Update documentation

The documentation needs updating! In the readme.md config parameters are missing, are outdated or lack description. As mentioned in issue #127, the config files require docstrings as well.