mehta-lab / microdl Goto Github PK
View Code? Open in Web Editor NEW3D virtual staining with 2D and 2.5D U-Nets
License: BSD 3-Clause "New" or "Revised" License
3D virtual staining with 2D and 2.5D U-Nets
License: BSD 3-Clause "New" or "Revised" License
Some ground truth masks will be generated by some manual or semi-automatic means when a simple thresholding isn't enough.
microDL should be able to accept a mask directory as an input and add that to the training data seamlessly.
Parameter "mask_dir" seems to have a different meaning in inference_dataset.py
from training and preprocessing. From the docstring:
":param str/None mask_dir: If inference targets are masks stored in a
different directory than the image dir. Assumes the directory contains
a frames_meta.csv containing mask channels (which will be target channels
in the inference config) z, t, p indices matching the ones in image_dir"
It seems to me here mask_dir is actually target_dir, but in training time mask_dir refers to the folder with binary or float maps used for weighted loss calculation.
I suggest we use "target_dir" instead of "mask_dir" in inference_dataset.py
to avoid confusion. Also separate "data_dir" into "input_dir" and "target_dir" in the training and preprocessing pipeline to be consistent to this change, since currently the input and target images are assumed to be always in the same directory.
flat fielding should be performed on raw data before normalization. so if possible we should see if we can make flat fielding available outside of pre-processing. on another end, it is something that should be corrected immediately after acquisition, so this might not be an issue to address immediately. but again documenting discussion with @smguo
Version 3.0.3 is buggy when adding color bars in subplots. Updating to 3.1.1 solved the issue
When applying inference to images of shape 2562x2562 px² they are cropped to 2048x2048 px². Only the inferred image is saved and this makes it impossible to further compare ground truth and input images to the inferred images as the exact cropping area is unknown.
Furthermore, the inferred image has not the same dynamic range and the ground truth image. In the inference figure both target and prediction have pixel values ranging up to 33K. However, the ground truth image only has pixel values up to 280. The inferred image is stored with values up to 33K.
Both scenarios make it hard to further compare ground truth and inferred image outside of microDL. Could we think of a strategy to solve this?
Inference figure showing different pixel values for target image
Commit 151cc25 master branch
Keras is now part of Tensorflow2's API. Switching to tf.keras will make it easier to maintain the code base and add new models.
I've made this into an issue so we can have a coherent thread in which we can reference parts of the code and I can answer questions about the workflow.
I have just done some preliminary testing:
microdl_torch
in the comp_micro
group environments/hpc/projects/CompMicro/projects/virtualstaining/torch_microDL/config_files/2022_09_27_A549_NuclStain/09_30_2022_15_09/
(for the 2d data and network)/hpc/projects/CompMicro/projects/virtualstaining/torch_microDL/config_files/2019_02_15_KidneyTissue_DLMBL_subset/09_30_2022_12_06/
(for the 3d data and 2.5d network)There are more thorough instructions and documentation of the new torch config file in the branch's micro_dl/torch_unet/readme.md
. The general 'gist' of the PyTorch workflow is that right now we still need valid preprocessing, training, and inference configs for some set of data, but most of (see example training configs) the parameters pertaining to training and model initiation can be ignored, and are instead part of the torch_config.yml
file. This is clunky, but it is only a temporary solution for testing, and once we've determined the integrity of the PyTorch models we can phase out the other config files.
The PyTorch 2.5D Unet currently uses same convolution, to simplify information-matching in the concatenation of skip connections. This introduces unreliable values into the edges of inference outputs from the network. Valid convolution is superior in this aspect, but faces a few challenges:
it's not clear what any of the fields in the configuration .yml files do.
For example, for config_preprocess, what is the relationship between tile: depths
, tile: mask-depth
, channel_ids
and masks: channels
?
Additionally, each step of the process (generate_meta
, preprocess
etc... ) produces files or folders, but it's not clear what those are and for what purpose
I have been using microDL by splitting the workflow across HPC cluster and the docker environment on power i9 machine. I have been predominantly using microDL to perform virtual staining using phase images. The initial data preparation and preprocessing is performed on HPC, and training and inference steps on the docker on power i9.
The first form of data which is the primary input to the pipeline is the stack of brightfield images captured on the microscopes. The first step is the phase image reconstruction from the brightfield image stacks using waveorder library, after which they are aligned to get the focal plane centered at every imaged position. The reconstructed phase images are input to the microDL preprocessing step. The following workflow is well implemented by occasional visual checks of resulting images by the user, which is easier using napari and imageJ (best used on HPC).
The next steps of training and inference are performed in the docker environment on power i9 machine. The following steps cannot be implemented elsewhere currently due to unstable environment on HPC and lower disk space availability on machines like fry2 (very slow training speed). So, it is currently best implemented in the docker environment.
Documenting discussion with @smguo Basically one might have training data across different folders from different datasets. microdl must have a way to copy and merge all the 3 frames_meta.csv files for the 3 datasets and form one folder with one dataset for training and one csv file.
add this to def run_action(args):
will solve the problem
if not os.path.exists(config['trainer']['model_dir']): os.makedirs(config['trainer']['model_dir'])
In testing Jenny's zarr reading PR I've come across multiple bugs, I believe more thorough testing is also needed to make sure that the issue's weren't with my testing scripts. Here they are in order of urgency:
The reader determines whether to use zarr or tiff image reading by the presence of a frames_meta.csv
metadata file in the image directory. However, after preprocessing it generates this file and leaves it in the directory. The next time the zarr folder is preprocessed (perhaps with different parameters) the preprocessing breaks, because it sees the existing metadata of the last preprocessing cycle and tries to read tiff files (when the data is in fact zarr format). There should be some other way to determine whether we are reading from zarr or not.
It seems that I am unable to run preprocessing (of any two directories of either file format) twice consecutively in the same script. Some very strange behavior happens where assertion errors about the config's flat_field
and 'mask_dir' parameters fails when they definitely should not (they detect some parameter that doesn't exist, etc). This should be looked into more.
I cannot seem to generate masks from certain channels in .zarr files. I've traced the problem down to a ValueError: cannot convert float NaN to integer
when attempting to read the channel_idx
from the auto-generated frames_meta.csv
metadata. Perhaps the typing of that parameter is incorrect when autogenerated
If a user generates the frames_meta.csv'
metadata while mounted differently (for example if they are accessing a data folder through /gpfs/projects/CompMicro..
or /home/their.user/CompMicro/..
), then running preprocessing on that data from any other mounting will fail, as the metadata captures that mounting's absolute path as the path to the reconstructed images in each folder. I had to write a script to go in and convert these paths to /hpc/projects/CompMicro/..
before using the data to test preprocessing.
There were a few other issues I ran into, but I believe they might have been symptoms of these main ones. Hopefully fixing these will allow for smooth usage of the .zarr
reading.
The scripts I have used for this testing can be found in /hpc/projects/comp_micro/projects/virtualstaining/testing_zarr_scripts
.
The example config files currently don't have any docstrings to explain the function of each parameter. I suggest to add docstring in the yaml files as in this example:
The current microDL codebase needs a refreshed documentation to ease the on-boarding of new users to the current pipeline and to guide the transition of the code to tensorflow 2.0 or pytorch(#141) .
The structure of documentation we are aiming for is as follows: Simplify readme to describe installation and how to use CLI for preprocessing, training, inference. The readme also provides an index of other documentation. Most of the documentation will be centralized in the docs folder, containing
Since the docs folder is versioned with the repo, this approach will allow us to version the documentation along with the code.
Add tmux
Add requirements.
Can you share the config_3D.yml for preprocess,train,and inference script.py ?
We will use this google sheet to track the paths and config files of computational experiments, along with the imaging experiments.
Key decisions:
projects/virtualstaining/tf_microDL
and torch_microDL
folders.projects/virtualstaining/tf_microDL/config_files
and projects/virtualstaining/torch_microDL/config_files
. Many configs are currently saved in CompMicro/software/configfiles/microDL
.Feature request: Add some sort of feedback (like a progressbar) while preprocess.py is running. This would be more userfriendly, as the script takes a while.
callbacks = self._get_callbacks()
if 'metrics' in self.config:
208 metrics_list = self.config['metrics']
209 metrics = get_metrics(metrics_list)
210 self._compile_model(loss, optimizer, metrics)
211 else:
212 self._compile_model(loss, optimizer, metrics=None)
`` gpu_id is an int whereas mem_frac is a list
run_action(args, gpu_id, gpu_mem_frac[0])
if loss_is_masked:
177 if 'metrics' in self.config:
178 masked_metrics = [metric(self.num_target_channels)
179 for metric in metrics]
180 self.model.compile(loss=masked_loss(loss,
181 self.num_target_channels),
182 optimizer=optimizer,
183 metrics=masked_metrics)
184 else:
185 self.model.compile(loss=masked_loss(loss,
186 self.num_target_channels),
187 optimizer=optimizer,
188 metrics=None)
Unet_stack_to_2D
@Property
def _get_input_shape(self):
"""Return shape of input"""
if self.config['data_format'] == 'channels_first':
shape = (1,
self.config['depth'],
self.config['height'],
self.config['width'])
else:
shape = (self.config['depth'],
self.config['height'],
self.config['width'], 1)
return shape
num of channels is hardcoded at 1! fix it
we have redundant copies of training data after tiling. @smguo suggested we should make use of numpy memmap - https://github.com/czbiohub/bam2fasta/blob/master/bam2fasta/np_utils.py#L5 to write these tiles.
This should act as an integration test, and test all kinds of processing; resize, mask, flatfield, tile, for uniform and non-uniform data.
This will need to be broken down into several PRs, maybe by uniform/nonuniform.
proposed fix: add a cropping block in read_one
`def _read_one(tp_dir, channel_ids, fname, flat_field_dir=None):
"""Read one image set
:param str tp_dir: timepoint dir
:param list channel_ids: list of channels to read from
:param str fname: fname of the image. Expects the fname to be the same
in all channels
:param str flat_field_dir: dir where flat field images are stored
:returns: np.array of shape nb_channels, im_size (with or without
flat field correction)
"""
cur_images = []
for ch in channel_ids:
cur_fname = os.path.join(tp_dir,
'channel_{}'.format(ch),
fname)
cur_image = np.load(cur_fname).astype(np.float32)
if flat_field_dir is not None:
ff_fname = os.path.join(flat_field_dir,
'flat-field_channel-{}.npy'.format(ch))
ff_image = np.load(ff_fname)
cur_image = image_utils.apply_flat_field_correction(
cur_image, flat_field_image=ff_image)
cur_image = zscore(cur_image)
## minimally crop the image from center to have cropped shape always power of 2
image_shape = np.asarray(cur_image.shape)
crop_shape = 2**(np.floor(np.log2(image_shape))-1)
crop_shape = crop_shape.astype(np.int)
ctr = np.floor(image_shape/2).astype(np.int)
cur_image = cur_image[ctr[0]-crop_shape[0]:ctr[0]+crop_shape[0],
ctr[1]-crop_shape[1]:ctr[1]+crop_shape[1]]
cur_images.append(cur_image)
cur_images = np.stack(cur_images)
return cur_images`
How about adding or referencing a test dataset with which the pipeline can be tried out. This would allow external users to clearly see which data structure is required and make the adaption to their dataset & setup easier.
Predicted images don't have the full dimension as the targets
@smguo @mattersoflight Tagging here in this issue as you have mentioned Integration testing for the repo. I would think GitHub's actions workflows might be better.
Hi together,
I wanted to prepare training data that cover different modalities, so we can compare the impact of phase vs brightfield and deconvolved vs not deconvolved channels. However, I get an unexpected behavior of the z-alignment script. The script simply copies all single page tif file images into the new aligned folder, without aligning them (from the 97 z-ids per position all z-ids are saved in the aligned folder).
I plan to preparing following training data combinations:
The scripts are located here:
/gpfs/CompMicro/projects/HEK/2022_03_15_orgs_nuc_mem_63x_04NA/all_pos_single_page/
align_z_focus_2022_03_15_only_phase_refmem_min25_max60.py
align_z_focus_2022_03_15_only_deconv_refmem_min25_max60.py
align_z_focus_2022_03_15_raw_data_refmem_min25_max60.py
align_z_focus_2022_03_15_refmem_min25_max60.py
. -> this script works! Aligned images are saved at /gpfs/CompMicro/projects/HEK/2022_03_15_orgs_nuc_mem_63x_04NA/all_pos_single_page/all_pos_Phase1e-3_Denconv_Nuc8e-4_Mem8e-4_pad15_bg50_registered_refmem_min25_max60
Furthermore, the alignment script takes quite a while to run through. If anyone finds time it would be great to speed the script up with e.g. multiprocessing.
I ran the scripts on hulk in a microDL docker container with the command python <script name>
I would highly appreciate if anyone finds time to debug this issue! Please let me know if there are open questions / if I can help with anything.
Best,
Johanna
Make sure settings are compatible with model selection, or (even better) remove settings from config file and automatically set them based on model selection.
E.g. dataset -> squeeze should be set to true for 2D models and false otherwise.
What is the optimal receptor field along the z dimension for virtual staining? It should be similar to the receptor field along XY, because biological structures are not aligned in cartesian coordinates. However, optical resolution along Z is 2x lower than XY resolution.
The accuracy of prediction may improve when information outside optical resolution limit is used.
Imaging is typically done with 5x5 pixels within the XY resolution limit of microscope and 5 slices within the Z resolution limit.
Hi, running inference with a 2.5D model results in an error related to the dimensions of the pred_stack (and target_stack) variable at infernce->image_inference.py->predict_2d(). A 2D model yields stacks in the format (ZCYX), which can be transposed to (CYXZ) in the predict_2d function. A 2.5D model yields stacks in the format (ZC?YX), which leads to the dimensionality error. I also attached an image of the preprocessing 2.5D model single channel, to show that 2D tiles were correctly configured.
Paths to used configs
Error message
Preprocessing config of 2.5D model single channel
If preprocessing is called using an existing directory it is much slower (doesn't use multiprocessing).
And it doesn't check if all channels are already there, it runs again even if all preprocessed data is already there.
use_multiprocessing
option needs to be True
for model.fit_generator to initiate multiple generator instances
Nice post demonstrating behaviors of model.fit_generator with and without multiprocessing:
https://keunwoochoi.wordpress.com/2017/08/24/tip-fit_generator-in-keras-how-to-parallelise-correctly/
Will fix this in the next PR
update methods that use aux_utils.get_row_idx to take pos_idx as a param or
replace with aux_utils.get_meta_idx
model_inference.py, gen_mask_seg.py (ignore), estimate_flat_field.py
remove 'tune_hyperparam' related options in train
Convert networks/conv_blocks/pad_channels() and _crop_layer() to custom layers in keras to avoid serialization errors when saving model weights
`
if num_dims == 2:
if config['network']['data_format'] == 'channels_first':
im_shape = [1, 1, im_size[0], im_size[1]]
else:
im_shape = [1, im_size[0], im_size[1], 1]
elif num_dims == 3:
if config['network']['data_format'] == 'channels_first':
im_shape = [1, 1, im_size[0], im_size[1], im_size[2]]
else:
im_shape = [1, im_size[0], im_size[1], im_size[2], 1]
`
SSIM of the prediction and the target is currently estimated by z-score the target, but this makes the luminance term always zero and underestimate SSIM. Instead, the prediction should be scaled back to the same range as target before computing metrics.
Note that this doesn't affect Pearson correlation metrics as it's scale invariant. This doesn't apply to segmentation output which is always in [0, 1]
It won't save the trained model by default as train_loss always decreases. Better to change it to val_loss
How to dynamically import classes across all sub-modules
Logging
microDL currently only accepts single-paged tiff as input format but microscopy community is switching to zarr format. Add support for zarr format will reduce data conversion and duplication.
How can I modify the inference configuration to use a trained model on a dataset which has different order of channel than the dataset which was used for training?
For instance, consider two datasets A and B. The datasets have captured in the following order:
Dataset A : nucleus =channel 0, membrane =channel 1, phase =channel 2.
Dataset B: phase =channel 0, mitochondria =channel 1.
If I trained a microDL model on dataset A to predict the nucleus using phase images, how can I modify the inference configuration to work on dataset B to virtually stain the nucleus from the phase channels?
proposed fix:
change
for row in range(0, n_rows, step_size[0]): if row + tile_size[0] > n_rows: row = check_in_range(row, n_rows, tile_size[0]) for col in range(0, n_cols, step_size[1]): if col + tile_size[1] > n_cols: col = check_in_range(col, n_cols, tile_size[1])
to
for row in range(0, n_rows - tile_size[0] + step_size[0]+ 1, step_size[0]): if row + tile_size[0] > n_rows: row = check_in_range(row, n_rows, tile_size[0]) for col in range(0, n_cols - tile_size[1] + step_size[1]+ 1, step_size[1]): if col + tile_size[1] > n_cols: col = check_in_range(col, n_cols, tile_size[1])
Replace it with something that generates a frames_meta.csv if your file names adhere to naming convention.
The documentation needs updating! In the readme.md config parameters are missing, are outdated or lack description. As mentioned in issue #127, the config files require docstrings as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.