royerlab / cytoself Goto Github PK

Self-supervised models for encoding protein localization patterns from microscopy images

License: BSD 3-Clause "New" or "Revised" License

Python 96.41% Jupyter Notebook 3.59%

deep-learning protein imaging autoencoder fluorescence pytorch self-supervised self-supervised-learning tensorflow opencell

cytoself's People

Contributors

Stargazers

Watchers

Forkers

jimmylba chihuanbin li-li-github anakiiiiiiiiii yangsenwxy terminatorj srmlabhainu sofroniewn jmhb0 keithchev ningshuang-yao genggengyuhuai-cell hulbji ed-fgx jessica-ewald

cytoself's Issues

Labels that don't match up with available images

Hi! I'm using OpenCell data + Cytoself for a project, and want to use the provided label data to look up original images. However, I'm having some trouble matching the labels to their original images (in AWS).
For example, I have the following label:

ensg: ENSG00000038358
name: EDC4
loc_grade1: vesicles
loc_grade2:  NaN
loc_grade3: cytoplasmic
protein_id: 830
FOV_id: 2883

But there's no image with this FOV ID in the s3 bucket:

I'd appreciate any guidance on how to appropriately use the labels to find original images. Thanks so much.

Variance normalization

I'm trying to understand the variance normalization that happens here

cytoself/cytoself/trainer/cytoselffull_trainer.py

Line 144 in 9f48239

    
           self.model.mse_loss['reconstruction1_loss'] = mse_loss_fn(model_outputs[0], img) / variance

I can't seem to find a similar variance normalization discussed in the hierarchical VQ-VAE paper or other older papers, and it's not clear to me why normalization by a single scalar is needed here.

It's also not clear to me why you only normalize reconstruction1_loss, but reconstruction2_loss calculated here is left the same

cytoself/cytoself/trainer/autoencoder/cytoselffull.py

Line 198 in 9f48239

self.mse_loss[f'reconstruction{len(self.decoders) - i}_loss'] = nn.MSELoss()(

Wouldn't it make sense to normalize both of them, given that they get summed into the same loss?

Also do you think it's really important to calculate the variance for train/ val/ test separately? Could one just use the variance across train for all? The difference doesn't seem to be large, and it makes things a little easier.

Any help here would be appreciated - thanks!!

Release trained model weights

Is there any chance @li-li-github @royerloic you could release trained weights for the PyTorch model that would allow me to use your autoencoder and the embedding space? Ideally it would be the same weights and umap parameters that allow me to remake figure 2 in your paper.

I’ve started trying to retrain the model with your data, but I’m not getting umaps that look as nice as yours in the paper. I can continue down that route, but if you have a nice trained model that you could add to the repo or hugging face that would be a big help!

thanks!!

Dataloader for full dataset

Great work here @li-li-github! I'm Nick from CZI SciTech team and am interested in retraining cytoself on the full dataset. I noticed right now you that for the DataManagerOpenCell you don't have a convenience function for downloading the full dataset (like DataManagerOpenCell.download_sample_data) or for then creating a dataloader, like datamanager.const_dataloader.

Downloading the full data is very easy, but the format is then slightly different so I can't just use datamanager.const_dataloader out of the box.

I am thinking about expanding that method so it can handle the full data. Would you be interested in having that be a PR to this repo, or do you already have an alternative recommended way to deal with the full data?

failure to run plot_clustermaps()

When running below line from colab tutorial (currently running in jupyter notebook)
analytics.plot_clustermaps()

I get the following errors. Please let me know how to resolve it.

Thanks!

ValueError Traceback (most recent call last)
/tmp/ipykernel_2025/1780327345.py in
----> 1 analytics.plot_clustermaps()

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/cytoself/analysis/analytics.py in plot_clustermaps(self, data, corr_idx_idx, target_vq_layer, datatype, savepath, filename, format, num_cores)
631 fileName="corridx_" + datatype,
632 num_cores=num_cores,
--> 633 savepath=savepath,
634 )
635 corr_idx_idx = self.corr_idx_idx

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/cytoself/analysis/analytics.py in calc_corr_idx_idx(self, data, fileName, num_cores, savepath)
567 for d in data:
568 if len(d) > 0:
--> 569 d = np.nan_to_num(selfpearson_multi(d, num_cores=num_cores))
570 self.corr_idx_idx.append(d)
571 # self.corr_idx_idx = np.stack(self.corr_idx_idx)

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/cytoself/analysis/pearson_correlation.py in selfpearson_multi(data, num_cores, axis)
32 corr = Parallel(n_jobs=num_cores, prefer="threads")(
33 delayed(corr_single)(i1, ar1, data.shape[0], data[i1:])
---> 34 for i1, ar1 in enumerate(tqdm(data))
35 )
36 corr = np.vstack(corr)

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/joblib/parallel.py in call(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())

~/anaconda3/envs/cytoself/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):

~/anaconda3/envs/cytoself/lib/python3.7/multiprocessing/pool.py in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
119 job, i, func, args, kwds = task
120 try:
--> 121 result = (True, func(*args, **kwds))
122 except Exception as e:
123 if wrap_exception and func is not _helper_reraises_exception:

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/joblib/_parallel_backends.py in call(self, *args, **kwargs)
593 def call(self, *args, **kwargs):
594 try:
--> 595 return self.func(*args, **kwargs)
596 except KeyboardInterrupt as e:
597 # We capture the KeyboardInterrupt and reraise it as

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/joblib/parallel.py in call(self)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def reduce(self):

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/joblib/parallel.py in (.0)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def reduce(self):

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/cytoself/analysis/pearson_correlation.py in corr_single(i1, ar1, dim, data1)
16 corr = np.zeros((1, dim))
17 for i2, ar2 in enumerate(data1):
---> 18 corr[:, i2 + i1] = pearsonr(ar1, ar2)[0]
19 return corr
20

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/scipy/stats/stats.py in pearsonr(x, y)
4043 # scipy.linalg.norm(xm) does not overflow if xm is, for example,
4044 # [-5e210, 5e210, 3e200, -3e200]
-> 4045 normxm = linalg.norm(xm)
4046 normym = linalg.norm(ym)
4047

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/scipy/linalg/misc.py in norm(a, ord, axis, keepdims, check_finite)
143 # Differs from numpy only in non-finite handling and the use of blas.
144 if check_finite:
--> 145 a = np.asarray_chkfinite(a)
146 else:
147 a = np.asarray(a)

~/anaconda3/envs/cytoself/lib/python3.7/site-packages/numpy/lib/function_base.py in asarray_chkfinite(a, dtype, order)
487 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
488 raise ValueError(
--> 489 "array must not contain infs or NaNs")
490 return a
491

ValueError: array must not contain infs or NaNs

ModuleNotFoundError: No module named 'cytoself.datamanager'

Dear Author of cytoself,

I got an Error when I run preparing Data on cytoself such as Attach file I send with this email. Could you please give suggestions for problem solving.
Thank you for your attention.

Best regards,
Dedy

UMap not looking good

Hi @li-li-github - i was away for a while, but getting back to this now. I closed my other issues to keep things clean, but thanks for your help there!

I've been trying to retrain cytoself on the full data, and I can get to pretty good looking images, but when I run umap I don't see any structure. For example

and

I've trained ~17 epochs and get to a reconstructions loss around 0.05 for reconstructions_loss1.

I'm using my own trainer class and might have slightly different hyperparameters, but things look much worse than I would expect if I had things implemented right. Do you have target numbers for other training losses? Or any other ideas on what might be going wrong here. If easier to hop on a zoom call I can talk some time too.

Thanks!!!

How to normalize the nuclear distance channel?

Hi, great work!

Image data downloaded from those image_data0x.npy files has un-normalized nuclear distance channel. Would be nice if you can provide the way that to normalize the channel to -1 and 1.

Thanks.

Packages missing + embedding_dim is required for vq_args

I ran into two errors while trying to run the sample analysis. I had the same issue as Dedy (#26) but fixed this by downloading the cytoself folder from the github, then installing the additional dependencies. I now have this error while trying to run CytoselfFullTrainer:

Suggest setting the exact package version for torch in requirements.txt

I ran the install instructions on the README and tried running example_scripts/simple_example.py, but I got some error todo with the torch.nn.Upsample layer (sorry I forgot to save the error).

I was able to fix it by downgrading torch from >2.0 to 1.13.1. So maybe some issue was introduced when torch 2.0 launched.

I don't know if this is affecting others, but you might be able to avoid the issue by setting the exact torch version that you used in the requirements.txt.

PicklingError: Can't pickle <function DataManagerOpenCell.<lambda>

Dear Author,

I tried reproducing the steps from the readme file, but I received an error.
at step 2 the line
trainer.fit(datamanager, tensorboard_path='tb_logs')
throws the following error:

PicklingError: Can't pickle <function DataManagerOpenCell. at 0x000001070AC03550>: attribute lookup DataManagerOpenCell. on cytoself.datamanager.opencell failed.

I get the following Traceback:

PicklingError Traceback (most recent call last)
Cell In[16], line 20
12 train_args = {
13 'lr': 1e-3,
14 'max_epoch': 1,
(...)
17 'earlystop_patience': 6,
18 }
19 trainer = CytoselfFullTrainer(train_args, homepath='demo_output', model_args=model_args)
---> 20 trainer.fit(datamanager, tensorboard_path='tb_logs')

File D:\MA\CytoSelf\cytoself-main\cytoself-main\cytoself\trainer\basetrainer.py:427, in BaseTrainer.fit(self, datamanager, initial_epoch, tensorboard_path, **kwargs)
425 # Train the model
426 self.model.train(True)
--> 427 train_metrics = self.run_one_epoch(datamanager, 'train', **kwargs)
428 self.model.train(False)
430 # Validate the model

File D:\MA\CytoSelf\cytoself-main\cytoself-main\cytoself\trainer\vqvae_trainer.py:192, in VQVAETrainer.run_one_epoch(self, datamanager, phase, **kwargs)
190 raise ValueError('phase only accepts train, val or test.')
191 _metrics = []
--> 192 for _batch in tqdm(data_loader, desc=f'{phase.capitalize():>5}'):
193 loss = self.run_one_batch(
194 _batch, var, zero_grad=is_train, backward=is_train, optimize=is_train, **kwargs
195 )
196 _metrics.append(loss)

File D:\Programme\Miniconda\envs\cytoself\lib\site-packages\tqdm\std.py:1182, in tqdm.iter(self)
1179 time = self._time
1181 try:
-> 1182 for obj in iterable:
1183 yield obj
1184 # Update and possibly print the progressbar.
1185 # Note: does not call self.update(1) for speed optimisation.

File D:\Programme\Miniconda\envs\cytoself\lib\site-packages\torch\utils\data\dataloader.py:438, in DataLoader.iter(self)
436 return self._iterator
437 else:
--> 438 return self._get_iterator()

File D:\Programme\Miniconda\envs\cytoself\lib\site-packages\torch\utils\data\dataloader.py:386, in DataLoader._get_iterator(self)
384 else:
385 self.check_worker_number_rationality()
--> 386 return _MultiProcessingDataLoaderIter(self)

File D:\Programme\Miniconda\envs\cytoself\lib\site-packages\torch\utils\data\dataloader.py:1039, in _MultiProcessingDataLoaderIter.init(self, loader)
1032 w.daemon = True
1033 # NB: Process.start() actually take some time as it needs to
1034 # start a process and pass the arguments over via a pipe.
1035 # Therefore, we only add a worker to self._workers list after
1036 # it started, so that we do not call .join() if program dies
1037 # before it starts, and del tries to join but will get:
1038 # AssertionError: can only join a started process.
-> 1039 w.start()
1040 self._index_queues.append(index_queue)
1041 self._workers.append(w)

File D:\Programme\Miniconda\envs\cytoself\lib\multiprocessing\process.py:121, in BaseProcess.start(self)
118 assert not _current_process._config.get('daemon'),
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect
124 # reference to the process object (see bpo-30775)

File D:\Programme\Miniconda\envs\cytoself\lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)

File D:\Programme\Miniconda\envs\cytoself\lib\multiprocessing\context.py:327, in SpawnProcess._Popen(process_obj)
324 @staticmethod
325 def _Popen(process_obj):
326 from .popen_spawn_win32 import Popen
--> 327 return Popen(process_obj)

File D:\Programme\Miniconda\envs\cytoself\lib\multiprocessing\popen_spawn_win32.py:93, in Popen.init(self, process_obj)
91 try:
92 reduction.dump(prep_data, to_child)
---> 93 reduction.dump(process_obj, to_child)
94 finally:
95 set_spawning_popen(None)

File D:\Programme\Miniconda\envs\cytoself\lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)

PicklingError: Can't pickle <function DataManagerOpenCell. at 0x000001070AC03550>: attribute lookup DataManagerOpenCell. on cytoself.datamanager.opencell failed

Thank You for Your time,

Installation

Hello, when attempting to use the demo py file I get an error that the modules
"
cytoself.data_loader.data_manager import DataManager
from cytoself.models import CytoselfFullModel
from cytoself.data_loader.data_generator import image_and_label_generator
from cytoself.analysis.analytics import Analytics"

Do not exist.

Is there any initial file of the manual selected spectrum index?

I need to align my own protein photo, but I cannot get the manual picked index the same as in the original paper.
How can I do my own prediction according to the model you supply in the example path?

Pickling error running CytoselfFullTrainer

Hi, I've installed cytoself via cloning from this repo / dev instructions and tried running the example scripts but ran into this error.

Nuclear segmentation mask for the original resolution images

Hi!

Thanks for sharing the preprocessed data. I was wondering if you also have the segmentation for the images at their original resolution instead of these downsampled ones. That would be very helpful.

Alternatively, would it be possible to know where in the original images these 100x100 crops locate, e.g. the location of the top left corner in the raw images?

Thanks!

Early stopping does not count consecutive steps on pytorch branch

Hey Hiro,

The paper says this about early stopping:

training was terminated when the validation loss did not improve for more than 12 consecutive epochs.

Here is the relevant code in cytoself/trainer/basetrainer.py

  if _vloss < best_vloss:
      best_vloss = _vloss
      self.best_model = deepcopy(self.model)
      # Save the best model checkpoint
      self.save_checkpoint()
  else:
      count_lr_no_improve += 1
      count_early_stop += 1

I believe you need to add count_early_stop=0 in the if branch to reset the counter.

The difference between tensorflow and pytorch version

In the tensorflow version: decoder using qtvec as inputs.
But in the pytorch version, using encoded as inputs instead of qtvec.

May you provide some explanation about this difference?

Setup of full .npy dataset?

Hi! I wanted to ask about the setup of the full npy dataset and how each image relates to the labels. There are ~9-10 images per label. Are they in order corresponding to labels (e.g. first 9 images corresponds to first label) or is there some other setup structure?

Thanks so much!

Code about Nucleus Segmentation

Hi, very nice job!

Could I reference the code about nucleus segmentation mentioned in your NM article? I can't find it in this repository.

Thanks.

Loss weights for optimal training

Hi,

I had a question about the optimal weighting of the losses during training.

I notice that initially if you weigh the fc loss, vq_loss, and reconstruction loss equally, the quantizer is not trained well enough to provide meaningful outputs, and the network never learns a good codebook because it is biased too much towards prediction. Reducing the prediction loss, on the other hand, makes the network learn a representation which is good for reconstruction/quantization but not prediction. I am unable to find a good balance.

Was the weighting of the losses modulated during training? What loss weights were optimal for training?

train！

hi！
I have a few problems with training.
I tried to get the final model parameters myself when reproducing the code, but the results were terrible. So I want to know how many epochs you trained to get the final result.
At the same time, the amount of data for this work is for me, so can I only use proteins with a single first level localization information for training? This will make my attempt much easier.

Google drive links inaccessible/invalid for examples/simple_example

Hello,

I have set-up a conda environment according to the instructions in the README and I receive the following message in my terminal when I run python examples/simple_example.py:

Downloading data...
Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1gkiEMKdadOel4Xh6KoS2U603JTkZhgDw 

Downloading...
From: https://drive.google.com/uc?id=16-0bhKrUMbZ0DSz768Z_q13yNivHyfVO
To: /my/computer/Desktop/cytoself/example_label.npy
100%|████████████████████████████████████████████████████████████████| 51.3k/51.3k [00:00<00:00, 15.8MB/s]
Access denied with the following error:

 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1znRLbYJJqd11Zqv-5_yUmNjarKcwIWMg 

Downloading...
From: https://drive.google.com/uc?id=1RM654Qavcy8gG5uy3mCzi8EsOT_xOlVd
To: /my/computer/Desktop/cytoself/protein_uniloc.csv
100%|████████████████████████████████████████████████████████████████| 8.57k/8.57k [00:00<00:00, 28.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1WrxhGsSzivZVAlL_K2FLVsRmHrsfhyrI
To: /my/computer/Desktop/cytoself/dgram_index1.npy
100%|████████████████████████████████████████████████████████████████| 16.5k/16.5k [00:00<00:00, 21.7MB/s]
Loading data...
Traceback (most recent call last):
  File "examples/simple_example.py", line 49, in <module>
    image_data = np.load('example_image.npy')
  File "/my/computer/anaconda3/envs/cytoself/lib/python3.7/site-packages/numpy/lib/npyio.py", line 417, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'example_image.npy'

When I try the google drive links in-browser (as suggested) they lead to files with different names. Any tips on how to resolve this?

Thanks,
Jeff