bittremieux / gleams Goto Github PK

View Code? Open in Web Editor NEW

19.0 1.0 6.0 21.42 MB

GLEAMS is a Learned Embedding for Annotating Mass Spectra.

Home Page: https://doi.org/10.1101/483263

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

mass-spectrometry deep-learning clustering

gleams's Introduction

GLEAMS

For more information:

Official code website

GLEAMS is a Learned Embedding for Annotating Mass Spectra. GLEAMS encodes mass spectra as vectors of features and feeds them to a neural network to embed them into a 32-dimensional space in which spectra generated by the same peptide are close together. It then detects spectrum clusters of spectra generated by the same peptide.

The software is available as open-source under the BSD license.

If you use GLEAMS in your work, please cite the following publication:

Wout Bittremieux, Damon H. May, Jeffrey Bilmes, William Stafford Noble. A learned embedding for efficient joint analysis of millions of mass spectra. Nature Methods 19, 675–678 (2022). doi:10.1038/s41592-022-01496-1

Installation

GLEAMS requires Python 3.8, a Linux operating system, and a CUDA-enabled GPU.

Create a Conda environment and install the necessary compiler tools and GPU runtime:

conda env create -f https://raw.githubusercontent.com/bittremieux/GLEAMS/master/environment.yml && conda activate gleams

Install GLEAMS:

pip install git+https://github.com/bittremieux/GLEAMS.git

Using GLEAMS

For detailed usage information, see the command-line help messages:

gleams --help
gleams embed --help
gleams cluster --help

Spectrum embedding

GLEAMS provides the gleams embed command to convert MS/MS spectra in peak files to 32-dimensional embeddings. Example:

gleams embed *.mzML --embed_name GLEAMS_embed

This will read the MS/MS spectra from all matched mzML files and export the results to a two-dimensional NumPy array of dimension n x 32 in file GLEAMS_embed.npy, with n the number of MS/MS spectra read from the mzML files. Additionally, a tabular file GLEAMS_embed.parquet will be created containing corresponding metadata for the embedded spectra.

Embedding clustering

After converting the MS/MS spectra to 32-dimensional embeddings, they can be clustered to group spectra with similar embeddings using the gleams cluster command. Example:

gleams cluster --embed_name GLEAMS_embed --cluster_name GLEAMS_cluster --distance_threshold 0.3

This will perform hierarchical clustering on the embeddings with the given distance threshold. The output will be written to the GLEAMS_cluster.npy NumPy file with cluster labels per embedding (-1 indicates noise, minimum cluster size 2). Additionally, a file GLEAMS_cluster_medoids.npy will be created containing indexes of the cluster representative spectra (medoids).

Advanced usage

Full configuration of GLEAMS, including various configurations to train the neural network, can be modified in the gleams/config.py file.

Frequently Asked Questions

I get a "This repository is over its data quota" error when trying to install GLEAMS. Help!

Sometimes you can get the following error when trying to install GLEAMS following the instructions above:

Warning This repository is over its data quota. Account responsible for LFS bandwith should purchase more data packs to restore access.

This is caused by many people downloading GLEAMS recently, running out of Git LFS bandwith used to download the model weights.

You can circumvent this error by cloning the repository and downloading the weights file manually:

Create and activate a new Conda environment with the necessary dependencies, as described in step 1 above.
Clone the GLEAMS repository while skipping the model weights:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/bittremieux/GLEAMS.git

Manually download the GLEAMS model weights from the latest release.
Move the model weights file to the data/ directory in the cloned GLEAMS repository.
From the root of the GLEAMS repository (e.g. cd GLEAMS), use pip to install GLEAMS:

pip install .

Where can I find the GLEAMS training data?

GLEAMS was trained on 30 million PSMs from the MassIVE-KB (v1) dataset. As this is a very large dataset, the spectra are not readily available as a single download.

To compile the full training dataset, on the MassIVE website, go to MassIVE Knowledge Base > Human HCD Spectral Library > All Candidate library spectra > Download. This will give you a zipped TSV file with the metadata and peptide identifications for all 30 million PSMs. Using the filename (column "filename") you can then retrieve the corresponding spectra from the MassIVE FTP server and extract the required spectra using their scan number (column "scan").

These 30 million PSMs are a subset of all spectrum identifications that were obtained during compilation of the MassIVE-KB resource (maximum top 100 PSMs for 2.1 million unique precursors). All 185 million PSMs at 1% FDR obtained using MSGF+, as described by Wang et al., can be retrieved in multiple mzTab files from MassIVE. To do so, extract all unique search task identifiers in the "proteosafe_task" column from the previously downloaded metadata TSV file. Next, use these identifiers to compile the following URLs: https://proteomics2.ucsd.edu/ProteoSAFe/result.jsp?task=[ID]&view=view_result_list by replacing [ID] by each identifier. Finally, on each search task web page, click "Download" to download a zip file that contains all PSMs for that search task in the mzTab format.

The full 669 million spectra that were processed using GLEAMS are all spectra in peak files listed in the MassIVE-KB metadata TSV file.

Contact

For more information you can visit the official code website or send an email to [email protected].

gleams's People

Contributors

Stargazers

Watchers

Forkers

jenssettelmeier melihyilmaz marioernestovaldes ai-learner-liu siaer

gleams's Issues

about MassIVE-KB datasets

Thanks for sharing this great work. I think GLEAMS is very helpful for processing the mass spectrum.
However, when I am checking the datasets used in GLEAMS, I cannot find how to download the 30 million high-quality PSMs of the MassIVE-KB spectral library and 185 million PSMs of the identification results for the full MassIVE-KB dataset in the MassIVE web site. Could you kindly help me about how to find these data?
Looking forward to your reply.

Clustering MS/MS spectra from multiple species

Hello,

I have installed the GLEAMS tool in one of my conda env. I am planning to check MS/MS clusters specific to data from different species. However, the general gleams command execution shows only embed and cluster options for mzML input files. It is not clear whether the embedding has to be done together for all the mzML files from all the species or it has to be done separately and clustered together at the "gleams cluster" step?.

Please help

Thanks,
Chinmaya

OSError: SavedModel file does not exist at: /home/hp/miniconda3/envs/gleams/gleams/data/gleams_82c0124b.hdf5/{saved_model.pbtxt|saved_model.pb}

I was trying to execute gleams embed as it is provided in your instructions. However, I am facing the below OSError saying that the saved model in .pbtxt or .pb format is not present in the said path.

**2022-12-03 15:10:00,264 INFO [gleams/MainProcess] gleams.cli_embed : GLEAMS version 0.4.dev1+g8831ad6
2022-12-03 15:10:00,282 DEBUG [gleams/MainProcess] encoder.__init__ : Read the reference spectra
2022-12-03 15:10:00,907 DEBUG [gleams/MainProcess] encoder.__init__ : Select 500 valid reference spectra
2022-12-03 15:10:03,207 DEBUG [gleams/MainProcess] nn.embed : Load the stored GLEAMS neural network
2022-12-03 15:10:04,047 DEBUG [gleams/MainProcess] embedder.__init__ : Running the embedder model on 1 GPU(s)
Traceback (most recent call last):
  File "/home/hp/miniconda3/envs/gleams/bin/gleams", line 8, in <module>
    sys.exit(gleams())
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/gleams/gleams.py", line 97, in cli_embed
    nn.embed(metadata_filename, config.model_filename, f'{embed_name}.npy',
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/gleams/nn/nn.py", line 163, in embed
    emb.load()
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/gleams/nn/embedder.py", line 158, in load
    self.siamese_model = keras.models.load_model(self.filename)
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/tensorflow/python/keras/saving/save.py", line 186, in load_model
    loader_impl.parse_saved_model(filepath)
  File "/home/hp/miniconda3/envs/gleams/lib/python3.8/site-packages/tensorflow/python/saved_model/loader_impl.py", line 110, in parse_saved_model
    raise IOError("SavedModel file does not exist at: %s/{%s|%s}" %
OSError: SavedModel file does not exist at: /home/hp/miniconda3/envs/gleams/gleams/data/gleams_82c0124b.hdf5/{saved_model.pbtxt|saved_model.pb}

The error is because any package is missing in the conda environment where gleams has been installed?

--
Chinmaya

Some PSM was missing

I used the reference mgf with 1000 PSM, while it only output 968 embedding. Could you please share the information about the missing PSM?

ValueError: invalid literal for int() with base 10

Hi I am running GLEAMS on a bunch of MGF files using gleams embed *.mgf --embed_name GLEAMS_embed with 1 gpu.

The job starts:

2023-10-18 23:08:17,623 INFO [gleams/MainProcess] gleams.cli_embed : GLEAMS version 0.4.dev7+g13ebc74.d20231018
2023-10-18 23:08:17,672 DEBUG [gleams/MainProcess] encoder.init : Read the reference spectra
2023-10-18 23:08:18,008 DEBUG [gleams/MainProcess] encoder.init : Select 500 valid reference spectra
2023-10-18 23:08:19,146 DEBUG [gleams/MainProcess] nn.embed : Load the stored GLEAMS neural network
2023-10-18 23:08:19,200 DEBUG [gleams/MainProcess] embedder.init : Running the embedder model on 1 GPU(s)
2023-10-18 23:08:19,660 INFO [gleams/MainProcess] nn.embed : Embed all peak files for metadata file /var/tmp/pbs.858407.hn-10-03/tmpy7q23kig/GLEAMS_embed.parquet
2023-10-18 23:08:19,662 INFO [gleams/MainProcess] nn.embed : Process dataset GLEAMS [ 1/ 1] (120 files)
2023-10-18 23:08:19,663 DEBUG [gleams/MainProcess] feature._peaks_to_features : Process file 202112249_TY_Nathaniel_1364_m2_3_FAIMS_OTIT_F1.mgf

but then encounters the following error:

Traceback (most recent call last):
  File "/data/petretto/home/e0470749/.conda/envs/gleams/bin/gleams", line 8, in <module>
    sys.exit(gleams())
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/gleams/gleams.py", line 97, in cli_embed
    nn.embed(metadata_filename, config.model_filename, f'{embed_name}.npy',
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/gleams/nn/nn.py", line 188, in embed
    for filename, file_scans, file_encodings in joblib.Parallel(
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/gleams/feature/feature.py", line 68, in _peaks_to_features
    scans['scan'] = scans['scan'].astype(np.int64)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/pandas/core/generic.py", line 5815, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 418, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/data/petretto/home/e0470749/.conda/envs/gleams/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 327, in apply

Number of spectra in MGF vs embed file does not match

Hi,

I have an MGF file with 10,067 spectra in it, but running gleams embed only returns embeddings for 1,663 of them. Any idea why?

OSError: SavedModel file does not exist at: /storage/yangkl/anaconda3/envs/gleams/gleams/data/gleams_82c0124b.hdf5/{saved_model.pbtxt|saved_model.pb}

Hi Wout,

I received this error while running gleams embed: gleams_error.txt

Any advice on how to resolve this? The server I am using does not have GPU access, but I will switch to a server with GPU soon. But I wasn't sure that this was a problem that was GPU related?

Thanks,
Kevin