wfondrie / depthcharge Goto Github PK

View Code? Open in Web Editor NEW

57.0 10.0 18.0 1.32 MB

A deep learning toolkit for mass spectrometry

Home Page: https://wfondrie.github.io/depthcharge/

License: Apache License 2.0

Python 100.00%

deep-learning framework mass-spectrometry metabolomics proteomics pytorch

depthcharge's Introduction

Depthcharge is a deep learning toolkit for building Transformer models to analyze mass spectrometry data.

About

Many deep learning tools have been developed for the analysis of mass spectra or mass spectrometry analytes, like peptides and small molecules. However, each one has had to reinvent the wheel.

Depthcharge aims to provide a flexible, but opinionated, framework for rapidly prototyping deep learning models for mass spectrometry data. Think of Depthcharge as a set of building blocks to get you started on a new deep learning project focused around mass spectrometry data. Depthcharge delivers these building blocks in the form of PyTorch modules, which can be readily used to assemble customized deep learning models for your task.

To learn more, visit our documentation.

depthcharge's People

Contributors

Stargazers

Watchers

Forkers

bercestedincer noble-lab liangzhendong123 irleader rowannelson bharathabnair cctvastu justin-a-sanders chhh mhoopmann animesh alfred-n dmalzl tuanle618 farmmanic tikeng ccranney

depthcharge's Issues

Error in reversing peptides with PeptideTokenizer.detokenize

Hi all,

Detokenizing peptides with mods e.g. PEP+79.996 with let's say reversed tokenized tensor [1,2,3] is not returning the expected behavior. In the current implementation here https://github.com/wfondrie/depthcharge/blob/main/depthcharge/tokenizers/peptides.py#L201
The tokens [1,2,3] get first detokenized and joined to P+79.996EP before getting reversed to PEP69.97+P which is then returned. I think they should be first detokenized without joining (i.e. join=False in detokenize()), then reversed, finally joined.

Best,
Daniela

Add detail to invalid spectra warning parsers.py

Warning should include why the spectra were invalid e.g. WARNING: Skipped 6745 spectra with invalid charge > 5.

Update Peptide Transformer API

Update the API for Peptide transformers to be more flexible, like we did with the Spectrum transformers.

Problem with AnnotatedSpectrumDataset and n_workers>0

Hi there,
I started to use depthcharge and I really like it :)
However, I have a bit of a problem with the dataset AnnotatedSpectrumDataset from the latest depthcharge version. I am testing it with the function from https://github.com/wfondrie/depthcharge/blob/main/tests/unit_tests/test_data/test_loaders.py#L47
That works well. However, whenever I increase the number of workers it gets stuck and the code never terminates.
Have you encountered problems like this? Do you have any idea why this may happen?
I am using Python 3.10.12 on a Linux machine. Here are some of my package versions:

torch==2.1.0
pytorch-lightning==1.9.5
pylance==0.8.16
pyteomics==4.6.3

Thanks a lot in advance!

[bug] m/z encoding is incorrect

See Noble-Lab/casanovo#145.

Example code to read .mgf file

Hi, I would like to know how to read and preprocess a .mgf file using the package. Can you please help me by providing an example code for that, which can then be used to pass on other package functions such as Encoder and Transformer? Thank You

Why there is a replacement from "I" to "L" in peptide sequence during the tokenize function of _PeptideTransformer in transformer.py?

Hi, casanovo and depthcharge are excellent work. However, there is some confuse in transformer.py. Why there is a replacement from "I" to "L" in peptide sequence during the tokenize function of _PeptideTransformer? Will this replacement influence the predict precusion of peptide sequence incuding "I" during inference?

Obtaining the retention time and ActivationType (e.g., HCD) from mgf files using Depthcharge

Thanks for your excellent contributions to the computational mass spectrum community.
Can we obtain the retention time and ActivationType (e.g., HCD) from mgf files using the depthcharge package? And how?

Provide tokenization function

Have a separate tokenization function to unify this behavior across depthcharge (i.e. to calculate masses from peptide sequences) and Casanovo (during evaluation of the predictions).

Existing index doesn't know whether it's annotated

When trying to re-use an existing HDF5 index, I get the following error in hdf5.py on line 80:

AttributeError: 'AnnotatedSpectrumIndex' object has no attribute 'annotated'

Looking at the code, I don't know when the annotated attribute should be set. In fact, _handle is set to None in line 63, so I'm not fully understanding how this piece of code is supposed to work.

Error in analyte decoder when tokens is None

Hi all,

This line here: https://github.com/wfondrie/depthcharge/blob/main/depthcharge/transformers/analytes.py#L375

causes trouble on my side when tokens is None. Probably because the token encoder expects integer tensors but in that line, the tensor is initialized by default as float.

An easy fix is probably something like this:
tokens = torch.zeros(batch, 0, dtype=torch.int64)

Best,
Daniela

SpectrumDataset: is a IterableDataset,How to use shuffle in Dataloader?

When i run these codes

import torch
import pandas as pd
import polars as pl
import depthcharge as dc
import natsort
import re
import numpy as np 
import pyarrow as pa
from pyarrow import int32
from depthcharge.data import SpectrumDataset
from torch.utils.data import DataLoader
from depthcharge.transformers import SpectrumTransformerEncoder
from depthcharge.encoders import FloatEncoder
from depthcharge.data import CustomField

mzml_file = ["20190118_Q2_MD_ColQ2-51_AlexanderBull_P15_Fluide_4microscans.mgf"]

parse_kwargs = {
    "progress": False,
    "preprocessing_fn": [
        dc.data.preprocessing.set_mz_range(min_mz=0),
        dc.data.preprocessing.filter_intensity(max_num_peaks=200),
        dc.data.preprocessing.scale_intensity(scaling="root"),
        dc.data.preprocessing.scale_to_unit_norm,
    ],
    "custom_fields": [
      # CustomField("Seq", lambda x: x["params"]["seq"], pa.string()),
        CustomField("RT", lambda x: x["params"]["rtinseconds"], pa.float64()),
        CustomField("charge", lambda x: x['params']['charge'], pa.list_(int32())),
    ]
}

dataset = SpectrumDataset(mzml_file, batch_size=8, parse_kwargs=parse_kwargs)


from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=None,sampler=True)

for batch in loader:
    print(batch)

I get a error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 38
     33 dataset = SpectrumDataset(mzml_file, batch_size=8, parse_kwargs=parse_kwargs)
     36 from torch.utils.data import DataLoader
---> 38 loader = DataLoader(dataset, batch_size=None,shuffle=True)
     40 for batch in loader:
     41     print(batch)

File ~/miniconda3/envs/ttt/lib/python3.11/site-packages/torch/utils/data/dataloader.py:313, in DataLoader.__init__(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context, generator, prefetch_factor, persistent_workers, pin_memory_device)
--> 308     raise ValueError(
    309         f"DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle={shuffle}")
    311 if sampler is not None:
    312     # See NOTE [ Custom Samplers and IterableDataset ]
    313     raise ValueError(
    314         f"DataLoader with IterableDataset: expected unspecified sampler option, but got sampler={sampler}")
    315 elif batch_sampler is not None:
    316     # See NOTE [ Custom Samplers and IterableDataset ]
    317     raise ValueError(
    318         "DataLoader with IterableDataset: expected unspecified "
    319         f"batch_sampler option, but got batch_sampler={batch_sampler}")

ValueError: DataLoader with IterableDataset: expected unspecified sampler option, but got shuffle=True

How to use shuffle in Dataloader?

Bump PyTorch dependency to >=1.9

This is the version that introduced the batch_first parameter for transformer models.