matchms / matchms Goto Github PK
View Code? Open in Web Editor NEWPython library for processing (tandem) mass spectrometry data and for computing spectral similarities.
License: Apache License 2.0
Python library for processing (tandem) mass spectrometry data and for computing spectral similarities.
License: Apache License 2.0
Since recently I handle MS data from mgf
as well as json
files (both coming form GNPS). Unfortunately both formats come with conflicting metadata fields.
To a large extend I think it will be fine to just parse the fields accordingly when loading data (PR matchms/matchms-backup#207).
But this doesn't work for the precursor mass.
pepmass
field with two values: [mz, intensity]
.precursor_mz
field.Create a tutorial: Jupyter notebook with a lightweight dataset
see #33 (comment)
matchms/matchms/filtering/reduce_to_number_of_peaks.py
Lines 32 to 35 in 58c45ed
I tried using save_as_mgf
on the actual data.
The good part: it seems to work even for large lists of spectrums.
The bad part: it is pretty slow. Writing the entire dataset (147,000 spectra) to a mgf file of about 1GB took >20 minutes. I slightly changed save_as_mgf
(#152) but that didn't improve things much, so it is mostly likely due to the way pyteomics writes it to an mgf file.
I ran a quickly made json function for comparison, which was at least 5 times faster.
import json
def save_as_json(spectrums, filename):
"""Save spectrum(s) as json file.
Args:
----
spectrums: list of Spectrum() objects, Spectrum() object
Expected input are match.Spectrum.Spectrum() objects.
filename: str
Provide filename to save spectrum(s).
"""
if not isinstance(spectrums, list):
# Assume that input was single Spectrum
spectrums = [spectrums]
# Convert matchms.Spectrum() into dictionaries
spectrum_dicts = []
for spectrum in spectrums:
spec = spectrum.clone()
spectrum_dict = {"intensities": spec.peaks.intensities.tolist(),
"mz": spec.peaks.mz.tolist(),
"metadata": spec.metadata}
spectrum_dicts.append(spectrum_dict)
# Write to json file
with open(filename, 'w') as fout:
json.dump(spectrum_dicts, fout)
Question: Does it make sense to add such a function?
We still need the save_to_mgf
function because that is how the data can be shared within the community. But to create safe points, store data during the processing pipeline etc. one could still switch to a different (better?) format.
By the way:
Json was just what first crossed my mind. It could of course be any other suited file format!
Documentation was improved in #53, but https://matchms.readthedocs.io/en/latest/ has not been updated.
In PR #45 half of the tests fail because of isort complaining about Imports are incorrectly sorted.
.
Even when running isort on one of the named files (add_fingerprint()
) that didn't disappear.
I am now not sure if there is something going wrong with the isort test, or whether import must be in one very particular order and format (in which case that should be specified to avoid unlimited trial&error).
matchms/matchms/old/ms_functions.py
Lines 74 to 75 in 61602e7
We should consider splitting off matchms/similarity/spec2vec/*
into its own repo. If we choose to do so, we may want to use the same method I used for splitting off notebooks
and old-iomega-spec2vec
.
Also, we should decide on which repository (if any) should be a fork of iomega/spec2vec
.
In the spec2vec project we compare different spectral similarities to the similarity between respective molecular fingerprints (as a kind of "ground truth").
Those fingerprints are calculated based on the actual structure of a molecule so they need smiles (or inchi) as input.
There are many different flavors of molecular fingerprints. We implemented some key types using rdkit in the old spec2vec code. That should be refactored and added to matchms I think.
The current implementation of calc_scores.calculate
is deliberately naive while we are sorting out the refactored skeleton of matchms
:
Lines 18 to 22 in 672cbbc
However, once we sort that out, we should starting thinking about increasing performance through parallel execution of parts of the code.
There are various ways that can help us, such as:
However, I'd like to emphasize that we should first do a performance analysis (make your suggestions for tools and procedures in the comments below).
Regardless of the type of parallelization, we'll likely need some kind of chunking ability. We could opt to add a property mask
to the Scores object, which is exactly equal in size to Scores.scores
but of type Boolean
. mask
could thus be used to slice the larger Scores.score matrix into chunks, which can be sent off to different processes or machines. I think multithreading might not need this as the memory is already shared (needs confirmation).
If we do implement a mask and parallelization, it's probably convenient to also implement Scores.__add__(self)
which could then be used to recombine partially evaluated instances of Scores
according to their masks (with checks on equality of Scores.reference_spectrums
and Scores.measured_spectrums
).
They both define and implement the functions get_peaks_arrays() and get_matching_pairs() and the same line of code to normalize the score at the end of the calculation.
To solve this we could:
To get a quick feeling what the library is about, I would like to see a code snippet how to use the library.
We could save memory by only storing scores that match a user-provided condition.
For problems such as library grouping, Scores.scores
is symmetric (for most similarity functions). We could add an option is_symmetric
or triu_only
(triangular upper part only) and tril_only
(triangular lower part only) for a 2x performance gain.
Actual implementation on GNPS as a workflow. GNPS is in Conda and has an API.
Due to changed workflow we do no longer need an exponential fit to the peak intensity histogram. The current workflow switched to using reduce_number_of_peaks
filter.
(I think so)
The install instructions on https://anaconda.org/nlesc/matchms don't work.
Trying
conda install -c nlesc matchms
It is unable to find the dependencies.
The install command conda install --channel nlesc --channel bioconda --channel conda-forge matchms
in the README.md does work.
We should somehow make sure the install instructions on https://anaconda.org/nlesc/matchms work or are corrected in the description.
Refs matchms/matchms-backup#189
Correction should be such that it passes
matchms/.github/workflows/build.yml
Line 43 in 61365e1
without problems. This may be simply a matter of adding an alias to setup.cfg
with the line above in it minus the --check-only
and --diff
parts.
Contact GNPS site maintainers via PI: Justin van der Hooft (Wageningen)
For GNPS see:
https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp
https://ccms-ucsd.github.io/GNPSDocumentation/
For users who don't like Python we could create KNIME nodes for matchms which can be used in a KNIME workflow.
The workflow can be based on the integration tests.
The first prototype could use Python nodes with code snippets written by us.
Later we could write proper nodes with the archetype
The signature of CosineGreedy.call is
def __call__(self, spectrum: SpectrumType, reference_spectrum: SpectrumType) -> float
But it should be
def __call__(self, spectrum: SpectrumType, reference_spectrum: SpectrumType) -> Tuple[float, int]
So far we implemented the cosine score. The other main score we work with in the iomega project is called "modified cosine score" and in addition takes into account mass-shifts of peaks due to different precursor-m/z.
Within the iOMEGA project I had do the following a few times:
It would hence be nice to have function like import_metadata()
or import_metadata_json()
.
For instance something like this:
from matchms.importing import import_metadata
import_metadata(spectrums, metadata_json,
match_field_original='spectrumid',
match_field_json='gnpsid',
replace_fields=['inchi', 'smiles', 'inchikey'])
And for exporting it could be something like this:
from matchms.exporting import export_metadata
export_metadata(spectrums, metadata_filename,
fields_to_export=['name', 'spectrumid', 'inchi', 'smiles', 'inchikey'])
The data folder below should also be present in conda package
matchms/data
Any function that mentions:
is likely part of looking up what is known about a compound, so you can interpret a match with a measured spectrum. This is completely separate from the similarity calculation, and (optionally) happens after. We should therefore move it to a new directory, /lookup
to be created in the root of the module.
See matchms/matchms-backup#228 . Integration test coverage should be seperate from unit test coverage.
The add losses
filter creates losses by subtracting peak mz from the precursor-m/z.
When using it on actual data I ran into two issues:
from_mz
and to_mz
parameters.from_mz=0
, because some spectra have peaks with m/z larger than the precursor-m/z.Currently the filter only normalized peak intensities, but that would give very different results for running
spectrums = [add_losses(s) for s in spectrums]
spectrums = [normalize_intensities(s) for s in spectrums]
than for
spectrums = [normalize_intensities(s) for s in spectrums]
spectrums = [add_losses(s) for s in spectrums]
As an alternative to #60, we could give the user the option to only store the top x results and discard the rest.
For the similarity calculation it is crucial to remove excessive amounts of low intensity peaks from spectra. Raw spectra vary widely in their number of peaks (about 6000 of the 147,000 have more than 1,000 peaks, the maximum number of peaks is even >70,000). Main reason why this is a problem:
In the past, this has been done in different ways.
One of them was to fit an exponential function to the peak intensity distribution. This is already partly implemented in spectrum.plot()
but would need adjustments.
Even better in my view, however, is to skip it for now and move to something simpler. Namely, removing low intensity peaks until a desired maximum number of peaks is reached.
old-structure/ms_similarity_classical.py:cosine_score_hungarian()
is a classical similarity measure function, we should
cosine_score_hungarian.py
under /similarity/
either by casting at the end, or by using a different type for its input arguments
I tried to use the spikes[i]
, but an unexpected result.
import numpy as np
from matchms import Scores, Spectrum
from matchms.similarity import CosineGreedy
spectrum = Spectrum(mz=np.array([100, 150, 200.]), intensities=np.array([0.7, 0.2, 0.1]), metadata={'id': 'spectrum1'})
spectrum.peaks[0]
I expected
(100.0, 0.7)
But I got
[100. 150. 200.]
Although covered by unit tests, sonarcloud does not recognize any test coverage of our numba-based functions (e.g. collect_peak_pairs
).
Actual spectra can have largely varying number of peaks (some with many 1,000s or 10,000s).
The current CosineGreedy
implementation is based on numpy vector operations. Unfortunately, that doesn't scale well with high number of peaks. When calculating the cosine similarity between spectrum1 and spectrum2 (with n1 and n2 number of peaks), the current CosineGreedy
has to calculate a matrix multiplication of two n1*n2-sized matrices.
For spectra with 1,000s of peaks that seems to explode in terms of compute (and memory).
We should hence shift to the former implementation which was added in PR matchms/matchms-backup#239 .
Here a quick comparison:
import time
from matchms import Spectrum, Scores
from matchms.similarity import CosineGreedy, CosineGreedyNumba
n_peaks = 2000
n_spectrums = 10
test_spectrum = Spectrum(mz=numpy.linspace(10, 1000, n_peaks),
intensities=numpy.ones((n_peaks)),
metadata={})
# current numpy vector based implementation
cosine_greedy = CosineGreedy(tolerance=0.01)
tstart = time.time()
similarity_matrix = Scores(n_spectrums*[test_spectrum], n_spectrums*[test_spectrum],
cosine_greedy).calculate().scores
tend = time.time()
print("execution time:", tend-tstart)
# numba based implementation
cosine_greedy = CosineGreedyNumba(tolerance=0.01)
tstart = time.time()
similarity_matrix = Scores(n_spectrums*[test_spectrum], n_spectrums*[test_spectrum],
cosine_greedy).calculate().scores
tend = time.time()
print("execution time:", tend-tstart)
For the above code I got:
execution time: 22.43099617958069
execution time: 0.5635242462158203
When it comes to finding bottlenecks regarding memory and compute, I strongly expect that the similarity score calculations will be the most important issue.
The current implementation of the cosine score calculation CosineGreedy
is already making use of vectorized numpy calculations. But at least in a quick check I found that a former numba based implementation still seems to be faster (bit more than 2-fold?).
Depending on the number of epochs, the number of documents, and the size of the documents, model training can take a while. In my view, some type of logger is really needed to see that the process is working fine. For now I just used the following:
from gensim.models.callbacks import CallbackAny2Vec
class EpochLogger(CallbackAny2Vec):
"""Callback to log information about training progress.
Used to keep track of gensim model training"""
def __init__(self, num_of_epochs):
self.epoch = 0
self.num_of_epochs = num_of_epochs
self.loss = 0
def on_epoch_end(self, model):
"""Return progress of model training"""
loss = model.get_latest_training_loss()
# loss_now = loss - self.loss_to_be_subed
print('\r',
' Epoch ' + str(self.epoch+1) + ' of ' + str(self.num_of_epochs) + '.',
end="")
print('Change in loss after epoch {}: {}'.format(self.epoch+1, loss - self.loss))
self.epoch += 1
self.loss = loss
Maybe we can add something along those lines to spec2vec such that people could simply import it.
from spec2vec import EpochLogger
...
I think we can do with just 1.
matchms/.github/workflows/conda_build.yml
Lines 47 to 53 in 106b4cb
In my code example (#41) I would like to link to documentation of methods. Sadly the docs do not show the methods because they do not have doc strings.
For
My former pre-processing of the data included a step where I tried to "repair" incorrect Inchis or smiles by finding a matching compound via PubChem.
This is not a very simple task, because one needs to define what a good-enough match is (it used to be a very lengthy function find_pubchem_match
which was part of ms_functions.py
in the old code.
While I consider this something nice to have, it first should be decided:
When handling both data from GNPS json files (see matchms/matchms-backup#207), a number of new issues arise.
One of the is that GNPS json files do not have name
field, but already provide a (poorly) cleaned compound_name
and adduct
instead.
The current add_adduct
function does not work on the new format. And later lookup operations rely on one specific field to retrieve the compound name (preferably in a cleaned version).
Currently several filters use print statements when they change metadata.
As Stefan mentioned (in PR #34, #34 (comment)), it would probably be cleaner to do this via logging.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.