elucidatainc / elmaven Goto Github PK

LC-MS data processing tool for large-scale metabolomics experiments.

Home Page: https://resources.elucidata.io/elmaven/

License: GNU General Public License v2.0

C++ 78.12% C 4.37% QMake 0.39% Shell 3.17% Python 0.72% Inno Setup 0.02% Makefile 8.79% M4 0.43% Objective-C++ 1.73% Objective-C 1.98% Assembly 0.17% Go 0.12% Batchfile 0.01%

c-plus-plus lc-ms lc-msms mass-spectrometry metabolomics qt

elmaven's People

Contributors

Stargazers

Watchers

Forkers

arvindiyer reidme sahil21 shubhra-agrawal ipankaji coder5492 lparsons ky0016 thelittlewonder jainaman224 tanmaygoyal saifulbkhan biocodings shivamzaz mayankmtg paopaocui biterbilen naman calico ruohongw arnavvats jopark aspirincode podds sandhyads sarangab rish9511 markkun sakshikukreja14 vaniisgh jun-lizst smd105 jonatanfernandez saif-el mneinast xxing9703 wdglee alienzj andrewrosko metabolomicsaustralia-bioinformatics mshabdiz baopan0130 liuwuping duanshumeng ryan180711 liu5796796 akey7 nemtorcomplex2 samdecraemer ddlidded yinyinghao

elmaven's Issues

Deleting sample is not clearing all information.

This is a hypothesis that we have observed when we are working on #5. After a sample is been deleted memory is not getting cleared in ubuntu but it's getting cleared on windows. This has to be checked more thoroughly.

Pick isotopic peaks even when parent abundance is zero

It's come up a few times now that we do a labeling experiment and end up with 100% labeling of one or more metabolites. In this case, when you're doing automated peak picking with a database, you won't pick the parent peak and therefore the labeled peak is lost. It would be kind of nice to be able to specify that maven should look for the isotopes even if the parent isn't there. It would probably have to be a compound-specific option I think -- otherwise, the number of potential peaks completely explodes.

shift-drag integration doesn't properly refresh bookmarked peaks table

Copied from Trello board

load some samples
pick a compound
shift-drag in the EIC window -- this should integrate the peak and all isotopes for the selected RT window.
the peak shows up in the peaks table but without the + icon to expand isotopes.
do this with another compound, and now the expansion is available for the first compound.

Internally, it's there immediately, because if you do export to csv, all isotopes wil be present. It's just the table in the gui that's not properly being populated.

This bug is present in all versions of Maven that I've used. Double-clicking to bookmark based on the suggested peak does refresh the table properly.

by default, full scan doesn't show up in SRM list, but adjusting one of the Q1/Q3 values forces the window to refresh, and then it shows up

Replace pugixml, Eigen and libneural libraries with newer versions

All of these libraries are OpenSource, replacing them with their newer releases would be a good thing.

deleted peaks causing crash

Bookmarking a peak and then deleting it seems to cause a whole host of issues.

Group ID of next peak is not set correctly (e.g. bookmark 1,2,3 then delete 2, next peak picked gets 3)
After peak is deleted, often get crash as you move between compounds.

Crashing on opening on Windows 2008 server

Victor was able to install El Maven 0.2 on my laptop, and it’s mostly working. But when he installed it on the server (Windows Server 2008), it crashed immediately upon opening. Here are the two screenshots that he got during the crash.

Alignment

This is copied from trello.
Today I looked semi-seriously for the first time at the alignment code. There are actually a few things in there that make me very optimistic, among some that make me less so.

The alignment logic in peakdetector.cpp is completely broken and it's not at all surprising that it doesn't work at all. Groups have to be identified first, then doAlignment can be called on those groups. Also, some parameters are set at the end of the alignGroups function which probably break further processing. This is clearly code that was never finished or tested at all.
The basic structure of doAlignment() is fairly simple and very conducive to the changes I want to make. It's a fairly modularized function that takes a vector of peakgroups, sets the rt vector to a new set of times, and saves the old times in originalretentiontimes.

2b) What I want to do is have another option at the core of the algorithm, where on each sample, you run processCompounds using a special compound database that's only for alignment. Then fit some kind of spline from the expected (reference) RTs to the observed RTs for all the compounds you find, and use this to interpolate the whole RT vector. This is in many ways much much easier than aligning all samples at once. For the gui, the current dialog box could be reused to a great extent, and the command line version is even easier.
(This development would be in parallel to fixing the existing code.).

There's a clear bug in line 44 of mzAligner -- the argument is supposed to be polynomialDegree. Easy to fix but unfortunately this makes me think this code wasn't tested at all. However, this is actually correct in the open source version! (no argument, and polynomialDegree is referenced within Fit()). I did a quick diff of the two versions -- it looks like quite a number of changes were made.
At the moment I can get alignment to go through successfully for some sets, but it looks like it works only when there is no alignment needed (maybe it's ok when the algorithm exits after 1 iteration or something like this). Otherwise, I usually get a crash. (even after fixing the one bug I mentioned above)

Maybe the right place to start is to take a look at what was actually changed and perhaps reverting back to the version of mzAligner in the open source code, and trying some alignments to see if we still get crashes? If we do, maybe we can bring back or bring in some of the logging and debugging output and see if we can work this out.

Let me know if that seems reasonable and how it would fall in the timeline.

PS. I'm sure you understand this, but just to emphasize: the way that alignment needs to work is that there should be a single transformation from unaligned retention times to aligned retention times (i.e. you don't want to align mass slices individually -- you want to use the individual mass slices to get an average transformation that would then be applied to the entire RT vector). Ideally this transformation is constrained to be monotonic.

Separate widget logic into libmaven

Isotopic refactoring

Things to focus on isotopic refactoring:

Special Treatment of C13

Currently there are two places from where label information is to be taken. One in the peak detection dialogue box and other inside the options button on the mainwindow.
In El-MAVEN all the Isotopic settings are taken from options button. Peakdetection isotope checkbox is not functional.
If we select or deselect C13 its always finding isotopes of C13
There are two places where compute Isotope function is defined. One in isotopic logic and other as the part of the peakdetector.
@chubukov I remember you had some suggestion on how isotopic detection functionality should ideally be. Could you please share with us those suggestions?

Pick isotopic peaks even when parent abundance is zero

It's come up a few times now that we do a labeling experiment and end up with 100% labeling of one or more metabolites. In this case, when you're doing automated peak picking with a database, you won't pick the parent peak and therefore the labeled peak is lost. It would be kind of nice to be able to specify that maven should look for the isotopes even if the parent isn't there. It would probably have to be a compound-specific option I think -- otherwise, the number of potential peaks completely explodes.

Saving/loading mzroll screws up labeling information

This is a consequence of the "Feng addition" of the way we handle isotope outputs. Loading the mzroll file does not set the sample._C13labeled etc variables. So if you save a project, then reload it, there is no way to export the labeled species when you save to CSV.

parentgroup has no peaks

I was seeing some strange behavior with isotope detection, and I started playing with the options a bit to try to figure it out. I set the option "maxIsotopeScanDiff" to 500. This caused Maven to crash in the mzUtils::correlation function, because an empty vector was passed for y. This in turn was due to the fact that BackgroundPeakUpdate::pullIsotopes was called with a ParentGroup with no peaks.I don't fully understand what's going on yet. Hopefully this will be easier as we write out what actually happens during the peak detection and grouping. Should be easy to reproduce with almost any data. @chubukov We are not able to reproduce this.

@chubukov I am trying to capture all the isotopic related issues into one place so that we can refactor isotopic detection. I have captured reported isotopic issues also into this. I guess this will help us to coordinate refactoring in a better way. Could you please add suggestions you have in isotopic refactoring in this thread. I will keep on updating this list according to your suggestions.

Progress Bar doesn't update when loading mzroll

If mzXml files are loaded into the maven, progress bar works perfectly fine. But when mzroll is loaded, the progress bar stays at zero percentage.

fast double click when bookmarking peaks

This is one that caused a lot of headaches, and I had some hints as to the issue, but I think I've figured it out.

Load some samples and select a compound from the list
Quickly (i.e. without hovering) double click on one of the circles

This will bookmark the peak, but will not pick any isotopes.

If you first single-click on the peak, triggering display of the isotopes plot, then double click will have the expected behavior. It doesn't seem to matter if isotope plot is enabled or not.
Even hovering over the circle for a few seconds without clicking seems to be sufficient.

Let me know if you're able to reproduce -- if not, I'll make a video.

m/z info in compound widget

This is present in all maven versions.

-If you provide m/z in the compound database, that m/z is displayed in the compounds widget
-If you provide formula as well as m/z, the m/z you provide is displayed, but the m/z calculated from the formula overrides it for all purposes (like pulling EIC)
-If you provide formula but not m/z, the m/z displayed is actually the neutral mass (and the correct ionized m/z is calculated from the formula and used for pulling EIC)

This is quite inconsistent, the last case in particular is misleading.

Not 100% sure what the right solution is, especially since ionizationMode can change after loading the compound database (it's determined when samples are loaded). I guess my vote is to display calculated m/z and trigger a refresh of the widget if ionizationMode changes.

Maybe if we do that let's have ionizationMode displayed somewhere in the main window (I think this has come up before). Maybe in the corner by the formula and ppm boxes. (and take it out of the main options).

Please don't forget about neutral ionizationMode as well.

get rid of set name row in csv output

When outputting to csv from the maven gui, the column headers are in row 1, and the second row has the set name for each sample. The third row onwards has the compound data.

In our hands the second row is just an annoyance -- we always delete that row before doing any further work. It just makes things awkward for further processing. I'm having a hard time imagining a good use case where it really would help -- I say let's just get rid of it unless you can think of a good reason to keep it.

|charge|>1

Reported by @chubukov on Bitbucket

One of the things that was never handled properly in Maven was multiple charged species (even though some parts of the code are designed to handle it). You can see https://bitbucket.org/elucidatainc/qews/commits/fd181d02608f3e0d1ad32709f8ea62fa0561b76f for an example of some basic edits to make it work in the CLI.
Browsing the new code, I can see that this bug is still present. E.g.

bool mzSlice::calculateMzMinMax(float CompoundppmWindow, int ionizationMode) {
     float ppmScale = 1e6;

    //Calculating the mzmin and mzmax
    if (!this->compound->formula.empty()) {
        //Computing the mass if the formula is given
        double mass = MassCalculator::computeMass(this->compound->formula, ionizationMode);

should really be compound->charge instead of ionizationMode. I'm certain this is far from the only place.
Charge for a known should be read in when compoundDb is read (that functionality already exists). When the user types a formula in the main window, we should probably default to charge=ionizationMode, though ideally there would be either an extra box for the charge or a way to specify it in the formula (e.g. C50H20N10/3) meaning charge=3*ionizationMode. If we do this, we have to make the ionizationMode option front and center in the main options. I can't think of any other problematic cases at the moment.
In terms of priority: on the one hand, it's definitely a bug and it will become critical if we ever apply Maven to proteomics, or, to some degree, lipidomics. On the other hand, it has never really worked, and we're probably never going to make Maven into a tool that's heavily used by the proteomics community. So in that sense, I'd say it's not critical for initial release.

Clang in macOS (by default) doesn't support openmp

Add instructions to install OpenMP for clang.

crash if database file is malformed

When I accidentally had fewer than the expected number of fields in one line of the database file, Maven crashed when trying to load it. Probably need to fail more gracefully here, or accept the incomplete entry in some way.

To reproduce, put in just the compound name as the only field.

Source File column headers

in EL Maven, the column headers must be the following in this order:
compound expectedRt id formula mz

the following columns in this order would not load into El Maven:
compound formula id rt mz

reading ms/ms files from qExactive

I'm sharing an example of an ms/ms file from the qexactive instrument. Maven has difficulty loading this file properly either in mzML or mzXML format (from msconvert/readw).

dropbox link

Crash: Changing options when no groups are present

Reported by @chubukov on Trello board.

Here's one I was able to reproduce and partially debug.

Load some files and manually pick a peak
Delete the peak in the bookmarked table
Hit options and change an isotope lableling checkbox

This will crash because it tries to call pullIsotopes on the currently displayed group, except that the group has been deleted, so some of the pointers are now invalid.

I think this just needs better checking on what's currently being displayed.

ms/ms files in mzML format

Here's an example where it looks like Maven is not correctly processing an mzML file created by msConvert. The mzXML created by the deprecated mzWiff utility works fine, however.

dropbox maven/bug_reports/2016_11_09_ms_ms_mzML_from_abi4000

mzData Parser

ElMaven doesn't handle mzData very well.

Be able to provide tolerance on expectedRt

loading mzroll from projectdockwidget

Loading an mzroll file works fine through either the file->load dialog, or the open file button in the main window. But when doing it through the open project button on the inner widget, maven will read all the files but not populate anything.

Reproducible with every mzroll file I've tried.

Saving/loading mzroll screws up labeling information

Reported by @chubukov on Trello

This is a consequence of the "Feng addition" of the way we handle isotope outputs.

Loading the mzroll file does not set the sample._C13labeled etc variables. So if you save a project, then reload it, there is no way to export the labeled species when you save to CSV.

Crash: segmentation fault when loading mzML samples when zlib is enabled

I enabled the zlib in the elMaven code (In mzroll.pri it was commented out) and while uploading the mzML samples it was giving an segmentation fault.

PS: I compiled the libmaven as dynamic library for this specific case as there was some issue when I enable ZLIB in static mode.

Check behavior when peak intensities are < 1.0

setting isotope options for manual peak picking

This was always problematic also in Maven 776, but I really can't get it to work at all in EL Maven 0.2.

It seems that no matter which options I set, I cannot get N15-labeled species to show up when bookmarking a peak. (I can get C13N15 species, but not N15 only). The peaks are clearly there in the spectrum.

Can you put together a working example? Tell me if you need data files.

If you're actively working on refactoring the isotope options as we discussed, then we can wait until that's finished.

use expected m/z, not observed m/z when pulling EIC based on peak table entry

Copied from Trello, with more detail added.

If you bookmark a peak (or get a table through database search) it will assign an m/z value to the peak group. That value might be a little different from the expected m/z for the compound, depending on the ppm window and the actual MS data. When you now click on the peak in the peaks table, maven will use the "observed" m/z as the center of the mass window, and show you an EIC for the observed m/z +/- ppm window. I think more proper behavior would be to use the original expected m/z value, which should still be stored (since there's a link to the compound in the peakgroup).

If the peak is the result of doing untargeted search or of entering a mass into the textbox at the upper right, then there is no expected m/z, and no compound linked, so then you would use observed m/z.

I guess the corner case is if the user enters a formula into the text box -- it would be nice if that resulted in a fake compound with the proper m/z being stored with the peakgroup, but honestly I don't care whether that works.

crash if peak picking is cancelled

I can semi-reliably get Maven to crash if I hit "cancel" after initiating peak picking (while it's running). This happens most reproducibly if you do untargeted analysis, but sometimes also with targeted.

Discrepancy between the mass shown in the EIC window and the part where all metabolites are shown

Victor:
This bug is due to the following behavior:

The way that the group mz is calculated is that it's an average of the mz value for each peak, which in turn is the mz of the scan with the highest intensity for that peak (within the mz and rt window). However, if for a particular sample there are no scans with intensity>0 (after baseline correction), the mz reported for that peak will be zero. That zero will be averaged with the correct mz values for the other peaks, leading to the nonsensical result.

To reproduce the behavior, load some samples with extremely low intensity and some with normal intensity, set "drop top X baseline intensities" to 0 in the options, and integrate a few peaks. I bet you will see some of them reported as mz=0 for the blank samples. (If it doesn't work right away, I'll make up a minimal working example).

The faulty code as far as I can tell is in EIC* mzSample::getEIC(float mzmin,float mzmax, float rtmin, float rtmax, int mslevel) (though I suspect that the SRM or MS/MS versions also have similar problems). You can see that the innermost for loop is structured such that if there are no scans with intensity > 0, __maxMz will remain zero.

I believe that a trivial fix will be to initialize __maxIntensity to a negative number instead of zero. Another solution, which would also handle the degenerate case of no scans at all, would be to leave that part as is, but in the code that computes the average for the group, ignore the zero values. I believe that happens in void PeakGroup::groupStatistics() but I'm not 100% certain.

I was waiting to fix this until we had a better write-up of the workflow of the algorithm when it might be easier to understand any unintended consequences of this. I would be ok with just implementing a fix if we had a good set of tests to run.

Automatic Crash Reports

Right now there is nothing like automatic crash reports in El Maven which can provide crash reports like stack trace of that crashed instance.

crash in manual pick picking

Load some samples (I loaded anywhere from 5-30 samples).
Click on compounds tab and pick a compound
Either double click or shift-drag to integrate the peak

This has consistently crashed for me. Occasionally immediately, but usually 1-2 seconds after populating the bookmarked peaks table.

Doesn't seem to matter if I open a new compoundDb or not.

Support for online database queries

Some of the databases we could cover is 1. HMDB 2. KEGG 3. LipidMaps 4. METLIN 5. MassBank 6. ChemSpider

Exporting CSV doesn't include Isotopic peaks

When CSV is exported after isotopic peak-picking, only the parent ion is present in the output CSV. There is no information of Isotopes of the parent ions.

Handle uploading of Samples if mzroll is being generated from one computer and being used in another computer

If mzroll is being generated in one computer and being used in another computer, El Maven doesn't handle the uploading of mzXml samples very well. It will only check the path of real mzxml files or the directory in which mzroll is present.

There should be a pop-up in case ElMaven doesn't detect mzXml samples after uploading the mzroll. It should ask for change of path where mzXML samples present.

better way of specifying blank samples

I don't like the way that Maven currently determines whether a sample is a blank injection:

if (mystrcasestr(filename,"blan")

It's accident-prone and not at all documented. Let's make this more explicit somehow. For instance an extra column in the samples widget? Happy to hear other suggestions.

Not sure what the right thing to do is for the command-line version. Actually I think that a prefix string (or even a regexp) is ok (e.g. user can specify that samples starting with "BK" are blanks) as long as the default is to not treat any sample as a blank, and it's all well-documented.

Store or link compound database in mzroll file

Right now loading an mzroll file into Maven will not load/display the peak integration information unless you have the right compound list preloaded. This information should really be stored together with the mzroll file (it's reasonably small relative to the rest of the file, so could even be embedded). This would dramatically improve the utility of storing the mzroll file.

Expanding row of buttons in header

If you shrink the width of a window such that not all the buttons in the top toolbar are visible, a ">>" button appears that's presumably supposed to expand the list to a drop-down. For me, it has never done anything (in any maven version).

This seems to be relevant? from https://doc.qt.io/archives/4.6/qtoolbar.html

"When a QToolBar is not a child of a QMainWindow, it looses the ability to populate the extension pop up with widgets added to the toolbar using addWidget(). Please use widget actions created by inheriting QWidgetAction and implementing QWidgetAction::createWidget() instead."

Definitely not high priority if it involves rewriting all the widgets.

install without admin privileges

Currently the installer will not run without administrator privileges on Windows. Is it possible to allow this (obviously would have to install in some non-protected directory)? I know there's a setting in the InstallBuilder software for it, and I know that there are definitely plenty of windows applications that allow installation without admin privileges.

crash report: bookmarking peaks with large numbers of samples

In a real road test, I tried to use maven to do manual integration with about 100 samples loaded. I consistently got it to crash after doing 4-5 peaks, and there didn't seem to be any pattern as to what compounds it crashed on. I reproduced the same thing with the debug version, and it was always crashing on this line with a segmentation fault:

inline void addChild(const PeakGroup& child) { children.push_back(child); children.back().parent = this; }
(line 224 in peakGroup.h)

I can't figure out why it's segfaulting. this looks like a valid peakGroup and children is not null (it always has size=1 at the time of crash).

Could multithreading be involved somehow?

loading additional samples causes samples widget to not display old samples

Load some samples
Load some more samples

The first set will not appear in the samples widget. For all other purposes they appear to be in memory (e.g. peak integration and EIC display shows both the first and second set of samples.

I swear I saw this bug discussed earlier somewhere, sorry if it's a duplicate.

Replace MersenneTwister

MT is a part of std:: namespace now, we don't need MersenneTwister.h in our code.

Uploading mzroll is not multi-processed

Uploading the mzXml files takes use of all the processors but loading via mzroll only 1 core is being used.

Sorting the sample names that is uploaded.

After samples are uploaded using parallel processing. Files are not been sorted. So samples names which are alike are far off from the list. So add a sort function to address this issue.

Special Treatment of C13

Reported by Victor on Trello

I was looking at other places where pullIsotopes is used. One of those is in preparing the stacked barplot that shows the isotopic distribution in the EIC window. As expected, with the current bug, where the main window settings aren't read at all, changing the options has no effect. But when those lines are uncommented out (or in Maven 772/775), you can control which isotopes show up in that plot by checking/unchecking them in the options. Except: C13 will always show up, regardless of the option.

This doesn't happen in the plot generation, it happens further upstream when the peakgroup is created. I haven't quite tracked it down yet, but it happens also through the peak detection dialog.

Basically, Maven has all these built-in features to deal with natural 13C abundance (but not natural abundance of anything else), and so there are places where 13C gets special treatment. Eventually, I would like to expose these and make the behavior consistent.

feature request: select and delete button should delete multiple entries in table

If you select multiple entries in a peaks table and then hit the delete key, it will still only delete the first entry.

Maybe there should also be a "delete" entry in the right-click context menu (right now there is only "delete all").

Having to double click to change set name is unintuitive

isotopes in positive mode

Something is going horribly wrong in the calculation of isotope masses in positive mode.

I uploaded some sample data in Maven/bug_reports/sample_positive_mode_data as well as a source file.

Load a few samples and scroll to any compound (try L-Glutamic Acid for a really nice clean example). Click to bring up the isotope barplot and doubleclick to integrate. I use a ppm of 10.

If you look in the table, the expected isotopes are missing, and it looks like all the mass windows are about 2 daltons lighter from what they should be.

Is there a negative ionization (causing -1 instead of +1 for the mass calculation) hardcoded somewhere?

Note this doesn't happen if doing automatic peak integration through the peaks dialog.

Maven 776 does not have this bug.

deleted samples

bookmark a peak
delete one or more of the samples
click on the peak in the bookmarks table

This causes a crash. Probably the right behavior is to delete the entries in the peak table corresponding to that sample, but a temporarily acceptable alternative is to force the user to delete all the tables.