dylanhross / lipydomics Goto Github PK

View Code? Open in Web Editor NEW

8.0 2.0 2.0 171.57 MB

Python package for performing lipidomics analyses

License: MIT License

Python 78.57% Jupyter Notebook 21.43%

lipydomics's Introduction

LiPydomics (version 1.6.8)

Dylan H. Ross and Jang Ho Cho

Overview

lipydomics is a Python package for performing standard lipidomics analysis protocols on data in an efficient and reproducible fashion. The primary modules in the package are data, stats, plotting, identification and interactive. The data module is responsible for storing lipidomics datasets, along with their associated metadata and computed statistics. The stats module contains functions for computing statistics on a lipydomics Dataset instance and the plotting module has functions for generating various plots using computed statistics or the data itself. The identification module has functions for identifying lipid features using different levels of information. The interactive module provides a more user-friendly text-based interface for performing lipidomics data analysis for those who are not as familiar/comfortable with the more flexible Python API.

API

Documentation for the lipydomics API, organized by module, is available separately in api.md. An example of a complete scripted analysis is included in example_standard_analysis.py. An additional example analysis (executed in a Jupyter notebook) is available here.

Interactive

Documentation for the lipydomics.interactive module is available separately in interactive.md. A complete example demonstrating use of the interactive module (executed in a Jupyter notebook) is also available here.

Support

If you have any questions or suggestions, or notice any bugs please feel free to contact [email protected] or file an issue.

License

This software is made available under the MIT license. We ask that if you use this package in published work, please cite our paper.

lipydomics's People

Contributors

Stargazers

Watchers

Forkers

kaitlin-rempfert hasihays

lipydomics's Issues

`Dataset` method to drop features with user-defined criteria

The Dataset object needs a method for filtering the data on user-defined criteria and dropping any features that do not fit the criteria so that subsequently generated plots/statistics can more clearly reflect the desired message.
As an example, a user may compute a 2-group p-value and then want to only keep the features that significantly differ between the two groups. Another example, a user might only want to keep abundant features, so any feature with average intensity below a given threshold would be dropped.
Such a method would change the Dataset in an irreversible way, so the documentation should make that clear and encourage the user to save_bin prior to filtering so that they can go back if needed.
Multiple criteria can be applied through successive calls of the method.

The call signature should look something like the following:

class Dataset:
    # ...
    def drop_features(self, criteria, lower_bound=None, upper_bound=None, normed=None, idx=None):
        # ...

Where criteria specifies what to filter on and lower_bound and upper_bound are the bounds to use for the filtering (one or both may be used depending on the type of filtering being performed). Valid options for criteria are:

"mintensity" for filtering such that at least one sample must have an intensity >= the specified lower_bound, the normed kwarg must be set to a boolean to indicate whether to use normalized or raw intensities
"meantensity" for filtering such that the mean intensity across all samples must be >= the specified lower_bound, the normed kwarg must be set to a boolean to indicate whether to use normalized or raw intensities
any column-data statistics entry present in Dataset.stats. This might be a p-value or a feature loadings from PCA or PLS-DA. The entry must be column data, meaning that at least one dimension must be Dataset.n_features in length. If the statistic has more than 1 dimension (as something like PCA loadings would), the idx kwarg must be set to the index to use (e.g. for PCA loadings there are 3 columns for the feature loadings on axis 1, 2, or 3. To get the loadings for axis 1 the kwarg should be idx=0). Either lower_bound, upper_bound, or both must be set.

_An implementation note: _
It would probably make things easier to define a private helper method with a signature like:

class Dataset:
    # ...
    def drop_features_helper(self, target, lower_bound=None, upper_bound=None):
        # ...

Where target is an array of length Dataset.n_features which can be used directly to compute the indices of the features that should be retained using the provided bounds. It would then fo through all of the necessary arrays (any array that is Dataset.n_features in length, like intensities, identifications, stats entries...) and drop the features from them and update instance variables like Dataset.n_features. The public method would then just need to compute this target array depending on the criteria and pass that along.

Add some mechanism for detecting dubious lipid IDs

This example shows a set of lipid identifications made at the pred_mz_rt level. The blue are in line with the general expected trend for lipids, but the red have much lower CCS values, and are possibly not even lipids (?). In this case, identifications made without consideration of CCS has lead to some dubious assignments that are clearly wrong when examined manually. It would probably be beneficial to include some sort of mechanism in the lipid identification process for detecting such dubious identifications when matching is done without consideration of CCS.

For example: a global power fit (m/z vs. CCS) could be computed on the measured data, and lipid identifications could be ruled out which deviate significantly from the IM-MS conformational space of lipids (separate thresholds for low and high values to account for doubly-charged lipids which occupy a distinct space). Some work would need to be done to establish good empirical limits for such filtering, and it would need to be made optional in the identification process. This could be achieved by the addition of a kwarg in the identify_lipids(...) function with a corresponding helper function somewhere in the identification module to do the actual filtering. Alternatively, this could be made into a separate step entirely, as a standalone function from the identification module, so that this filtering must be specified explicitly and intentionally.

sort by confidence level when identifying multiple compounds

In the various id_feat_... functions, multiple identifications can be yielded for a given search criterion. Currently, they are only added in the order of their appearance when searching the database. There should be a way of ranking them on the basis of their relative confidence levels (i.e. how close are the query values to the theoretical values). This can be done using a score derived from either the l1 or l2 norm of the residuals vector. Example:

consider a 3-component match using m/z (mz_q) retention time (rt_q) and CCS (ccs_q)
q is the query vector defined as q = <mz_q, rt_q, ccs_q>
x is one potential feature vector returned from the query, defined as x = <mz_x, rt_x, ccs_x>
a residual vector r can be computed as follows r = <mz_r, rt_r, ccs_r> = <mz_x - mz_q, rt_x - rt_q, ccs_x - ccs_q>
We could directly take the l1 or l2 norm of this vector to get a confidence score, but it would not be very informative due to the widely different scales and precisions of each value.
One way to address this issue would be to apply a simple set of empirically determined rough scaling factors to the residual vector, based on knowledge about the scale and precision that can be expected from each value: r_adj = <100, 10, 1> * r = <100 * mz_r, 10 * rt_r, 1 * ccs_r>
A related but slightly different approach would be to derive the scaling factors from the already specified tolerance vector (tol = <mz_tol, rt_tol, ccs_tol>), since these also reflect the scale and precision of each value. In this case, we compute an adjusted residual vector as follows: r_adj = r / tol = <mz_r / mz_tol, rt_r / rt_tol, ccs_r/ccs_tol> = <mz_ra, rt_ra, ccs_ra>
finally, we can take either the l1 or l2 norm of the adjusted residuals to derive a score that is inversely proportional to confidence in a given prediction: (l1) 1/score = abs(mz_ra) + abs(rt_ra) + abs(ccs_ra) (l2) 1/score = sqrt(mz_ra^2 + rt_ra^2 + ccs_ra^2)
The choice of which norm to use will depend upon whether similar residuals in all components or especially small residuals in some components (and acceptance of some larger residuals in others) is more desirable.

In summary, need to associate some type of scoring with the identifications based upon how close the observed values are to queried values so that multiple identifications can be compared to one another.

option to use ppm when using m/z to search in lipid identification

Add an option to identify_lipids(...) to allow the use of ppm as the m/z tolerance instead of just a Da tolerance. This should also be added to the interactive interface.

add loadings plots for PCA and PLS-DA

Need to add some plots that include the feature loadings for PCA and PLS-DA analyses. These are just scatter plots of the feature loadings, perhaps with some form of annotation (labels, arrows?)

add type hints throughout package

https://docs.python.org/3/library/typing.html

type hints have been in python for a bit and all of the functions/modules and their docstrings specify type information so it would probably be beneficial to implement type hints throughout the package.

Probably kind of a big job and not necessarily a high priority...

api.md needs a section for LipidMass

LipidMass is exposed at the API level, the API documentation should have a section showing some examples on how to use it. This can be a new subsection within the Identification section. This should also include an example of the helper function get_lipid_mz from lipydomics.identification.mz_generation which is useful for quickly computing single m/z values.

Align axis labels in heatmaps with center of bins

Currently, heatmaps of lipid class log2fc have axis labels to the left (and below) the bins they correspond to. Example:

It would be better for interpretability to have these axis labels aligned with the center of the bins they correspond to. Example:

These plots are generated using the function: heatmap_lipid_class_log2fc from the module lipydomics/plotting.py

add_PCA3 needs to use group names

Currently, add_pca3(...) computes a 3-component PCA using the Dataset.intensities matrix directly. This does not allow for exclusion of a sample or samples from the computation of the PCA, which is necessary if a user wishes to consider only a subset of the overall dataset in this analysis. Proposed fix:

add_pca3(...) should take in a group_names arg, adopting an interface matching that of add_plsda(...)
the entries that are added to Dataset.stats need to follow naming conventions that include the group names (again, matching add_plsda(...))
the group_names arg in scatter_pca3_projections_bygroup(...) needs to match the groups used to compute the PCA, i.e., there will no longer be a way to plot only a subset of the groups used to compute a PCA

allow specifying a custom list of ID levels for lipid identification

currently, when identifying lipids the confidence level is specified as a single option. This can be an explicit single ID level or 'any' to use an approach that works through consecutive identification levels until an identification is made. Examples:

# identify features at the highest level possible
add_feature_ids(dset, tol, level='any')

# identify features using only experimental m/z and rt
add_feature_ids(dset, tol, level='meas_mz_rt')

There should be the possibility to pass a list of ID levels using the level= kwarg which would allow a user to specify a custom list of ID levels to try and in a specified order. Example:

# identify features by either theoretical or measured mz, rt, and CCS, try measured first
add_feature_ids(dset, tol, level=['meas_mz_rt_ccs', 'theo_mz_rt_ccs'])

Implement repr() method for Dataset

Implement __repr__() method for the lipydomics.data.Dataset class that prints relevant info like:

n_samples/n_features from the data
all of the groups defined in Dataset.group_indices
all of the stats entries defined in Dataset.stats along with their associated shapes

RuntimeWarning when computing absolute CCS from percentage

Not sure the particular cause, perhaps computing value using float (tol) vs. numpy.float64 (ccs) types? Or maybe because the computation happens in-place using a tuple: tol2[2] = tol2[2] / 100. * ccs?

The problem is likely somewhere in lipydomics/interactive.py or lipydomics/identification/__init__.py, but it will probably take some testing to diagnose the problem...

Not a huge issue since it only raises a RuntimeWarning, but I would rather that not happen if it means there is a problem with how we are handling the tolerance calculation.

Implement multivariate regression analysis using an external continuous variable

There are many lipidomics experiments where the samples have some sort of continuous variable associated with them (e.g. MIC for bacterial samples) that may or may not be correlated with lipidomic changes. In these cases, it could be useful to apply an analysis like PLS-RA which simultaneously performs dimensionality reduction and regression, that way a user can find out which lipid features correlate most strongly with the target variable.

There should be a way of associating the continuous variable with the Dataset instance. I think the best way to do this would be to introduce a new instance variable and corresponding method for adding them, like so:

define the instance variable in Dataset.__init___ ...

self.external_vars = None

... and add a method for assigning these external variables to the dataset

def add_external_var(self, label, var):
    """ ... """
    # input validation
    # ...
    if self.external_vars is None:
        self.external_vars = {}
    self.external_vars[label] = var

Add a function in the stats.py module that uses raw or normalized data, user-specified groups, and an external variable to compute a regression. All of the relevant statistics would be added to the Dataset.stats instance variable as usual.

defined in stats.py ...

def add_plsra(dataset, group_names, ext_var, normed=False, scaled=False):
    """ ... """
    # do the stuff 
    # ...

Implement both the assignment of external variables to a Dataset instance and the regression analysis in the interactive interface. Also, need to add an entry for the new regression analysis in the sheet name abbreviation function.
Add appropriate testing code for the new instance variable, method for adding it to the Dataset instance, and stats.py function for applying the regression analysis.
Add corresponding documentation and examples for the new instance variable, method for adding it to the Dataset instance, stats.py function for applying a regression analysis, and usage with interactive interface.

Error in interface when exporting to spreadsheet - sheet name too long

Unable to export data to spreadsheet when a statistic has too long of a name.

Example:

InvalidWorksheetName: Excel worksheet name 'PCA3_WT_Serum-geh_Serum-0641_Serum-sal1_Serum_loadings_raw' must be <= 31 chars.

Not sure how best to shorten the name without losing too much information...

Need to have full coverage of measured values in theoretical lipid m/z and CCS

The range of generated theoretical values needs to cover the entirety of measured values for each lipid class/MS adduct pair. An example is
where the measured values extend beyond the theoretically generated ones.

include an example of scripted analysis in api.md

It would be really useful to include an example of a complete analysis script that performs many of the functions already discussed in api.md. It would be especially useful to write the script in a flexible way to demonstrate how scripting can be used to establish a standardized analysis procedure that can be applied to multiple datasets.

Add Lysocardiolipins (LCL) to LipidMass

Add a class for predicting monolysocardiolipins (abbreviation: LCL) to the LipidMass package, and generate theoretical m/z values for them. We do not have reference CCS values (or retention times) for them yet, so only predict m/z.

utility function to print to console and logfile

There are a lot of cases in the database build scripts where statements are printed to the console and a log file. Right now most of those are just two consecutive calls to print() (one using the file=... kwarg) because I am lazy. It would be better to have a utility function that prints a single message to both places.

# utility function something like
def print_and_log(msg, log):
    print(msg)
    print(msg, file=log)

# ...

# used in the build scripts
msg = "wow, that worked"
print_and_log(msg, bld_log)

assign groups should be able to accept a list of group names and number of replicates

Dataset.assign_groups(...) should be amended such that a list of group names and number of replicates can be optionally provided instead of the existing group indices dictionary. Essentially this would take the group names ['Par', 'Dal2', 'Van8'] and 4 replicates and generate a group indices dictionary automatically: {'Par': [0, 1, 2, 3], 'Dal2': [4, 5, 6, 7], 'Van8': [8, 9, 10, 11]}.

Maybe this would be best as a separate method that generates group indices and just passes those along to Dataset.assign_groups(...)

interactive interface - internal standard normalization

When performing internal standard normalization, there is an error in the description when entering the m/z, rt, and CCS values where commas are shown in the example when the values should just be space delimited.
When searching for the internal standard feature, it seems that the user-provided values are cast to int, which is incorrect

Make database building more efficient

The database build steps where theoretical CCS or RT are added to the database take a very long time. I suspect that part of the issue is the way the values are computed:

# add theoretical CCS to the database
        print('\nadding predicted RT to database ...', end=' ')
        print('\nadding predicted RT to database ...', end=' ', file=bl)
        qry = 'SELECT t_id, lipid_class, lipid_nc, lipid_nu, fa_mod FROM theoretical_mz'
        tid_to_rt = {}
        for tid, lc, lnc, lnu, fam in cur.execute(qry).fetchall():
            if int(sum(c_encoder.transform([[lc]])[0])) != 0: # make sure lipid class is encodable
                x = np.array([featurize(lc, lnc, lnu, fam, c_encoder, f_encoder)]).reshape(1, -1)
                tid_to_rt[int(tid)] = model.predict(scaler.transform(x))[0]
        qry = 'INSERT INTO theoretical_rt VALUES (?, ?)'
        for tid in tid_to_rt:
            cur.execute(qry, (tid, tid_to_rt[tid]))
        print('ok\n')
        print('ok\n', file=bl)

The problem here is that we are fetching each sample from the database one at a time, featurizing them, then predicting CCS/RT. We are not taking advantage of the speed of performing the prediction all at once on a large input array. The solution is to break up the data fetching and CCS/RT prediction steps:

fetch the data with a query that includes the acceptance criteria (i.e. is the lipid class/FA mod encodable?)
iterate through these query results and featurize them, assembling them all into a single input array for CCS/RT prediction
Feed this large input array to the scaler and CCS/RT predictor to get a big array of results
iterate through the results (and input data) and add the predicted values to the database. This would be a good point to implement QC of predicted values (e.g. exclude absurdly high RT values)

Add stats function to get p-value for two group comparison

Need a new stats function that computes the p-value for comparison of two groups for all features. The actual statistical test used should be optional and controlled by the user, allowing them to select whether to use something like a t-test (parametric) or a kruskal-wallis (non-parametric) depending on what assumptions they want to make about their data (and therefore what statistical analyses are appropriate). I am leaning toward something that makes the test explicit (the option being something like test="kruskal-wallace") when calling the stats function. Which statistical test was used should also be included in the label for the Dataset.stats entry, something like: "kwallisP_Par_Dap_raw" or "ttestP_Dal_Van_normed"

Overall the function should have the following familiar signature:

def  add_2group_pvalue(dset, groups, stats_test, normed=False):

where dset is a reference to the Dataset object
groups is a list of the (two) groups to compare
stats_test specifies the statistical test to use (start with 'ttest' and 'kwallis')
and normed is the kwarg indicating whether to use normalized intensities for the comparison

filter_data in interactive converts the search terms to ints, should be floats

in filter_data function of interactive interface, the m/z, rt and CCS and their respective tolerances are converted to integers before being used to actually filter the data. This happens in the filter_d helper function:

def filter_d(mzs, rts, ccss, data):
    """ a helper function for filtering data
        given M/Z, RT, CCS ranges and a DataFrame containing data,
        find and returns all data within that range.
    """
    filtered = data[(data[0] < int(mzs[0]) + int(mzs[1])) & (data[0] > int(mzs[0]) - int(mzs[1])) &
                    (data[1] < int(rts[0]) + int(rts[1])) & (data[1] > int(rts[0]) - int(rts[1])) &
                    (data[2] < int(ccss[0]) + int(ccss[1])) & (data[2] > int(ccss[0]) - int(ccss[1]))]
    return filtered

These should all be floats since all of these properties are continuous values.

LipidMass add o- and p- variants for lysoglycerophospholipids

Implement o- and p- versions of lysoglycerophospholipids in lipydomics/identification/LipidMass/lipids/glycerophospholipids.py. This should work essentially the same as in the regular glycerophospholipids. Need to also amend the enumeration function (enumerate_all_lipids(...)) to reflect this change and add their masses to the theoretical_mz table of lipids.db.

LipidMass add PIP, PIP2, PIP3 subclasses of PI

Need to add classes for phosphorylated PI derivatives (PIP, PIP2, PIP3) in lipydomics/identification/LipidMass/lipids/glycerophospholipids.py as a subclass of PI. This should be very easy to implement, just add a mono-, di-, or tri-phosphate formula to the base.

the monophosphate (neutral) would be something like:


                 HO    OH
                 \___/      O
                 /   \      ||
        R  =  __/     \__O__P__OH
                \     /     |
                 \___/      OH
                 /   \
                HO     OH

Plotting PCA projections with >6 groups

When plotting PCA projections for a PCA that was computed with >6 groups, only the first 6 are actually plotted. This is because of the following code in plotting.py:

for dg, c, gn in zip(np.split(d, si), ['r', 'b', '#ffa600', 'purple', 'green', 'm'], group_names):
        ax.scatter(*dg.T[:2], marker='.', s=24, c=c, label=gn)

When I wrote this, the intent was to cycle through the 6 defined colors and repeat them when the number of groups exceeds 6. The above code does not do this, however, (relevant stackoverflow) so only the first 6 groups are actually plotted. This is a poor solution anyways because then the duplicated groups will be indistinguishable in the plot.

Possible solutions:

define a longer list of static colors, making it less likely that this problem will occur again (when will anyone have more than 8 groups? 16 groups?)
define some type of iterable that can produce an arbitrary number of unique colors (possibly randomly)
use matplotlib's cmap to get N colors spanning a predefined spectrum of colors

This issue also applies to the barplot_feature_bygroup plotting function

Need more info and usage examples in README

The README needs some more information about the API, including examples showing how to use the Dataset object, and associated stats and plotting functions:

summarize the structure and organization of the Dataset object
loading a dataset from .csv file, formatting the .csv file
performing data normalization
assigning groups
performing statistical analyses, viewing/interpreting results
generating plots of features, statistics
saving the Dataset object to a binary file, loading the binary file later

Direction of X-loadings in PLS-DA compared with Pearson Correlation

Sometimes the direction of the X-loadings from PLS-DA is flipped relative to the Pearson Correlation analysis. Normally when making S-Plots one features should show up in quadrant I (positive x-loading and positive correlation coefficient) for one sample and quadrant III (negative x-loading and negative correlation coefficient) for the other. At least when making an S-Plot (or perhaps this change should be made when storing the loadings in the first place?), the direction of the x-loadings needs to be flipped to reflect the proper group order used for correlation analysis.

The above is an example of an S-Plot where the x-loadings need to be flipped for the sample order to be correct (all negative values for Van4 and all positive values for Van8) and for the plot to be interpreted normally.

README needs a section on installation

There should be a section on how to install the library in README.md. This should come after the Overview and before the rest of the sections. It should mention the releases (source and wheel files), and pip installation (once a release has actually been uploaded to pip, ideally the next release which will have a few more fixes in it and will come before ASMS when we will be presenting lipydomics to a wider audience)

fix imports in LipidMass module so that it is accessible from lipydomics

if you try to import some code from the LipidMass module (e.g. from lipydomics.identification.LipidMass.lipids.glycerophospholipids import PIP3), it will fail since all of the import in LipidMass are absolute (I think that is the term for how it is set up now? i.e. from LipidMass.base import ...).
This could be solved using relative imports I think... or also by changing all of the from LipidMass.whatever import ... imports to from lipydomics.identification.LipidMass.whatever import ...
Either way, I think it would be beneficial to actually be able to import code from LipidMass since that could have some utility on its own.

make package pip installable

I would like for the package to be pip installable. It seems like we could just follow the instructions https://packaging.python.org/tutorials/packaging-projects/ to get it set up. I don't think we will necessarily need to put this up on the PyPI, but at least this will make it easier for users to download the package from this repo and get it working relatively quickly and easily

Implement saving/loading Dataset to/from disk

Implement serialization (or other method) for saving a lipidomics dataset (lipydomics.data.Dataset) to disk so that it can be easily loaded later.

A possible implementation would involve defining load/save functions separate from the Dataset class within lipydomics/data.py that either take a Dataset instance as an argument (save) or return one (load). Example:

from lipydomics.data import Dataset, load_dataset, save_dataset
dataset = Dataset('data.csv')
# ... do stuff with the dataset ...
save_dataset(dataset, 'dataset.pickle')
# ... later, load the saved version ...
dataset = load_dataset('dataset.pickle')

An alternative implementation would be defining a method within the Dataset class that saves the Dataset instance to disk, then altering Dataset.__init__(...) to include a boolean kwarg (something like load_bin=False) that controls whether the input file is treated as a .csv or a serialized Dataset instance (e.g. .pickle ). I am leaning toward this implementation because it seems a bit cleaner

from lipydomics.data import Dataset
dataset = Dataset('data.csv')
# ... do stuff with the dataset ...
dataset.save_bin('dataset.pickle')
# ... later, load the saved version by constructing a new Dataset ...
dataset = Dataset('dataset.pickle', load_bin=True)

Add Volcano Plot function

Volcano plots are popular in a lot of -omics analyses. Generally the idea is to plot Log2(fold-change) against the p-value for a two group comparison, and often certain cutoffs (e.g. p <= 0.05) are used to highlight the significantly different features. An example:

The plot itself should be easy enough to generate, it would work like the S-plot function which pulls two sets of statistics calculated using the same two groups: Log2(fold-change) and t-test (or equivalent nonparametric test) p-value and plot them against each other with formatting and whatnot. Two group Log2(fold-change) is already implemented as a stats function, but there is not yet a function for getting the p-value for comparison of two groups. See issue #66 regarding that stats function which needs to be implemented first...

add sheet with calibrated RT in exported spreadsheet

When exporting a Dataset instance to excel spreadsheet, check if the rt_calibration attribute is set and if so add a sheet with the calibrated retention times.

Getting the calibrated retention times could look something like:

if dset.rt_calibration is not None:
    cal_rts = [dset.rt_calibration.get_calibrated_rt(rt) for mz, rt, ccs in dset.labels.T]
    # add a sheet to the spreadsheet with the calibrated retention times
    # ...

retain/generate informative headers in exported spreadsheet

When exporting data to a spreadsheet, the headers of each column should have informative names. For instance, the feature labels should have m/z, retention time, and CCS, while all of the intensity columns should use group names (if assigned) or just indices (starting from 0)

For example, this is the "Data" sheet from an exported dataset:

	0	1	2	3	4	5	6
0	875.0819635	0.396533333	321.4978938	0	0	0	0
1	577.7609906	0.464166667	257.4985102	0	0	0	0
2	953.8316816	1.515533333	335.1760674	36532.75479	25451.21208	35097.7626
3	877.098118	0.51775	326.2188472	0	0	0	0
4	874.7856908	1.326683333	315.5401582	0	0.391970839	0.061965944	0.041801409

If groups have not been assigned, it should look like:

	mz	rt	ccs	0	1	2	3
0	875.0819635	0.396533333	321.4978938	0	0	0	0
1	577.7609906	0.464166667	257.4985102	0	0	0	0
2	953.8316816	1.515533333	335.1760674	36532.75479	25451.21208	35097.7626
3	877.098118	0.51775	326.2188472	0	0	0	0
4	874.7856908	1.326683333	315.5401582	0	0.391970839	0.061965944	0.041801409

And if there had been some groups assigned:

	mz	rt	ccs	group1	group1	group2	group2
0	875.0819635	0.396533333	321.4978938	0	0	0	0
1	577.7609906	0.464166667	257.4985102	0	0	0	0
2	953.8316816	1.515533333	335.1760674	36532.75479	25451.21208	35097.7626
3	877.098118	0.51775	326.2188472	0	0	0	0
4	874.7856908	1.326683333	315.5401582	0	0.391970839	0.061965944	0.041801409

Where the corresponding group indices would be {'group1': [0, 1], 'group2': [2, 3]}

Other sheets like identifications should do the same, e.g.:

	0	1	2	3	4
2670	775.4499298	0.086483333	252.0133333	theo_mz	PG(33:3)_[M+2Na-H]+
1596	475.3349619	0.1203	214.4559095	theo_mz	DG(24:2)_[M+Na]+
3154	459.3436095	0.1203	213.2139357	theo_mz	DG(26:4)_[M+H-H2O]+

would become:

	mz	rt	ccs	id_level	putative_id
2670	775.4499298	0.086483333	252.0133333	theo_mz	PG(33:3)_[M+2Na-H]+
1596	475.3349619	0.1203	214.4559095	theo_mz	DG(24:2)_[M+Na]+
3154	459.3436095	0.1203	213.2139357	theo_mz	DG(26:4)_[M+H-H2O]+

Convert database build scripts to use pure Python

Currently, the build scripts used to build the lipids database and predictive CCS and RT models are written as standalone executable scripts that are all coordinated using a bash script (build_new_db.sh). This limits the portability of the library when it comes to modifying or expanding the database because of the reliance on bash.

All of the build steps should be reworked so that the database build process (and predictive model training) is implemented in pure Python, that way it can be performed easily on any platform. It should end up working something like:

 python3 -m lipydomics.identification.build_db

use flush option on all print statements in test suite

When running tests, nothing gets printed until all the tests have completed. I think adding flush=True to all of the print() calls would fix this.

Need to be able to identify lipid features

Lipid features can be identified on the basis of m/z, LC retention time, CCS, or some combination of these descriptors. Identifications can be made by comparing against reference values or generated values. Need to add a module (lipydomics/identification) for performing such identifications. This module should have a function for applying identifications to the labels in a Dataset instance. An example of what this might look like:

from lipydomics.data import Dataset
from lipydomics.identification import add_identifications
dset = Dataset(...)
add_identifications(dset, ...) # the rest of the parameters indicate the ID criteria

This would then add/populate an instance variable, Dataset.identifications, with the feature identifications.
The add_identifications(...) function from above could be implemented in identifications/__init__.py, that way it is exposed in the same fashion as the stats or plotting functions, but it can still call all of the necessary internal utilities since the identification is going to take a few moving parts to work.
Other parameters passed to the add_identifications(...) function control the various criteria for making an identification.
plotting functions should be altered to be aware of identifications when relevant, e.g. when plotting the intensities of single features the identification should be included in the plot

Documentation on database building

The lipid database can be modified and rebuilt (and CCS/RT prediction models re-trained) using the built-in build system. There is currently no documentation on using the database build system (basically just invoking python3 -m lipydomics.identification.build_database) or the steps that it includes. This would be really good information to document, in addition to how to alter the build process to do things like changing which datasets are used to build the database and train the predictive models. This would probably be good information to house with the installation information (#62), since that is a likely step that should happen at installation time.

multi-barplot plotting function

People may want to look at a series of lipid species (i.e. a series of chain lengths) between different groups, or maybe a handful of lipids that show significant differences, all in one plot.
E.g.

<image from Hines, K. M. et al. mSphere 2, 99–16 (2017).>

It may therefore be nice to include a plotting function that can produce these multi-barplots. The challenges in implementing this would be in getting the formatting to look nice and consistent with varying numbers of lipid species or groups, as well as selection of the relevant data for each lipid species. I do not think the data selection method implemented in the normal barplot function would work for this, but perhaps something akin to what is used in the interactive selection of retention time calibrants may be suitable.

Documentation on the database build process

We need some documentation that describes the database build process, some important points:

running the main build script python3 -m lipydomics.identification.build_database
explain where the different database builds get saved
explain the general build process, what steps happen in what order
configuration of the database build process, how to customize what datasets are included, what build parameters are configurable
how to add new data, what files (and format) are required, what metadata is required and where does it go

this documentation should have an individual markdown file dedicated to it and referenced in README.md and api.md

Expose CCS/RT prediction functions in API

Currently, models trained to predict CCS and RT are only used in the database build process. The trained models are saved as pickle files and so they are technically available for making predictions. It might be good to expose these predictive models in the API so people can make predictions for individual lipids. There needs to be robust handling of the edge cases though, they should raise errors (or at the very least warnings) if a user tries to make predictions on lipid class, fa modifier, or ms adduct that has not been explicitly encoded. These functions should look something like:

from lipydomics.identification import predict_ccs, predict_rt

# get predicted CCS for PC(34:1) [M+H]+
theo_ccs = predict_ccs('PC', 34, 1, '[M+H]+')

# get predicted HILIC RT for PC(p33:2), MS adduct is irrelevant in this case
theo_rt = predict_rt('PC', 33, 2, fa_mod='p') 

# predict CCS for some fictitious lipid AbcDEF(o69:6) [M+Cl]-
# this will raise an error because AbcDEF is not an encoded lipid class
another_ccs = predict_ccs('AbcDEF', 69, 6, '[M+Cl]-', fa_mod='o')

Change CCS tolerances to use percentage package-wide

Absolute CCS tolerances are not typically used when comparing CCS values, the convention is to use a percent when matching. Therefore, all parts of the package that use CCS tolerance should be using percentage instead of absolute tolerance. The interactive interface should also be updated to reflect this difference.

api.md needs some more information on Dataset object

the Dataset object is the core of this library's functionality, as it stores all of the data, metadata, identifications and computed statistics for a lipidomics dataset. There should be a section added to api.md that describes how the Dataset object is organized, what type of information is stored and where. Having this information available to the user makes the library even more useful and flexible because the user might want to access intensities/identifications/statistics from the Dataset instance and do other things with them (custom analyses/reporting/plotting/etc.).

add normalization by internal standard

Add the ability to use a single feature's signal to normalize the rest of the data.

select a feature
get per-sample intensities
compute a normalization vector (set the max to 1.000, compute ratios, invert)
apply normalization (as usual)
this function can live in the interface module

Bar plot feature by group when no feature found

When trying to make bar plot feature by group in the interactive and no matching feature is found, the it tells you no matching feature was found and that the plot was generated while no plot was actually generated.