choderalab / assaytools Goto Github PK

View Code? Open in Web Editor NEW

18.0 17.0 11.0 59.63 MB

Modeling and Bayesian analysis of fluorescence and absorbance assays.

Home Page: http://assaytools.readthedocs.org

License: GNU Lesser General Public License v2.1

Python 99.18% Shell 0.82%

bayesian-analysis python assay-modeling drug-discovery fluorescence experimental-data-analysis autoprotocol

assaytools's Issues

Figure out how to get consistent delG predictions

I'm currently writing a interface that quick analyses, stores, and makes a figure of results for bayesian analysis, and noticed that for our the simple pymc model function, we aren't getting consistent results when rerunning the analysis:

from assaytools import pymcmodels
            pymc_model = pymcmodels.make_model(Pstated, dPstated, Lstated, dLstated, 
               top_complex_fluorescence=reorder_protein,
               top_ligand_fluorescence=reorder_buffer,
               use_primary_inner_filter_correction=True, 
               use_secondary_inner_filter_correction=True, 
               assay_volume=assay_volume, DG_prior='uniform')

I mentioned these inconsistencies in my last labmeeting, and this is a big reason why I've started working on #56, but just thought I'd post it here as well, since it's coming up again.

In the image below the two adjacent plots are from the same dataset, but the analysis is done at different times (notice time stamp). The plotting is slightly different (delG in title vs. in legend) as this is what I was playing with when repeating the analysis. Seems to be fine for Bosutinib Isomer, but not for Erlotinib (sorry the image is a bit fuzzy...).

Currently parameters for MCMC sampling are:

niter = 500000 # number of iterations
nburn = 50000 # number of burn-in iterations to discard
nthin = 500 # thinning interval

Why are we getting multimodal predictions for delG?

See below images from the second half of this notebook: https://github.com/choderalab/assaytools/blob/master/data/spectra/Spectra-Analysis.ipynb

Hoping this will clear up with increased sampling...

Determine depency of delG error on dPstated

Tested the effect of changing dPstated = 0.35 * inputs['Pstated'] to dPstated = 0.15 * inputs['Pstated']. Interestingly, this removed the long tails for the Src-Erlotonib fits, changing them from this (dPstated = 0.35 * inputs['Pstated']):

to this (dPstated = 0.15 * inputs['Pstated']):

Similarly for the same number of iterations, running the same job three times gives more consistent answers with dPstated at 0.15, changing this (dPstated = 0.35 * inputs['Pstated']):

to this (dPstated = 0.15 * inputs['Pstated']):

Proposed API for more general experiment handling

As discussed in our meeting, we'd like to generalize the model to permit each well to have arbitrary concentrations of multiple components, and to attach a competitive binding model for each ligand species. Later, we could even provide a set of models to enable automated model selection.

For now, I'd like to propose an API for a general experiment, as well as a few convenience subclasses that make it easy to set up experiments of the kind we are currently performing.

Specifying wells

We first need some way to describe, for each well:

total well volume
liquid level height (which might be computed from the well geometry and volume)
concentrations (and uncertainties) of all components in the well

I wonder if we might want to piggypack on autoprotocol-python to specify these in a form that will be compatible with autoprotocol.

We can use the Well object to describe all of this information by standardizing the names of specific additional well properties (which can be added via well.add_properties() or set through well.set_properties()). We can use a
WellGroup to specify a collection of wells to pass to the Bayesian model for analysis.

An example of specifying a well using this approach would be

from autoprotocol.container import Well, Container, ContainerType
# Define the container type
container_type = ContainerType(name='4ti-0110', is_tube=False, well_count=96, col_count=12,...)
# Define the container
container = Container(id=id, container_type=container_type)
# Define wells (the verbose way; we'd provide convenience methods to format the plate)
from autoprotocol.unit import Unit
well = container.well("A1") # retrieve a specific well from the plate
well.set_volume(Unit(100, "milliliter")) # set the well volume
# Set concentrations of well components
concentrations = { 
    'Src' : Unit(0.5, "micromolar"), 
    'bosutinib' : Unit(20, "micromolar"), 
    'imatinib' : unit(10, "micromolar")
}
well.set_properties({'concentrations' : concentrations})
# We would also need the optical path length through liquid, or well area.
well.set_properties({'area' : Unit(35, "millimeters**2")})

Since Unit does not provide automatic unit conversion capabilities, we would have to work out a way to automatically convert between simtk.unit and autoprotocol.unit.Unit.

Specifying plate reader data

autoprotocol doesn't specify any particular format for the data output from plate reads, so we're stuck with coming up with our own. We might want to create a class that ingests an Infinite XML file and provides a pythonic way to access data, e.g.

>>> from assaytools.platereaders import InfiniteDataset
>>> data = InfiniteDataset(filename='data/2015-01-20 18-14-25_plate_1.xml')
>>> print data.welldata('A6')
{'TopFluorescenceRead' : 12451, 'BottomFluorescenceRead' : 74252, 'Absorbance' : 0.5632 }

where InfiniteDataset would be a subclass of a more generic PlateReaderDataset.

Because we haven't been super consistent about naming schemes (though we should be!) we may want to provide a mapping dict either here or in the analysis stage that maps the names in the XML file to names that we want in the dict returned by data.welldata(id).

General API for Bayesian analysis

We could specify the components of a well as a dict of concentrations.

# Create a model
from assaytools.analysis import CompetitiveBindingAnalysis
receptor = 'Src' # receptor name (only one allowed)
ligands = ['bosutinib', 'bosutinib-isomer', 'imatinib'] # competitive ligand names
model = CompetitiveBindingAnalysis(wells=wellgroup, data=welldata, receptor=receptor, ligands=ligands)
# fit the maximum a posteriori (MAP) estimate
map = model.map_fit()
# run some MCMC sampling and return the MCMC object
mcmc = model.run_mcmc()

This could use some help!

Switch to simpler msmbuilder ci scheme

Clean up quickmodel

Move 95% interval plotting to plots.py
Add ability to increase samples through inputs.py
Add ability to use filesets for single wavelength assays.
Include subset of data used for analysis in output file
Include recently added things in output file (equilibration, 95% interval)

Longterm:

Add autoprotocol analysis via quickmodel

Fix colors and labels in xml2png

Change xml2png to seaborn so colors are more informative and change labels to A-H instead 0-7.

Add pint to dependencies in conda/docs

We need to be sure to mention pint as a dependency in conda/docs. This was a problem @Lucelenie had when first installing assaytools.

Add xml2png4scans script

We should add this console script too:
https://github.com/choderalab/kinome-data/blob/master/fluorescence-assay/Plate_Type_Analysis/xml2png4scans.py

Conda packaging

Looks like some of the dependencies (seaborn, pymc) have some numpy version restrictions, so we may need to build our own versions of some of these packages for the omnia channel to match other omnia python versions.

Another error/warning when using quickmodel

/cbio/jclab/home/hansons/opt/anaconda/lib/python2.7/site-packages/assaytools-0.2.0-py2.7.egg/assaytools/pymcmodels.py:239: RuntimeWarning: invalid value encountered in divide
  scaling = (1 - np.exp(-alpha)) / alpha

Update platereader.py, incorporating good parts of xml2png.py, then use in xml2png

Update platereader.py, incorporating good parts ofxml2png.py and good parts of current platereader.py:

Make a constructor with toDataframe(), getWellData(), getColumn(), and getRow()).
In the process find out whether dataframe or dictionary underneath is easiest for now, and will also see if there's a convenient way to attach the read parameters to the actual data.
Then use this in new xml2png and ipynbs.

Making quickmodel more flexible to accept different plate layouts

At the moment quickmodel is designed to work this plate layout:

odd rows: protein + ligand1
even rows: buffer + ligand1

I would like to analyse an experiment with a different layout:
Row A : protein1 + ligand1
Row B : protein2 + ligand1
Row C : protein3 + ligand1
Row D : protein4 + ligand1
Row E : protein5 + ligand1
Row F : protein6 + ligand1
Row G : buffer + ligand1
Row H : buffer + ligand1

I will try to modify quickmodel script to handle this case. Can I add it as an alternative quickmodel script to assaytools for now? @sonyahanson

I think this is the part I need to modify:
https://github.com/choderalab/assaytools/blob/master/scripts/quickmodel.py#L74-L75

Autoprotocol example gives basically zero with huge variance delG prediction in simplest case.

Both Greg and I have run autoprotocol, and did not get initial promising results with the example. Has anyone gotten better results?

Fitting multiple replicates

We can use the current framework to fit multiple replicates of the same experiment that use the same stock solutions

We simply have to concatenate the concentration and fluorescence vectors for the multiple experiments. That is, if each experiment used 12 wells and we perform them in triplicate, the concentration and fluorescence vectors will have 36 elements

Fitting multiple m

Unknown error/warning with Quickmodel

Whenever I run quickmodel.py I see the following warning for only the analysis of 1st row of the plate. I don't understand what is causing it, but it is a curiosity that it doesn't happen in the analysis of the following rows:
/Users/isikm/opt/anaconda/lib/python2.7/site-packages/assaytools-0.2.0-py2.7.egg/assaytools/pymcmodels.py:77: RuntimeWarning: divide by zero encountered in power tau = np.sqrt(np.log(1.0 + (stddev/mean)**2))**(-2)

Adjust makemodel so it can also be used for competition assay.

Adjust makemodel so it can also be used for competition assay modeling. This might be easier if we use a dictionary as input instead a bunch of separate terms.

Suggestions for API improvements to facilitate competition experiment analysis

From our discussion today, here is what we decided we would need for input:

Compound stock dictionary. A master data structure listing

compound_stocks = [
   { 'id' : 'BOS001',
     'compound_name' : 'bosutinib',
     'compound_mw' : 530.446, # g/mol
     'compound_mass' : 12.45, # mg compound dispensed
     'compound_purity' : 0.96, # purity
     'solvent_mass' : 12.52, # g solvent dispensed
   }
...
]

The compound stock dictionary could be used for all experiments, and would eventually be retrieved from a lightweight database running on Amazon EC2 or someplace.

D300 XML file specifying compound and DMSO concentrations in each well
For each assay plate:
- A list of compound stock ids for used for the rows (e.g. ['BOS001', 'GEF002', 'ERL001', 'BSI004'])
- The protein name pipetted into rows and its stated concentration
- The Infinite XML file after the plate has been read
An error assumption dictionary which would contain
- D300 dispense CVs
- protein concentration uncertainties
- any other assumptions that go into error modeling

Water evaporation rates

It's actually surprisingly difficult to find a simple approach to compute water evaporation rates.

As a result, I'd suggest performing the experiment: Fill a 100 mL trough, a 15 mL trough, and a 96-well plate with a defined quantity of water and measure their mass every hour for 8 hours; then at 24 hours and 48 hours.

Could we just set a slightly more restrictive prior on F_PL?

Tested the effect of setting F_PL using values taken from the mean after equilibration from a previous run of assaytools, and dF_PL=0.50, which was approximated from the standard deviation. Tested this for Src:Erlotinib and Src: Gefitinib.

F_PL_def = {'Src-Erlotinib-EF': 59601584903.0,
            'Src-Gefitinib-GH': 81835891154.8}

Similar to issue #106 we see an expected removal of long tails and better reproducibility compared to without setting the F_PL. Note here, we are still looking at a dPstated = 0.35 * inputs['Pstated'].

Without setting F_PL:

Setting F_PL and dF_PL:

Without setting F_PL:

Setting F_PL and dF_PL:

Also we no longer see any sort of correlation between F_PL and DeltaG:

Without setting F_PL:

Setting F_PL and dF_PL:

While this is nice, I'm worried this is maybe too restrictive? Maybe we can just have a slightly more informed prior instead of this (link):

F_PL_guess = (Fmax - Fmin) / min(Pstated.max(), Lstated.max())
model['F_PL'] = pymc.Uniform('F_PL', lower=0.0, upper=2*Fmax/min(Pstated.max(),Lstated.max()), value=F_PL_guess)

Missing dependencies

While installing assaytools in a Python 3.5 environment I noticed some dependencies are missing. I had to install lxml, seaborn and pymc separately after running python setup.py install.

I noticed that those missing dependencies are actually listed in setup.py, but commented out: https://github.com/choderalab/assaytools/blob/master/setup.py#L87

I want to activate those lines so that these dependencies are also installed at setup.

Convert all strictly positive quantities to log-space for pymc2 models

This is just a reminder for me to convert all strictly positive quantities (concentrations, intensities, quantum yields, etc.) to log-space for the pymc2 models as I've done for the pymc3 models.

Make xml2png more general.

Fix xml2png so it works well for scans and > 3 single reads without manual intervention.

Test out Gaussian Process for dealing with outliers

Thanks to Patrick and Bas for chatting about this over coffee after my lab meeting. Sounds like Lee had a very promising answer: Gaussian Processes!

Here are some potentially useful links I found by googling 'gaussian process outliers python':
https://bugra.github.io/work/notes/2014-05-11/robust-regression-and-outlier-detection-via-gaussian-processes/
https://ocefpaf.github.io/python4oceanographers/blog/2015/03/16/outlier_detection/

Add ipython notebook testing

Add the testing script from openpathsampling (opentis) and integrate into travis testing.

Add command-line arguments to quickmodel to allow control over number of iterations

cc: #83 (comment)

Fix bug in pymcmodels.py that results in error upon plotting.

When plotting some of the new data (with higher gain and larger bandwidth), but not all of it there are some errors when plotting is attempted:

We think it is related to the fact that we are using e^sigma in plots.py:
https://github.com/choderalab/assaytools/blob/master/AssayTools/plots.py#L356

What units should we be using for deltaG?

There was some confusion during the constant pH grant process about what units we are using here for deltaG. Til now we've been consistently using John's k_B*T units (which do seem to actually be R*T, see below). I'm not clear however, how we want to proceed with changing/confirming this within assaytools as a whole. Should we change everything to kcal/mol? or should we use kJ/mol (SI units)? or should we provide options for the user to output the unit they find most informative?

Slack screenshot:

Chembl prior centered around 0?

Is the chembl prior centered around 0? If so, this seems wrong and we should adjust accordingly if we ever plan to use this: https://github.com/choderalab/assaytools/blob/master/AssayTools/pymcmodels.py#L372

Currently we generally use the uniform prior, so this hasn't been an issue thus far...

Create storage object for analyses

Started working on a way of storing the results, most useful for tracking different ways of analyzing the same dataset. So far this is what I've come up with: a dictionary of inputs that gets combined with a dictionary of outputs that gets stored as a csv file.

inputs = {
    'data_file'     :  'p38_singlet1_20160420_153238.xml',
    'Lstated'       :  np.array([20.0e-6,14.0e-6,9.82e-6,6.88e-6,4.82e-6,3.38e-6,2.37e-6,1.66e-6,1.16e-6,0.815e-6,0.571e-6,0.4e-6,0.28e-6,0.196e-6,0.138e-6,0.0964e-6,0.0676e-6,0.0474e-6,0.0320e-6,0.0240e-6,0.0160e-6,0.0120e-6,0.008e-6,0.00001e-6], np.float64), # ligand concentration, M
    'ligand_order'  :  ['BOS_w_backfill','BSI_w_backfill','ERL_w_backfill','GEF_w_backfill','BOS_no_backfill','BSI_no_backfill','ERL_no_backfill','GEF_no_backfill'],
    'protein'       :  'p38',
    'Pstated'       :  0.5e-6 * np.ones([24],np.float64), # protein concentration, M
    'assay_volume'  :  50e-6, # assay volume, L
    'well_area'     :  0.1369, # well area, cm^2 for 4ti-0203 [http://4ti.co.uk/files/3113/4217/2464/4ti-0201.pdf]
    }

outputs = {
        'analysis'        : 'pymcmodels',
        'outfiles'        : 'DeltaG_%s.npy, DeltaG_trace_%s.npy' %(rows_to_analyze, rows_to_analyze),
        'rows_to_analyze' : rows_to_analyze,
        'section'         : section,
        'DeltaG'          : "DeltaG = %.1f +- %.1f kT" % (DeltaG, dDeltaG),
        'Kd'              :  Kd_summary,
        'datetime'        : datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
    }

This is a rough start so more input on what to include and what format would be ideal is much appreciated.

Currently this is created and run by a quick_model function that takes in the inputs dictionary and rows_to_analyze = 'BSI_w_backfill' and section = '280_480_TOP_120' makes a few figures and spits out the results.

So far this just spits out a csv file for each analysis rather than just appending to a database.

Check in latest fluorescence assay examples

Would be great for @sonyahanson to check in her latest fluorescence assay examples when possible.

Get ReadTheDocs working

Allow handling of [L] = 0 data.

To do this we will add a special case where when [L] = 0 for both [P] = Protein_concentration and [P] = 0. Currently we have to define a minimal ligand concentration in these wells, though this is not actually the case.

Add quickmodel example.

Add quickmodel.py example and notebook to examples.

Write Tests

We need to figure out what tests we want. This is currently pretty incomplete.

New pre-release

A decent amount of updates have been added to the code. Would it be helpful to create a new prerelease? Since we actively use some of this code in projects, it might be easier to rely on releases to indicate what version of the code was used for an analysis.

Standardized Infinite plate reader section names for XML?

Have we standardized the Section names for the XML files generated by the Infinite?

If so, are these documented somewhere?

examples/xml2png/xml2png.py doesn't seem to be including parameters in png

At least in the 'BGI...'.png files I'm looking at, it just says 'None, None, None'.

Improve/unify treatment of OVERs

Change these to nan's so they are not taken into account in analysis.

Update travis for anaconda.org changes

There were some changes at anaconda.org that require updating the automated testing framework.

Add option to quickmodel to generate full page of traces

@sonyahanson has some nice code that generates a full page of traces that are useful for understanding what is going on with sampling issues. It would be great to add a command-line argument to quickmodel to generate these immediately after sampling is completed.

Improving xml2png.py

I was testing xml2png.py with Infinite results of different assay types. I will list here some suggestions:

Adding usage example with type argument will be useful here:
https://github.com/choderalab/assaytools/blob/master/scripts/xml2png.py#L4
For single_96 type, the plots are too wide. We should adjust plot dimensions.
There is empty space on x-axis for wells bigger than 12 on 96-well plot.
I think adding axis labels will be great. x-axis label: Well Index, y-axis label: Fluorescence
Plot title usually has None printed for 'Part of Plate' section.

https://github.com/choderalab/assaytools/blob/master/scripts/xml2png.py#L111-L124
Do we need "Part of Plate" information? Or can we assign a default "Full Plate" when xml doesn't have this information.
I don't have ant 384-well data, so I couldn't test singlet_384 type but I noticed it has hard coded ligand concentration(Lstated) for plotting. Can we remove that?
https://github.com/choderalab/assaytools/blob/master/scripts/xml2png.py#L404-L418
The difference between types spectra and scan wasn't obvious to me.

Quickmodel should explicitly close figures after writing them to disk

This warning appears when running quickbayes.sh:

/Users/choderaj/miniconda3/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Migrate docs to S3 bucket so we can enable IPython notebook examples in docs

Add LogNormalWrapper for pymc2

This is just a reminder for me to create a convenience function to generate pymc.Lognormal random variates similar to the pymc3 convenience function.

Standardize XML data file names.

As suggested by @MehtapIsik

Names of the xml data files are confusing. Can we change them all to a "[protein name] _ 
[fluorescent probe] _ [non-fluorescent ligand(if exists)] _ [experiment date].xml" format?

Optimize GeneralBindingModel

From some profiling tests, it looks like root solving in the GeneralBindingModel is the dominant cost of the new autoprotocol-based assaytools changes, with total cost being somewhere around 76% of total sampling time spent in calls to scipy.optimize.root here.

I imagine we could greatly speed this up by rewriting the target function using some optimization scheme.

Bayesian outlier detection

(See also #52 on Gaussian Processes for outlier detection.)

I'm evaluating different strategies for automated Bayesian outlier detection, but it would be very helpful to have an idea of the physical mechanism causing outliers since this will affect how we deal with it in the code.

Some possibilities would affect all measurements:

Erroneous compound dispensing via the HP D300
Erroneous protein dispensing via the EVO
Incomplete mixing
Ligand or protein aggregation

Some effects might affect only one type of measurement:

Dust particles might only affect top or bottom reads
Scratch, smear, or defect in plate may only affect bottom reads

Any other ideas?

@sonyahanson has prepared a nice dataset that illustrates this here, but it may need to be updated to the new scheme for representing input for quickmodel.

Raw Data Organization

Probably good to change how/where we are putting the raw data. I guess I have started doing this with the assaytools/data folder, where the data is separated out by data type (singlet, spectra, etc.). Though right now there are also snippets of analysis in there, which I should move.

Also, do we want to keep the data we are analyzing for the paper and the data we have for other projects (Mehtap, Andrea, etc.) in a different place? Possibly a different repository? Or is just a separate folder within assaytools/data sufficient?

choderalab / assaytools Goto Github PK

assaytools's Issues

Specifying wells

Specifying plate reader data

General API for Bayesian analysis

Recommend Projects

Recommend Topics

Recommend Org