Code Monkey home page Code Monkey logo

assaytools's Introduction

Build Status Anaconda Badge Anaconda-Server Badge ReadTheDocs Status DOI

assaytools

Assay modeling and Bayesian analysis made easy.

AssayTools is a Python library that allows users to model automated assays and analyze assay data using powerful Bayesian techniques that allow complete characterization of the uncertainty in fit models.

See the online documentation.

See some examples of assaytools in action.

WARNING: This is a very early experimental pre-release. Assaytools is still under active experimental development. Use at your own risk!

Authors

Acknowledgments

  • Many thanks to the authors of the MDTraj project on which much of the modern project skeleton is based.

assaytools's People

Contributors

gregoryross avatar grundye avatar jchodera avatar jhprinz avatar lucelenie avatar mehtapisik avatar mikemhenry avatar sonyahanson avatar steven-albanese avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

assaytools's Issues

Missing dependencies

While installing assaytools in a Python 3.5 environment I noticed some dependencies are missing. I had to install lxml, seaborn and pymc separately after running python setup.py install.

I noticed that those missing dependencies are actually listed in setup.py, but commented out: https://github.com/choderalab/assaytools/blob/master/setup.py#L87

I want to activate those lines so that these dependencies are also installed at setup.

Bayesian outlier detection

(See also #52 on Gaussian Processes for outlier detection.)

I'm evaluating different strategies for automated Bayesian outlier detection, but it would be very helpful to have an idea of the physical mechanism causing outliers since this will affect how we deal with it in the code.

Some possibilities would affect all measurements:

  • Erroneous compound dispensing via the HP D300
  • Erroneous protein dispensing via the EVO
  • Incomplete mixing
  • Ligand or protein aggregation

Some effects might affect only one type of measurement:

  • Dust particles might only affect top or bottom reads
  • Scratch, smear, or defect in plate may only affect bottom reads

Any other ideas?

@sonyahanson has prepared a nice dataset that illustrates this here, but it may need to be updated to the new scheme for representing input for quickmodel.

Fitting multiple replicates

We can use the current framework to fit multiple replicates of the same experiment that use the same stock solutions

We simply have to concatenate the concentration and fluorescence vectors for the multiple experiments. That is, if each experiment used 12 wells and we perform them in triplicate, the concentration and fluorescence vectors will have 36 elements

Another error/warning when using quickmodel

/cbio/jclab/home/hansons/opt/anaconda/lib/python2.7/site-packages/assaytools-0.2.0-py2.7.egg/assaytools/pymcmodels.py:239: RuntimeWarning: invalid value encountered in divide
  scaling = (1 - np.exp(-alpha)) / alpha     

Proposed API for more general experiment handling

As discussed in our meeting, we'd like to generalize the model to permit each well to have arbitrary concentrations of multiple components, and to attach a competitive binding model for each ligand species. Later, we could even provide a set of models to enable automated model selection.

For now, I'd like to propose an API for a general experiment, as well as a few convenience subclasses that make it easy to set up experiments of the kind we are currently performing.

Specifying wells

We first need some way to describe, for each well:

  • total well volume
  • liquid level height (which might be computed from the well geometry and volume)
  • concentrations (and uncertainties) of all components in the well

I wonder if we might want to piggypack on autoprotocol-python to specify these in a form that will be compatible with autoprotocol.

We can use the Well object to describe all of this information by standardizing the names of specific additional well properties (which can be added via well.add_properties() or set through well.set_properties()). We can use a
WellGroup to specify a collection of wells to pass to the Bayesian model for analysis.

An example of specifying a well using this approach would be

from autoprotocol.container import Well, Container, ContainerType
# Define the container type
container_type = ContainerType(name='4ti-0110', is_tube=False, well_count=96, col_count=12,...)
# Define the container
container = Container(id=id, container_type=container_type)
# Define wells (the verbose way; we'd provide convenience methods to format the plate)
from autoprotocol.unit import Unit
well = container.well("A1") # retrieve a specific well from the plate
well.set_volume(Unit(100, "milliliter")) # set the well volume
# Set concentrations of well components
concentrations = { 
    'Src' : Unit(0.5, "micromolar"), 
    'bosutinib' : Unit(20, "micromolar"), 
    'imatinib' : unit(10, "micromolar")
}
well.set_properties({'concentrations' : concentrations})
# We would also need the optical path length through liquid, or well area.
well.set_properties({'area' : Unit(35, "millimeters**2")})

Since Unit does not provide automatic unit conversion capabilities, we would have to work out a way to automatically convert between simtk.unit and autoprotocol.unit.Unit.

Specifying plate reader data

autoprotocol doesn't specify any particular format for the data output from plate reads, so we're stuck with coming up with our own. We might want to create a class that ingests an Infinite XML file and provides a pythonic way to access data, e.g.

>>> from assaytools.platereaders import InfiniteDataset
>>> data = InfiniteDataset(filename='data/2015-01-20 18-14-25_plate_1.xml')
>>> print data.welldata('A6')
{'TopFluorescenceRead' : 12451, 'BottomFluorescenceRead' : 74252, 'Absorbance' : 0.5632 }

where InfiniteDataset would be a subclass of a more generic PlateReaderDataset.

Because we haven't been super consistent about naming schemes (though we should be!) we may want to provide a mapping dict either here or in the analysis stage that maps the names in the XML file to names that we want in the dict returned by data.welldata(id).

General API for Bayesian analysis

We could specify the components of a well as a dict of concentrations.

# Create a model
from assaytools.analysis import CompetitiveBindingAnalysis
receptor = 'Src' # receptor name (only one allowed)
ligands = ['bosutinib', 'bosutinib-isomer', 'imatinib'] # competitive ligand names
model = CompetitiveBindingAnalysis(wells=wellgroup, data=welldata, receptor=receptor, ligands=ligands)
# fit the maximum a posteriori (MAP) estimate
map = model.map_fit()
# run some MCMC sampling and return the MCMC object
mcmc = model.run_mcmc()

This could use some help!

What units should we be using for deltaG?

There was some confusion during the constant pH grant process about what units we are using here for deltaG. Til now we've been consistently using John's k_B*T units (which do seem to actually be R*T, see below). I'm not clear however, how we want to proceed with changing/confirming this within assaytools as a whole. Should we change everything to kcal/mol? or should we use kJ/mol (SI units)? or should we provide options for the user to output the unit they find most informative?

Slack screenshot:
units

Standardize XML data file names.

As suggested by @MehtapIsik

Names of the xml data files are confusing. Can we change them all to a "[protein name] _ 
[fluorescent probe] _ [non-fluorescent ligand(if exists)] _ [experiment date].xml" format?

Quickmodel should explicitly close figures after writing them to disk

This warning appears when running quickbayes.sh:

/Users/choderaj/miniconda3/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Determine depency of delG error on dPstated

Tested the effect of changing dPstated = 0.35 * inputs['Pstated'] to dPstated = 0.15 * inputs['Pstated']. Interestingly, this removed the long tails for the Src-Erlotonib fits, changing them from this (dPstated = 0.35 * inputs['Pstated']):

src-erlotinib-ef_binding_iter0

to this (dPstated = 0.15 * inputs['Pstated']):

src-erlotinib-ef_binding_iter0

Similarly for the same number of iterations, running the same job three times gives more consistent answers with dPstated at 0.15, changing this (dPstated = 0.35 * inputs['Pstated']):

comparing_src_geferl_3iter

to this (dPstated = 0.15 * inputs['Pstated']):
comparing_src_geferl_3iter

New pre-release

A decent amount of updates have been added to the code. Would it be helpful to create a new prerelease? Since we actively use some of this code in projects, it might be easier to rely on releases to indicate what version of the code was used for an analysis.

Improving xml2png.py

I was testing xml2png.py with Infinite results of different assay types. I will list here some suggestions:

  1. Adding usage example with type argument will be useful here:
    https://github.com/choderalab/assaytools/blob/master/scripts/xml2png.py#L4
  2. For single_96 type, the plots are too wide. We should adjust plot dimensions.
    There is empty space on x-axis for wells bigger than 12 on 96-well plot. mi_flu_hsa_lig1_20150914_172954
  3. I think adding axis labels will be great. x-axis label: Well Index, y-axis label: Fluorescence
  4. Plot title usually has None printed for 'Part of Plate' section.
    image
    https://github.com/choderalab/assaytools/blob/master/scripts/xml2png.py#L111-L124
    Do we need "Part of Plate" information? Or can we assign a default "Full Plate" when xml doesn't have this information.
  5. I don't have ant 384-well data, so I couldn't test singlet_384 type but I noticed it has hard coded ligand concentration(Lstated) for plotting. Can we remove that?
    https://github.com/choderalab/assaytools/blob/master/scripts/xml2png.py#L404-L418
  6. The difference between types spectra and scan wasn't obvious to me.

Create storage object for analyses

Started working on a way of storing the results, most useful for tracking different ways of analyzing the same dataset. So far this is what I've come up with: a dictionary of inputs that gets combined with a dictionary of outputs that gets stored as a csv file.

inputs = {
    'data_file'     :  'p38_singlet1_20160420_153238.xml',
    'Lstated'       :  np.array([20.0e-6,14.0e-6,9.82e-6,6.88e-6,4.82e-6,3.38e-6,2.37e-6,1.66e-6,1.16e-6,0.815e-6,0.571e-6,0.4e-6,0.28e-6,0.196e-6,0.138e-6,0.0964e-6,0.0676e-6,0.0474e-6,0.0320e-6,0.0240e-6,0.0160e-6,0.0120e-6,0.008e-6,0.00001e-6], np.float64), # ligand concentration, M
    'ligand_order'  :  ['BOS_w_backfill','BSI_w_backfill','ERL_w_backfill','GEF_w_backfill','BOS_no_backfill','BSI_no_backfill','ERL_no_backfill','GEF_no_backfill'],
    'protein'       :  'p38',
    'Pstated'       :  0.5e-6 * np.ones([24],np.float64), # protein concentration, M
    'assay_volume'  :  50e-6, # assay volume, L
    'well_area'     :  0.1369, # well area, cm^2 for 4ti-0203 [http://4ti.co.uk/files/3113/4217/2464/4ti-0201.pdf]
    }

outputs = {
        'analysis'        : 'pymcmodels',
        'outfiles'        : 'DeltaG_%s.npy, DeltaG_trace_%s.npy' %(rows_to_analyze, rows_to_analyze),
        'rows_to_analyze' : rows_to_analyze,
        'section'         : section,
        'DeltaG'          : "DeltaG = %.1f +- %.1f kT" % (DeltaG, dDeltaG),
        'Kd'              :  Kd_summary,
        'datetime'        : datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
    }

This is a rough start so more input on what to include and what format would be ideal is much appreciated.

Currently this is created and run by a quick_model function that takes in the inputs dictionary and rows_to_analyze = 'BSI_w_backfill' and section = '280_480_TOP_120' makes a few figures and spits out the results.

So far this just spits out a csv file for each analysis rather than just appending to a database.

Figure out how to get consistent delG predictions

I'm currently writing a interface that quick analyses, stores, and makes a figure of results for bayesian analysis, and noticed that for our the simple pymc model function, we aren't getting consistent results when rerunning the analysis:

from assaytools import pymcmodels
            pymc_model = pymcmodels.make_model(Pstated, dPstated, Lstated, dLstated, 
               top_complex_fluorescence=reorder_protein,
               top_ligand_fluorescence=reorder_buffer,
               use_primary_inner_filter_correction=True, 
               use_secondary_inner_filter_correction=True, 
               assay_volume=assay_volume, DG_prior='uniform')

I mentioned these inconsistencies in my last labmeeting, and this is a big reason why I've started working on #56, but just thought I'd post it here as well, since it's coming up again.

In the image below the two adjacent plots are from the same dataset, but the analysis is done at different times (notice time stamp). The plotting is slightly different (delG in title vs. in legend) as this is what I was playing with when repeating the analysis. Seems to be fine for Bosutinib Isomer, but not for Erlotinib (sorry the image is a bit fuzzy...).

compare

Currently parameters for MCMC sampling are:

niter = 500000 # number of iterations
nburn = 50000 # number of burn-in iterations to discard
nthin = 500 # thinning interval

Water evaporation rates

It's actually surprisingly difficult to find a simple approach to compute water evaporation rates.

As a result, I'd suggest performing the experiment: Fill a 100 mL trough, a 15 mL trough, and a 96-well plate with a defined quantity of water and measure their mass every hour for 8 hours; then at 24 hours and 48 hours.

Clean up quickmodel

  • Move 95% interval plotting to plots.py
  • Add ability to increase samples through inputs.py
  • Add ability to use filesets for single wavelength assays.
  • Include subset of data used for analysis in output file
  • Include recently added things in output file (equilibration, 95% interval)

Longterm:

  • Add autoprotocol analysis via quickmodel

Update platereader.py, incorporating good parts of xml2png.py, then use in xml2png

Update platereader.py, incorporating good parts ofxml2png.py and good parts of current platereader.py:

  • Make a constructor with toDataframe(), getWellData(), getColumn(), and getRow()).
  • In the process find out whether dataframe or dictionary underneath is easiest for now, and will also see if there's a convenient way to attach the read parameters to the actual data.
  • Then use this in new xml2png and ipynbs.

Raw Data Organization

Probably good to change how/where we are putting the raw data. I guess I have started doing this with the assaytools/data folder, where the data is separated out by data type (singlet, spectra, etc.). Though right now there are also snippets of analysis in there, which I should move.

Also, do we want to keep the data we are analyzing for the paper and the data we have for other projects (Mehtap, Andrea, etc.) in a different place? Possibly a different repository? Or is just a separate folder within assaytools/data sufficient?

Could we just set a slightly more restrictive prior on F_PL?

Tested the effect of setting F_PL using values taken from the mean after equilibration from a previous run of assaytools, and dF_PL=0.50, which was approximated from the standard deviation. Tested this for Src:Erlotinib and Src: Gefitinib.

F_PL_def = {'Src-Erlotinib-EF': 59601584903.0,
            'Src-Gefitinib-GH': 81835891154.8}

Similar to issue #106 we see an expected removal of long tails and better reproducibility compared to without setting the F_PL. Note here, we are still looking at a dPstated = 0.35 * inputs['Pstated'].

Without setting F_PL:

src-erlotinib-ef_binding_iter0

Setting F_PL and dF_PL:

src-erlotinib-ef_binding_iter0

Without setting F_PL:

comparing_src_geferl_3iter

Setting F_PL and dF_PL:

comparing_src_geferl_3iter

Also we no longer see any sort of correlation between F_PL and DeltaG:

Without setting F_PL:

delg_v_fpl_erl0

Setting F_PL and dF_PL:
delg_v_fpl_erl0

While this is nice, I'm worried this is maybe too restrictive? Maybe we can just have a slightly more informed prior instead of this (link):

F_PL_guess = (Fmax - Fmin) / min(Pstated.max(), Lstated.max())
model['F_PL'] = pymc.Uniform('F_PL', lower=0.0, upper=2*Fmax/min(Pstated.max(),Lstated.max()), value=F_PL_guess)

Unknown error/warning with Quickmodel

Whenever I run quickmodel.py I see the following warning for only the analysis of 1st row of the plate. I don't understand what is causing it, but it is a curiosity that it doesn't happen in the analysis of the following rows:
/Users/isikm/opt/anaconda/lib/python2.7/site-packages/assaytools-0.2.0-py2.7.egg/assaytools/pymcmodels.py:77: RuntimeWarning: divide by zero encountered in power tau = np.sqrt(np.log(1.0 + (stddev/mean)**2))**(-2)

Conda packaging

Looks like some of the dependencies (seaborn, pymc) have some numpy version restrictions, so we may need to build our own versions of some of these packages for the omnia channel to match other omnia python versions.

Allow handling of [L] = 0 data.

To do this we will add a special case where when [L] = 0 for both [P] = Protein_concentration and [P] = 0. Currently we have to define a minimal ligand concentration in these wells, though this is not actually the case.

Write Tests

We need to figure out what tests we want. This is currently pretty incomplete.

Making quickmodel more flexible to accept different plate layouts

At the moment quickmodel is designed to work this plate layout:

  • odd rows: protein + ligand1
  • even rows: buffer + ligand1

I would like to analyse an experiment with a different layout:
Row A : protein1 + ligand1
Row B : protein2 + ligand1
Row C : protein3 + ligand1
Row D : protein4 + ligand1
Row E : protein5 + ligand1
Row F : protein6 + ligand1
Row G : buffer + ligand1
Row H : buffer + ligand1

I will try to modify quickmodel script to handle this case. Can I add it as an alternative quickmodel script to assaytools for now? @sonyahanson

I think this is the part I need to modify:
https://github.com/choderalab/assaytools/blob/master/scripts/quickmodel.py#L74-L75

Optimize GeneralBindingModel

From some profiling tests, it looks like root solving in the GeneralBindingModel is the dominant cost of the new autoprotocol-based assaytools changes, with total cost being somewhere around 76% of total sampling time spent in calls to scipy.optimize.root here.

I imagine we could greatly speed this up by rewriting the target function using some optimization scheme.

Suggestions for API improvements to facilitate competition experiment analysis

From our discussion today, here is what we decided we would need for input:

  • Compound stock dictionary. A master data structure listing
compound_stocks = [
   { 'id' : 'BOS001',
     'compound_name' : 'bosutinib',
     'compound_mw' : 530.446, # g/mol
     'compound_mass' : 12.45, # mg compound dispensed
     'compound_purity' : 0.96, # purity
     'solvent_mass' : 12.52, # g solvent dispensed
   }
...
]

The compound stock dictionary could be used for all experiments, and would eventually be retrieved from a lightweight database running on Amazon EC2 or someplace.

  • D300 XML file specifying compound and DMSO concentrations in each well
  • For each assay plate:
    • A list of compound stock ids for used for the rows (e.g. ['BOS001', 'GEF002', 'ERL001', 'BSI004'])
    • The protein name pipetted into rows and its stated concentration
    • The Infinite XML file after the plate has been read
  • An error assumption dictionary which would contain
    • D300 dispense CVs
    • protein concentration uncertainties
    • any other assumptions that go into error modeling

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.