choderalab / assaytools Goto Github PK
View Code? Open in Web Editor NEWModeling and Bayesian analysis of fluorescence and absorbance assays.
Home Page: http://assaytools.readthedocs.org
License: GNU Lesser General Public License v2.1
Modeling and Bayesian analysis of fluorescence and absorbance assays.
Home Page: http://assaytools.readthedocs.org
License: GNU Lesser General Public License v2.1
I'm currently writing a interface that quick analyses, stores, and makes a figure of results for bayesian analysis, and noticed that for our the simple pymc model function, we aren't getting consistent results when rerunning the analysis:
from assaytools import pymcmodels
pymc_model = pymcmodels.make_model(Pstated, dPstated, Lstated, dLstated,
top_complex_fluorescence=reorder_protein,
top_ligand_fluorescence=reorder_buffer,
use_primary_inner_filter_correction=True,
use_secondary_inner_filter_correction=True,
assay_volume=assay_volume, DG_prior='uniform')
I mentioned these inconsistencies in my last labmeeting, and this is a big reason why I've started working on #56, but just thought I'd post it here as well, since it's coming up again.
In the image below the two adjacent plots are from the same dataset, but the analysis is done at different times (notice time stamp). The plotting is slightly different (delG in title vs. in legend) as this is what I was playing with when repeating the analysis. Seems to be fine for Bosutinib Isomer, but not for Erlotinib (sorry the image is a bit fuzzy...).
Currently parameters for MCMC sampling are:
niter = 500000 # number of iterations
nburn = 50000 # number of burn-in iterations to discard
nthin = 500 # thinning interval
See below images from the second half of this notebook: https://github.com/choderalab/assaytools/blob/master/data/spectra/Spectra-Analysis.ipynb
Hoping this will clear up with increased sampling...
Tested the effect of changing dPstated = 0.35 * inputs['Pstated']
to dPstated = 0.15 * inputs['Pstated']
. Interestingly, this removed the long tails for the Src-Erlotonib fits, changing them from this (dPstated = 0.35 * inputs['Pstated']
):
to this (dPstated = 0.15 * inputs['Pstated']
):
Similarly for the same number of iterations, running the same job three times gives more consistent answers with dPstated at 0.15, changing this (dPstated = 0.35 * inputs['Pstated']
):
As discussed in our meeting, we'd like to generalize the model to permit each well to have arbitrary concentrations of multiple components, and to attach a competitive binding model for each ligand species. Later, we could even provide a set of models to enable automated model selection.
For now, I'd like to propose an API for a general experiment, as well as a few convenience subclasses that make it easy to set up experiments of the kind we are currently performing.
We first need some way to describe, for each well:
I wonder if we might want to piggypack on autoprotocol-python
to specify these in a form that will be compatible with autoprotocol.
We can use the Well object to describe all of this information by standardizing the names of specific additional well properties (which can be added via well.add_properties()
or set through well.set_properties()
). We can use a
WellGroup to specify a collection of wells to pass to the Bayesian model for analysis.
An example of specifying a well using this approach would be
from autoprotocol.container import Well, Container, ContainerType
# Define the container type
container_type = ContainerType(name='4ti-0110', is_tube=False, well_count=96, col_count=12,...)
# Define the container
container = Container(id=id, container_type=container_type)
# Define wells (the verbose way; we'd provide convenience methods to format the plate)
from autoprotocol.unit import Unit
well = container.well("A1") # retrieve a specific well from the plate
well.set_volume(Unit(100, "milliliter")) # set the well volume
# Set concentrations of well components
concentrations = {
'Src' : Unit(0.5, "micromolar"),
'bosutinib' : Unit(20, "micromolar"),
'imatinib' : unit(10, "micromolar")
}
well.set_properties({'concentrations' : concentrations})
# We would also need the optical path length through liquid, or well area.
well.set_properties({'area' : Unit(35, "millimeters**2")})
Since Unit
does not provide automatic unit conversion capabilities, we would have to work out a way to automatically convert between simtk.unit
and autoprotocol.unit.Unit
.
autoprotocol doesn't specify any particular format for the data output from plate reads, so we're stuck with coming up with our own. We might want to create a class that ingests an Infinite XML file and provides a pythonic way to access data, e.g.
>>> from assaytools.platereaders import InfiniteDataset
>>> data = InfiniteDataset(filename='data/2015-01-20 18-14-25_plate_1.xml')
>>> print data.welldata('A6')
{'TopFluorescenceRead' : 12451, 'BottomFluorescenceRead' : 74252, 'Absorbance' : 0.5632 }
where InfiniteDataset
would be a subclass of a more generic PlateReaderDataset
.
Because we haven't been super consistent about naming schemes (though we should be!) we may want to provide a mapping dict either here or in the analysis stage that maps the names in the XML file to names that we want in the dict returned by data.welldata(id)
.
We could specify the components of a well as a dict
of concentrations.
# Create a model
from assaytools.analysis import CompetitiveBindingAnalysis
receptor = 'Src' # receptor name (only one allowed)
ligands = ['bosutinib', 'bosutinib-isomer', 'imatinib'] # competitive ligand names
model = CompetitiveBindingAnalysis(wells=wellgroup, data=welldata, receptor=receptor, ligands=ligands)
# fit the maximum a posteriori (MAP) estimate
map = model.map_fit()
# run some MCMC sampling and return the MCMC object
mcmc = model.run_mcmc()
This could use some help!
Longterm:
Change xml2png to seaborn so colors are more informative and change labels to A-H instead 0-7.
We need to be sure to mention pint
as a dependency in conda/docs. This was a problem @Lucelenie had when first installing assaytools
.
We should add this console script too:
https://github.com/choderalab/kinome-data/blob/master/fluorescence-assay/Plate_Type_Analysis/xml2png4scans.py
Looks like some of the dependencies (seaborn
, pymc
) have some numpy version restrictions, so we may need to build our own versions of some of these packages for the omnia
channel to match other omnia python versions.
/cbio/jclab/home/hansons/opt/anaconda/lib/python2.7/site-packages/assaytools-0.2.0-py2.7.egg/assaytools/pymcmodels.py:239: RuntimeWarning: invalid value encountered in divide
scaling = (1 - np.exp(-alpha)) / alpha
Update platereader.py
, incorporating good parts ofxml2png.py
and good parts of current platereader.py
:
toDataframe()
, getWellData()
, getColumn()
, and getRow())
.xml2png
and ipynb
s.At the moment quickmodel is designed to work this plate layout:
I would like to analyse an experiment with a different layout:
Row A : protein1 + ligand1
Row B : protein2 + ligand1
Row C : protein3 + ligand1
Row D : protein4 + ligand1
Row E : protein5 + ligand1
Row F : protein6 + ligand1
Row G : buffer + ligand1
Row H : buffer + ligand1
I will try to modify quickmodel script to handle this case. Can I add it as an alternative quickmodel script to assaytools for now? @sonyahanson
I think this is the part I need to modify:
https://github.com/choderalab/assaytools/blob/master/scripts/quickmodel.py#L74-L75
We can use the current framework to fit multiple replicates of the same experiment that use the same stock solutions
We simply have to concatenate the concentration and fluorescence vectors for the multiple experiments. That is, if each experiment used 12 wells and we perform them in triplicate, the concentration and fluorescence vectors will have 36 elements
Whenever I run quickmodel.py
I see the following warning for only the analysis of 1st row of the plate. I don't understand what is causing it, but it is a curiosity that it doesn't happen in the analysis of the following rows:
/Users/isikm/opt/anaconda/lib/python2.7/site-packages/assaytools-0.2.0-py2.7.egg/assaytools/pymcmodels.py:77: RuntimeWarning: divide by zero encountered in power tau = np.sqrt(np.log(1.0 + (stddev/mean)**2))**(-2)
Adjust makemodel so it can also be used for competition assay modeling. This might be easier if we use a dictionary as input instead a bunch of separate terms.
From our discussion today, here is what we decided we would need for input:
compound_stocks = [
{ 'id' : 'BOS001',
'compound_name' : 'bosutinib',
'compound_mw' : 530.446, # g/mol
'compound_mass' : 12.45, # mg compound dispensed
'compound_purity' : 0.96, # purity
'solvent_mass' : 12.52, # g solvent dispensed
}
...
]
The compound stock dictionary could be used for all experiments, and would eventually be retrieved from a lightweight database running on Amazon EC2 or someplace.
['BOS001', 'GEF002', 'ERL001', 'BSI004']
)It's actually surprisingly difficult to find a simple approach to compute water evaporation rates.
As a result, I'd suggest performing the experiment: Fill a 100 mL trough, a 15 mL trough, and a 96-well plate with a defined quantity of water and measure their mass every hour for 8 hours; then at 24 hours and 48 hours.
Tested the effect of setting F_PL
using values taken from the mean after equilibration from a previous run of assaytools, and dF_PL=0.50
, which was approximated from the standard deviation. Tested this for Src:Erlotinib and Src: Gefitinib.
F_PL_def = {'Src-Erlotinib-EF': 59601584903.0,
'Src-Gefitinib-GH': 81835891154.8}
Similar to issue #106 we see an expected removal of long tails and better reproducibility compared to without setting the F_PL
. Note here, we are still looking at a dPstated = 0.35 * inputs['Pstated']
.
Without setting F_PL
:
Setting F_PL
and dF_PL
:
Without setting F_PL
:
Setting F_PL
and dF_PL
:
Also we no longer see any sort of correlation between F_PL and DeltaG:
Without setting F_PL
:
While this is nice, I'm worried this is maybe too restrictive? Maybe we can just have a slightly more informed prior instead of this (link):
F_PL_guess = (Fmax - Fmin) / min(Pstated.max(), Lstated.max())
model['F_PL'] = pymc.Uniform('F_PL', lower=0.0, upper=2*Fmax/min(Pstated.max(),Lstated.max()), value=F_PL_guess)
While installing assaytools in a Python 3.5 environment I noticed some dependencies are missing. I had to install lxml
, seaborn
and pymc
separately after running python setup.py install
.
I noticed that those missing dependencies are actually listed in setup.py
, but commented out: https://github.com/choderalab/assaytools/blob/master/setup.py#L87
I want to activate those lines so that these dependencies are also installed at setup.
This is just a reminder for me to convert all strictly positive quantities (concentrations, intensities, quantum yields, etc.) to log-space for the pymc2 models as I've done for the pymc3 models.
Fix xml2png so it works well for scans and > 3 single reads without manual intervention.
Thanks to Patrick and Bas for chatting about this over coffee after my lab meeting. Sounds like Lee had a very promising answer: Gaussian Processes!
Here are some potentially useful links I found by googling 'gaussian process outliers python':
https://bugra.github.io/work/notes/2014-05-11/robust-regression-and-outlier-detection-via-gaussian-processes/
https://ocefpaf.github.io/python4oceanographers/blog/2015/03/16/outlier_detection/
Add the testing script from openpathsampling (opentis) and integrate into travis testing.
cc: #83 (comment)
When plotting some of the new data (with higher gain and larger bandwidth), but not all of it there are some errors when plotting is attempted:
We think it is related to the fact that we are using e^sigma in plots.py
:
https://github.com/choderalab/assaytools/blob/master/AssayTools/plots.py#L356
There was some confusion during the constant pH grant process about what units we are using here for deltaG. Til now we've been consistently using John's k_B*T
units (which do seem to actually be R*T
, see below). I'm not clear however, how we want to proceed with changing/confirming this within assaytools
as a whole. Should we change everything to kcal/mol? or should we use kJ/mol (SI units)? or should we provide options for the user to output the unit they find most informative?
Is the chembl prior centered around 0? If so, this seems wrong and we should adjust accordingly if we ever plan to use this: https://github.com/choderalab/assaytools/blob/master/AssayTools/pymcmodels.py#L372
Currently we generally use the uniform prior, so this hasn't been an issue thus far...
Started working on a way of storing the results, most useful for tracking different ways of analyzing the same dataset. So far this is what I've come up with: a dictionary of inputs that gets combined with a dictionary of outputs that gets stored as a csv file.
inputs = {
'data_file' : 'p38_singlet1_20160420_153238.xml',
'Lstated' : np.array([20.0e-6,14.0e-6,9.82e-6,6.88e-6,4.82e-6,3.38e-6,2.37e-6,1.66e-6,1.16e-6,0.815e-6,0.571e-6,0.4e-6,0.28e-6,0.196e-6,0.138e-6,0.0964e-6,0.0676e-6,0.0474e-6,0.0320e-6,0.0240e-6,0.0160e-6,0.0120e-6,0.008e-6,0.00001e-6], np.float64), # ligand concentration, M
'ligand_order' : ['BOS_w_backfill','BSI_w_backfill','ERL_w_backfill','GEF_w_backfill','BOS_no_backfill','BSI_no_backfill','ERL_no_backfill','GEF_no_backfill'],
'protein' : 'p38',
'Pstated' : 0.5e-6 * np.ones([24],np.float64), # protein concentration, M
'assay_volume' : 50e-6, # assay volume, L
'well_area' : 0.1369, # well area, cm^2 for 4ti-0203 [http://4ti.co.uk/files/3113/4217/2464/4ti-0201.pdf]
}
outputs = {
'analysis' : 'pymcmodels',
'outfiles' : 'DeltaG_%s.npy, DeltaG_trace_%s.npy' %(rows_to_analyze, rows_to_analyze),
'rows_to_analyze' : rows_to_analyze,
'section' : section,
'DeltaG' : "DeltaG = %.1f +- %.1f kT" % (DeltaG, dDeltaG),
'Kd' : Kd_summary,
'datetime' : datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
}
This is a rough start so more input on what to include and what format would be ideal is much appreciated.
Currently this is created and run by a quick_model
function that takes in the inputs
dictionary and rows_to_analyze = 'BSI_w_backfill'
and section = '280_480_TOP_120'
makes a few figures and spits out the results.
So far this just spits out a csv
file for each analysis rather than just appending to a database.
Would be great for @sonyahanson to check in her latest fluorescence assay examples when possible.
To do this we will add a special case where when [L] = 0 for both [P] = Protein_concentration and [P] = 0. Currently we have to define a minimal ligand concentration in these wells, though this is not actually the case.
Add quickmodel.py
example and notebook to examples
.
We need to figure out what tests we want. This is currently pretty incomplete.
A decent amount of updates have been added to the code. Would it be helpful to create a new prerelease? Since we actively use some of this code in projects, it might be easier to rely on releases to indicate what version of the code was used for an analysis.
Have we standardized the Section
names for the XML files generated by the Infinite?
If so, are these documented somewhere?
At least in the 'BGI...'.png files I'm looking at, it just says 'None, None, None'.
Change these to nan's so they are not taken into account in analysis.
There were some changes at anaconda.org that require updating the automated testing framework.
@sonyahanson has some nice code that generates a full page of traces that are useful for understanding what is going on with sampling issues. It would be great to add a command-line argument to quickmodel to generate these immediately after sampling is completed.
I was testing xml2png.py with Infinite results of different assay types. I will list here some suggestions:
None
printed for 'Part of Plate'
section.singlet_384
type but I noticed it has hard coded ligand concentration(Lstated
) for plotting. Can we remove that?spectra
and scan
wasn't obvious to me.This warning appears when running quickbayes.sh
:
/Users/choderaj/miniconda3/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
max_open_warning, RuntimeWarning)
This is just a reminder for me to create a convenience function to generate pymc.Lognormal
random variates similar to the pymc3 convenience function.
As suggested by @MehtapIsik
Names of the xml data files are confusing. Can we change them all to a "[protein name] _
[fluorescent probe] _ [non-fluorescent ligand(if exists)] _ [experiment date].xml" format?
From some profiling tests, it looks like root solving in the GeneralBindingModel
is the dominant cost of the new autoprotocol-based assaytools changes, with total cost being somewhere around 76% of total sampling time spent in calls to scipy.optimize.root
here.
I imagine we could greatly speed this up by rewriting the target function using some optimization scheme.
(See also #52 on Gaussian Processes for outlier detection.)
I'm evaluating different strategies for automated Bayesian outlier detection, but it would be very helpful to have an idea of the physical mechanism causing outliers since this will affect how we deal with it in the code.
Some possibilities would affect all measurements:
Some effects might affect only one type of measurement:
Any other ideas?
@sonyahanson has prepared a nice dataset that illustrates this here, but it may need to be updated to the new scheme for representing input for quickmodel.
Probably good to change how/where we are putting the raw data. I guess I have started doing this with the assaytools/data
folder, where the data is separated out by data type (singlet, spectra, etc.). Though right now there are also snippets of analysis in there, which I should move.
Also, do we want to keep the data we are analyzing for the paper and the data we have for other projects (Mehtap, Andrea, etc.) in a different place? Possibly a different repository? Or is just a separate folder within assaytools/data
sufficient?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.