General question--how should topology proposals be organized, and how should they be m

I like grouping this by type of proposal: <div class="snippet-clipboard-content no

Topology proposals and system construction,about choderalab/perses

Comments (42)

jchodera commented on June 19, 2024

I like grouping this by type of proposal:

* Transformation
   * SmallMoleculeTransformation
      * SmallMoleculeLibrary
      * CombinatorialLibrary
      * ReactionAwareSynthesis
   * PolymerTransformation
      * PointMutation
      * PeptideLibrary

These classes should focus on manipulating the Topology object and creating the atom mapping between old and new Topology objects

We can separate out the system building machinery into a helper SystemBuilder tool that uses the Topology to create a modified System. We're nearly done making changes to the OpenMM ffxml handler and AMBER forcefields that will allow us to resurrect the old YANK SystemBuilder for this purpose.

from perses.

pgrinaway commented on June 19, 2024

Yeah, the more I think about this, the more I like grouping by type of proposal. Scientifically, it seems to make more sense.

from perses.

jchodera commented on June 19, 2024

The design should maximize code reuse (or minimize code duplication). There are certainly opportunities for experimentation within these classes to improve efficiency. This could either be done with adding more class options or through subclasses, though class options may be easier. We should think about how we might want to specify these options at runtime, perhaps through a YAML file as with YANK.

We will also need sampler drivers:

MCMCSampler - samples with fixed log weights g_k
SAMSSampler - adjusts weights g_k to produce marginal distribution that matches imposed pi_k

Once we write these components, it should be easy to confine most of our scientific tinkering to various design/optimization classes that run multiple SAMSSamplers (via MPI?) and do their thing by using the online g_k estimated to update the target pi_k that is fed back to the samplers.

from perses.

jchodera commented on June 19, 2024

Would it help to flesh out the APIs for these things further, or is it pretty straightforward?

from perses.

pgrinaway commented on June 19, 2024

It would probably help to do that. If anyone has a suggestion, feel free to either post it here or open a PR. After I finish the current PR in the next few days, I am happy to modify my current topology proposal stuff to conform to the consensus.

from perses.

jchodera commented on June 19, 2024

Let's brainstorm a bit on how we want to use this for a few use cases:

Computing all binding free energies for molecules in a library
Identifying the best binders from a combinatorial library
Identifying resistance mutants to a given compound
Identifying compounds that optimally bind a target but not an antitarget using reaction-aware synthesis
Identifying peptides from a combinatorially-generated library that bind well to a given protein
Identifying peptides from a fixed library that bind well to an MHC:TCR complex

What about something like this?

Computing binding free energies for all molecules in a library.

# Initialize a forcefield engine, which will also be able to parameterize small molecules in addition to polymer residues.
ffxmls = ['amber99sbildn.xml', 'gaff.xml', 'tip3p.xml', 'ions.xml']
forcefieldplus = ForceFieldPlus(*ffxmls)
gaff_typer = GAFFSmallMoleculeParameterizer('gaff.xml')
forcefield.addExtension(gaff_typer) # parameter assignment for small molecules

# Load a protein-ligand system with a reference ligand bound.
from simtk.openmm.app import PDBFile
pdb = PDBFile('abl-imatinib.pdb') # load a file where all residues have already been modeled and all atoms have been added to the ligand

# Solvate the complex.
from simtk.openmm.app import Modeller, ForceField
modeller = Modeller(pdb.topology, pdb.positions)
forcefield = ForceField(*ffxmls)
modeller.addSolvent(forcefield, model='tip3p', padding=10*angstroms, neutralize=True, ionicStrength=50*millimolar)

# Create a system.
system = forcefieldplus.createSystem(pdb.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=9*angstroms, rigidWater=True, removeCMMotion=False)

# Load the small molecule library, a SmallMoleculeTransformation subclass.
# Maybe it is better to call this a ProposalEngine instead of a Transformation?
library = SmallMoleculeLibrary(filename='library.mol2')

# Add the reference ligand to the compound library and tell the Topology object which part of the system it should be sampling.
topology = TopologyPlus(pdb.topology) # an enhanced version of the OpenMM Topology object
reference_ligand = topology.select('resname MOL')
library.addMolecule(reference_ligand.asOEMol()) # add the reference compound to the library
library.setMoleculeToSample(reference_ligand) # tell it which part of the Topology object it should be sampling

# Create a sampler to automatically determine their free energies
sampler = SAMSSampler(system=system, topology=topology, positions=modeller.positions, proposal_engine=library)

# Run the sampler.
sampler.timestep = 2*femtoseconds
sampler.nsteps_per_iteration = 500
sampler.run(niterations=1000)

# Extract binding free energies.
report = library.generateReport(sampler)

This postulates we create two additional classes that behave like more functional versions of OpenMM classes:

ForceFieldPlus: Like ForceField, but allows the addition of extensions that can type other parts of the Topology object
TopologyPlus: Like Topology, but useful

We may be able to get these into OpenMM proper, but we don't need to worry about that for now.

from perses.

jchodera commented on June 19, 2024

Whoops---this only handles the case where there is a single simulation (the complex), which doesn't correspond to any of our design problems. Let me think about how to flexibly generalize this to the case where we have SAMS deal with the ligand-in-complex and ligand-in-solvent simulations at the same time.

from perses.

pgrinaway commented on June 19, 2024

Hm, I like the proposed extensions to ForceField. I think a useful version of Topology would be great, but at least for my present applications, I don't think it's completely necessary at the moment.

I think that this part:

topology = TopologyPlus(pdb.topology) # an enhanced version of the OpenMM Topology object
reference_ligand = topology.select('resname MOL')
library.addMolecule(reference_ligand.asOEMol()) # add the reference compound to the library
library.setMoleculeToSample(reference_ligand)

is very useful, but I think we can accomplish it in another way (for the near term, at least). For instance, do we need the atom selection DSL? Could we have instead: reference_ligand = topology.residues['MOL']?

Also, why not just have the library initially include the starting molecule, and just set the sampler's initial state that way? Currently, the idea is that inside the sampler we'd instantiate the Transformation (+1 on changing the name to ProposalEngine, btw, I'm just using the current name for clarity), and when we call transformation.propose(...) we pass to it the current system, topology, positions. We do need to correctly set the initial SMILES string, so there may be some use for something that perceives the initial molecule.

I also don't think that we necessarily want the library object to keep the state--instead, the Sampler should do that, and call library.propose(current_state...). Otherwise, it's possible that a bug could make things out of sync...

In general, my thinking thus far is:

I really like the proposed ForceField extensions and think that's the priority here.
We probably want something to perceive the ligand in the initial simulation to avoid the potential bug of someone mis-specifying the initial molecule.
I don't think the selection DSL is particularly necessary for us, since simple queries can be addressed in a dict style paradigm, and more complex queries can be made nicely in Python as long as TopologyPlus is a somewhat pythonic object.
I don't think that the Transformation or ProposalEngine should contain the logic for estimating free energies or other analyses. I actually think that might be better left to an Analysis class (which would also allow easier flexibility in terms of mixing and matching proposals and analysis methods).

More thoughts to come later, but that is my initial reaction.

from perses.

jchodera commented on June 19, 2024

I think a useful version of Topology would be great, but at least for my present applications, I don't think it's completely necessary at the moment.

Excellent point. If we can specify residue spans we can get away with that. We just have to make sure it works more or less straightforwardly for selecting the entity to be modified in all of the applications above.

Also, why not just have the library initially include the starting molecule, and just set the sampler's initial state that way?

The library on disk may not contain the molecule in the X-ray structure if we're using a molecule set form ZINC or eMolecules. We needed a simply way to add whatever molecule is already present and set it as the current ligand, so I suggested one way to do it. We can certainly go with other ways!

In thinking about how to extend this to multitarget design, we may need to add a separate ligand for each cocrystal structure (both of which should be in the whole pool?) and allow each MCMCSampler or SASASampler to have its own residue selection for which part of the topology object is being replaced by the SmallMoleculeLibrary proposed transformations. Each sampler should also separately track which state is currently active. I'm not sure if that means we would have a single SmallMoleculeLibrary object that could be associated with multiple samplers (this seems ideal for memory savings, but could be a problem if we distribute the samplers to multiple nodes) or a separate instance of SmallMoleculeLibrary associated with each sampler.

We do need to correctly set the initial SMILES string, so there may be some use for something that perceives the initial molecule.

Yeah, this should probably be part of the call like library.setMoleculeToSample()---it could match the molecule to one in the library and set the metadata to be the current SMILES string.

Do you want to use an integer index representation of which state is active in order for SASASampler to simply work in the compact space of integer indices [0,states), or would you rather use some more general key system so that it can adjust free energies based on, say, SMILES strings as keys to a dict of relative free energies?

I also don't think that we necessarily want the library object to keep the state--instead, the Sampler should do that, and call library.propose(current_state...). Otherwise, it's possible that a bug could make things out of sync...

I think your suggestion of having the sampler keep track of the state is the best approach.

I don't think that the Transformation or ProposalEngine should contain the logic for estimating free energies or other analyses. I actually think that might be better left to an Analysis class (which would also allow easier flexibility in terms of mixing and matching proposals and analysis methods).

Excellent point. If SASASampler keeps track of relative free energies with general keys (which could be SMILES, peptide sequences, etc) we can always have accessors to in the ProposalEngine subclasses that query more information about these keys and just have an Analysis object do that when printing reports.

from perses.

jchodera commented on June 19, 2024

It looks like we will need a whole set of Design and Analysis drivers---one to run the simulations and couple the SASASampler chains via updates to `$\hat{\pi}_k$, and one to do the analysis afterwards.

Maybe something like this?

# Construct a target function for multi-target design to bind to Abl but not Src.
# Keys specify SASASampler objects, and values are exponents in the computation of target weights.
# \hat{\pi}_k = Z_{Abl L_k} / Z_{Src L_k}
targets = { abl_sampler : 1.0, src_sampler : -1.0 }
# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(targets, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

Would there be use for other variants of MultiTargetDesign, or is this general enough for what we want?

from perses.

jchodera commented on June 19, 2024

Also tagging @andrrizzi (who is working on streamlining system setup) and @juliebehr (who will need to build her mutational design platform with this framework).

from perses.

jchodera commented on June 19, 2024

Could we have instead: reference_ligand = topology.residues['MOL']?

Does this capability exist now, @pgrinaway, as a parmed or mdtraj thing? Or is this functionality we would implement?

from perses.

pgrinaway commented on June 19, 2024

I think we might need to implement it. AFAIK parmed and mdtraj offer a ResidueList or a generator, respectively (as well as a method that allows residue access by index in mdtraj).

from perses.

pgrinaway commented on June 19, 2024

mdtraj does have the atom selection syntax, though.

from perses.

jchodera commented on June 19, 2024

Here's a more complete version of the above example. Still not perfect, but improving.

Computing binding free energies for all molecules in a library

# Parameters
cutoff = 9*angstroms
padding = cutoff + 1*angstrom
ionic_strength = 50*millimolar

# Initialize a forcefield engine, which will also be able to parameterize small molecules in addition to polymer residues.
ffxmls = ['amber99sbildn.xml', 'gaff.xml', 'tip3p.xml', 'ions.xml']
forcefieldplus = ForceFieldPlus(*ffxmls)
gaff_typer = GAFFSmallMoleculeTyper('gaff-type-definitions.xml')
forcefield.addExtension(gaff_typer) # parameter assignment for small molecules

#
# Set up sampler for protein:ligand complex
#

# Load a protein-ligand system with a reference ligand bound.
# This is a convenience function for some modeling steps, returning parmed.Structure.
from tools import StructureHelper
complex = StructureHelper(rcsbid='2HYY', chains=['A'], addMissingResidues=True, addMissingHeavyAtoms=True, protonateLigand=True)

# Identify the reference ligand
reference_ligand = complex[':STI'] # imatinib (resname 'STI')

# Solvate the complex.
from simtk.openmm.app import Modeller, ForceField
complex_modeller = Modeller(complex.topology, complex.positions)
complex_modeller.addSolvent(forcefieldplus, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system for the ligand in complex
complex_system = forcefieldplus.createSystem(complex_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Load the small molecule library, a SmallMoleculeTransformation subclass.
# Maybe it is better to call this a ProposalEngine instead of a Transformation?
library = SmallMoleculeLibrary(filename='library.mol2')

# Add the reference ligand to the compound library
library.addMolecule(reference_ligand.topology) # add the reference compound to the library

# Create a SAMS sampler for the complex.
# Note that we tell the SAMSSampler which atoms to tell the ProposalEngine to use in constructing its first proposal
# SAMSSampler needs to keep track of these atoms because it will know whether a proposal is accepted or rejected, while the ProposalEngine doesn't keep track of this information (though we may want it to)
complex_sampler = SAMSSampler(system=complex_system, topology=complex_modeller.topology, positions=complex_modeller.positions, proposal_engine=library, proposal_atom_selection=reference_ligand.topology)

#
# Set up sampler for ligand in solvent
#

# Solvate the reference ligand.
solvent_modeller = Modeller(reference_ligand.topology, reference_ligand.positions)
solvent_modeller.addSolvent(forcefieldplus, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system for the ligand in complex.
solvent_system = forcefieldplus.createSystem(solvent_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the ligand in solvent.
# Note that we tell the SAMSSampler which atoms to tell the ProposalEngine to use in constructing its first proposal
# SAMSSampler needs to keep track of these atoms because it will know whether a proposal is accepted or rejected, while the ProposalEngine doesn't keep track of this information (though we may want it to)
solvent_sampler = SAMSSampler(system=solvent_system, topology=solvent_modeller.topology, positions=solvent_modeller.positions, proposal_engine=library, proposal_atom_selection=reference_ligand.topology)

#
# Create the design sampler
#

# Construct a target function for maximizing ligand binding to Abl
# Keys specify SASASampler objects, and values are exponents in the computation of target weights.
# \hat{\pi}_k = Z_{Abl L_k} / Z_{L_k}
targets = { complex_sampler : 1.0, solvent_sampler : -1.0 }
# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(targets, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

from perses.

jchodera commented on June 19, 2024

Updated the above a bit. This scheme relies on duck typing: ForceFieldPlus behaves like ForceField; parmed.Topology behaves like Topology, etc. We make use of parmed.Structure for selections.

from perses.

jchodera commented on June 19, 2024

Here's what target/antitarget ligand design would look like using this approach (tagging @steven-albanese):

# Parameters
cutoff = 9*angstroms
padding = cutoff + 1*angstrom
ionic_strength = 50*millimolar

# Initialize a forcefield engine, which will also be able to parameterize small molecules in addition to polymer residues.
ffxmls = ['amber99sbildn.xml', 'gaff.xml', 'tip3p.xml', 'ions.xml']
forcefieldplus = ForceFieldPlus(*ffxmls)
gaff_typer = GAFFSmallMoleculeTyper('gaff-type-definitions.xml')
forcefield.addExtension(gaff_typer) # parameter assignment for small molecules

#
# Load protein:ligand complex structures
#

from tools import StructureHelper
abl_complex = StructureHelper(rcsbid='2HYY', chains=['A'], addMissingResidues=True, addMissingHeavyAtoms=True, protonateLigand=True)
abl_ligand = abl_complex[':STI']
src_complex = StructureHelper(rcsbid='1Y57', chains=['A'], residues=range(270,524), addMissingResidues=True, addMissingHeavyAtoms=True, protonateLigand=True)
src_ligand = abl_complex[':MPZ']

#
# Set up the library
#

# Load the small molecule library, a SmallMoleculeTransformation subclass.
library = SmallMoleculeLibrary(filename='library.mol2')

# Add the reference ligands to the compound library
library.addMolecule(abl_ligand.topology)
library.addMolecule(src_ligand.topology)

#
# Set up sampler for abl:ligand
#

# Solvate the complex
abl_modeller = Modeller(abl_complex.topology, abl_complex.positions)
abl_modeller.addSolvent(forcefieldplus, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system
abl_system = forcefieldplus.createSystem(abl_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the complex.
abl_sampler = SAMSSampler(system=abl_system, topology=abl_modeller.topology, positions=abl_modeller.positions, proposal_engine=library, proposal_atom_selection=abl_ligand.topology)

#
# Set up sampler for src:ligand
#

# Solvate the complex
src_modeller = Modeller(src_complex.topology, src_complex.positions)
src_modeller.addSolvent(forcefieldplus, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system
src_system = forcefieldplus.createSystem(abl_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the complex.
src_sampler = SAMSSampler(system=src_system, topology=src_modeller.topology, positions=src_modeller.positions, proposal_engine=library, proposal_atom_selection=src_ligand.topology)

#
# Create the design sampler
#

# Construct a target function for maximizing ligand binding to Src but not Abl
# Keys specify SASASampler objects, and values are exponents in the computation of target weights.
# \hat{\pi}_k = Z_{Src L_k} / Z_{Abl L_k}
target_samplers = { src_sampler : 1.0, abl_sampler : -1.0 }
# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(target_samplers, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

We could probably even layer helper functions on top of this to set up the target samplers, like this:

# Specify targets with a dict
targets = [
    { name : 'Src', weight : +1, rcsbid : '2HYY', chains : ['A'], ligand_selection : ':STI' },
    { name : 'Abl', weight : -1, rcsbid : '1Y57', chains : ['A'], residues : range(270,524), ligand_selection=':MPZ' },
    ...
]

# Load the small molecule library, a SmallMoleculeTransformation subclass.
library = SmallMoleculeLibrary(filename='library.mol2')

# Create all the target samplers using same options for all samplers.
options = { 'forcefield' : forcefieldplus, 'ionic_strength' : 50*millimolar, 'nonbondedMethod':app.CutoffPeriodic, ... }
target_samplers = CreateTargetSamplers(targets, proposal_engine=library, options=options)

# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(target_samplers, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

from perses.

jchodera commented on June 19, 2024

Thinking about how @juliebehr's mutational resistance problem would look.

Instead of SmallMoleculeLibrary, we would have a class like PointMutation that is a subclass of PolymerProposal and is able to rebuild Topology objects containing polymers by allowing one or more residues to be substituted for another.

PolymerProposal objects would need to take the following information:

a ForceField object (or the ffxml files directly) containing the residue definitions, so that it understands how to change one Topology object into another by substituting residues

PointMutation objects would also need information about:

the wild-type sequence, which could either be specified or obtained from the Topology file
which chain in the Topology file it would be modifying
some parameters specifying how many point mutants away from the wild-type sequence are allowed
some specification of how mutations are to be proposed (e.g. single-nucleotide vs single-amino acid changes)

This could potentially look like this:

# Parameters
cutoff = 9*angstroms
padding = cutoff + 1*angstrom
ionic_strength = 50*millimolar

# Initialize a forcefield engine, which will also be able to parameterize small molecules in addition to polymer residues.
ffxmls = ['amber99sbildn.xml', 'gaff.xml', 'tip3p.xml', 'ions.xml']
forcefieldplus = ForceFieldPlus(*ffxmls)
gaff_typer = GAFFSmallMoleculeTyper('gaff-type-definitions.xml')
forcefield.addExtension(gaff_typer) # parameter assignment for small molecules

#
# Retrieve reference wild-type sequence from UniProt
#

from tools import UniProtHelper # new tool we build from Ensembler code or BioPython

# Isoform IB is dominant; we will work with only the a truncated form of the kinase domain.
wt_sequence = UniProtHelper.retrieveSequence(id='ABL1_HUMAN', isoform='IB', span=(229,500))

#
# Load Abl:imatinib and Abl:ATP structures and identify the kinase in each
#

from tools import StructureHelper
# Abl:imatinib structure
# We need to specify exactly which sequence and residue span we want to build/retain to make sure kinase is same in both structures
inhibitor_complex = StructureHelper(rcsbid='2HYY', chains=['A'], addMissingResidues=True, addMissingHeavyAtoms=True, protonateLigand=True, 
    reference_sequences = {'A' : wt_sequence }, retain_residue_spans={'A' : (229,500)} ) # tricky since ligand is in chain A as well
inhibitor_kinase = abl_complex['::A'] # chain A
# Abl:ATP structure
atp_complex = StructureHelper(rcsbid='2G1T', chains=['A','E'], addMissingResidues=True, addMissingHeavyAtoms=True, protonateLigand=True,
   reference_sequences = {'A' : wt_sequence }, retain_residue_spans={'A' : (229,500)} )
# chain E is a non-hydrolyzable ATP analogue that must be substituted with a real ATP
atp_complex.substituteLigand(source=atp, dest=atp_complex['::E'], method='shape-overlay') # we would need to write this function
atp_kinase = atp_complex['::A'] # chain A

#
# Set up the proposal engine
#

# Create a point mutation proposal object, which is a subclass of the polymer transformation proposal class
# In this case, allow up to two point mutations from the "wild-type" sequence represented by the sequence in `abl_kinase` topology, allowing all natural amino acid substitutions anywhere
# See below for many other variations on what we may want from this API.
from perses.mutations import PointMutationEngine, NaturalAminoAcids
mutation_engine = PointMutationEngine(reference_sequence=wt_sequence, forcefield=forcefieldplus, allowed_residues=NaturalAminoAcids, max_point_mutants=2)

#
# Set up sampler for Abl:inhibitor
#

# Solvate the complex
inhibitor_modeller = Modeller(inhibitor_complex.topology, inhibitor_complex.positions)
inhibitor_modeller.addSolvent(forcefieldplus, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system
inhibitor_system = forcefieldplus.createSystem(abl_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the complex.
inhibitor_sampler = SAMSSampler(system=inhibitor_system, topology=inhibitor_modeller.topology, 
    positions=inhibitor_modeller.positions, proposal_engine=mutation_engine, 
    proposal_atom_selection=inhibitor_kinase.topology)

#
# Set up sampler for Abl:ATP
#

# Solvate the complex
atp_modeller = Modeller(atp_complex.topology, atp_complex.positions)
atp_modeller.addSolvent(forcefieldplus, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system
atp_system = forcefieldplus.createSystem(atp_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the complex.
atp_sampler = SAMSSampler(system=satp_system, topology=atp_modeller.topology, 
    positions=atp_modeller.positions, proposal_engine=mutation_engine, 
    proposal_atom_selection=atp_kinase.topology)

#
# Create the design sampler
#

# Construct a target function for identifying mutants that maximize ATP binding but minimize inhibitor binding
# Keys specify SASASampler objects, and values are exponents in the computation of target weights.
# \hat{\pi}_k = Z_{Abl_k:ATP} / Z_{Abl_k:I}
target_samplers = { atp_sampler : 1.0, inhibitor_sampler : -1.0 }
# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(target_samplers, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

Some notes:

Chain selection still needs to be implemented in ParmEd.

from perses.

jchodera commented on June 19, 2024

And we could probably once again think of a way to simplify this with a helper function:

# Set up the proposal engine
from perses.mutations import PointMutationEngine, NaturalAminoAcids
mutation_engine = PointMutationEngine(reference_sequence=wt_sequence, forcefield=forcefieldplus, allowed_residues=NaturalAminoAcids, max_point_mutants=2)

# Specify targets with a dict
# This isn't quite where we need it to be, but might be close.
targets = [
    { name : 'Abl:inhibitor', weight : -1, rcsbid : '2HYY', chains : ['A'], proposal_selection='::A:229-500', target_sequences = {'A' : wt_sequence } },
    { name : 'Abl:ATP', weight : +1, rcsbid : '2G1T', chains : ['A','E'], proposal_selection='::A:229-500', target_sequences = {'A' : wt_sequence } },
    ...
]

# Load the small molecule library, a SmallMoleculeTransformation subclass.
library = SmallMoleculeLibrary(filename='library.mol2')

# Create all the target samplers using same options for all samplers.
options = { 'forcefield' : forcefieldplus, 'ionic_strength' : 50*millimolar, 'nonbondedMethod':app.CutoffPeriodic, ... }
target_samplers = CreateTargetSamplers(targets, proposal_engine=library, options=options)

# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(target_samplers, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

We'd have to think hard about the setup pipeline aspect of things (@andrrizzi), since this needs to be flexible enough to let us:

Specify where to take template structural data from (RCSB, PDB file, mmCIF)
Specify what target sequence(s) we want to model---which may differ in sequence and/or length from the template data---and which parts of which chains should provide templates for which targets. This is complicated by the fact that both protein and ligand are in chain 'A' in 2HYY but in separate chains in 2G1T. The target sequence(s) may be manually specified, come from UniProt, or be specified in the PDB file either via the DBREF/SEQRES section or the resolved span of residues.
Specify target pH for protonation state modeling of polymers
Specify how the ligand is to be replaced (e.g. ROCS shape overlay) if we need to do so
Specify how to fully protonate the ligand and select the appropriate protonation state.

from perses.

jchodera commented on June 19, 2024

Finally, here's an example for the MHC:peptide:TCR use case:

# Parameters
cutoff = 9*angstroms
padding = cutoff + 1*angstrom
ionic_strength = 50*millimolar

# Initialize a forcefield
ffxmls = ['amber99sbildn.xml', 'tip3p.xml', 'ions.xml']
forcefield = ForceField(*ffxmls)

#
# Load MHC:peptide:TCR structure
#

from tools import StructureHelper
# We just need to build any missing atoms/residues from the SEQRES
complex = StructureHelper(rcsbid='1AO7', chains=['A','B','C'], addMissingResidues=True, addMissingHeavyAtoms=True) 
# Identify components
mhc = complex['::A']
tcr = complex['::B']
peptide = complex['::C']

#
# Set up the proposal engine
#

# Create a point mutation proposal object, which is a subclass of the polymer transformation proposal class
# In this case, we use a library of sequences to define the accessible space of sequences
library = SequenceLibrary(filename='sequence-library.fasta')
library.addSequence(peptide) # make sure the peptide sequence is included

#
# Set up sampler for MHC:peptide:TCR
#

# Solvate the complex
complex_modeller = Modeller(complex.topology, complex.positions)
complex_modeller.addSolvent(forcefield, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system
complex_system = forcefield.createSystem(complex_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the complex.
complex_sampler = SAMSSampler(system=complex_system, topology=complex_modeller.topology, 
    positions=complex_modeller.positions, proposal_engine=library, 
    proposal_atom_selection=peptide)

#
# Set up sampler for peptide in solvent
#

# Solvate the peptide
peptide_modeller = Modeller(peptide.topology, peptide.positions)
peptide_modeller.addSolvent(forcefield, model='tip3p', padding=padding, neutralize=True, ionicStrength=ionic_strength)

# Create a system
peptide_system = forcefield.createSystem(peptide_modeller.topology, nonbondedMethod=CutoffPeriodic, nonbondedCutoff=cutoff, rigidWater=True, removeCMMotion=False)

# Create a SAMS sampler for the complex.
peptide_sampler = SAMSSampler(system=peptide_system, topology=peptide_modeller.topology, 
    positions=peptide_modeller.positions, proposal_engine=library, 
    proposal_atom_selection=peptide.topology)

#
# Create the design sampler
#

# Construct a target function for identifying peptides that maximize the MHC:peptide:TCR ternary complex affinity
# Keys specify SASASampler objects, and values are exponents in the computation of target weights.
# \hat{\pi}_k = Z_{MHC:peptide_k:TCR} / Z_{peptide_k}
target_samplers = { complex_sampler : 1.0, peptide_sampler : -1.0 }
# Set up the design to run in parallel via MPI.
designer = MultiTargetDesign(target_samplers, mpicomm=MPI.COMM_WORLD)
# Run the designer engine
designer.run()

from perses.

swails commented on June 19, 2024

Chain selection still needs to be implemented in ParmEd.

FWIW, the selection syntax handles chain IDs just fine. For example: struct[('A','B'),:10,:] will select all atoms in residues 1 to 10 of chains A and B. I'd suggest this route rather than waiting for the mask syntax grow chain selection IDs.

from perses.

jchodera commented on June 19, 2024

Thanks, @swails!

from perses.

jchodera commented on June 19, 2024

At one point, @pgrinaway suggested we might be able to use the small molecule graph-based (MCSS) topology mapping scheme even for peptide or protein transformations. While I don't think this will work if fed things which are not small molecules, I wonder if this could actually do the trick if we constrain this to work residue-by-residue.

What we really need is a way to take a Topology object, program one or more residue mutations (which could be small molecules, amino acids, nucleic acids, etc.), an end up with a new Topology and an atom mapping from old to new atoms.

For subsituting a small molecule with another small molecule, MCSS could be used to propose several candidate matches, and one of the matches selected according to some criteria (with ties selected randomly).

For a residue with one or more bonds to other residues, however, this is no so simple. We could, however, work residue by residue and perform this operation. Bonds to other residues could either be truncated or replaced with special atoms (e.g. X, Y, Z...) and forced to match. The same MCSS scheme could then be used on a residue-by-residue basis.

The templates for each residue could be pulled from a ForceField object, which maintains a representation of all known residue topologies. New small molecule templates can also be added to the ForceField as they are generated.

We could imagine an API for this either being a static factory that operates on Topology objects

from perses import TopologyTools
residues = [ residue for residue in topology.residues() ]
substitutions = { residues[10] : 'GLY', residues[20] : 'LEU', residues[100] : 'benzene' }
[new_topology, atom_mapping] = TopologyTools.substituteResidues(topology, forcefield, substitutions)

where this repeatedly calls substituteResidue() and updates the atom mapping. Not sure if it would be best if atom_mapping maps atom indices or actual Atom objects (from which atom indices can be extracted).

We could also imagine this might just be a feature of an enhanced TopologyPlus class:

[new_topology, atom_mapping] = topology.substituteResidue(residue, new_resname)

residues = [ residue for residue in topology.residues() ]
substitutions = { residues[10] : 'GLY', residues[20] : 'LEU', residues[100] : 'benzene' }
[new_topology, atom_mapping] = topology.substituteResidues(residue_substitutions)

from perses.

jchodera commented on June 19, 2024

(I think @juliebehr may have used a similar scheme in constructing her overlay of amino acids.)

from perses.

jchodera commented on June 19, 2024

Just compiling a list of all the components we would need for the above API proposal:

Structure preparation pipelines

@andrrizzi can implement these:

StructureHelper (or some better name) - assists in preparing a polymer structure coming from the RCSB or a PDB/PDBx file, selecting chains, adding missing residues/heavy atoms, making mutations or modeling a sequence onto a template, replacing a ligand, retaining desired cofactors, etc. We can make extensive use of parmed for this.

System builder schemes

enhanced ForceField or ForceFieldPlus - has plugin extensions to parameterize small molecules or other unknown residues using GAFF/Antechamber, as well as allow GBSA parameters to be assigned if desired; @jchodera is working on this

Proposal engines

ProposalEngine - @pgrinaway has already written the base class; just needs renaming/refactoring
- SmallMoleculeProposalEngine - @pgrinaway has already written this
  - SmallMoleculeLibrary - @pgrinaway has already written this
  - CombinatorialLibrary - @pgrinaway will eventually write this
  - ReactionAwareSynthesis - @pgrinaway will eventually write this
- PolymerProposalEngine - propose changes in polymeric residues
  - PointMutationEngine - propose point mutations; @juliebehr
  - PeptideLibraryEngine - propose changes among a number of peptides (or other polymers) in a library; @juliebehr can write this

Topology proposal schemes

@juliebehr and @pgrinaway can implement these:

TopologyTools.substituteResidue(s) or TopologyPlus.substituteResidue(s) - substitute one or more residues in a topology file for another residue from definitions in a ForceField object

Samplers

Presumably @pgrinaway will implement these:

Sampler - base class
- MCMCSampler - used by SAMSSampler to perform MCMC sampling in configuration space and discrete design space
  - SAMSSampler - perform self-adjusted mixture sampling using fixed target weights that can be updated through API calls

Design drivers

MultiTargetDesign - universal positive/negative design driver that manages multiple SAMSSampler objects on one or more GPUs. Perhaps @pgrinaway can write this?

Utility functions

CreateTargetSamplers - this could automate the setup of multiple SAMSSampler objects and pipeline the setup of multiple StructureHelpers to feed them

Alchemical functions

DualTopologyFactory (choderalab/alchemy#21) - this would allow us to experiment with different RJMC schemes that would less abruptly change charges, types, and valence terms when a scaffold or molecule is switched from one to the other; @jchodera

Near-term examples

Computing all binding free energies for molecules in a library; @pgrinaway
- e.g. host-guest ligands from SAMPL challenges
- e.g. ligands and non-ligands for T4 lysozyme mutant systems (L99A and L99A/M102Q)
Identifying selective compounds from molecules in a library; @steven-albanese, @pgrinaway, @sonyahanson
- e.g. molecules from a large kinase inhibitor library with specific kinase or kinase conformation selectivities
Identifying resistance mutants to a given compound; @juliebehr
- e.g. Abl:imatinib
Identifying compounds that optimally bind a target but not an antitarget using reaction-aware synthesis; @pgrinaway and @steven-albanese
- e.g. kinase inhibitor design
Identifying peptides from a combinatorially-generated library that bind well to a given protein; @juliebehr and @pgrinaway
- e.g. MYB from Kentsis
Identifying peptides from a fixed library that bind well to an MHC:TCR complex; @juliebehr
- MHC:peptide:TCR example from Ron Gejman

from perses.

jchodera commented on June 19, 2024

Any comments on what needs to be done first or fleshing out a more specific API or skeleton for some of these components?

from perses.

pgrinaway commented on June 19, 2024

My thoughts on the API + the work division:

Regarding the Sampler-facing topology proposal API, save for a potential name change, I think it should remain as it is. I like the idea of the TopologyTools class, which would be very useful in the implementation of the the polymer topology proposal classes. I think from the point of view of the Sampler classes, the topology proposal API should be very simple.
For the Sampler, I have an initial suggestion, though it may need more thought:

mcmc_sampler = MCMCSampler(topology_engine, geometry_engine, ncmc_engine)

#inside a SAMS sampler?
#should we do something like this:
mcmc_sampler.sample_x(1000) #run 1000 steps of integration in configuration space
mcmc_sampler.sample_k(1) #attempt to jump to a new state (compound/mutation/whatever)

#update the weights
current_k = mcmc_sampler.current_k
weights_new = self._update_weights(current_k) #get a vector of updated weights
mcmc_sampler.weights = weights_new #set via property setter

Then maybe for the external interface of the SAMSSampler (pretending we are inside a method
that uses the SAMSSampler):

sams_sampler = SAMSSampler(system=peptide_system, topology=peptide_modeller.topology, 
    positions=peptide_modeller.positions, proposal_engine=library, 
    proposal_atom_selection=peptide.topology)

#I wonder if it would be useful to be able to specify a protocol for drawing samples of
#configurations and states, such as:
protocol = {n_iterations : 100, x_update_per_iteration : 100, k_update_per_iteration : 1}

sams_sampler.run_mcmc(protocol) 
sams_sampler.adjust_weights() #do we need an explicit call to this?
current_targets = sams_sampler.targets
current_weights = sams_sampler.weights

#update the target weights. Other SAMSSamplers that would be required here left out for simplicity
new_targets = self._update_target_weights(current_targets, current_weights...)

I generally like the rest of the API as presented. I'm also satisfied with the assignment of tasks here. In general, I love MCMC and SA, so I'm happy to work on any of those things.

from perses.

jchodera commented on June 19, 2024

(deleted because of formatting screwups)

from perses.

jchodera commented on June 19, 2024

My phone screwed up the previous comment, so here's a new try:

Regarding the Sampler-facing topology proposal API, save for a potential name change, I think it should remain as it is. I like the idea of the TopologyTools class, which would be very useful in the implementation of the the polymer topology proposal classes. I think from the point of view of the Sampler classes, the topology proposal API should be very simple.

Remain as in the proposal here or the currently implemented code?

For the Sampler, I have an initial suggestion, though it may need more thought:

mcmc_sampler = MCMCSampler(topology_engine, geometry_engine, ncmc_engine)

#inside a SAMS sampler?
#should we do something like this:
mcmc_sampler.sample_x(1000) #run 1000 steps of integration in configuration space
mcmc_sampler.sample_k(1) #attempt to jump to a new state (compound/mutation/whatever)

#update the weights
current_k = mcmc_sampler.current_k
weights_new = self._update_weights(current_k) #get a vector of updated weights
mcmc_sampler.weights = weights_new #set via property setter

What if we have classes that were intended to interoperate:

MCMCSampler - sample positions using composition of MCMC moves. This is essentially already written and living in repex. We could ingest it or make it into its own package to facilitate reuse and testing.
ExpandedEnsembleSampler - sample positions with MCMCSampler, state with some form of Gibbs sampling, according to fixed weights (that can be changed via a public API). Discrete states might be arbitrary keys or integers.
SAMSSampler - uses an ExpandedEnsembleSampler and adjusts the weights dynamically according to one of the strategies in Z Tan's paper. Can also be split out because multiple projects can use it.

This way, we have a basic API for each part of the sampler chain and encapsulates the behavior relevant to the immediate algorithm inside the relevant class, rather than having, say, Gibbs sampling and weight adaptation mixed into the same thing.

Things I am less sure of: would SAMSSampler create an ExpandedEnsembleSampler under the hood (in which case it may need to relay options to the constructor) or is it best to have the caller pass in an ExpandedEnsembleSampler? Same thing with MCMCSampler. If we have the caller pass the sampler, it gives us more flexibility in subclassing each of these to change their behavior.

Then maybe for the external interface of the SAMSSampler (pretending we are inside a method
that uses the SAMSSampler):

sams_sampler = SAMSSampler(system=peptide_system, topology=peptide_modeller.topology,
positions=peptide_modeller.positions, proposal_engine=library,
proposal_atom_selection=peptide.topology)

This looks pretty good. Is there a way to generalize the SAMSSampler further so we could use it for the SAMS paper for parallel tempering and alchemical solvation free energy calculations too, or would that just cause too much trouble?

#I wonder if it would be useful to be able to specify a protocol for drawing samples of
#configurations and states, such as:
protocol = {n_iterations : 100, x_update_per_iteration : 100, k_update_per_iteration : 1}

We could have some options we could set for each of the samplers through either an API or an options dict.

sams_sampler.run_mcmc(protocol)
sams_sampler.adjust_weights() #do we need an explicit call to this?
current_targets = sams_sampler.targets
current_weights = sams_sampler.weights

#update the target weights. Other SAMSSamplers that would be required here left out for simplicity
new_targets = self._update_target_weights(current_targets, current_weights...)
I generally like the rest of the API as presented. I'm also satisfied with the assignment of tasks here.
In general, I love MCMC and SA, so I'm happy to work on any of those things.

Great!

from perses.

jchodera commented on June 19, 2024

For example, we could have

from thermodynamics import ThermodynamicState
thermodynamic_state = ThermodynamicState(system=test.system, temperature=298*u.kelvin)
# Create a sampler state.
sampler_state = SamplerState(system=test.system, positions=test.positions)
# Create a move set specifying probabilities fo each type of move.
move_set = { HMCMove(nsteps=10) : 0.5, LangevinDynamicsMove(nsteps=10) : 0.5 }
# Create MCMC sampler
mcmc_sampler = MCMCSampler(thermodynamic_state, move_set=move_set)
# Create Expanded Ensemble sampler
expanded_ensemble_sampler = ExpandedEnsembleSampler(mcmc_sampler, log_weights=log_weights)
# Create an SAMS sampler
sams_sampler = SAMSSampler(mcmc_sampler, log_target_probabilities=log_target_probabilities)

from perses.

pgrinaway commented on June 19, 2024

Remain as in the proposal here or the currently implemented code?

Currently implemented code (though the innards will change with the OpenMM parameterization stuff, I like the simple interface of topology_engine.propose(current_state)). The other objects you mentioned to assist are great, but they might just be used inside a TopologyEngine.

What if we have classes that were intended to interoperate:
MCMCSampler - sample positions using composition of MCMC moves. This is essentially already written and living in repex. We could ingest it or make it into its own package to facilitate reuse and testing.
ExpandedEnsembleSampler - sample positions with MCMCSampler, state with some form of Gibbs sampling, according to fixed weights (that can be changed via a public API). Discrete states might be arbitrary keys or integers.
SAMSSampler - uses an ExpandedEnsembleSampler and adjusts the weights dynamically according to one of the strategies in Z Tan's paper. Can also be split out because multiple projects can use it.
This way, we have a basic API for each part of the sampler chain and encapsulates the behavior relevant to the immediate algorithm inside the relevant class, rather than having, say, Gibbs sampling and weight adaptation mixed into the same thing.

I've been mulling this over for a while now. At first I was a bit unsure, but I think that this might be the way to go. Regardless of the state k, we'll likely want to propagate x with the same MCMC steps. And a modification of the k update could be accomplished by subclassing ExpandedEnsembleSampler. I'm +1 on this

Things I am less sure of: would SAMSSampler create an ExpandedEnsembleSampler under the hood (in which case it may need to relay options to the constructor) or is it best to have the caller pass in an ExpandedEnsembleSampler? Same thing with MCMCSampler. If we have the caller pass the sampler, it gives us more flexibility in subclassing each of these to change their behavior.

I would vote for having the caller pass the sampler, for the flexibility reasons cited. I'd like to make these components as independent as possible so that we can mix and match.

This looks pretty good. Is there a way to generalize the SAMSSampler further so we could use it for the SAMS paper for parallel tempering and alchemical solvation free energy calculations too, or would that just cause too much trouble?

Well, if we're passing in an ExpandedEnsembleSampler, we could subclass that and have the k update be equivalent to updating the temperature, lambda, or whatever, and it should still work as expected. So I don't think that's too much trouble at all.

Going to post this and then comment on the repex stuff.

from perses.

jchodera commented on June 19, 2024

Well, if we're passing in an ExpandedEnsembleSampler, we could subclass that and have the k update be equivalent to updating the temperature, lambda, or whatever, and it should still work as expected. So I don't think that's too much trouble at all.

Yeah, this might be the best way, since I think each of these scenarios is pretty different.

We could have

ExpandedEnsembleSampler - base class
* TopologyExpandedEnsemblerSampler - sample over different topology proposals
* SimulatedTemperingSampler - sample over different temperatures
* AlchemicalSampler - sample over different alchemical states?

from perses.

pgrinaway commented on June 19, 2024

Ok, I think that repex MCMC sampler code is really nice, so I'm +1 on using it here. As to bringing it into the repo, I am fine with that as well, but I suppose it depends on future plans for repex.

from perses.

pgrinaway commented on June 19, 2024

My thoughts exactly regarding the subclasses of ExpandedEnsembleSampler.

from perses.

jchodera commented on June 19, 2024

Ok, I think that repex MCMC sampler code is really nice, so I'm +1 on using it here. As to bringing it into the repo, I am fine with that as well, but I suppose it depends on future plans for repex.

What about breaking it into a separate mcmc or openmm-mcmc repo?

from perses.

pgrinaway commented on June 19, 2024

What about breaking it into a separate mcmc or openmm-mcmc repo?

That's great. It can probably help with other projects then, too.

from perses.

jchodera commented on June 19, 2024

OK, want me to do it or do you?

from perses.

pgrinaway commented on June 19, 2024

I can do it. Should we then make repex dependent on it?

from perses.

jchodera commented on June 19, 2024

Yes!

from perses.

jchodera commented on June 19, 2024

@pgrinaway if this class structure looks good, can you and @juliebehr take a stab at implementing a skeleton on Friday? We should really get something basic up and running just to make sure we haven't forgotten something important with the design. Then we can come back to concentrating on the sampler contents.

from perses.

pgrinaway commented on June 19, 2024

The class structure looks fine to me. I think I'm misunderstanding something about your request, though.

I have opened a new issue with a description of what I think the near-term goals and work are in #65.

from perses.

pgrinaway commented on June 19, 2024

Closing this since discussion seems to have ended here. We can certainly talk more about it later, and we'll have the closed issue to reference at that point.

from perses.

Topology proposals and system construction about perses HOT 42 CLOSED

Comments (42)

Computing binding free energies for all molecules in a library.

Computing binding free energies for all molecules in a library

Structure preparation pipelines

System builder schemes

Proposal engines

Topology proposal schemes

Samplers

Design drivers

Utility functions

Alchemical functions

Near-term examples

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent