Code Monkey home page Code Monkey logo

ammolite's Introduction

Ammolite, Aligned Molecule Matching

Ammolite is a software tool for finding structurally similar molecules. Ammolite is especially good for searching large molecule databases, like pubchem, quickly. Among its potential uses Ammolite would help researchers with novel molecules to predict the properties of their molecule by finding known molecules with similar structures.

Ammolite matches pairs of molecules based on the amount their structures overlap. This measurement is quantified by the tanimoto coefficient. Two molecules with a tanimoto coefficient of 1 are identical while lower coefficients (down to 0) indicate increasingly different molecules. Typical searches with Ammolite look for molecules matching a query with tanimoto coefficient at least 0.7-0.9.

For efficient search Ammolite requires that a molecule database be properly compressed. Compression is quite a slow process so pre-compressed versions of the pubchem database are available for download at WEBSITE.

For more information about the algorithm behind Ammolite see PAPER

How to use the Ammolite command line tool

The command line version of Ammolite supports three basic commands: compress, search, and examine.

Compress:

The compress command is used to compress a database of molecules.

The syntax is as follows

./Ammolite compress --source list/of/sdf/files/to/compress --target name/of/new/database

If an existing Ammolite database is given as the target the SDF files in source will be appended to the existing database.

Optionally a maximum number of threads can be specified using

--threads

To use JVM settings other than the default (Usually to increase the RAM available to the JVM) replace './Ammolite' with

java -jar path/to/Ammolite.jar compress

A sample compress command:

java -Xmx360g -jar Ammolite.jar compress -s all/pubchem/sdf/*.sdf -t all-pubchem-ammolite-compression

Compression of large databases is resource intensive and is only recommended on machines with a large amount of RAM and CPU. Compression of a ~50M molecule database took about 10 days on a machine with 20 processing cores and 360GB of RAM.

Search:

The search command is used to search an Ammolite database or a set of SDF files.

Ammolite search outputs results as a CSV file with the following structure:

Query_ID, Target_ID, Size_of_Query, Size_of_Target, Size_of_Overlap, Tanimoto_Coefficient

Optionally Ammolite search can also write a series of SDF files detailing the structure of the overlaps between query and target molecules.

The test command requires only three arguments:

  • -d, --database ... Path to the database. If using linear search this may include multiple files and SDF files. Otherwise this must be a compressed Ammolite-database.
  • -q, --queries ... SDF files of queries.
  • -t, --threshold The minimum tanimoto coefficient for search results

Optional arguments are:

  • -l, --linear-search Search the database exhaustively using linear search.
  • -o, --out-file By default Ammolite will write its results to stdout. Instead a file may be specified.
  • -w, --write-sdfs Make SDF files detailing the overlap between targets and queries.

The difference between linear-search and regular search:

By default Ammolite uses compressive acceleration to search only a subset of a large database. This results in signifigant speed gains but has a slight loss of accuracy and requires that databases be compressed before search.

If search speed is not a concern (for small sets of molecules - up to 100k depending on the situation) or if accuracy is paramount linear-search may be preferable.

linear-search does not require a special database format; target files may be given as SDF files.

A sample search command:

./Ammolite search -d path/to/Ammolite-Database.adb -q path/to/queries.sdf -o ammolite-search-results.csv -t 0.9

Arguments can be supplied directly to the jvm by replacing ./Ammolite with

java -jar Ammolite.jar search

Typically search requires much less RAM than compression but a search of the entire pubchem database can still benefit from up to 80GB of RAM. This can be supplied to the JVM with the argument

java -Xmx80g

Examine:

The examine command is useful for listing some basic statistics about a database. It is relatively simple to use.

./Ammolite examine an-ammolite-database.gad

How to use the Ammolite java API

Ammolite provides a small API for researchers who want to incorporate molecule search into their programs. Use of the Ammolite API requires a some familiarity with programming in java.

Ammolite is based off of the Chemistry Development Kit (CDK) and Small Molecule Subgraph Detection (SMSD). The CDK provides a huge amount of functionality for computational chemistry. A developer familiar with the CDK should have an easy time incorporating Ammolite into their programs.

Ammolite provides three methods which may be useful for API use in the edu.mit.csail.ammolite.search package.

SearchHandler.handleSearch(arguments ...) which mirrors the command line usage

Ammolite.search(arguments ...) which allows for compressively-accelerated search from CDK objects stored in memory

SMSDSearcher.search(arguments ...) which allows for linear-search from CDK objects stored in memory.

To use either Ammolite.search() or SMSDSearcher.search developers may have to implement an instance of the IResultHandler interface. There are three examples of the result hanlder interface in the ammolite.search package. Implementation of a custom instance should be relatively easy. Most developers will find the MatchingMoleculeHandler either meets their needs or provides a good basis for building a custom result handler.

Development Notes:

Ammolite is a new piece of software and may have a few bugs. To report bugs or suggest changes feel free to contact us.

Users should also feel free to contact us for help using either the command line tool or the API.

You may contact us by email at:

[email protected] (preffered) or [email protected]

ammolite's People

Contributors

dcdanko avatar ndaniels avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.