molssi / qcfractal Goto Github PK

View Code? Open in Web Editor NEW

143.0 16.0 47.0 64.69 MB

A distributed compute and database platform for quantum chemistry.

Home Page: https://molssi.github.io/QCFractal/

License: BSD 3-Clause "New" or "Revised" License

Python 99.89% Mako 0.02% Dockerfile 0.03% Makefile 0.03% Shell 0.03%

quantum-chemistry distributed-computing python database-platform computational-chemistry

qcfractal's Introduction

QCArchive

MolSSI Logo QCArchive Logo

A platform for compute, managing, compiling, and sharing large amounts of quantum chemistry data

Introduction

QCArchive is a platform that makes running large numbers of quantum chemistry calculations in a robust and scalable manner accessible to computational chemists. QCArchive is designed to handle thousands to millions of computations, storing them in a database for later sharing, retrieval and analysis, or export.

Documentation

Full documentation available here

Installing from the git repo

To install these packages with pip directly from this git repository,

pip install ./qcportal ./qcfractal ./qcfractalcompute ./qcarchivetesting

or, for a developer (editable) install,

pip install -e ./qcportal -e ./qcfractal -e ./qcfractalcompute -e ./qcarchivetesting

About this repository

This repository follows a monorepo layout. That is, this single repository contains several different python packages, each with its own setup information (pyproject.toml).

qcfractal - The main QCFractal server (database and web API)
qcportal - Python client for interacting with the server
qcfractalcompute - Workers that are deployed to run computations
qcarchivetesting - Helpers and pytest harnesses for testing QCArchive components

The reason for this is that at this stage, these components are very dependent on each other, and change one often requires changing others. This layout allows for that, while also being able to create/distribute separate python packages (that is, qcportal can be packaged separately and uploaded to PyPI or conda-forge).

License

BSD-3C. See the License File for more information.

qcfractal's People

Stargazers

Watchers

Forkers

loriab chrinide dnascimento13 dgasmith lnaden doaa-altarawy chayast sjayellis dsirianni yudongqiu mussard jchodera geeglee tsenapathi trevorgokey mattwelborn muammar hannahbrucemacdonald radical-cybertools exalearn ahurta92 alongd codtiger saromleang plin1112 dotsdl jthorton sinamostafanejad anabiman layeqa lilyminium pk-organics computational-chemistry-research cgbriggs99 qcmm hjnpark fermiq jchen0506 dr-marsmm lgtm-migrator berquist hadim janash quchem chrisiacovella mattwthompson arakelov-vahram aqhali

qcfractal's Issues

Handle import errors on compute

If a worker logs an import error (fails to import RDKit) the result is currently marked as failed. ImportErrors (ModuleNotFound) should be detected and resubmitted to the queue as a "soft" error up to a maximum of n times. Matching of tasks to corresponding workers with arbitrary queue backends will continue to be an issue under the current structure.

Results documentation

Several points to make:

Contains a single quantum chemistry execution (energy/gradient/Hessian/property)
- Not a finite-difference gradient or geometry optimization (see procedures)
Uniquely define by (program, molecule_id, driver, method, basis, options).
- ("psi4", "molecule_id", "energy", "b3lyp", "6-31g", "default")
Default execution engine is qcengine.compute.
Run completely on-node.

Switch PyMongo with Motor

Motor is a co-routine based wrapper to Mongo compared to the direct function PyMongo. The advantages are the following:

Allows async IO to the database to help avoid bottlenecks and increase throughput.
Can use a tornado IOLoop as the primary event loop.
API virtually identical to PyMongo to make switching easy.

Questions:

How far do we want async to permeate through the code base? Might want to isolate at DB sockets and Tornado front ends.
How much of a performance benefit is async over non-async?

Check base MongoDB version

The base MongoDB version should be checked:

>>> client = pymongo.MongoClient()
>>> client.server_info()["version"]
'3.4.4'

Potential use of marshmallow-mongoengine to serialize DB objects (replacing Pydantic)

I found a library that seemingly integrates Marchmellow (the validation and serialization library) with Mongoengine that will be used in QCFractal.

https://github.com/MongoEngine/marshmallow-mongoengine

Marshmallow is a popular ORM framework-agnostic library.

This needs some investigation to see how the serialization objects in marshmallow work without the DB in portal.

OSError: File: molecules/hooh.json not found

Describe the bug
When trying to test qcfractal integration with fragmenter, we get the error OSError: File molecules/hooh.json not found
https://travis-ci.org/openforcefield/fragmenter/jobs/455654228#L833

Additional context
@dgasmith added the integration test here:
openforcefield/openff-fragmenter#24

Services submission-based observer

Services track submitted jobs via their unique keys. As it is difficult for procedures to have unique keys submitted jobs should include hooks that report execution and ID back to the service.

Example:

Service:tmp_ID {
    "completed_jobs": []
}

task = {
    "procedure",
    "hooks": [("service", "tmp_ID", "completed_jobs")],
}

When the task is complete, the hook will be executed as the complete procedure is inserted into the procedures table.

Service:tmp_ID {
    "completed_jobs": [("procedure", "ID"), ]
}

TODO:

As the jobs are added to services how do we tell that a service iteration is complete?
Who triggers a service update?

Local Server Processes

Currently we try to support all distributed queue managers locally from the CLI. The complexity of this is increasing, but has very little benefit. For this kind of operation we just need a single manager type to support laptop-scale compute.

For this I would propose a Python standard library solution ProcessPoolExecutor. Interfacing to this is simple without dependancies and well supported. Unlike something like Dask they do not spin up quite a few threads that we do not have well defined control over and avoid the overhead of Parsl.

This would also make operations like:

server = FractalServer(local_compute="local") # Likely need a better kwarg
client = FractalClient(server)
# Do stuff, test, whatever

really quite straightforward and much more lightweight than currently supported.

I think this will also straighten out testing quite a bit where we run everything possible in a local mode and only test out additional managers where there is a possibility for divergence (queue tags, error handling, shutdowns, etc).

Queue database update lock

Currently, the next n computations in the queue are obtained via:

found = queue.find({"status": "WAITING"}, {sort}, limit=n)

query = {"_id": {"$in": [x["_id"] for x in found]}}
queue.update(query, {"status": "RUNNING"})

The reason for this format is find_one_and_update can only find a single document at a time and a loop of these calls is ~50x slower than the above. The downside of the above is that if multiple queue handlers call a query for the next n computations, it is possible that the computations could be duplicated in each handler as there are no locks.

The lock issue should be fixed so that there is a lock between the find and update step. Perhaps it is possible to do this in a single BulkWrite or aggregate call.

Tests require active Mongod

The testing structure currently assumes there is an active Mongod on the standard localhost:27017. This assumption could be problematic as testing takes ~30 seconds per Mongo access before the query times out resulting in a potential ~ 30-minute test failure. Several options are available:

Create a new fixture that checks for a live mongo instance and wraps all mongo tests.
Fail if no DB backend is detected.
Spawn a temporary Mongod instance if no live instance is identified.

Procedure documentation

Several points to make:

Fall into categories such as optimization, CBS, etc.
Each category has a unique and well-defined schema.
Default execution engine is qcengine.compute_procedure.
Run completely on-node.
Must contain at least two single computations.
Unified query interface (category, program, options).

Systematic logging overhaul

Logging is currently messy depending on the module with competing formats leading to non-uniform output logs.

Proposed standard:

date - module - operation - extra info

An example search of good logging should be undertaken to decide the future direction.

More flexible way to specify DB connection settings

In the CLI, I suggest allowing the option to use the host, port, and DB name as separate inputs to QCFractal server as they are very convenient and easier to format than squashing user/pass/host/port/DB_name in a URIs. We can still keep the option URI too.
The common user would want to only choose a DB name.

Fireworks Cleanup

The Queue is currently leaving all completed Fireworks jobs in the Fireworks queue. As the QCFractal database queue should be the central pivot for queue technology and Fireworks should only keep temporary data, the workflows associated with runs should be removed when complete.

The associated command should be LaunchPad.delete_wf(fw_id, delete_launch_dirs=True).

Task Queue Size Assumption in Documentation

It is assumed that the number of tasks in QCFractal's task queue will far surpass the speed at which tasks can be completed. The task distribution design was built based around this assumption and therefore the assumption must be added to the documentation.

Crank Interface

Requirements gathering for interaction with a Crank result.

Search capabilities:

Search by id
Search by initial molecule (geometry information included) and SMILES.
Search by QM method, program, basis, and options

Required information:

Computational details (method, program(s), basis, molecule)
Lowest energy at each grid point
Lowest energy final trajectory at each grid point.
Trajectory "history" at each grid point.

@yudongqiu @ChayaSt can you fill in the details of your requirements here?

Hash Index Usage in Services

Currently the hash_index is used to track submitted jobs by services such as TorsionDrive and query if the results are complete or not. As the hash_index is not perfectly guaranteed to be unique these services should use the returned queue_id after submission to find the result based on the result_location of queue document.

spell Fractal as Fraqctal

Describe the bug
failure to take advantage of differential spelling searches

Add QCElemental

QCElemental is now on conda-forge. First pass integration includes:

Molecule parsing
Physical constants

Future passes will include:

Molecule/Results etc object base models

Service Document Enhancements

Services could use a few additional fields to keep some consistency:

created/modified_on so we can look for stalled jobs.
A "status_message" or similar so that we can provide data upon lookup (what torsiondrive generation, how many optimizations run, when the latest drive completed, any possible errors).

@ChayaSt Any other thoughts here?

REST and Schema documentation style

Current schema documentation style should be improved as the current version is perhaps not the most optimal. Several options could be:

Most of these options are build for REST endpoints (which we have). However, we also have pure "schema" such as the QCSchema that would be nice to present in a precise syntax.

Anyone have an opinion or additional options to examine?

Safe `kill` when run as background thread

Currently a kill pid when run as background thread will cause a hard stop as KeyBoardInterupt only catches SIGINT and kill causes throws a SIG. Between the Python signal modules and the signal_callback function this can be handled correctly.

Use Mongoengine as an ODM for MongoDB

We would like to have the option to use ODM (Object-Document mapper, like ORM) when handling MongoDB. This will move some of the business logic and validation, currently implemented in the monogo_socket, into the Mongoengine.

This addition will require some changes in the design of the indexes and collections relations in the current design to be able to handle it efficiently in the Mongoengine layer.

Optimize DB indexes

After the new DB design, create appropriate DB indexes to support and speed up current queries.

Note

Keep an index for hash_inx field in Procedure table.

Separate Queue Processes

Currently a FractalServer instance must own a queue nanny in order to issue compute. It would be better if queue nannies could be standalone processes with communication via a standard REST API so that multiple compute queue types can be integrated into a single fractal server and added/removed on the fly.

The REST API would need to handle the following tasks:

Nanny would query for new tasks with size and label arguments.
- FractalServer would add a lock to the queried tasks with a Nanny ID.
- Nanny would push jobs into its respective distributed workflow client (Fireworks/Dask/etc).
Nanny would push completed tasks back into FractalServer.
- Optionally, tasks can be pushed back as not-completed to remove the lock incase the Nanny shuts down.
Nanny would ping FractalServer to update status periodically. If FractalServer has not heard from Nanny after a certain time, FractalServer should recycle the tasks back into the general task queue.
- Be good to update current compute load, amount of connected compute, etc.

Database based queue implementation

Currently, additions to the queue are held in the respective queue adapters. This has several significant drawbacks:

If the server shuts down everything in the queue is lost.
It is difficult to query errors later in the process.
Services are not able to fully track computations through the queue until their inserts into procedures/result tables.

A new table will solve these issues that has the following structure:

   task = {
       "hash_index": idx,
       "spec": {
           "function": "qcengine.compute_procedure",
           "args": [{
               "json_blob": "data"
           }],
           "kwargs": {},
       },
       "hooks": [("service", "service_id", "field")],
       "tag": None,
      "created_on": datetime(),
      "modified_on": datetime()
   }

The four required categories are:

hash_index is a full hash of the incoming data to avoid duplicates.
spec section handles the standard distributed Python call signature as seen in Fireworks/Dask.
hooks contains the new observer pattern for services, see #....
tag a tag on the queue such as openff or refdb to help select new tasks out of the queue.
created_on/modifed_on tags as DateTime entries for creation/modification.

Client add_compute should throw if molecule not found

Describe the bug
The current add_compute as part of the client will work if you give it a molecule ID already in the server or a fully defined molecule which will be inserted into the server.

The corner case is when you give it a molecule ID which is NOT in the server already, the current function will not error and will simply return an empty return object. We should change this to instead throw an error indicating which IDs were missing.

Add GenericCollection

Add a GenericCollection object which contains little to no structure allowing the user to fully define the underlying data. This would be used by other programs interacting with a FractalClient to control the computation and build internal structure appropriate for that program. An example here could be the SI2 DDD @loriab.

See here.

Conda env channel changes

Conda-forge is the primary channel used for all environments; this is not needed as Anaconda has pulled the majority of the packages back upstream. Also, conda-forge NumPy and Psi4 has conflicts as cf uses OpenBLAS. This does not appear to be an issue, but no reason to have multiple BLAS libraries in the possible conflict.

Containerization of distributed compute

Currently distributed compute through QCEngine can be obtained via conda installs. It would be good to support two methods of obtaining the distributed computing software:

Through Docker images. See MolSSI's current docker hub.
Through pinned conda environment.yamls. See here.

Containerization opens a general question of how do we package different subunits. For example, we likely do not want to distribute MOPAC, Psi4, and NWChem in the same container due to size. Also what "external" packages should we default to (PyOptKing, geomeTRIC, etc.).

Simplified task submission

Many steps for task submission is repeated here:
https://github.com/MolSSI/QCFractal/blob/master/qcfractal/procedures/procedures.py

Try to create a utility method in the socket that does those steps.

Task submission use cases

This is the possible use cases when submitting a task to the queue.

Case 1: Single Result

The user whould like to find or computer a result identified by a specific:

'program',
'driver',
'method',
'basis',
'molecule',
'options'.

The user will create the molecule and store it in the Database, and has its ID.
For the options, they should find it in the options table?? (TODO)

Ques: is the common case to have many of such single results at the same time?

The following is the logical steps to be done:

Search results table to see if it exists using the data s/he has (flaged with include_queued=True).
1. If found and status=COMPLETE) then return the results data.
2. if found but status=INCOMPLETE, then merge hooks and return the job ID and result ID or a top
  level procedure ID
3. If not found at all, then add a result or a top level procedure, and submit a task into the queue.

This should be mostly done in the socket itself.

When a task is done:

if it's a simple result, store the data in task.base_result, then call the hooks.
if it's a procedure, "create" the children results, and then update the procedure object (it's id is in task.base_result) with those ids. Next, call the hooks.

QueueAdapter Base Class

A base class for the QueueAdapter tech should be created to lower the current duplication and ensure a number of functions are correctly implemented.

Database Update Option

Currently, once a database is added from the Client, there is no way to update the database. The Database.save member function does have a force option that does not currently work. force should be fixed, and a test added.

"Loose" molecule index

The current molecule hash is effecitvely a identity check with the current fields:

Exact matches on "symbols", "multiplicity", "real", "fragments", "fragment_charges", and "fragment_multiplicities".
1.e-6 match on "mass"
1.e-4 match "charge"
1.e-8 match on "geometry"

This hash should be effectively a unique index allowing for a quick search of identical molecules. For more approximate searches that may return many molecules there have been several suggestions of new molecule hashes:

atom_count = "".join([sym + symbols.count(sym) for sym in set(symbols]) e.g., C6H6 for benzene. Very simple and possibly sieves queries down dramatically.
Similar to the current molecule_hash with a canonical symbol order, orientation, center of mass, and loose geometry match (~1e-2). This hash would allow quick identification of similar molecules.
SMILES - How do we obtain a "canonical" ordering that is deterministic.

Note that a hash search of a molecule is O(log(N)) while a direct comparison is O(N). As the database project expects on the order of O(1e8-9) molecules this difference is insurmountable.

New indices should be discussed and added on an as-needed basis.

from_server does not find an existing OpenFFWorkflow

Describe the bug
When I try to retrieve an existing workflow by calling collections.OpenFFWorkflow.from_server(name='chemper2_rdkit', client=client) of a collection that I had already registered, I get the following error:

KeyError                                  Traceback (most recent call last)
<ipython-input-3-b4da8d7fca6e> in <module>
----> 1 wf_2 = portal.collections.OpenFFWorkflow.from_server(name='chemper2_rdkit', client=client)

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in from_server(cls, client, name)
    121             raise KeyError("Warning! `{}: {}` not found.".format(class_name, name))
    122 
--> 123         return cls.from_json(tmp_data["data"][0], client=client)
    124 
    125     @classmethod

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in from_json(cls, data, client)
    168         print('data: {}'.format(data))
    169         # Allow PyDantic to handle type validation
--> 170         return cls(name, client=client, **data)
    171 
    172     def to_json(self, filename=None):

~/src/molssi/QCFractal/qcfractal/interface/collections/openffworkflow.py in __init__(self, name, options, client, **kwargs)
     39         # Expand options
     40         if options is None:
---> 41             raise KeyError("No record of OpenFFWorkflow {} found and no initial options passed in.".format(name))
     42         super().__init__(name, client=client, **options, **kwargs)
     43 

KeyError: 'No record of OpenFFWorkflow chemper2_rdkit found and no initial options passed in.'

But when I look at client.get_collection("OpenFFWorkflow", 'chemper2_rdkit', full_return=True)
I see that the workflow does exist:

{'meta': {'errors': [],
  'n_found': 1,
  'success': True,
  'error_description': False,
  'missing': []},
 'data': [{'name': 'chemper2_rdkit',
   'collection': 'openffworkflow',
   ...

But since options is None here It raises a KeyError.
To Reproduce

# set up a client
client = portal.FractalClient("https://localhost:7777/", verify=False)
# register workflow
with open('example_workflow.json') as f:
    wf_json = json.load(f)['example']['fragmenter']
wf = portal.collections.OpenFFWorkflow(name='example', client=client, options=wf_json)
# new workflow by getting an existing one from the server
wf_2 = portal.collections.OpenFFWorkflow.from_server(name='example', client=client)

The example workflow is here

@dgasmith

Locator Objects

Locator objects currently look like the following:

{
  table: ...,
  index: ...,
  data: ....
}

These specify either a single result or a collection of results and the index can either be a true index or id. This is being used to abstract away individual queries from the user.

Come up with a better name than locator objects.
Implement locator returns consistently throughout the code.
Implement a special locator object for the queue which can do the following:
- Return the queue documents to indicate status, queue depth, error message, etc.
- Return either a queue document or the actual result. This is useful when a user submits a piece of compute and would like the results of that compute based on queue id.

These were first implemented in #57.

Service documentation

Several topics of discussion:

Run on the server (or perhaps Lambda) and require stateless functions.
Generally iterative and "sleep" during an iterations execution.
Custom IO on a per-service basis.
Each service is a unique class found qcfractal/services/

A service query looks like:

DB search for service.
Load JSON and create new instantiations from the JSON.
Run the custom object.query function.
Return result.

Better handle completed torsion drive task result checking

As described in #61, the check for previous completed torsion drive scans searches through the procedure table, not the tasks table like the other jobs do. This will cause a problem since the logic for iterate assumes the job_map IDs are on the tasks_queue table. However, this has not come up in testing nor in production yet.

Things needed:

Make a test to hit this code
Improve the logic to make sure all IDs in the job_map are either from the same table, or all use a locator object to fetch across the MongoDB

Unify Task vs Queue Scheduler language

Right now the Interface Client makes requests to the task_scheduler entry point. Internally on the server this is called queue_scheduler. This we should either settle on the task or queue but not both to help with our documentations and others as they try to learn the package.

Super low priority though and just a thought I was having as I looked through the code

SQL Database Backend

The database backend is isolated through adapter classes which opens the possibility of additional backends as there are performance questions around MongoDB. To test the performance of various backends a SQL-based backend should be developed, tried, and compared against the Mongo version.

Test examples

The examples are currently untested and likely to break unless this is done. The following pieces likely need to be executed:

A per-example bash script to be run where the final execution should have a simple test within it.
A script which iterates over the examples and runs each test.
Addition of the script to the to the .travis.yml that also includes codecov.

Database query logging

All queries to the database should be logged to obtain usage statistics and help us optimize for the most queried cases. Logging can be basic: (endpoint, JSON payload, number of results, size of return payload).

Revised value retrieval + stoich interaction

As we discussed last week, I will be implementing the ability to specify which field to pull with a get_data function call. We mentioned a flag, which I call do_stoich to determine whether the aggregation of values should be multiplied by the stoichiometry weights and then summed. My question is, if I have do_stoich as false, what operation do I perform on the values I retrieve?

For reference, here is my get_data revised header:

# Returns a single data value from database
    # Field: The name of the field in Page that you retrieve
    # DB: The name of the DB you want to query
    # rxn: The name of the reaction to query
    # Method: Method to use to query a Page
    # do_stoich: Flag whether or not to sum based on stoich
    # debug_level: 0-2 verbosity
  def get_value(self, field, db, rxn, stoich, method, do_stoich=True, debug_level=1):

For example, let's say I'm pulling success fields. My get_value function has three molecules under stoichiometry/cp, so I look up each with the method to get a page. The success fields of each are True, True, and False. I imagine do_stoich=False in this case as the values are not numerical. What then do I return from this function?

As a follow up, when the return field is return_value, we already decided to have a fallback procedure that does a lookup in reaction_results/cp. What should happen if the field is not return_value (say, success instead)? Do I just return None?

Thanks, if any of this is unclear let me know.

Return DB objects as pymongo directly.

In strorage_socket, when approperiate, return query objects as_pymongo directly.
In this case, we will need to translate _id to id.
Notice that their will be no dereferencing to reference fields.

Manager Heartbeats

When managers shut down they tell their QCFractal instance to revert owned RUNNING jobs to WAITING so that they can be returned to the queue. However managers that have a hard shutdown (SIGINT/SIGKILL) cannot call back to the sever to free jobs. To prevent this, heartbeats should be implemented so that jobs related to a manager are freed if they are not heard from in a certain amount of time. Related to #90.

Raise error if workflow does not exist

currently client.get_collection(collection_name=workflow_id, collection_type='OpenFFWorkflow') silently returns None if the collection is not found.
It will be better if it raises an error.

GridOptimization

Currently we have the TorsionDrive service which causes geometry optimizations in a wavefront propagation, as this can spawn multiple geometry optimization without dependance this makes for a good QCFractal service. Currently a single optimization looks something like:

{
 'keywords': # geometry optimizer keywords
 'input_specification' # A QCSchema input specification minus a molecule
 'initial_molecule' # The initial molecule

# Above is input, below is the output bit
 'trajectory' # A ordered list of single results
 'energies' # A ordered list of energies
 'final_molecule' # The final molecule specification (sometime not in the trajectory itself)
}

If we have a GridOptimization we will likely want to reuse the single optimization framework. Would we want something similar like:

{
 'keywords': # geometry optimizer keywords
 'input_specification' # A QCSchema input specification minus a molecule
 'initial_molecule' # The initial molecule

 `optimizations` # A list (?) of optimizations performed
}

What other top level fields do we want?

@leeping @yudongqiu