chemellia / chemistryfeaturization.jl Goto Github PK

View Code? Open in Web Editor NEW

41.0 5.0 14.0 10.21 MB

Interface package for featurizing atomic structures

Home Page: https://chemistryfeaturization.chemellia.org/dev/

License: MIT License

Julia 91.41% Python 8.59%

machine-learning featurization chemistry

chemistryfeaturization.jl's People

Contributors

Stargazers

Watchers

Forkers

z-q-y sinamostafanejad timholy akshatg625 standardgalactic zeta1999 heindelj ven-k sakshi-satyanand simonschoelly thazhemadam sampsonwang233 dhairyalgandhi rikhuijzer

chemistryfeaturization.jl's Issues

custom broadcast for featurize

ideally we could do featurize.([atoms1, atoms2, atoms3], fzn) without needing to put the fzn in brackets, and there's definitely a way to do this...

add subgraph/"supergraph" functions to AtomGraph

It would be really cool to be able to filter based on feature values (for example, if the contextual features idea from this issue gets implemented, then we could for example take only the adsorbate atoms or only the surface atoms.

Would need to make sure to modify feature matrix and element list appropriately and also recompute laplacian, but this should be easy.

Similarly, would be cool to construct larger graphs (e.g. adsorbed species on a surface) by just inputting two smaller ones and specifying which edges need to be added (e.g. an edge from node X in graph 1 to node y in graph 2), though how to handle working out weights would need to be figured out...

rdkit deps via RDKitMinimalLib

see https://github.com/eloyfelix/RDKitMinimalLib.jl

This is probably a very reasonable medium-term (and maybe even long-term) solution for the rdkit CI headaches...

serialization

Need to have a way to save out AtomGraph and AtomFeat types to files...

make `DummyFeaturization`

We want to enforce that Atoms objects with encoded features populated always carry a featurization. However, for certain model layers (e.g. graph convolution), the output is another such Atoms object, but the original featurization doesn't actually make sense anymore (and in fact may not even be strictly valid if the size of the feature matrix changed through the layer). Hence, we should probably make a DummyFeaturization to basically be a placeholder and if you try to actually do a decode with it, it gives some helpful message explaining why it can't decode.

On a related note, I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.

Labeling priority because some solution to this is needed to make AtomicGraphNets compatible with the restructured package.

Use Intervals.jl in onehot-decode

More a slick thing than a critical functionality, but dropping this here anyway: https://invenia.github.io/Intervals.jl/latest/

More options in graph-building

Currently we have the cgcnn.py option plus the Voronoi option. It would make sense to add an option to cut off by distance (e.g. nearest, second-nearest, etc. neighbors). Things to figure out here:

what's a reasonable "default" tolerance for counting neighbors as being in a set (e.g. all being nearest or second-nearest)? Once we set this we can ensure all their edge weights are the same which will be useful e.g. in DEQ stuff
would it make sense to implement a cutoff about distances to an atom's periodic images of itself? e.g. a "stop-adding-edges" condition once an atom has hit its own image? In a lot of cases (i.e. really large unit cells) you probably wouldn't actually want to go anywhere near this far, but for smaller ones I doubt you'd often want to go much past this...

bulk version of add_features! function

Would be good to have a version that can take in an array of AtomGraph objects and add features. Should take a few minutes to add and make tests for but I'm in the middle of something else so I'm making this issue to remind myself about it...

other neighbor list options

not high-priority, but I am curious how sensitive accuracy is to these since I did find that there was a noticeable different between the Ulissi group's Voronoi approach vs. the more "naive" cutoff distance (surprisingly, the latter seemed to work better, at least in the cases I tested)

see ideas here: https://pymatgen.org/pymatgen.analysis.local_env.html (source here)

finish updating WeaveModel-related stuff for restructure

must-haves

get to the point of being able to build/featurize the molecule and convert it to the right format to be fed into WeaveModel.jl
figure out any other naming things, e.g. for BondFeatureDesciptors

nice-to-haves

some kind of visualization, maybe via OpenSMILES or MolecularGraph

Feel free to make a more "granular" version of this checklist, haha

Change from DiffEqBase to SciMLBase

DiffEqBase is the lowest common denominator for the DiffEq packages, not necessarily the whole SciML ecosystem, and so it has a lot DiffEq dependencies. These are generally not required by downstream packages. If what you're looking for is a way to define problems without having most dependencies, we recommend you use SciMLBase as the dependency since everything like ODEProblem, SteadyStateProblem, etc. is defined there. We basically recommend depending on SciMLBase for problem definitions, and solver packages for specific solvers, but generally most non-SciML packages should not be depending on DiffEqBase directly (given the split of SciMLBase in 2021)

For more details see: https://diffeq.sciml.ai/stable/features/low_dep/ and https://discourse.julialang.org/t/psa-the-right-dependency-to-reduce-from-differentialequations-jl/72757

Let me know if you need any help updating this, though for almost all dependents here it should be a trivial name change as you're actually using pieces from SciMLBase.

make `SpeciesFeature` implementations, tests

This may end up going along with Weave stuff since most of the actual examples of SpeciesFeature that we have atm are rdkit things, but filing it as a separate issue for now.

Register the package!

Don't want to do this until I have some more thorough docs/examples.

Resources:

https://github.com/JuliaRegistries/Registrator.jl
for getting Python/Conda deps working: https://github.com/JuliaPy/PyPlot.jl/blob/e547385265314eb57065309644cf6d8c8a068fa0/src/init.jl#L77

custom contextual features

Possibility to add custom features not from a lookup table but from user-provided labels. See this issue for more details.

make `featurization` in `FeaturizedAtoms` a `const` once Julia 1.8 is released

With Julia 1.8, we'll be able to declare const fields in mutable structs. We should use this to make featurization in FeaturizedAtoms a const, since the featurization assigned to a FeaturizedAtoms object shouldn't change.

diff --git a/src/featurizedatoms.jl b/src/featurizedatoms.jl
index 983cde3..b36050a 100644
--- a/src/featurizedatoms.jl
+++ b/src/featurizedatoms.jl
@@ -14,7 +14,7 @@ Container object for an atomic structure object, a featurization, and the result
 """
 struct FeaturizedAtoms{A,F<:AbstractFeaturization}
     atoms::A
-    featurization::F
+    const featurization::F
     encoded_features::Any
     FeaturizedAtoms{A,F}(atoms, featurization) where {A,F<:AbstractFeaturization} =
         new(atoms, featurization, encode(atoms, featurization))

make features unitful?

Really just a "nice to have" thing and not crucial at all...leaving here as a note to self, to be eventually categorized properly once I decide on a list of issue tags.

fix macOS CI

Currently, the tests don't run on macOS CI, and the following error is raised -
Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib.

parallelize graph-building

It's a one-time cost for a given dataset, but it's quite slow (~1k/hr) for larger sets, and should be easy to batch out in some reasonably automated way...

AtomGraph type

Making a custom type for atomic graphs so we can jettison FeaturedGraph and my weird fork of GeometricFlux.

Currently, planning to store:

A SimpleWeightedGraph representing the graph itself
a list of element symbols corresponding to each node
the graph laplacian
optionally, a feature matrix, which, if present, must have second dimension equal to the number of nodes in the graph
(maybe) a ref to featurization metadata (feature names, lengths, binning types, etc.)

Functions to define on this object (need to look at LightGraphs API and see if there's a way to write it so they'll "just work"):

laplacian, normalized_laplacian
adjacency_matrix, degree_matrix, etc.
feature, weights, atom_ids

Future possibilities (either for this or perhaps subtypes of it):

storing oxidation states
storing bond/pair features so @SeanXiaoyuSun can use it for other DeepChem stuff

Using strings to create Featurizations via GraphNodeFeaturization doesn't work with things other than ElementFeatureDescriptors

MWE Below:

using ChemistryFeaturization

# these work
featurization = GraphNodeFeaturization([
    "Group",
    "Row",
    "Block",
    "Atomic mass",
    "Atomic radius",
    "X",
])

# but with "isaromatic", which is a SpeciesFeatureDescriptor, it will not. (Block thrown in just for fun)
featurization = GraphNodeFeaturization([
    "Block",
    "isaromatic"
])

The error is: ERROR: LoadError: AssertionError: Feature isaromatic isn't in the lookup table!, so do these terms just need to simply be added?

`visualize(::AtomGraph)` should write to file

This should be easy enough. I think @sakshi-satyanand is going to take a crack at this :)

Examples

At least one example (documented with comments and readme) of building features for each paradigm (currently, just two - graphs from CIF files and Weave model). These can likely be copy/pasted from the other repos...

`AtomGraph`s can't be built from files if a relative path is provided

Trying to build an AtomGraph from a file doesn't seem to work if the path provided as input_file_path is a relative path.

julia> AtomGraph("test/test_data/strucs/mp-195.cif")

┌ Warning: Unable to build graph for test/test_data/strucs/mp-195.cif
└ @ ChemistryFeaturization.Atoms ~/Chemellia/ChemistryFeaturization.jl/src/atoms/atomgraph.jl:107
missing

This seems to be because rc[:paths][:crystals] = @__DIR__, that Xtals.jl requires (see here) hasn't been defined for relocatability with package compiler reasons (see discussion here).

Make neighbor list building for graphs consistent

Currently the pymatgen version borks if you pass it a nonperiodic structure (e.g. molecule)...I think switching over to the ASE version may work better, then we can just feed that into the weight-calculating function. Probably the voronoi option will just have to only work with pymatgen structures for awhile...

allow more edge features

Currently can have effectively one parameterized by the edge weights.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Coulomb Matrix featurization

This might be useful to have at some point. Could be intriguing to treat it as an adjacency matrix for a convolution, though probably the diagonal elements might have to be scaled down...

pretty printing for `FeaturizedAtoms`

Fix Windows CI

Note to self to eventually go back and troubleshoot the windows-latest runner in the CI, which is currently removed entirely. This will likely be easier if/when we're able to excise some more Python dependencies, since the last time around the errors had to do with numpy C-extensions...

improve/expand tests for FeaturizedAtoms, encoding

They also probably need to be reorganized a bit in light of the addition of the FeaturizedAtoms type...

make a changelog

(and add it to the docs) before it's too painful to contemplate...

MolecularGraph

I came here from https://julialang.org/jsoc/gsoc/deepchem/ which looks really exciting.

Just to make sure efforts are as non-redundant as possible...I may have chatted with @rkurchin previously about this, but I can't find an issue. Since you mention SMILES and other such stuff, are you aware of:

https://github.com/mojaie/MolecularGraph.jl (this package is less discoverable than it deserves, I'm not sure why but see https://github.com/JuliaComputing/JuliaHub/issues/81)
https://github.com/JuliaHealth/PubChemCrawler.jl

add "overwrite" option in build_graphs_batch

Should default to true, but if false checks if given path already exists in output folder and skips to speed things up, allow finishing an interrupted graph-building

Option of numerical and not just one-hot features

I'd like to play with this. My suspicion is that if it works, it will take a lot more epochs to train to the same accuracy, but it would also allow us to get away with much smaller models. In addition, having input parameters that are actual values would allow cool things like very transparent sensitivity analyses via autodiff.

This would almost certainly require some normalization of the input features; so the AtomFeat objects would need to store the normalization in order to be able to invert the encoding properly. My inclination now is that this is best achieved via a new type (e.g. split into OneHotAtomFeat and NumericalAtomFeat that both inherit from an abstract AtomFeat class? A lot of things could be fairly easily dispatched onto both)

Make docs!

Can use GitHub wiki to start with

make `validate_features` function

Copying over from #85 ...

I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.

The abstract dispatch is actually already there in #87 so "all" that remains is actual implementations (envisionable, though certainly not entirely trivial).

move `output_file_path` to positional arguments in `AtomGraph` constructor from file

This should be really easy, I'm just busy with other stuff this second and don't want to forget.

Reason being, currently if we want to build and serialize a large number of graphs, they'd all have to be stored in memory and then serialized one by one, and you lose progress if something goes wrong (such as, for example, some graphs couldn't be built and then the script borks when it tries to serialize a missing object). if we make this change, then we can broadcast over output files and serialize a graph as it's being created, and we won't run into the problem I'm describing.

Is this informed by very specific experience today? Maayyyybe...

make a `validate_features` function

Copying over from #85 ...

I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.

The abstract dispatch is actually already there in #87 so "all" that remains is actual implementations (envisionable, though certainly not entirely trivial).

Autodiff the graph building for "forces"

Came up in a group meeting discussion. Could be a neat idea to try to differentiate with respect to graph weights and see if you can get something like a force by propagating that through a pretrained model...

more features, sources, allow user to add features

would be good to showcase that it's easy to add other features to lookup table
also maybe add oxidation-state-dependent things

#92 might not be a sustainable fix

Running upgrade ChemistryFeaturization in a local environment for AtomicGraphNets caused an error -

ERROR: Error building `ChemistryFeaturization`: 
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed
PackagesNotFoundError: The following packages are missing from the target environment:
  - mkl
[ Info: Running `conda install -y -c conda-forge ase` in root environment
[ Info: Running `conda install -y -c conda-forge rdkit` in root environment
[ Info: Running `conda install -y -c conda-forge pymatgen` in root environment
[ Info: Running `conda remove -y mkl` in root environment
ERROR: LoadError: failed process: Process(setenv(`/Users/rkurchin/.julia/conda/3/bin/conda remove -y mkl`,["XPC_FLAGS=0x0", "PWD=/Users/rkurchin/git_repos/AtomicGraphNets.jl", "XPC_SERVICE_NAME=0", "TERM_PROGRAM=vscode", "VSCODE_GIT_ASKPASS_NODE=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Frameworks/Code Helper (Renderer).app/Contents/MacOS/Code Helper (Renderer)", "SHELL=/bin/zsh", "VSCODE_GIT_ASKPASS_MAIN=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass-main.js", "__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x0", "GIT_ASKPASS=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass.sh", "VSCODE_GIT_IPC_HANDLE=/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/vscode-git-2ea4c5f5ce.sock"  …  "SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.DRpPU5mZJv/Listeners", "JULIA_LOAD_PATH=@:/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/jl_cHo7jp", "JULIA_EDITOR=\"/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/bin/code\"", "HOME=/Users/rkurchin", "TERM=xterm-256color", "TERM_PROGRAM_VERSION=1.47.3", "JULIA_NUM_THREADS=", "COLORTERM=truecolor", "OPENBLAS_MAIN_FREE=1", "PYTHONIOENCODING=UTF-8"]), ProcessExited(1)) [1]
Stacktrace:
 [1] pipeline_error
   @ ./process.jl:525 [inlined]
 [2] run(::Cmd; wait::Bool)
   @ Base ./process.jl:440
 [3] run
   @ ./process.jl:438 [inlined]
 [4] runconda(args::Cmd, env::String)
   @ Conda ~/.julia/packages/Conda/sNGum/src/Conda.jl:129
 [5] rm (repeats 2 times)
   @ ~/.julia/packages/Conda/sNGum/src/Conda.jl:226 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/ChemistryFeaturization/O2LBl/deps/build.jl:6
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [8] top-level scope
   @ none:5
in expression starting at /Users/rkurchin/.julia/packages/ChemistryFeaturization/O2LBl/deps/build.jl:6

This seems to have been caused because of #92. We probably need a more sustainable, and less sketchy fix for that.

Note - The error occurred on a macOS-x64 system, and I wasn't able to personally reproduce this error on a Linux-x86_64 system.

some `ElementFeature` convenience functions

pull all possible values (categorical) or range (continuous) of encodable values
dispatch encoding to elemental symbols (could be nice for demos)

PyCall may be a CI hurdle

Currently it's only used for graph IO/ building with ase and pymatgen. Would it be hard to have Julia versions of these? Are there artifacts that we can use instead? Maybe consider spinning the PyCall dep to a separate package?