Code Monkey home page Code Monkey logo

chemistryfeaturization.jl's People

Contributors

chrisrackauckas avatar dhairyalgandhi avatar github-actions[bot] avatar rkurchin avatar seanxiaoyusun avatar simonschoelly avatar sinamostafanejad avatar thazhemadam avatar timholy avatar viralbshah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

chemistryfeaturization.jl's Issues

custom broadcast for featurize

ideally we could do featurize.([atoms1, atoms2, atoms3], fzn) without needing to put the fzn in brackets, and there's definitely a way to do this...

add subgraph/"supergraph" functions to AtomGraph

It would be really cool to be able to filter based on feature values (for example, if the contextual features idea from this issue gets implemented, then we could for example take only the adsorbate atoms or only the surface atoms.

Would need to make sure to modify feature matrix and element list appropriately and also recompute laplacian, but this should be easy.

Similarly, would be cool to construct larger graphs (e.g. adsorbed species on a surface) by just inputting two smaller ones and specifying which edges need to be added (e.g. an edge from node X in graph 1 to node y in graph 2), though how to handle working out weights would need to be figured out...

serialization

Need to have a way to save out AtomGraph and AtomFeat types to files...

make `DummyFeaturization`

We want to enforce that Atoms objects with encoded features populated always carry a featurization. However, for certain model layers (e.g. graph convolution), the output is another such Atoms object, but the original featurization doesn't actually make sense anymore (and in fact may not even be strictly valid if the size of the feature matrix changed through the layer). Hence, we should probably make a DummyFeaturization to basically be a placeholder and if you try to actually do a decode with it, it gives some helpful message explaining why it can't decode.

On a related note, I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.

Labeling priority because some solution to this is needed to make AtomicGraphNets compatible with the restructured package.

More options in graph-building

Currently we have the cgcnn.py option plus the Voronoi option. It would make sense to add an option to cut off by distance (e.g. nearest, second-nearest, etc. neighbors). Things to figure out here:

  • what's a reasonable "default" tolerance for counting neighbors as being in a set (e.g. all being nearest or second-nearest)? Once we set this we can ensure all their edge weights are the same which will be useful e.g. in DEQ stuff
  • would it make sense to implement a cutoff about distances to an atom's periodic images of itself? e.g. a "stop-adding-edges" condition once an atom has hit its own image? In a lot of cases (i.e. really large unit cells) you probably wouldn't actually want to go anywhere near this far, but for smaller ones I doubt you'd often want to go much past this...

bulk version of add_features! function

Would be good to have a version that can take in an array of AtomGraph objects and add features. Should take a few minutes to add and make tests for but I'm in the middle of something else so I'm making this issue to remind myself about it...

finish updating WeaveModel-related stuff for restructure

must-haves

  • get to the point of being able to build/featurize the molecule and convert it to the right format to be fed into WeaveModel.jl
  • figure out any other naming things, e.g. for BondFeatureDesciptors

nice-to-haves

  • some kind of visualization, maybe via OpenSMILES or MolecularGraph

Feel free to make a more "granular" version of this checklist, haha

Change from DiffEqBase to SciMLBase

DiffEqBase is the lowest common denominator for the DiffEq packages, not necessarily the whole SciML ecosystem, and so it has a lot DiffEq dependencies. These are generally not required by downstream packages. If what you're looking for is a way to define problems without having most dependencies, we recommend you use SciMLBase as the dependency since everything like ODEProblem, SteadyStateProblem, etc. is defined there. We basically recommend depending on SciMLBase for problem definitions, and solver packages for specific solvers, but generally most non-SciML packages should not be depending on DiffEqBase directly (given the split of SciMLBase in 2021)

For more details see: https://diffeq.sciml.ai/stable/features/low_dep/ and https://discourse.julialang.org/t/psa-the-right-dependency-to-reduce-from-differentialequations-jl/72757

Let me know if you need any help updating this, though for almost all dependents here it should be a trivial name change as you're actually using pieces from SciMLBase.

make `SpeciesFeature` implementations, tests

This may end up going along with Weave stuff since most of the actual examples of SpeciesFeature that we have atm are rdkit things, but filing it as a separate issue for now.

make `featurization` in `FeaturizedAtoms` a `const` once Julia 1.8 is released

With Julia 1.8, we'll be able to declare const fields in mutable structs. We should use this to make featurization in FeaturizedAtoms a const, since the featurization assigned to a FeaturizedAtoms object shouldn't change.

diff --git a/src/featurizedatoms.jl b/src/featurizedatoms.jl
index 983cde3..b36050a 100644
--- a/src/featurizedatoms.jl
+++ b/src/featurizedatoms.jl
@@ -14,7 +14,7 @@ Container object for an atomic structure object, a featurization, and the result
 """
 struct FeaturizedAtoms{A,F<:AbstractFeaturization}
     atoms::A
-    featurization::F
+    const featurization::F
     encoded_features::Any
     FeaturizedAtoms{A,F}(atoms, featurization) where {A,F<:AbstractFeaturization} =
         new(atoms, featurization, encode(atoms, featurization))

make features unitful?

Really just a "nice to have" thing and not crucial at all...leaving here as a note to self, to be eventually categorized properly once I decide on a list of issue tags.

fix macOS CI

Currently, the tests don't run on macOS CI, and the following error is raised -
Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib.

parallelize graph-building

It's a one-time cost for a given dataset, but it's quite slow (~1k/hr) for larger sets, and should be easy to batch out in some reasonably automated way...

AtomGraph type

Making a custom type for atomic graphs so we can jettison FeaturedGraph and my weird fork of GeometricFlux.

Currently, planning to store:

  • A SimpleWeightedGraph representing the graph itself
  • a list of element symbols corresponding to each node
  • the graph laplacian
  • optionally, a feature matrix, which, if present, must have second dimension equal to the number of nodes in the graph
  • (maybe) a ref to featurization metadata (feature names, lengths, binning types, etc.)

Functions to define on this object (need to look at LightGraphs API and see if there's a way to write it so they'll "just work"):

  • laplacian, normalized_laplacian
  • adjacency_matrix, degree_matrix, etc.
  • feature, weights, atom_ids

Future possibilities (either for this or perhaps subtypes of it):

  • storing oxidation states
  • storing bond/pair features so @SeanXiaoyuSun can use it for other DeepChem stuff

Using strings to create Featurizations via GraphNodeFeaturization doesn't work with things other than ElementFeatureDescriptors

MWE Below:

using ChemistryFeaturization

# these work
featurization = GraphNodeFeaturization([
    "Group",
    "Row",
    "Block",
    "Atomic mass",
    "Atomic radius",
    "X",
])

# but with "isaromatic", which is a SpeciesFeatureDescriptor, it will not. (Block thrown in just for fun)
featurization = GraphNodeFeaturization([
    "Block",
    "isaromatic"
])

The error is: ERROR: LoadError: AssertionError: Feature isaromatic isn't in the lookup table!, so do these terms just need to simply be added?

Examples

At least one example (documented with comments and readme) of building features for each paradigm (currently, just two - graphs from CIF files and Weave model). These can likely be copy/pasted from the other repos...

`AtomGraph`s can't be built from files if a relative path is provided

Trying to build an AtomGraph from a file doesn't seem to work if the path provided as input_file_path is a relative path.

julia> AtomGraph("test/test_data/strucs/mp-195.cif")

┌ Warning: Unable to build graph for test/test_data/strucs/mp-195.cif
└ @ ChemistryFeaturization.Atoms ~/Chemellia/ChemistryFeaturization.jl/src/atoms/atomgraph.jl:107
missing

This seems to be because rc[:paths][:crystals] = @__DIR__, that Xtals.jl requires (see here) hasn't been defined for relocatability with package compiler reasons (see discussion here).

Make neighbor list building for graphs consistent

Currently the pymatgen version borks if you pass it a nonperiodic structure (e.g. molecule)...I think switching over to the ASE version may work better, then we can just feed that into the weight-calculating function. Probably the voronoi option will just have to only work with pymatgen structures for awhile...

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Coulomb Matrix featurization

This might be useful to have at some point. Could be intriguing to treat it as an adjacency matrix for a convolution, though probably the diagonal elements might have to be scaled down...

Fix Windows CI

Note to self to eventually go back and troubleshoot the windows-latest runner in the CI, which is currently removed entirely. This will likely be easier if/when we're able to excise some more Python dependencies, since the last time around the errors had to do with numpy C-extensions...

make a changelog

(and add it to the docs) before it's too painful to contemplate...

MolecularGraph

I came here from https://julialang.org/jsoc/gsoc/deepchem/ which looks really exciting.

Just to make sure efforts are as non-redundant as possible...I may have chatted with @rkurchin previously about this, but I can't find an issue. Since you mention SMILES and other such stuff, are you aware of:

Option of numerical and not just one-hot features

I'd like to play with this. My suspicion is that if it works, it will take a lot more epochs to train to the same accuracy, but it would also allow us to get away with much smaller models. In addition, having input parameters that are actual values would allow cool things like very transparent sensitivity analyses via autodiff.

This would almost certainly require some normalization of the input features; so the AtomFeat objects would need to store the normalization in order to be able to invert the encoding properly. My inclination now is that this is best achieved via a new type (e.g. split into OneHotAtomFeat and NumericalAtomFeat that both inherit from an abstract AtomFeat class? A lot of things could be fairly easily dispatched onto both)

make `validate_features` function

Copying over from #85 ...

I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.

The abstract dispatch is actually already there in #87 so "all" that remains is actual implementations (envisionable, though certainly not entirely trivial).

move `output_file_path` to positional arguments in `AtomGraph` constructor from file

This should be really easy, I'm just busy with other stuff this second and don't want to forget.

Reason being, currently if we want to build and serialize a large number of graphs, they'd all have to be stored in memory and then serialized one by one, and you lose progress if something goes wrong (such as, for example, some graphs couldn't be built and then the script borks when it tries to serialize a missing object). if we make this change, then we can broadcast over output files and serialize a graph as it's being created, and we won't run into the problem I'm describing.

Is this informed by very specific experience today? Maayyyybe...

make a `validate_features` function

Copying over from #85 ...

I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.

The abstract dispatch is actually already there in #87 so "all" that remains is actual implementations (envisionable, though certainly not entirely trivial).

Autodiff the graph building for "forces"

Came up in a group meeting discussion. Could be a neat idea to try to differentiate with respect to graph weights and see if you can get something like a force by propagating that through a pretrained model...

#92 might not be a sustainable fix

Running upgrade ChemistryFeaturization in a local environment for AtomicGraphNets caused an error -

ERROR: Error building `ChemistryFeaturization`: 
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed
PackagesNotFoundError: The following packages are missing from the target environment:
  - mkl
[ Info: Running `conda install -y -c conda-forge ase` in root environment
[ Info: Running `conda install -y -c conda-forge rdkit` in root environment
[ Info: Running `conda install -y -c conda-forge pymatgen` in root environment
[ Info: Running `conda remove -y mkl` in root environment
ERROR: LoadError: failed process: Process(setenv(`/Users/rkurchin/.julia/conda/3/bin/conda remove -y mkl`,["XPC_FLAGS=0x0", "PWD=/Users/rkurchin/git_repos/AtomicGraphNets.jl", "XPC_SERVICE_NAME=0", "TERM_PROGRAM=vscode", "VSCODE_GIT_ASKPASS_NODE=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Frameworks/Code Helper (Renderer).app/Contents/MacOS/Code Helper (Renderer)", "SHELL=/bin/zsh", "VSCODE_GIT_ASKPASS_MAIN=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass-main.js", "__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x0", "GIT_ASKPASS=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass.sh", "VSCODE_GIT_IPC_HANDLE=/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/vscode-git-2ea4c5f5ce.sock"  …  "SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.DRpPU5mZJv/Listeners", "JULIA_LOAD_PATH=@:/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/jl_cHo7jp", "JULIA_EDITOR=\"/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/bin/code\"", "HOME=/Users/rkurchin", "TERM=xterm-256color", "TERM_PROGRAM_VERSION=1.47.3", "JULIA_NUM_THREADS=", "COLORTERM=truecolor", "OPENBLAS_MAIN_FREE=1", "PYTHONIOENCODING=UTF-8"]), ProcessExited(1)) [1]
Stacktrace:
 [1] pipeline_error
   @ ./process.jl:525 [inlined]
 [2] run(::Cmd; wait::Bool)
   @ Base ./process.jl:440
 [3] run
   @ ./process.jl:438 [inlined]
 [4] runconda(args::Cmd, env::String)
   @ Conda ~/.julia/packages/Conda/sNGum/src/Conda.jl:129
 [5] rm (repeats 2 times)
   @ ~/.julia/packages/Conda/sNGum/src/Conda.jl:226 [inlined]
 [6] top-level scope
   @ ~/.julia/packages/ChemistryFeaturization/O2LBl/deps/build.jl:6
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [8] top-level scope
   @ none:5
in expression starting at /Users/rkurchin/.julia/packages/ChemistryFeaturization/O2LBl/deps/build.jl:6

This seems to have been caused because of #92. We probably need a more sustainable, and less sketchy fix for that.

Note - The error occurred on a macOS-x64 system, and I wasn't able to personally reproduce this error on a Linux-x86_64 system.

PyCall may be a CI hurdle

Currently it's only used for graph IO/ building with ase and pymatgen. Would it be hard to have Julia versions of these? Are there artifacts that we can use instead? Maybe consider spinning the PyCall dep to a separate package?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.