chemellia / chemistryfeaturization.jl Goto Github PK
View Code? Open in Web Editor NEWInterface package for featurizing atomic structures
Home Page: https://chemistryfeaturization.chemellia.org/dev/
License: MIT License
Interface package for featurizing atomic structures
Home Page: https://chemistryfeaturization.chemellia.org/dev/
License: MIT License
ideally we could do featurize.([atoms1, atoms2, atoms3], fzn)
without needing to put the fzn
in brackets, and there's definitely a way to do this...
It would be really cool to be able to filter based on feature values (for example, if the contextual features idea from this issue gets implemented, then we could for example take only the adsorbate atoms or only the surface atoms.
Would need to make sure to modify feature matrix and element list appropriately and also recompute laplacian, but this should be easy.
Similarly, would be cool to construct larger graphs (e.g. adsorbed species on a surface) by just inputting two smaller ones and specifying which edges need to be added (e.g. an edge from node X in graph 1 to node y in graph 2), though how to handle working out weights would need to be figured out...
see https://github.com/eloyfelix/RDKitMinimalLib.jl
This is probably a very reasonable medium-term (and maybe even long-term) solution for the rdkit CI headaches...
Need to have a way to save out AtomGraph and AtomFeat types to files...
We want to enforce that Atoms
objects with encoded features populated always carry a featurization. However, for certain model layers (e.g. graph convolution), the output is another such Atoms
object, but the original featurization doesn't actually make sense anymore (and in fact may not even be strictly valid if the size of the feature matrix changed through the layer). Hence, we should probably make a DummyFeaturization
to basically be a placeholder and if you try to actually do a decode with it, it gives some helpful message explaining why it can't decode.
On a related note, I had to comment out some checks in the main AtomGraph
constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization
rather than GraphNodeFeaturization
specifically), so we could just make a new function like validate_features(encoded, fzn)
that would dispatch across different featurization types, and if you fed in a DummyFeaturization
it would always say it was okay.
Labeling priority because some solution to this is needed to make AtomicGraphNets compatible with the restructured package.
More a slick thing than a critical functionality, but dropping this here anyway: https://invenia.github.io/Intervals.jl/latest/
Currently we have the cgcnn.py option plus the Voronoi option. It would make sense to add an option to cut off by distance (e.g. nearest, second-nearest, etc. neighbors). Things to figure out here:
Would be good to have a version that can take in an array of AtomGraph objects and add features. Should take a few minutes to add and make tests for but I'm in the middle of something else so I'm making this issue to remind myself about it...
not high-priority, but I am curious how sensitive accuracy is to these since I did find that there was a noticeable different between the Ulissi group's Voronoi approach vs. the more "naive" cutoff distance (surprisingly, the latter seemed to work better, at least in the cases I tested)
see ideas here: https://pymatgen.org/pymatgen.analysis.local_env.html (source here)
must-haves
BondFeatureDesciptor
snice-to-haves
Feel free to make a more "granular" version of this checklist, haha
DiffEqBase is the lowest common denominator for the DiffEq packages, not necessarily the whole SciML ecosystem, and so it has a lot DiffEq dependencies. These are generally not required by downstream packages. If what you're looking for is a way to define problems without having most dependencies, we recommend you use SciMLBase as the dependency since everything like ODEProblem, SteadyStateProblem, etc. is defined there. We basically recommend depending on SciMLBase for problem definitions, and solver packages for specific solvers, but generally most non-SciML packages should not be depending on DiffEqBase directly (given the split of SciMLBase in 2021)
For more details see: https://diffeq.sciml.ai/stable/features/low_dep/ and https://discourse.julialang.org/t/psa-the-right-dependency-to-reduce-from-differentialequations-jl/72757
Let me know if you need any help updating this, though for almost all dependents here it should be a trivial name change as you're actually using pieces from SciMLBase.
This may end up going along with Weave stuff since most of the actual examples of SpeciesFeature
that we have atm are rdkit things, but filing it as a separate issue for now.
Don't want to do this until I have some more thorough docs/examples.
Resources:
Possibility to add custom features not from a lookup table but from user-provided labels. See this issue for more details.
With Julia 1.8, we'll be able to declare const
fields in mutable
structs. We should use this to make featurization
in FeaturizedAtoms
a const
, since the featurization
assigned to a FeaturizedAtoms
object shouldn't change.
diff --git a/src/featurizedatoms.jl b/src/featurizedatoms.jl
index 983cde3..b36050a 100644
--- a/src/featurizedatoms.jl
+++ b/src/featurizedatoms.jl
@@ -14,7 +14,7 @@ Container object for an atomic structure object, a featurization, and the result
"""
struct FeaturizedAtoms{A,F<:AbstractFeaturization}
atoms::A
- featurization::F
+ const featurization::F
encoded_features::Any
FeaturizedAtoms{A,F}(atoms, featurization) where {A,F<:AbstractFeaturization} =
new(atoms, featurization, encode(atoms, featurization))
Really just a "nice to have" thing and not crucial at all...leaving here as a note to self, to be eventually categorized properly once I decide on a list of issue tags.
Currently, the tests don't run on macOS CI, and the following error is raised -
Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.1.dylib
.
It's a one-time cost for a given dataset, but it's quite slow (~1k/hr) for larger sets, and should be easy to batch out in some reasonably automated way...
Making a custom type for atomic graphs so we can jettison FeaturedGraph and my weird fork of GeometricFlux.
Currently, planning to store:
Functions to define on this object (need to look at LightGraphs API and see if there's a way to write it so they'll "just work"):
laplacian
, normalized_laplacian
adjacency_matrix
, degree_matrix
, etc.feature
, weights
, atom_ids
Future possibilities (either for this or perhaps subtypes of it):
MWE Below:
using ChemistryFeaturization
# these work
featurization = GraphNodeFeaturization([
"Group",
"Row",
"Block",
"Atomic mass",
"Atomic radius",
"X",
])
# but with "isaromatic", which is a SpeciesFeatureDescriptor, it will not. (Block thrown in just for fun)
featurization = GraphNodeFeaturization([
"Block",
"isaromatic"
])
The error is: ERROR: LoadError: AssertionError: Feature isaromatic isn't in the lookup table!
, so do these terms just need to simply be added?
This should be easy enough. I think @sakshi-satyanand is going to take a crack at this :)
At least one example (documented with comments and readme) of building features for each paradigm (currently, just two - graphs from CIF files and Weave model). These can likely be copy/pasted from the other repos...
Trying to build an AtomGraph
from a file doesn't seem to work if the path provided as input_file_path
is a relative path.
julia> AtomGraph("test/test_data/strucs/mp-195.cif")
┌ Warning: Unable to build graph for test/test_data/strucs/mp-195.cif
└ @ ChemistryFeaturization.Atoms ~/Chemellia/ChemistryFeaturization.jl/src/atoms/atomgraph.jl:107
missing
This seems to be because rc[:paths][:crystals] = @__DIR__
, that Xtals.jl requires (see here) hasn't been defined for relocatability with package compiler reasons (see discussion here).
Currently the pymatgen version borks if you pass it a nonperiodic structure (e.g. molecule)...I think switching over to the ASE version may work better, then we can just feed that into the weight-calculating function. Probably the voronoi option will just have to only work with pymatgen structures for awhile...
Currently can have effectively one parameterized by the edge weights.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
This might be useful to have at some point. Could be intriguing to treat it as an adjacency matrix for a convolution, though probably the diagonal elements might have to be scaled down...
Note to self to eventually go back and troubleshoot the windows-latest
runner in the CI, which is currently removed entirely. This will likely be easier if/when we're able to excise some more Python dependencies, since the last time around the errors had to do with numpy C-extensions...
They also probably need to be reorganized a bit in light of the addition of the FeaturizedAtoms
type...
(and add it to the docs) before it's too painful to contemplate...
I came here from https://julialang.org/jsoc/gsoc/deepchem/ which looks really exciting.
Just to make sure efforts are as non-redundant as possible...I may have chatted with @rkurchin previously about this, but I can't find an issue. Since you mention SMILES and other such stuff, are you aware of:
Should default to true, but if false checks if given path already exists in output folder and skips to speed things up, allow finishing an interrupted graph-building
I'd like to play with this. My suspicion is that if it works, it will take a lot more epochs to train to the same accuracy, but it would also allow us to get away with much smaller models. In addition, having input parameters that are actual values would allow cool things like very transparent sensitivity analyses via autodiff.
This would almost certainly require some normalization of the input features; so the AtomFeat
objects would need to store the normalization in order to be able to invert the encoding properly. My inclination now is that this is best achieved via a new type (e.g. split into OneHotAtomFeat
and NumericalAtomFeat
that both inherit from an abstract AtomFeat
class? A lot of things could be fairly easily dispatched onto both)
Can use GitHub wiki to start with
Copying over from #85 ...
I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.
The abstract dispatch is actually already there in #87 so "all" that remains is actual implementations (envisionable, though certainly not entirely trivial).
This should be really easy, I'm just busy with other stuff this second and don't want to forget.
Reason being, currently if we want to build and serialize a large number of graphs, they'd all have to be stored in memory and then serialized one by one, and you lose progress if something goes wrong (such as, for example, some graphs couldn't be built and then the script borks when it tries to serialize a missing
object). if we make this change, then we can broadcast over output files and serialize a graph as it's being created, and we won't run into the problem I'm describing.
Is this informed by very specific experience today? Maayyyybe...
Copying over from #85 ...
I had to comment out some checks in the main AtomGraph constructor that were validating that the encoded features actually matched the featurization (because it now dispatches on AbstractFeaturization rather than GraphNodeFeaturization specifically), so we could just make a new function like validate_features(encoded, fzn) that would dispatch across different featurization types, and if you fed in a DummyFeaturization it would always say it was okay.
The abstract dispatch is actually already there in #87 so "all" that remains is actual implementations (envisionable, though certainly not entirely trivial).
Came up in a group meeting discussion. Could be a neat idea to try to differentiate with respect to graph weights and see if you can get something like a force by propagating that through a pretrained model...
Running upgrade ChemistryFeaturization
in a local environment for AtomicGraphNets caused an error -
ERROR: Error building `ChemistryFeaturization`:
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed
PackagesNotFoundError: The following packages are missing from the target environment:
- mkl
[ Info: Running `conda install -y -c conda-forge ase` in root environment
[ Info: Running `conda install -y -c conda-forge rdkit` in root environment
[ Info: Running `conda install -y -c conda-forge pymatgen` in root environment
[ Info: Running `conda remove -y mkl` in root environment
ERROR: LoadError: failed process: Process(setenv(`/Users/rkurchin/.julia/conda/3/bin/conda remove -y mkl`,["XPC_FLAGS=0x0", "PWD=/Users/rkurchin/git_repos/AtomicGraphNets.jl", "XPC_SERVICE_NAME=0", "TERM_PROGRAM=vscode", "VSCODE_GIT_ASKPASS_NODE=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Frameworks/Code Helper (Renderer).app/Contents/MacOS/Code Helper (Renderer)", "SHELL=/bin/zsh", "VSCODE_GIT_ASKPASS_MAIN=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass-main.js", "__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x0", "GIT_ASKPASS=/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass.sh", "VSCODE_GIT_IPC_HANDLE=/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/vscode-git-2ea4c5f5ce.sock" … "SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.DRpPU5mZJv/Listeners", "JULIA_LOAD_PATH=@:/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/jl_cHo7jp", "JULIA_EDITOR=\"/private/var/folders/s5/dcc5cmts1t14tsx8k7jxcn5m0000gn/T/AppTranslocation/F063CB06-C177-4113-A641-1D39C7CF6994/d/Visual Studio Code.app/Contents/Resources/app/bin/code\"", "HOME=/Users/rkurchin", "TERM=xterm-256color", "TERM_PROGRAM_VERSION=1.47.3", "JULIA_NUM_THREADS=", "COLORTERM=truecolor", "OPENBLAS_MAIN_FREE=1", "PYTHONIOENCODING=UTF-8"]), ProcessExited(1)) [1]
Stacktrace:
[1] pipeline_error
@ ./process.jl:525 [inlined]
[2] run(::Cmd; wait::Bool)
@ Base ./process.jl:440
[3] run
@ ./process.jl:438 [inlined]
[4] runconda(args::Cmd, env::String)
@ Conda ~/.julia/packages/Conda/sNGum/src/Conda.jl:129
[5] rm (repeats 2 times)
@ ~/.julia/packages/Conda/sNGum/src/Conda.jl:226 [inlined]
[6] top-level scope
@ ~/.julia/packages/ChemistryFeaturization/O2LBl/deps/build.jl:6
[7] include(fname::String)
@ Base.MainInclude ./client.jl:444
[8] top-level scope
@ none:5
in expression starting at /Users/rkurchin/.julia/packages/ChemistryFeaturization/O2LBl/deps/build.jl:6
This seems to have been caused because of #92. We probably need a more sustainable, and less sketchy fix for that.
Note - The error occurred on a macOS-x64 system, and I wasn't able to personally reproduce this error on a Linux-x86_64 system.
Currently it's only used for graph IO/ building with ase
and pymatgen
. Would it be hard to have Julia versions of these? Are there artifacts that we can use instead? Maybe consider spinning the PyCall dep to a separate package?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.