hackingmaterials / matminer Goto Github PK
View Code? Open in Web Editor NEWData mining for materials science
Home Page: https://hackingmaterials.github.io/matminer/
License: Other
Data mining for materials science
Home Page: https://hackingmaterials.github.io/matminer/
License: Other
matminer/matminer/distance_metrics/site.py has a note:
THIS IS CURRENTLY JUST HOSTING LEGACY CODE.
It will soon be refactored / changed based on some discussions.
- Anubhav (12/7/17)
@nisse3000 can you figure out the plan? Think this hosts your legacy method
Thanks for this project! This is not an issue, but more like a feature request: would you be interested in adding more sources of data (e.g. Nomad, Materials Cloud etc.)?
e.g. string column - > composition column
pymatgen dict repr - > pymatgen object
structure (no oxid) -> structure with oxidation state guesses
as suggested by @WardLT
nan
or raise an exception if featurization fails?Either myself or @dyllamt can implement
Talk to @WardLT if interested
When BaseFeaturizer.featurize_dataframe() is called for a featurizer which returns equal-dimension
numpy arrays (such as OrbitalFieldMatrix and the in-dev ManyBodyTensor), the code exits in error,
as shown in the Python code and output attached in testfeat.txt. The reason for this is that featurize_dataframe joins the features into a numpy array, which is interpreted as a multidimensional array if and only if each element of the array has the same length. If the features are single values, the array is 1D, and if they are variable-length arrays, the array is interpreted as a 1D collection of other objects. The assign
call in featurize_dataframe
does not handle multidimensional arrays.
A possible fix is shown in testfeat.py, in which a dataframe is featurized by adding a Series initialized
from a list of features. As shown in testnew.txt, there does not seem to be a significant speed difference between these methods. I apologize if the fix is underthought; I am not familiar with the reasoning for the original design, so I may well be missing something.
Error message is
Downloading matminer/datasets/diel_ref.csv (3.2 MB)
Error downloading object: matminer/datasets/diel_ref.csv (7f3c427): Smudge error: Error downloading matminer/datasets/diel_ref.csv (7f3c4276d159e56eaf13231ad7580824f0512a9aba51e6d7fdad0de513691fad): [7f3c4276d159e56eaf13231ad7580824f0512a9aba51e6d7fdad0de513691fad] Object does not exist on the server: [404] Object does not exist on the server
Errors logged to /Users/shyuepingong/repos/matminer/.git/lfs/objects/logs/20170824T115944.174205621.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: matminer/datasets/diel_ref.csv: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
i.e., a parallel addition type term
1 / avg = 1 / x1 + 1 / x2 + 1 / x3 ...
Sometimes the "reduced mass" of a system can be more informative than an average mass for example
This would take in a structure and retain a list that contains the number of bonds (probably as fraction). For example, if a structure had 2 Li-O bonds and 3 Li-P bonds, it could return [0.4, 0.6] as the featurizer value and ["Li-O", "Li-P"] as the feature labels.
Some aspects to note:
optypes = {4: ["sq"]}
mpr = MPRester()
s = mpr.get_structure_by_material_id("mp-20027")
site_f = OPSiteFingerprint(optypes=optypes, dr=0.1, zero_ops=False, dist_exp=0)
structure_f = OPStructureFingerprint(op_site_fp=site_f, stats=("mean",))
cols = structure_f.feature_labels()
print(cols)
I try to use featurize method in CrystalSiteFingerprint class, but met this bug:
TypeError: init() got an unexpected keyword argument 'target'
Possible solution:
In line 484 in featurizer/site.py, vnn = VoronoiNN(cutoff=self.cutoff_radius, targets=target), because no 'target' parameter in VoronoiNN
In line 486 in featurizer/site.py, n_w = vnn.get_voronoi_polyhedra(struct, idx)
because no 'use_weights' parameter in VoronoiNN.get_voronoi_polyhedra method
I recently compared the performance of matminer
to Magpie
for computing composition-based features. For ~250k entries, matminer
requires 24 minutes to evaluate the same features that take Magpie
4 seconds.
From a quick profiling run, it seems like a fair amount of the time (25%) is spent retrieving elemental properties and, surprisingly, computing the mode of a list. Digging further, the slow part of retrieving elemental properties is sorting the properties by atomic number and calling Composition.get_el_amt_dict()
. My plan is to adjust the AbstractData API such that it is possible to avoid performing either of these operations.
I have opened this issue to see if anyone else has ideas for speeding up the code, and to let you know I am overhauling the AbstractData API. Let me know if I should make a public branch if you want to work on this together and we can avoid merge conflicts.
Hello!
Thanks for your excellent program. I get some trouble in 'get_pymatgen_descriptor('Com','ionic_radii'). It reports this bug "TypeError: float() argument must be a string or a number, not 'dict'". Maybe you can fix this bug. Thanks!
Section 3, the random forest model should have subnumber like 3e, 3f etc instead of resetting to 3a
Make it a generic site avg fingerprint
Part 1: This would be a tool that takes in a feature (or list of features) and returns some functional outputs like:
where x is the original feature value. Note that the list of features above is from "Machine Learning and Materials Informatics: Recent Applications and Prospects" by Ramprasad et al.
Part 2: This would take in a list of features (either raw features or perhaps after applying the functions above) and combine them, i.e., give you x1x2 for all features. Note that if you applied the functions above, you will also have features like x1, x2, 1/x1, 1/x2, etc... so by multiplying all feature combinations you will end up with things like x1x2, x1/x2, x2/x1, etc. Unfortunately, some of these will just be 1 (x11/x1) or redundant (x1x1^2 is the same as x1^3).
The features chosen for the "pymatgen" preset are somewhat random (I chose them on a whim, based on no data, while coding in a rush)
Someone can try to do a better job with this...
There is a way to group columns in DataFrames which makes things look much prettier and more organized. See for example this example from Patrick Huck in MP:
It would be nice if:
I made some changes to the "datasets" package of matminer. Some require your attention:
And maybe a string label
@JFChen3 @WardLT Many of the descriptors in composition seem to need knowledge of the oxidation state of an element. As far as I can tell, this is read off some kind of table. But many elements are ambivalent (e.g., P3- in AlP vs P5+ in LiFePO4).
I would suggest using pymatgen Composition's oxi_state_guesses function which I recently developed to guess the oxidation state instead. This should correctly figure out whether P is -, +, or neutral (e.g. P element).
Let me know if you have any issues with the function. If it is slow for large systems let me know because I think I have ideas to speed it up (have an option to reduce the Composition first)
See PR #86 for details
See
# TODO: this featurizer should fail gracefully for compounds with no clear anions (e.g., metals where all elements have zero oxidation) - returning either NaN or zero.
matminer.featurizers.composition.ElectronegativityDiff
Talk to @WardLT if interested
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.95.144110
have a multiprocessing option in featurize dataframe to speed it up. Should detect the number of processors (straightforward with Python) and use multiprocessing package to parallelize.
@kylebystrom can you help clean up some of the sample dataframes?
index_col
parameter of pd.read_csv
(you'll notice there is no extraneous index column after doing this).volume
is volume per site, rename to volume_per_site
. If it's not per site, probably remove it.There seems to be some issue running some examples. In particular (as of Jan 8 despite using api_key), in the bulk modulus example https://hackingmaterials.github.io/matminer/example_bulkmod.html the line:
from matminer.featurizers.data import PymatgenData
returns
ModuleNotFoundError: No module named 'matminer.featurizers.data'
Also:
from matminer.descriptors.composition_features import get_pymatgen_descriptor
returns
ModuleNotFoundError: No module named 'matminer.descriptors'
It appears that something is wrong with the MPDS and Citrine IPython notebooks.* I get the following output when I try to pull from the skeleton:
$ git lfs pull skeleton master
Git LFS: (0 of 2 files) 0 B / 191.73 KB
Username for 'https://github.com': kylebystrom
Password for 'https://[email protected]':
Git LFS: (0 of 0 files, 2 skipped) 0 B / 0 B, 191.73 KB skipped [1f60dd7f710b54cf410b4215a740abc9b2e6eba0f92aadd5fd25046112c60094] Object does not exist on the server: [404] Object does not exist on the server
[13656b91a004cd765c2ad641c0efd741407b02aa382fce71ccbae96de51b48cc] Object does not exist on the server: [404] Object does not exist on the server
error: failed to fetch some objects from 'https://github.com/hackingmaterials/matminer.git/info/lfs'
*I verified that the two missing files are in fact the MPDS and Citrine notebooks.
I don't need the notebooks right now, but this issue also prevents me pushing any updates to my fork, which is problematic. Does anyone else have this issue or know a fix?
Also, I noticed that not all of the notebooks are stored with git lfs. Is this intentional?
There are two different (common) definitions of standard deviation. One of them has the number of samples in the denominator, n, and the other one has one less (n-1), which is known as Bessel's correction.
The questions are:
My vote is in favor of n-1 by default. I expect most of the time we will not have a full population in matminer. I expect most matminer users will be trying to build models for the purposes of applying that model on unseen data that is not in the data set, which would go along with the n-1 definition. But this should be discussed to make sure we get the right solution.
this will make sure the package_data correctly contains the CSV files needed for sample data sets (as @kylebystrom found out was an issue in the past)
putting this issue here as a reminder so we don't forget about this
A very interesting project. However when I follow the tutorial and try to:
from matminer.descriptors.composition_features import get_pymatgen_descriptor
avg_mass = np.mean(get_pymatgen_descriptor('LiFePO4', 'atomic_mass'))
...
this yields aModuleNotFoundError: No module named 'matminer.descriptors'
under v0.1.1.
Am I missing something?
@WardLT The AGNIFingerPrints tests fail on Py2 with error listed below. I tried a few fixes but still doesn't seem to work. Note that everything is OK in Py3.
Traceback (most recent call last):
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/tests/test_site.py", line 52, in test_off_center_cscl
site1 = agni.featurize(self.cscl, 0)
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/site.py", line 103, in featurize
raise Exception('Unrecognized direction')
Exception: Unrecognized direction
Traceback (most recent call last):
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/tests/test_site.py", line 31, in test_simple_cubic
features = agni.featurize(self.sc, 0)
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/site.py", line 103, in featurize
raise Exception('Unrecognized direction')
Exception: Unrecognized direction
Hi all
I'd like to make the capitalization of feature names in matminer consistent. We can do:
I'm leaning towards the former, it's just easier to remember and type. The latter leads to some people deciding to do "Formation energy FERE" which is not the same ...
Thoughts, comments, etc?
get_cbm_vbm_scores
is way too hacky. Only works for tet vs oct, uses a custom function, etc. Remove this var. If someone want this function they should code it well (works for multiple environments, etc), or keep it inside their own test code. Note that there is another issue open for a featurizer that takes in a site and returns back a string label for the coordination number of that site.Many featurizers don't have a set number or type of features that they return - the features depends on the inut. For example, the BagofBonds featurizers (currently as a PR) will return one value for each potential bond combination in the material. For Li2O this would mean Li-Li, Li-O, and O-O. But for Li (metal) the only feature is Li-Li. Or for Na2S it is Na-Na, Na-S, S-S. So the features are different for each entry. There are multiple featurizers like this, like ChemicalSRO.
There are two issues here:
It would be better if someone could design an improvement to BaseFeaturize() to support these kinds of features. I should say that text mining handles this kind of use case all the time. For example, the CountVectorizer
in scikit-learn will return one "feature" for every distinct word in a block of text, and the features will be different for each text block. The way to do this is to use the fit_transform()
method in scikit-learn.
I am wondering if something similar is needed / helpful for some of our featurizers.
See this todo:
# TODO: @sbajaj unclear whether cohesive energies are taken from first ref, second ref, or combination of both
matminer.featurizers.composition.CohesiveEnergy#citations
You can add some code comments that explains what is going on. This could be part of the main doc for the CohesiveEnergy featurizer, e.g.
Class to get cohesive energy per atom of a compound by adding known
elemental cohesive energies from the formation energy of the
compound. <<SOME MORE DESCRIPTION HERE ABOUT KNOWN COHESIVE ENERGIES SOURCE>>
Apparently pip install of matminer does not copy data tables - which results in much of the core code breaking. Someone needs to investigate and fix. See:
better tutorials (e.g. show off featurize_dataframe, better use of featurizers, include structure featurizers, separate into “getting data”, “featurizing data”, and “full data mining workflow”, make sure they work with Py3 (no print ‘x’), use built-in example datasets for tutorials, etc.
Commit 948fe5e makes citrination-client and plotly required libraries of matminer. This should not be the case. We can think of a typical use case of matminer as someone that has a spreadsheet of data and wants to add descriptors (composition or structure) to it and then run a data mining model. Neither plotly nor citrination-client is needed for that (although pymatgen and pandas are). If a library is extremely lightweight or robust then it's also usually OK to put it in requirements.
As far as I can tell you added the requirements to get a unit test to pass. But the way to do it is to fix the unit test, not require additional libraries. e.g., pymatgen contains examples on how to write unittests with optional libraries (you raise a SkipTest if the library is not installed). Can you try it this way?
Note also that if you do add requirements, you should also pin the version as per the other requirements.
See note from @WardLT
matminer/featurizers/composition.py:29
(internal note for myself)
many ML algorithms (e.g., clustering, kernel regression) require only distance between points, not explicit features. Add a distance_metrics package to help with this
Note that it's possible to code a generalized distance function that takes in a featurizer, args1 for data point1, args2 for data point2, and a choice of distance metric (angle or vector distance). Such a feature would work for any existing featurizer.
Other distance functions could be there if you didn't use explicit featurization to get distance.
I think there are a few issues with the RDF calculation method (link for convenience)
dist_rdf
dict. In the loop when normalizing dist_rdf
, it seems like the keys are treated as bin indices (based on the shell thickness calculation)If you agree these are problems, I can make these changes. As you have tests for this function already, I figured it worth checking in before altering both the code and the tests.
Hi I'm hoping you can provide a reference for matminer/matminer/featurizers/data_files/okeeffe_params.json?
I presume these come from https://doi.org/10.1107/S0108768190011041 or some extension but it's not clear which values are pulled (e.g., oxidation state, bonding partner, etc.) or the significance of "r" vs "c".
Thanks,
Chris
Some of the featurizers have a large amount of presets, all of which are doing essentially the same thing. Here's one from a recent PR, although there were many before this of the same form (some committed by me).
+ if preset == "VoronoiNN":
+ return ChemicalSRO(VoronoiNN())
+ elif preset == "JMolNN":
+ return ChemicalSRO(JMolNN())
+ elif preset == "MinimumDistanceNN":
+ return ChemicalSRO(MinimumDistanceNN())
+ elif preset == "MinimumOKeeffeNN":
+ return ChemicalSRO(MinimumOKeeffeNN())
+ elif preset == "MinimumVIRENN":
+ return ChemicalSRO(MinimumVIRENN())
+ else:
+ raise RuntimeError('Unknown preset.')
We should:
It will also allow us to automatically support new NearNeighbor algorithms ,etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.