hackingmaterials / matminer Goto Github PK

View Code? Open in Web Editor NEW

457.0 29.0 188.0 42.55 MB

Data mining for materials science

Home Page: https://hackingmaterials.github.io/matminer/

License: Other

Python 30.19% Makefile 0.16% CSS 0.13% HTML 69.52%

materials-science data-mining machine-learning matminer condensed-matter

matminer's Issues

figure out what to do with distance_metrics.site

matminer/matminer/distance_metrics/site.py has a note:

THIS IS CURRENTLY JUST HOSTING LEGACY CODE.

It will soon be refactored / changed based on some discussions.

- Anubhav (12/7/17)

@nisse3000 can you figure out the plan? Think this hosts your legacy method

More sources of data

Thanks for this project! This is not an issue, but more like a feature request: would you be interested in adding more sources of data (e.g. Nomad, Materials Cloud etc.)?

add utilities package for various functions

e.g. string column - > composition column
pymatgen dict repr - > pymatgen object
structure (no oxid) -> structure with oxidation state guesses

Documentation / standardization for BaseFeaturizer

as suggested by @WardLT

Specify a common format for the documentation for the featurizers (e.g., how to list which features it generates)
Should functions return nan or raise an exception if featurization fails?

add new Featurizer based on orbital energies in pymatgen

Either myself or @dyllamt can implement

Adding MDF data retrieval class

Talk to @WardLT if interested

ElementFraction should add citations()

@WardLT

See:

# TODO: implement citations

matminer.featurizers.composition.ElementFraction

Voronoi-tessellation-based features (i.e., de Jong, Isayev, Ward)

talk to @WardLT if interested

https://www.nature.com/articles/srep34256
https://www.nature.com/articles/ncomms15679
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.96.024104

featurize_dataframe mishandles features that are equally sized numpy arrays

When BaseFeaturizer.featurize_dataframe() is called for a featurizer which returns equal-dimension
numpy arrays (such as OrbitalFieldMatrix and the in-dev ManyBodyTensor), the code exits in error,
as shown in the Python code and output attached in testfeat.txt. The reason for this is that featurize_dataframe joins the features into a numpy array, which is interpreted as a multidimensional array if and only if each element of the array has the same length. If the features are single values, the array is 1D, and if they are variable-length arrays, the array is interpreted as a 1D collection of other objects. The assign call in featurize_dataframe does not handle multidimensional arrays.

A possible fix is shown in testfeat.py, in which a dataframe is featurized by adding a Series initialized
from a list of features. As shown in testnew.txt, there does not seem to be a significant speed difference between these methods. I apologize if the fix is underthought; I am not familiar with the reasoning for the original design, so I may well be missing something.

testfeat.txt
testnew.txt

Can't clone matminer

Error message is

Downloading matminer/datasets/diel_ref.csv (3.2 MB)
Error downloading object: matminer/datasets/diel_ref.csv (7f3c427): Smudge error: Error downloading matminer/datasets/diel_ref.csv (7f3c4276d159e56eaf13231ad7580824f0512a9aba51e6d7fdad0de513691fad): [7f3c4276d159e56eaf13231ad7580824f0512a9aba51e6d7fdad0de513691fad] Object does not exist on the server: [404] Object does not exist on the server

Errors logged to /Users/shyuepingong/repos/matminer/.git/lfs/objects/logs/20170824T115944.174205621.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: matminer/datasets/diel_ref.csv: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.

PropertyStats should include a "reduced mass" type average

i.e., a parallel addition type term

1 / avg = 1 / x1 + 1 / x2 + 1 / x3 ...

Sometimes the "reduced mass" of a system can be more informative than an average mass for example

Add a BagofBonds structure featurizer

This would take in a structure and retain a list that contains the number of bonds (probably as fraction). For example, if a structure had 2 Li-O bonds and 3 Li-P bonds, it could return [0.4, 0.6] as the featurizer value and ["Li-O", "Li-P"] as the feature labels.

Some aspects to note:

You probably want to allow any pymatgen NearNeighbor class to be used as the bond detection method. Some presets might help with usage
If you are featurizing an entire data frame, you need the feature labels to match and be a concatenation of all the bonds in all the various structures in the data frame. Also some complications if you fit a model and then predict a structure for which there are no bonds in the training set. Some hashing tricks from text mining might help with this. Talk to AJ if you are implementing and this is confusing to you.

code fails for multiple reasons if I try to only set a single optype

@nisse3000

optypes = {4: ["sq"]}
mpr = MPRester()
s = mpr.get_structure_by_material_id("mp-20027")
site_f = OPSiteFingerprint(optypes=optypes, dr=0.1, zero_ops=False, dist_exp=0)
structure_f = OPStructureFingerprint(op_site_fp=site_f, stats=("mean",))
cols = structure_f.feature_labels()
print(cols)

Minor Bug in CrystalSiteFingerprint.featurize()

I try to use featurize method in CrystalSiteFingerprint class, but met this bug:
TypeError: init() got an unexpected keyword argument 'target'

Possible solution:
In line 484 in featurizer/site.py, vnn = VoronoiNN(cutoff=self.cutoff_radius, targets=target), because no 'target' parameter in VoronoiNN
In line 486 in featurizer/site.py, n_w = vnn.get_voronoi_polyhedra(struct, idx)
because no 'use_weights' parameter in VoronoiNN.get_voronoi_polyhedra method

Improving Performance of Composition Features

I recently compared the performance of matminer to Magpie for computing composition-based features. For ~250k entries, matminer requires 24 minutes to evaluate the same features that take Magpie 4 seconds.

From a quick profiling run, it seems like a fair amount of the time (25%) is spent retrieving elemental properties and, surprisingly, computing the mode of a list. Digging further, the slow part of retrieving elemental properties is sorting the properties by atomic number and calling Composition.get_el_amt_dict(). My plan is to adjust the AbstractData API such that it is possible to avoid performing either of these operations.

I have opened this issue to see if anyone else has ideas for speeding up the code, and to let you know I am overhauling the AbstractData API. Let me know if I should make a public branch if you want to work on this together and we can avoid merge conflicts.

Tiny bugs in get_pymatgen_descriptor(）?

Hello!
Thanks for your excellent program. I get some trouble in 'get_pymatgen_descriptor('Com','ionic_radii'). It reports this bug "TypeError: float() argument must be a string or a number, not 'dict'". Maybe you can fix this bug. Thanks!

Heading numbering in ipynb for predicting bandgaps is incorrect

Section 3, the random forest model should have subnumber like 3e, 3f etc instead of resetting to 3a

Rename OPStryctureFingerprint

Make it a generic site avg fingerprint

matminer can include capability to functionalize features

Part 1: This would be a tool that takes in a feature (or list of features) and returns some functional outputs like:

x
1/x
x^05.
x^-0.5
x^2
x^-2
x^3
x^-3
ln(x)
1/ln(x)
exp(x)
exp(-x)

where x is the original feature value. Note that the list of features above is from "Machine Learning and Materials Informatics: Recent Applications and Prospects" by Ramprasad et al.

Part 2: This would take in a list of features (either raw features or perhaps after applying the functions above) and combine them, i.e., give you x1x2 for all features. Note that if you applied the functions above, you will also have features like x1, x2, 1/x1, 1/x2, etc... so by multiplying all feature combinations you will end up with things like x1x2, x1/x2, x2/x1, etc. Unfortunately, some of these will just be 1 (x11/x1) or redundant (x1x1^2 is the same as x1^3).

pymatgen preset in ElementProperty

The features chosen for the "pymatgen" preset are somewhat random (I chose them on a whim, based on no data, while coding in a rush)

Someone can try to do a better job with this...

support for grouped DataFrames in featurizers

There is a way to group columns in DataFrames which makes things look much prettier and more organized. See for example this example from Patrick Huck in MP:

It would be nice if:

sample dataframes included grouped columns (e.g., an "outputs" group, a "metadata" group, an "inputs" group) ( @kylebystrom )
the featurize_dataframe() method could featurize into a group so that all features for a particular featurizer (or multifeaturizer) were grouped together.

Piezo and elastic data sets

@kylebystrom

I made some changes to the "datasets" package of matminer. Some require your attention:

Note that git-lfs seemed to be a big hassle and so I decided to just remove it altogether. Hopefully a normal "pull" operation will just fix everything on your end, but if not you may need to re-clone. Let's skip git-lfs from now on unless it really seems necessary, sorry for the hassle but now we know.
Note that I added tests to the package
Note that I added a note of the original reference both in the CSV and method
Needs your attention: The elastic_tensor dataset is nice in that it contains both formula and structure columns that can be used to generate descriptors. However, the "piezo" and "dielectric_constant" data sets don't have these columns (they are embedded in "meta"). Can you pull these columns out of meta, along with anything else that looks useful to have as its own column? You'll need to update the unit tests afterward. Make sure to preserve the comment line at top that gives the reference.

Featurizer to turn site into a CN (single float)

And maybe a string label

oxidation states of elements in Composition descriptors

@JFChen3 @WardLT Many of the descriptors in composition seem to need knowledge of the oxidation state of an element. As far as I can tell, this is read off some kind of table. But many elements are ambivalent (e.g., P3- in AlP vs P5+ in LiFePO4).

I would suggest using pymatgen Composition's oxi_state_guesses function which I recently developed to guess the oxidation state instead. This should correctly figure out whether P is -, +, or neutral (e.g. P element).

Let me know if you have any issues with the function. If it is slow for large systems let me know because I think I have ideas to speed it up (have an option to reduce the Composition first)

revise unit tests for ChemEnvFingerprint

@nisse3000

See PR #86 for details

ElectronegativityDiff failures

See

# TODO: this featurizer should fail gracefully for compounds with no clear anions (e.g., metals where all elements have zero oxidation) - returning either NaN or zero.

matminer.featurizers.composition.ElectronegativityDiff

add BOP, AFS features

Talk to @WardLT if interested

https://journals.aps.org/prb/abstract/10.1103/PhysRevB.95.144110

support multiprocessing in featurize_dataframe

have a multiprocessing option in featurize dataframe to speed it up. Should detect the number of processors (straightforward with Python) and use multiprocessing package to parallelize.

some cleanups to sample dataframes

@kylebystrom can you help clean up some of the sample dataframes?

Don't need an "idx" column (there is already a material_id column that serves as an index)
Actually make the "material_id" the index column of the dataframe - see for example the index_col parameter of pd.read_csv (you'll notice there is no extraneous index column after doing this).
if volume is volume per site, rename to volume_per_site. If it's not per site, probably remove it.
reorder the columns to be in the following order

The index column ("material_id")
The formula/composition
Other parameters that succintly describe the input structure (nsites, volume, space_group, etc)
The actual structure column itself
Simple (float) output data - band gap, K_VRH, etc.
Complex (vector, tensor) output data - elastic tensor, piezoelectric tensor, etc.
Any extra metadata (cif, poscar, etc) that is essentially redundant info with structure

Data retrieval and descriptor tools

There seems to be some issue running some examples. In particular (as of Jan 8 despite using api_key), in the bulk modulus example https://hackingmaterials.github.io/matminer/example_bulkmod.html the line:

from matminer.featurizers.data import PymatgenData

returns

ModuleNotFoundError: No module named 'matminer.featurizers.data'

Also:

from matminer.descriptors.composition_features import get_pymatgen_descriptor

returns

ModuleNotFoundError: No module named 'matminer.descriptors'

LFS File Issue (IPython notebooks): Missing Objects

It appears that something is wrong with the MPDS and Citrine IPython notebooks.* I get the following output when I try to pull from the skeleton:

$ git lfs pull skeleton master
Git LFS: (0 of 2 files) 0 B / 191.73 KB
Username for 'https://github.com': kylebystrom
Password for 'https://[email protected]':
Git LFS: (0 of 0 files, 2 skipped) 0 B / 0 B, 191.73 KB skipped [1f60dd7f710b54cf410b4215a740abc9b2e6eba0f92aadd5fd25046112c60094] Object does not exist on the server: [404] Object does not exist on the server
[13656b91a004cd765c2ad641c0efd741407b02aa382fce71ccbae96de51b48cc] Object does not exist on the server: [404] Object does not exist on the server
error: failed to fetch some objects from 'https://github.com/hackingmaterials/matminer.git/info/lfs'

*I verified that the two missing files are in fact the MPDS and Citrine notebooks.

I don't need the notebooks right now, but this issue also prevents me pushing any updates to my fork, which is problematic. Does anyone else have this issue or know a fix?

Also, I noticed that not all of the notebooks are stored with git lfs. Is this intentional?

normalization for standard deviation in PropertyStats

There are two different (common) definitions of standard deviation. One of them has the number of samples in the denominator, n, and the other one has one less (n-1), which is known as Bessel's correction.

The questions are:

which definition should be used by default
should we support both as an option

My vote is in favor of n-1 by default. I expect most of the time we will not have a full population in matminer. I expect most matminer users will be trying to build models for the purposes of applying that model on unseen data that is not in the data set, which would go along with the n-1 definition. But this should be discussed to make sure we get the right solution.

unit tests for sample data loaders

@kylebystrom

this will make sure the package_data correctly contains the CSV files needed for sample data sets (as @kylebystrom found out was an issue in the past)

putting this issue here as a reminder so we don't forget about this

get_pymatgen_descriptor not available

A very interesting project. However when I follow the tutorial and try to:

from matminer.descriptors.composition_features import get_pymatgen_descriptor
avg_mass = np.mean(get_pymatgen_descriptor('LiFePO4', 'atomic_mass'))
    ...

this yields aModuleNotFoundError: No module named 'matminer.descriptors' under v0.1.1.

Am I missing something?

Py2: AGNIFingerprints test failing

@WardLT The AGNIFingerPrints tests fail on Py2 with error listed below. I tried a few fixes but still doesn't seem to work. Note that everything is OK in Py3.

�======================================================================
ERROR: test_off_center_cscl (test_site.FingerprintTests)

Traceback (most recent call last):
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/tests/test_site.py", line 52, in test_off_center_cscl
site1 = agni.featurize(self.cscl, 0)
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/site.py", line 103, in featurize
raise Exception('Unrecognized direction')
Exception: Unrecognized direction

======================================================================
ERROR: test_simple_cubic (test_site.FingerprintTests)
Test with an easy structure

Traceback (most recent call last):
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/tests/test_site.py", line 31, in test_simple_cubic
features = agni.featurize(self.sc, 0)
File "/Users/ajain/Documents/code_matgen/matminer/matminer/featurizers/site.py", line 103, in featurize
raise Exception('Unrecognized direction')
Exception: Unrecognized direction

Implement atomic orbital code as features

@albalu @dyllamt

Can you implement the atomic orbital code as a Composition featurizer that gives the pertinent info? I think it's called MolecularOrbitals in pymatgen although I am not sure they are really molecular orbitals

Capitalization consistency in feature names

Hi all

I'd like to make the capitalization of feature names in matminer consistent. We can do:

everything lower case (e.g. "band center") except for acroynms (e.g. "formation energy FERE")
first-word capitalization ("Formation Energy FERE")

I'm leaning towards the former, it's just easier to remember and type. The latter leads to some people deciding to do "Formation energy FERE" which is not the same ...

@WardLT @JFChen3 @nisse3000

Thoughts, comments, etc?

some remaining items for DosFeaturizer

@dyllamt @albalu

move to featurizers/dos.py (instead of bandstructure.py)
the xbm_location_i should probably return fractional, not Cartesian coords. Can't think of any good reason for it to be Cartesian
the coordination variable in get_cbm_vbm_scores is way too hacky. Only works for tet vs oct, uses a custom function, etc. Remove this var. If someone want this function they should code it well (works for multiple environments, etc), or keep it inside their own test code. Note that there is another issue open for a featurizer that takes in a site and returns back a string label for the coordination number of that site.
get_cbm_vbm_scores() should explain in the docs what it's doing

BaseFeaturizer() doesn't support certain types of featurizers well

Many featurizers don't have a set number or type of features that they return - the features depends on the inut. For example, the BagofBonds featurizers (currently as a PR) will return one value for each potential bond combination in the material. For Li2O this would mean Li-Li, Li-O, and O-O. But for Li (metal) the only feature is Li-Li. Or for Na2S it is Na-Na, Na-S, S-S. So the features are different for each entry. There are multiple featurizers like this, like ChemicalSRO.

There are two issues here:

You get this clunky thing where feature_labels() only really works well after you've called featurize(), and is only really pertinent to the most recent featurize() call. This makes the Featurizer a stateful object which is less preferable and more prone to breakage.
Sometimes, you need to override featurize_dataframe() to get this to work.

It would be better if someone could design an improvement to BaseFeaturize() to support these kinds of features. I should say that text mining handles this kind of use case all the time. For example, the CountVectorizer in scikit-learn will return one "feature" for every distinct word in a block of text, and the features will be different for each text block. The way to do this is to use the fit_transform() method in scikit-learn.

I am wondering if something similar is needed / helpful for some of our featurizers.

Add more descriptive comments to citations for CohesiveEnergy

@saurabh02

See this todo:

# TODO: @sbajaj unclear whether cohesive energies are taken from first ref, second ref, or combination of both

matminer.featurizers.composition.CohesiveEnergy#citations

You can add some code comments that explains what is going on. This could be part of the main doc for the CohesiveEnergy featurizer, e.g.

        Class to get cohesive energy per atom of a compound by adding known
        elemental cohesive energies from the formation energy of the
        compound. <<SOME MORE DESCRIPTION HERE ABOUT KNOWN COHESIVE ENERGIES SOURCE>>

fix pip install

Apparently pip install of matminer does not copy data tables - which results in much of the core code breaking. Someone needs to investigate and fix. See:

#122 (comment)

better (and working) tutorials

better tutorials (e.g. show off featurize_dataframe, better use of featurizers, include structure featurizers, separate into “getting data”, “featurizing data”, and “full data mining workflow”, make sure they work with Py3 (no print ‘x’), use built-in example datasets for tutorials, etc.

Implement CN algo as NearNeighbors in pymatgen

citrination-client and plotly should be optional requirements

@saurabh02

Commit 948fe5e makes citrination-client and plotly required libraries of matminer. This should not be the case. We can think of a typical use case of matminer as someone that has a spreadsheet of data and wants to add descriptors (composition or structure) to it and then run a data mining model. Neither plotly nor citrination-client is needed for that (although pymatgen and pandas are). If a library is extremely lightweight or robust then it's also usually OK to put it in requirements.

As far as I can tell you added the requirements to get a unit test to pass. But the way to do it is to fix the unit test, not require additional libraries. e.g., pymatgen contains examples on how to write unittests with optional libraries (you raise a SkipTest if the library is not installed). Can you try it this way?

Note also that if you do add requirements, you should also pin the version as per the other requirements.

move has_oxidation_states function to pymatgen?

See note from @WardLT

matminer/featurizers/composition.py:29

add a distance_metrics package

(internal note for myself)

many ML algorithms (e.g., clustering, kernel regression) require only distance between points, not explicit features. Add a distance_metrics package to help with this

Note that it's possible to code a generalized distance function that takes in a featurizer, args1 for data point1, args2 for data point2, and a choice of distance metric (angle or vector distance). Such a feature would work for any existing featurizer.

Other distance functions could be there if you didn't use explicit featurization to get distance.

Potential Issue with RDF Method

I think there are a few issues with the RDF calculation method (link for convenience)

RDF are typically normalized by number density. This function appears normalize by mass density
In the loop over pairs of neighbors, pair distances are used as keys for the dist_rdf dict. In the loop when normalizing dist_rdf, it seems like the keys are treated as bin indices (based on the shell thickness calculation)

If you agree these are problems, I can make these changes. As you have tests for this function already, I figured it worth checking in before altering both the code and the tests.

Miedema model generalized for ternaries and higher

@Qi-max

Reference for O'keeffe parameters?

Hi I'm hoping you can provide a reference for matminer/matminer/featurizers/data_files/okeeffe_params.json?

I presume these come from https://doi.org/10.1107/S0108768190011041 or some extension but it's not clear which values are pulled (e.g., oxidation state, bonding partner, etc.) or the significance of "r" vs "c".

Thanks,
Chris

too many presets

Some of the featurizers have a large amount of presets, all of which are doing essentially the same thing. Here's one from a recent PR, although there were many before this of the same form (some committed by me).

+        if preset == "VoronoiNN":
 +            return ChemicalSRO(VoronoiNN())
 +        elif preset == "JMolNN":
 +            return ChemicalSRO(JMolNN())
 +        elif preset == "MinimumDistanceNN":
 +            return ChemicalSRO(MinimumDistanceNN())
 +        elif preset == "MinimumOKeeffeNN":
 +            return ChemicalSRO(MinimumOKeeffeNN())
 +        elif preset == "MinimumVIRENN":
 +            return ChemicalSRO(MinimumVIRENN())
 +        else:
 +            raise RuntimeError('Unknown preset.')

We should:

Clean up the code to shorten it, using something like this:
https://stackoverflow.com/questions/4821104/python-dynamic-instantiation-from-string-name-of-a-class-in-dynamically-imported

It will also allow us to automatically support new NearNeighbor algorithms ,etc.

Consider having more guidance to presets. e.g. instead of just a billion presets, to do more like what the ElementProperty presets and do and just give a few "vetted" suggestions.

hackingmaterials / matminer Goto Github PK

matminer's Issues

�====================================================================== ERROR: test_off_center_cscl (test_site.FingerprintTests)

====================================================================== ERROR: test_simple_cubic (test_site.FingerprintTests) Test with an easy structure

Recommend Projects

Recommend Topics

Recommend Org

�======================================================================
ERROR: test_off_center_cscl (test_site.FingerprintTests)

======================================================================
ERROR: test_simple_cubic (test_site.FingerprintTests)
Test with an easy structure