hackingmaterials / automatminer Goto Github PK

An automatic engine for predicting materials properties.

License: Other

Python 99.54% Shell 0.44% CSS 0.02%

material-properties pipeline machine-learning prediction

automatminer's Introduction

automatminer is an automatic prediction engine for materials properties.

Tests	Code Coverage	Codacy	Release

If you're interested in the benchmarking datasets, see our dedicated package Matbench:

You may also be interested in the parent code of automatminer, matminer:

Matminer

If you find automatminer or Matbench useful. useful, please consider citing our paper:

Dunn, A., Wang, Q., Ganose, A., Dopp, D., Jain, A. Benchmarking Materials Property
Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj 
Computational Materials 6, 138 (2020). https://doi.org/10.1038/s41524-020-00406-3

automatminer is pip installable. Please use versions 1.0.0 forward.

automatminer's People

Contributors

Stargazers

Watchers

Forkers

ardunn doppe1g4nger codacy-badger utf theiman112860 karandsharma samysspace ada110 qi-max shizhe1 kmu captaindasheng zhenming-xu triphysics dixitmudit suthzx gpilania kcantosh chessmag zhigangmei diotro yuqi-song shyshy903 chunweizhu wangvei king9sun marzieghorbani mumerchem pk-organics mukhtarbayerouniversity rivertate fermiq andy-wang1988 sgbaird sailfish009 zankzeke rwoodsrobinson hrushikesh-s nityasagarjena kjappelbaum fronzi anoopspoona harel-coffee mperezjigato steinnberg andim53 mehranmo dwzhang98 noidvan

automatminer's Issues

Convert preprocessing modele into 2 separate operations

should have 2 separate classes based on Daniel's recommendation, which I agree with: DataCleaning and FeatureReduction

I will do this

load_* functions should ensure all numeric columns

Some load funcs return columns that appear numeric when you print them but are actually strings. Columns which are floats or ints should be converted with pd.to_numeric to avoid (very confusing) issues later on.

Analysis should produce summary and visualization as file

I like to look at some (preferably csv) file at the end that summarizes the results. It could be even excel file with different sheets such as featurize profile/summary, preprocess profile/summary and machine learning profile/summary. Then I will know for example at the end 100 cross-correlated features were removed in preprocess and such as such algorithms were tried and these are their scores and these features turn out to be the most important.

Note that all of this info is already available via logger but it looks messy so we need a to_file method or something organizing this results. I prefer csv since it is friendlier than JSON in my opinion.

Logger problems

needs tests
prints duplicate logs to stdout? or i may be mistaken here

TestAllFeaturizers will break whenever a new featurizer is added

Due to the way TestAllFeaturizers imports featurizers, whenever a new featurizer is added to matminer the test breaks.

Not sure if this is intended but it does not seem like a maintainable motif for testing.

Maybe we should have a separate util which detects all the featurizers present in current version of matminer and compares with the featurizers present in matbench, and lists them out in an easily readable manner. This way we could periodically check and update matbench featurizer sets with the newest featurizers from matminer without breaking the tests all the time

zhou gaps formula can't be converted to composition

I think this is a problem with only one of the formulas, but when I do:

df = load_expt_gap()
df['composition'] = [Composition(f) for f in df['formula']]

I get the error:

CompositionError: ,65 is an invalid formula!

Look into using NestedCV for automl, and whether it would be a good idea or not

Nested CV is a technique often used to decrease model generalization error and provide better comparisons of models. However it is more computationally expensive -- especially when searching over many models and hyperparameter combinations (i.e., in automl).

A short discussion on the topic in tpot is here:
EpistasisLab/tpot#554

It would be good to investigate this, at least at a cursory level.

Make dataset test set

Dataset test set is the following (tentatively):
Piezoelectric dataset - really hard, 900 samples predicting max piezoelectric tensor (regression) [structure + composition]
Alireza's expt. Gaps dataset - classification of metal/nonmetal [compositon]
MP Elastic dataset - predicting K_Vrh and G_vrh (regression) [structure and composition]
Metallic glass - classification of formation or not [composition]
Boltztrap - regression on effective mass for n and p [structure + composition]

These will be the datasets for comparing different pipeline configurations

MatPipe needs tests

^^^^^^^^^^

Add more featurizers to AllFeaturizers

Like anubhav said, we should add more featurizers based on combinations of the input parameters. Mainly I am referencing the site featurizers, applied to SiteStatsFingerprint

But also the MagPie preset of ElementProperty, ElementProperty + more stats (e.g., quartiles), BondFractions with approximations turned on/off, BagofBonds with SineCoulombMatrix instead of CoulombMatrix, etc.

Dimensionality reduction

@albalu might be interested in this
http://pubman.mpdl.mpg.de/pubman/item/escidoc:2616210/component/escidoc:2618812/SISSO_Ouyang_etal_2018.pdf

Castelli is missing structure or the doc is incorrect

@ardunn did you set load_castelli_perovskites up? If yes, can you see where the problem is?

all formula columns in load_* funcs should return Composition objects

They should not return strings, bc strings (AFAIK) are not usable with most matminer composition featurizers. This is the same for loading structures should return the Structure object, not the dictionary.

Using automatminer models as featurizers

Typically, a matbench model would be considered an "output" of the study.

However, let's say you are trying to create a model that relates crystal structure to band gap. You could of course have the standard matminer structure/composition features and use that as your candidate feature set.

But another option is to use a different matbench (or other ML) model as a feature as well. For example, let's say you previously made a matbench model to relate structure/composition->bulk modulus. Now, when you are predicting band gap, you can use that previous matbench model to get a value for bulk modulus which becomes a new feature for band gap prediction.

Related ideas include the papers:
[1] M.L. Hutchinson, E. Antono, B.M. Gibbons, S. Paradiso, J. Ling, B. Meredig, Overcoming data scarcity with transfer learning, ArXiv. (2017).
[2] Y. Zhang, C. Ling, A strategy to apply machine learning to small datasets in materials science, Npj Comput. Mater. 25 (2018) 28–33.

and the goal of the propnet project in our group is to do similar things as well.

Remove temp fix of CompositionToOxidComposition with next matminer

in autofeaturizer class

Add another study comparison with matbench

https://pubs.acs.org/doi/pdf/10.1021/jacs.8b02717

++robustness and usefulness of featurizer sets

Featurizer sets should include more featurizers. The "best" sets should, for example, include multiple instances of the same class (e.g. StructureFeaturizers.best should probably have multiple SiteStatsFingerprints)

Further package structure suggestions

In addition to the issue in #87 I think it may also be worth looking at the way the training data generation and handling process is structured. E.g. what are currently the top level modules metalearning, featurization, and preprocessing.

I think it was agreed that metalearning isn't a very accurate descriptor of what that submodule does. I think a better name would be something like "heuristic featurizer selection" or maybe just "featurizer selection".

A bigger sell I'd also like to put forward is that preprocessing is a bit of a misnomer for what that submodule currently does. Really all three of the above submodules are part of a preprocessing pipeline. What is currently preprocessing is more accurately a combined data cleaning and feature pruning step. It would make sense to bundle each of these under a comprehensive "preprocessing" or "data handling" module and split out the functionality of each submodule into its constituent steps. That way every top level module produces some complete product for the user and submodules then represent steps to completing that product. See the below diagram on what I think the better structure would look like:

MatPipe cannot be serialized

MatPipes need to be pickleable or serializable in some way

test_tpot hard to maintain with tpot version update

The current tpot version is 0.9.5, newer than the current requirement of 0.9.3. If run the test_tpot with the 0.9.5 version, most of the tests will fail, because the fitted scores of the top_models would differ. The tests may need to be modified in order to avoid maintenance difficulties.

Plan - WIP

CODING TODOS, IN ORDER OF IMPORTANCE/STATE OF NEGLECT - 10.2.2018

1. (tie) Heuristic-based Featurizer Selection - Qi

[given df, return a set of featurizers]

A scheme to abandon featurizers unworkable for a given composition dataset.
A scheme to abandon featurizers unworkable for a given structure dataset.

1. (tie) Top level class - Alex D

[given dataframe, give reports and final models] (+ generating final reports)

A method for benchmarking
A method for predicting
Way to determine regression or classification

2. (tie) Featurization - Alex G

[given df and set of featurizers, featurize a df robustly]

Having good sets of featurizers, working on transfer learning part, cached features, etc.

2. (tie) Preprocessing - Alireza

[given featurized df, return an ML ready df]
Coming up with good methods for preprocessing and and feature reduction

preprocess should not have a new method: separation of concerns
adding RandomForest feature selection
adding sensitivity analysis selection as an option

3. Data - Daniel

[given a request for data, return a nice dataframe, citations, etc.]
Moving the datasets to matminer, adding to figshare, making sure all columns are numeric, making sure correct citation data is present, having all in json format

update matminer to support seaborn style dataset loading including tests
convert existing dataset metadata dict to json file and make a gui or cli interface to update it
write functions to interface with dataset metadata and give user info
convert existing matminer datasets to json
remove deprecated matminer functions
convert matbench datasets to json, add to figshare, add to matminer
update matbench interface with datasets to use the matminer interface
ensure all datasets have their dataframes formatted properly, (numeric data, etc.)

4. ML pipeline - Qi?

[given an ML ready df (or just X and y), return a model]
Checking defaults, adding or testing Neural network?

Check Tpot defaults
Add other adaptor classes for other backends?

5. Analysis + Visualization + Interpretability - Daniel

[given a model and dataframe, return cool informative stats (and graphs)]

Logger needed

One logger should be instantiated in the top level class

The constituent steps (Featurization, Preprocessing, etc.) should all write to this same logger.

According to @utf there is easy way to do this without passing logger object?

Tpot defaults need investigation and modification

The default Tpot arguments and models should be looked at in closer detail. There might be an underlying reason why ExtraTreesRegressor is a recurring favorite of tpot (ie, the default hyperparameter grids for ETR are better than the others).

Also, it may be good to include a Keras or PyTorch NN (implementing the sklearn BaseEstimator methods needed to work with tpot, e.g. fit) within tpot's model search space.

This is a problem our incoming undergrad (Abhinav) maybe can look at?

normalize preprocess for the future use of pipeline

For the future use of pipeline, the classes in matbench.preprocess can also take the form of fit + transform, as the classes insklearn.preprocessing, such as Imputer and LabelEncoder etc.

Add a profiler to DataframeTransformer

so we know how long each task took

Top level classes should be able to serialize all pipeline info to json

Adding/converting .fit/.transform/.predict methods

As per @Doppe1g4nger and @Qi-max initial suggestion, our main classes should have sklearn-style .fit and (.predict or .transform) methods. It does not necessarily need to subclass BaseEstimator but each should still implement these methods, because we will be able to easily (1) tell what each class is doing and (2) apply the same transformation to other datasets

This applies to the following classes:

FeaturizerSelector (not necessary)
Featurization
Preprocessing
Tpot (already does!)

Example 1:

# Running a featurization
f = Featurization()
f.fit(df)  # We understand all features are being fitted to this df
df = f.transform(df)      # transform the same df
df2 = f.transform(df2)    # apply the transform to another dataset

Example 2:

# Running a data cleaning
pp = Preprocessing()
pp.fit(df, target_key)       # determines which features to keep
df = pp.transform(df)      # remove na samples, remove features
df2 = pp.transform(df2)   # apply the same set of cleaning/feature reduction steps to df2

Example 3:

# Running an entire matbench pipeline in this fashion
pipe = PredictionPipeline()
pipe.fit(df, target_key) .  # does featurization, preprocessing, and tpot fititng
pipe.top_model  # gives back the best tpot model
predictions = pipe.predict(df2)   # runs featurization, preprocesing, and tpot prediction on df2

Add a very simple example

^^^^^^^^

Model Selection Methodology

Not an issue per se, but some thoughts I had that may be useful.

A paper to look at is On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation (link). I think it goes over most of the "best practices" relevant to matbench.

The main topic of the paper is the types of bias and variance that arise in model selection and evaluation. An important case is how over-fitting during model selection can cause the final model to be either under or over fit. The more models / hyper-parameters searched over, the more likely it is for the best model to be chosen due to variance, which can result in a worse final model. The paper notes that minimizing bias during model selection is less important than variance, as (assuming some uniform bias) the best model will still be chosen. The paper gives a few options to deal with over-fitting during model selection, such as regularization and stopping criteria. A practical choice would be to use 5-fold cv instead of 10-fold cv, as it likely will have somewhat higher bias but lower variance, and is less computationally expensive. This also highlights why it is so important to have a final hold out test set to evaluate the final chosen model, as the cross validation scores can be heavily biased. Another option for small data (it is expensive) is a nested cross validation.

This quote gives the main conclusion:

model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.

In practice this means there must be a train / test split, with the training set then used as train / validation either by a split or cross validation, with the entirety of model selection (including most types of feature selection, hyper-parameter optimization, etc) internal to the cross validation. The TPOT code is a good example of this, where "pipelines" are compared to each other as a whole using cross validation, followed by a final evaluation of the best pipeline on a test set to estimate the true generalization error. A useful idea in model comparison would be using variance of the estimates of generalization error (e.g. var of CV error) to see if differences are statistically significant, but this is very hard to do correctly.

(Sidenote) With a goal of simply comparing many models (that is without the goal of choosing a best one at the end), the bias-variance trade-off is tricky. Theoretically either we can choose biased results to minimize variance, and hope that this gives accurate ranking of performance but may not reflect the true generalization error, or we can choose unbiased results that may reflect the generalization error but may not give an accurate ranking. In practice we hope to be somewhere in the middle that gives reasonable results. The hold out test set does not necessarily help here (though it may be better than nothing?), as using it to evaluate multiple models will lead to the same type of bias as in model selection. For example, after years of competitions training NN to do well on CIFAR-10, the selected models may just be over-fitting to the CIFAR-10 competition test set, and would do worse than previous models on new test data, even if they are currently winning by some small percent increase in classification accuracy. This is just a thought, I'm not sure how to deal with this issue, I haven't read much ML literature on large scale model comparison.

Let me know if you have any questions, want clarification, more references, or anything else.

Testing takes unacceptably long times

The tests are taking a very long time and are causing us to go over our CircleCI quota. The runtime of the tests should be reduced

Consider adaptor classes for other backends

In the same spirit as our TpotAutoML class now, we could include classes for other backends (i.e., autosklearn)

Top level class methods need work

Need top level class methods for the following:

benchmarking a dataset (can be implemented in .fit)
applying a ML-matsci pipeline to an external set of data (can be implemented in .predict)

This might have to come later, after we have the constituent classes done

Rename AutoML segment of pipeline to better reflect package use

The AutoML segment of the pipeline is both called AutoML to indicate that it does automatic model selection but it also has the automl toolkit as one of the tools used to complete that task with automl_utils.py. In addition it seems we've settled on using tpot as the main model selection tool rather than automl.

I think the current naming is confusing as it makes it hard to understand that the tool being used isn't necessarily automl and it's not immediately clear that automl_utils.py holds utilities for using the automl package, not our overall model selection part of the pipeline.

I think a better naming convention for the submodule would be something that better reflects that the model selection isn't necessarily tied to any package but is meant for automatic model selection. So something like matbench.model_selection or matbench.auto_model_selector rather than matbench.automl

set mpid as index if available in load_* functions

or any other kind of id but I think only mpid is available (e.g. there is no citrine id)

Tests need coverage assessment

We should add coverage reports to make sure (1) we have enough tests and (2) the tests we do have are useful

Tpot tests need to be redone

Tpot tests are testing stuff besides tpot (e.g., preprocessing changes break tpot tests)

tpot tests need to have pre-formatted data so the tpot problems are localized

Add logging warning level option to Matpipe object.

Analytics module needs tests

Title sums it up, adding the module in now so I can start reworking the examples but tests need to be added.

Analysis class needs to be beefed up with something actually useful

Given a PredictionPipeline object (or just a tpot model and feature list until I get the top level classes working better), analysis should give back a nice html (or other format?) containing:

identification of outliers
identification of the most important features
partial dependence plots w/ skater
LIME plots w/ skater
details of the features dropped/features retained
breakdown of the featurization time/fitting time/etc.
t-SNE plot based on features and labelled by material formula/phase

@Doppe1g4nger @ADA110 any other ideas on cool things to include here?

Using skater for analysis

After the model is returned, we need some interesting graphics and such for interpreting the model. Skater is a library which Anubhav initially suggested that gives you nice graphs on model interpretability:

https://github.com/datascienceinc/Skater

I think @Doppe1g4nger was going to work on this after finishing data cleaning?

Dataset storage needs improvement

Encoding and decoding structures and compositions to/from csv causes problems with oxidation states and is also slow. Is anyone open to converting the matbench dataset format to json or pickle for less hassle loading?

matminer issue: MaximumPackingEfficiency error

this structure featurizer returns the following error that is not forgiven by ignore_errors=True. You can reproduce it with the following example. The reason I put it here is that I reproduce it with our dataset that is not available in matminer, I just put it here not to forget about it...

import matminer.featurizers.structure as sf
from matbench.data.load import load_jdft2d
from matminer.utils.conversions import structure_to_oxidstructure
from pymatgen import Structure
df_init = load_jdft2d()[:2000]
# df_init = df_init[df_init['formula'].isin(['Nb3IrSe8', 'Bi2PbSe4'])]
df_init['structure'] = df_init['structure'].apply(Structure.from_dict)
structure_to_oxidstructure(df_init['structure'], inplace=True)
print(df_init)
featzer = sf.MaximumPackingEfficiency()
df = featzer.fit_featurize_dataframe(df_init, col_id='structure', ignore_errors=True)
print(df)

the error:

Traceback (most recent call last):
  File "/Users/alirezafaghaninia/Documents/py3/py3_codes/matbench/matbench/scratch.py", line 13, in <module>
    df = featzer.fit_featurize_dataframe(df_init, col_id='structure', ignore_errors=True)
  File "/Users/alirezafaghaninia/Documents/py3/py3_codes/matminer/matminer/featurizers/base.py", line 157, in fit_featurize_dataframe
    **kwargs)
  File "/Users/alirezafaghaninia/Documents/py3/py3_codes/matminer/matminer/featurizers/base.py", line 224, in featurize_dataframe
    res = pd.DataFrame(features, index=df.index, columns=labels)
  File "/Users/alirezafaghaninia/Documents/py3/lib/python3.6/site-packages/pandas/core/frame.py", line 387, in __init__
    arrays, columns = _to_arrays(data, columns, dtype=dtype)
  File "/Users/alirezafaghaninia/Documents/py3/lib/python3.6/site-packages/pandas/core/frame.py", line 7475, in _to_arrays
    dtype=dtype)
  File "/Users/alirezafaghaninia/Documents/py3/lib/python3.6/site-packages/pandas/core/frame.py", line 7552, in _list_to_arrays
    content = list(lib.to_object_array(data).T)
  File "pandas/_libs/src/inference.pyx", line 1517, in pandas._libs.lib.to_object_array
TypeError: object of type 'numpy.float64' has no len()

load_mp should return other quantities

including e_form, etc., some more of the stuff that is in the MAPI docs

I'll do this

find a way to obtain feature_cols list and target_col easily

Find a way to obtain feature_cols list and target col easily, as they can be essential supplies to machine learning, i.e. df[feature_cols list] as X and df[target_col] as y.

The feature_cols list includes:

feature_cols added by matminer featurizers during Featurize, can be obtained by feature_labels.
some input cols of the loaded df that the user wants them to be used as features.
may exclude some features after feature selection, e.g. removing features with low variance etc.

This is intended to make the chain "load-featurize-(preprocess)-machine learning" smoother.

What is MatbenchError?

@albalu

Add ability to ensemble top models in tpot

We might consider having an ability to ensemble predictions from the top tpot models (top models not just being best hyperparam combo for each model type, but whatever the top hyperparameter + model combinations are)

This can maybe wait until later though

Heuristic based featurizer selection

We need a FeaturizerSelection class which given a df, returns either (a) a set of featurizer (objects) to use for featurization or (b) a set of featurizers to exclude

I think @Qi-max is close to being done with this already

As far as implementing this in the pipeline, I think we have two choices:

FeaturizerSelection happens when Featurization.fit() is called
FeaturizerSelection is its own separate step before Featurization is used:

fs = FeaturizerSelection()
fs.fit(df)                      # determines the featurizer sets to use, sets .featurizers
featobjs = fs.featurizers       # the list of featurizers to be passed to Featurization
df = fs.transform(df)           # doesn't do anything to df?

Whatever @Qi-max and @utf decide is fine by me

Look into FunctionFeaturizer

We already have functionfeaturizer in matminer, it should be relatively easy to use it during featurization. We can also test whether having these features actually increases performance or not...

DataCleaner needs scaling

Was having trouble implementing this so I am putting it off till later

Basically fitting (with scaling enabled) using DataCleaner should define a scaler_obj for the class. This scaler object can then be used to transform all numerical (excluding target and one-hot or label columns) columns on other dataframes while preserving the scaling from the fitted scaler.

In other words the scaler should not be refit during .transform, only .fit.

Outlier detection as a preprocessing step

I think it would be worth looking into adding the option to run an outlier detection algorithm (sklearn has some good ones) during the preprocessing stage. Based on the results we could throw out outliers that might affect performance or dynamically change the tpot accuracy metric to one that's more outlier resistant.

I thought of this because one of the datasets I'm working with has a few outliers and I think they are causing tpot to try really hard to find a model that improves performance drastically on those few when it should instead be finding a marginally better fit for the vast majority of the data.