Comments (16)
from matminer.
It is probably best if we try to list all possible options before deciding on one.
Just to continue listing potential options, another one is to change featurize() so that it returns a NamedTuple of (values, labels). We would get rid of the feature_labels() method altogether - featurize() would return both data and labels. One reason to do this is that the initial sketch of the BaseFeaturizer assumed that feature_labels() were an attribute of the featurizer (plus any initialization options). But, after coding many of these featurizers, it really turns out to be an attribute of the featurizer plus the data thrown into it in many cases.
So something like:
values, labels = MyFeaturizer.featurize(data)
or a different usage to highlight the NamedTuple aspect:
x = MyFeaturizer.featurize(data)
values = x.values
labels = x.labels
from matminer.
I agree, listing the option is a good plan. One more to add to the list: having a separate base class for featurizers that do not perform well. This makes:
- No changes. All options that effect
featurize
,feature_labels
, et cetera must be set in the initializer - Add a training operation (
fit
) to BaseFeaturizer. The training operation changes sets attributes of the featurizer that may not be known before training data is available. - Featurize produces both labels and features. Each entry (row) can produce different features, and the
feature_label
operation becomes redundant - No changes to BaseFeaturizer, and we develop a new class. Perhaps featurizers that generate a "bag of words" type feature are indeed different enough to warrant a different API.
Am I missing any?
Of this current list, I think Option 2 is the best (e.g., the fit_transform
method proposed by @computron , if I am not misunderstanding his proposal):
-
Option 2 provides the easiest interface to ML libraries. The main purpose of the BaseFeaturizer class is to transform materials data into a form that works with many machine learning libraries: a matrix of features. The current contract for
featurize
encourages users to build featurizers that will generate an easy-to-use matrix of features. -
Option 2 will make it easier to update a model than Option 1. As @computron noted, we now have featurizers that generate different features depending on the input data. If we choose Option 1, each time an training set is updated we will have to re-initialize the featurizer, which places a burden on the user to track what the other inputs to the constructor are. A separate operation that updates the state of the featurizer given only the updated dataset reduces the complexity of updating a class by providing exactly the needed behavior using minimal inputs.
-
Option 2 will preserve
featurize_dataframe
not affecting the featurizer state. To be useful,featurize_dataframe
must generate the same features each time it is called on different datasets. A featurizer that generates different inputs for a model given the training set and test set will complicate evaluating the performance of the model. To address this, we must have a separate operations that alter which features are being generated.
The downside of option 2 is that there are indeed some ML libraries that expect inputs besides rectangular matrices. Some libraries operate on bag-of-word-like representations, others sequences/graphs. The current BaseFeaturizer
does not work well with these. Perhaps these will require a separate base class that does not force outputs to be table-like.
PS. Once we agree on a pattern, I can draft a 'Implementation Guide' for BaseFeaturizer that captures our plan. This will also fill issue #93.
from matminer.
I looked into the fit_transform
method for CountVectorizer
. To assign a total number of features for several text documents, the method finds the total vocabulary for each input text document. In the context of featurizers, this is analogous to computing feature_labels
for each input sample.
For some featurizers you might be able to infer the number of features (for each sample) from the input data (e.g. many of the structure featurizers create site-specific features, so you could count the number of crystal sites. The SiteStatsFingerprint
is a good example of this). However, this is not the general case, and you're back to computing feature_labels for all samples before calling fit_transform
.
Therefore, I don't think that there is a way around returning both features and feature_labels for every sample when you have a featurizer that can return heterogeneous features. To prevent GeneralFeaturizer
from being a state-full object you'll need to return both features and labels with every featurize
call.
However, you bring up a good point that Matminer is supposed to generate easy-to-use features, and having featurizers that don't return the same number of features doesn't really fit the bill. A user might have to develop a feature-prep pipeline that combines several site specific features in a meaningful way (e.g. sum all the site-specific features that are on tetrahedral sites together and then drop all the other features).
To separate featurizers that are "ML ready" and those that may require extra post-processing. I created a HeterogeneousFeaturizer
, which is a child of the BaseFeaturizer
class. GeneralFeaturier
s that are children of this new class do not use feature_labels
, but instead return a dictionary of features/labels with featurize
. It's easy to generate a sparse DataFrame from a list of dictionaries.
from matminer.
This is just a temp solution for now, but it offers some advantages like relieving the need for a feature_labels
function in general.
You might have two classes HeterogeneousFeaturizer
and HomogeneousFeaturizer
that are identical, but allow you to semantically differentiate between different featurizing behavior, so the user knows what they're getting with each featurizer.
from matminer.
Thanks for this detailed reply, @dyllamt ! I think the HeterogeneousFeaturizer
might be a good solution for features that really don't fit the mold of BaseFeaturizer
. How rare do you expect that to be?
You make a good analogy for the fit
function to be equivalent to computing labels before features. I agree that there are some examples where figuring out the feature labels ahead of time is easy, and I think this could be the general case. As another example, for the PRDF features you could determine which elements are present in your dataset, which can define both the features that will be computed and their labels.
To give us something concrete to discuss, I'm going to implement the PRDF using the fit
approach (Option 2) above. We can then contrast that with your HeterogeneousFeaturizer
implementation and, perhaps, we could use that to discuss which approach we like better. [Hopefully, I'll get the time to do that today or early tomorrow]
from matminer.
Hi all,
OK, I thought a bit about this. Here's my current suggestion, which echoes Option 2 above but with some differences.
-
We make all featurizes follow the
Transformer
pattern in scikit-learn. See these links:
http://scikit-learn.org/stable/data_transforms.html and http://www.dreisbach.us/articles/building-scikit-learn-compatible-transformers/ . This will also allow using matminer featurizers in sklearn pipelines if desired: https://signal-to-noise.xyz/post/sklearn-pipeline/ and naturally integrate matminer features with other sklearn tools. -
BaseFeaturizer
will subclassTransformerMixin
from sklearn. -
To make this work, the
BaseFeaturizer
must implement:
- fit(X matrix): for simple featurizers, this can be empty. For things like
BagofBonds
, it can generate the list of potential bonds and store as an internal variable. - transform(X matrix): this is just like our current "featurize_dataframe()" function, except it takes in a numpy matrix instead of a dataframe & cols.
- fit_transform (X matrix) - OPTIONAL. One can implement this if there is an efficient process to do both fit and transform, otherwise this is automatically implemented by sequentially calling fit() and transform() (via
TransformerMixin
).
Some changes that will need to be made to BaseFeaturizer:
- Need to add the
fit
method to BaseFeaturizer. Default implementation can bepass
, the simple Featurizers will do this. - Need to add the
transform
method to BaseFeaturizer. Currently, I am thinking a default implementation based on howfeaturize_dataframe
operates (i.e., using the user-definedfit()
andfeaturize()
methods instead of expecting the user to implementtransform
. - Probably, change
featurize_dataframe
to mainly just calltransform
. - Keep the
feature_labels
around the same as before, but note that one might need to callfit()
first (which might set an internal variable to get those feature labels. This means thatBaseFeaturizer()
is stateful - i.e.,feature_labels
returns the labels from the most recentfit()
operation only and might throw an error if it's called before callingfit()
. Under the new rules, that's OK. - Note also that, for certain featurizers,
featurize()
could also throw an error iffit()
is not called first. Again, under the new rules, that's OK. - Possibly add a method to BaseFeaturizer called
requires_fit()
with default implementation returning False. The more complex featurizers can override this to True. This might help communicate to the user which featurizers require a fit. It's possible that the BaseFeaturizer's implementation oftransform
orfeaturize_dataframe
will leverage this function.
For the current set of feature implementations, technically one does not need to change anything for the vast majority of featurizers. All that code can stay the same - the fit()
method already has a default implementation of pass
, the transform()
method is getting handled by the BaseFeaturizer
, and the fit_transform
method from the TransformerMixin
. However, the more complex featurizers like BagofBonds
or Dos
that are heterogeneous should be rewritten to take advantage of fit()
- this will simplify their code and make it more natural, as well as make featurize_dataframe
work out-of-the-box.
Note that as a positive result of all these changes, all the matminer featurizers will be compatible with standard sklearn Transformers and use the same language - which sounds to me like the right way to go.
from matminer.
(see also this link re:Pipelines: https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.2-lesson/readme.html)
from matminer.
👍 I'm on board with this plan.
from matminer.
A quick thought: is the desired behavior of transform
close to what we currently have named featurize_many
?
from matminer.
I see why you were both in favor of this paradigm now! I didn't see the vision before.
I think you're right about transform
and featurize_many
@WardLT. sklearn's FeatureUnion
might also take care of MultipleFeaturizer
.
I noticed that in sklearn's CountVectorizer
that the fit
method is just a call to fit_transform
. That might be convenient for BagofBonds
and Dos
where computing the features gives you the correct column headers. BagofBonds
seems to be already coded in that style.
from matminer.
I'm currently in the process of trying to capture @computron's description with an update of BaseFeaturizer and an implementation of the PRDF. Is anyone else already doing that?
from matminer.
Not me right now.
from matminer.
And, it occurs to me that we should probably make BaseFeaturizer inherent from BaseEstimator
from scikit-learn. I think that will provide us the ability to work with Pipelines
, with the only additional coding requirement being that we cannot use *.args
and **kwargs
in the initializers (see here). Sound reasonable?
from matminer.
I am mainly chasing around my daughter this long weekend so I am not implementing - @WardLT I think you are good to go!
Subclassing BaseEstimator
sounds good to me.
As an aside, at first I thought we might want to have *args
and **kwargs
in certain Featurizer constructors (e.g., to pass those arguments into various pymatgen objects), but I think we can do without them. For example, we can directly take in a pymatgen object in the constructor and skip the need for passing *args
and **kwargs
.
from matminer.
Closing this now based on #156
from matminer.
Related Issues (20)
- Materials Project time split dataset - `load_data_from_json` returns `None` during debugging (conditionally)
- `matminer.datasets.utils._validate_dataset()` flaky on Windows? HOT 5
- [FEATURE REQUEST] SkipAtom compositional featurizer
- AttributeError: 'DensityFeatures' object has no attribute 'desired_features' HOT 1
- SOAP features HOT 2
- Suggestion: OPTIMADE data retriever HOT 1
- Fail to approach MPData HOT 4
- Handling NaNs from ElementProperty HOT 3
- CI failing due to broken mongo service
- Fixing matminer's multiprocessing problem HOT 1
- New release 0.9.0? HOT 2
- WenAlloy wrong valence electron counts
- mp-api for MPDataRetrieval needs upgrade badly HOT 1
- Missing compatibility with pandas v2 HOT 11
- Issue link to matsci.org broken
- compatibility request: pandas-2+ HOT 2
- Error in import composition HOT 6
- Simple composition-based featurization fails due to an upgrade in pymatgen HOT 2
- Re-enable tests that are skipped in CI HOT 3
- when I import matminer,mistake as following:ValueError: Unexpected atomic number Z=119。 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from matminer.