Code Monkey home page Code Monkey logo

Comments (16)

WardLT avatar WardLT commented on August 16, 2024

from matminer.

computron avatar computron commented on August 16, 2024

It is probably best if we try to list all possible options before deciding on one.

Just to continue listing potential options, another one is to change featurize() so that it returns a NamedTuple of (values, labels). We would get rid of the feature_labels() method altogether - featurize() would return both data and labels. One reason to do this is that the initial sketch of the BaseFeaturizer assumed that feature_labels() were an attribute of the featurizer (plus any initialization options). But, after coding many of these featurizers, it really turns out to be an attribute of the featurizer plus the data thrown into it in many cases.

So something like:

values, labels = MyFeaturizer.featurize(data)

or a different usage to highlight the NamedTuple aspect:

x = MyFeaturizer.featurize(data)
values = x.values
labels = x.labels

from matminer.

WardLT avatar WardLT commented on August 16, 2024

I agree, listing the option is a good plan. One more to add to the list: having a separate base class for featurizers that do not perform well. This makes:

  1. No changes. All options that effect featurize, feature_labels, et cetera must be set in the initializer
  2. Add a training operation (fit) to BaseFeaturizer. The training operation changes sets attributes of the featurizer that may not be known before training data is available.
  3. Featurize produces both labels and features. Each entry (row) can produce different features, and the feature_label operation becomes redundant
  4. No changes to BaseFeaturizer, and we develop a new class. Perhaps featurizers that generate a "bag of words" type feature are indeed different enough to warrant a different API.

Am I missing any?

Of this current list, I think Option 2 is the best (e.g., the fit_transform method proposed by @computron , if I am not misunderstanding his proposal):

  • Option 2 provides the easiest interface to ML libraries. The main purpose of the BaseFeaturizer class is to transform materials data into a form that works with many machine learning libraries: a matrix of features. The current contract for featurize encourages users to build featurizers that will generate an easy-to-use matrix of features.

  • Option 2 will make it easier to update a model than Option 1. As @computron noted, we now have featurizers that generate different features depending on the input data. If we choose Option 1, each time an training set is updated we will have to re-initialize the featurizer, which places a burden on the user to track what the other inputs to the constructor are. A separate operation that updates the state of the featurizer given only the updated dataset reduces the complexity of updating a class by providing exactly the needed behavior using minimal inputs.

  • Option 2 will preserve featurize_dataframe not affecting the featurizer state. To be useful, featurize_dataframe must generate the same features each time it is called on different datasets. A featurizer that generates different inputs for a model given the training set and test set will complicate evaluating the performance of the model. To address this, we must have a separate operations that alter which features are being generated.

The downside of option 2 is that there are indeed some ML libraries that expect inputs besides rectangular matrices. Some libraries operate on bag-of-word-like representations, others sequences/graphs. The current BaseFeaturizer does not work well with these. Perhaps these will require a separate base class that does not force outputs to be table-like.

PS. Once we agree on a pattern, I can draft a 'Implementation Guide' for BaseFeaturizer that captures our plan. This will also fill issue #93.

from matminer.

dyllamt avatar dyllamt commented on August 16, 2024

I looked into the fit_transform method for CountVectorizer. To assign a total number of features for several text documents, the method finds the total vocabulary for each input text document. In the context of featurizers, this is analogous to computing feature_labels for each input sample.

For some featurizers you might be able to infer the number of features (for each sample) from the input data (e.g. many of the structure featurizers create site-specific features, so you could count the number of crystal sites. The SiteStatsFingerprint is a good example of this). However, this is not the general case, and you're back to computing feature_labels for all samples before calling fit_transform.

Therefore, I don't think that there is a way around returning both features and feature_labels for every sample when you have a featurizer that can return heterogeneous features. To prevent GeneralFeaturizer from being a state-full object you'll need to return both features and labels with every featurize call.

However, you bring up a good point that Matminer is supposed to generate easy-to-use features, and having featurizers that don't return the same number of features doesn't really fit the bill. A user might have to develop a feature-prep pipeline that combines several site specific features in a meaningful way (e.g. sum all the site-specific features that are on tetrahedral sites together and then drop all the other features).

To separate featurizers that are "ML ready" and those that may require extra post-processing. I created a HeterogeneousFeaturizer, which is a child of the BaseFeaturizer class. GeneralFeaturiers that are children of this new class do not use feature_labels, but instead return a dictionary of features/labels with featurize. It's easy to generate a sparse DataFrame from a list of dictionaries.

from matminer.

dyllamt avatar dyllamt commented on August 16, 2024

This is just a temp solution for now, but it offers some advantages like relieving the need for a feature_labels function in general.

You might have two classes HeterogeneousFeaturizer and HomogeneousFeaturizer that are identical, but allow you to semantically differentiate between different featurizing behavior, so the user knows what they're getting with each featurizer.

from matminer.

WardLT avatar WardLT commented on August 16, 2024

Thanks for this detailed reply, @dyllamt ! I think the HeterogeneousFeaturizer might be a good solution for features that really don't fit the mold of BaseFeaturizer. How rare do you expect that to be?

You make a good analogy for the fit function to be equivalent to computing labels before features. I agree that there are some examples where figuring out the feature labels ahead of time is easy, and I think this could be the general case. As another example, for the PRDF features you could determine which elements are present in your dataset, which can define both the features that will be computed and their labels.

To give us something concrete to discuss, I'm going to implement the PRDF using the fit approach (Option 2) above. We can then contrast that with your HeterogeneousFeaturizer implementation and, perhaps, we could use that to discuss which approach we like better. [Hopefully, I'll get the time to do that today or early tomorrow]

from matminer.

computron avatar computron commented on August 16, 2024

Hi all,

OK, I thought a bit about this. Here's my current suggestion, which echoes Option 2 above but with some differences.

  1. We make all featurizes follow the Transformer pattern in scikit-learn. See these links:
    http://scikit-learn.org/stable/data_transforms.html and http://www.dreisbach.us/articles/building-scikit-learn-compatible-transformers/ . This will also allow using matminer featurizers in sklearn pipelines if desired: https://signal-to-noise.xyz/post/sklearn-pipeline/ and naturally integrate matminer features with other sklearn tools.

  2. BaseFeaturizer will subclass TransformerMixin from sklearn.

  3. To make this work, the BaseFeaturizer must implement:

  • fit(X matrix): for simple featurizers, this can be empty. For things like BagofBonds, it can generate the list of potential bonds and store as an internal variable.
  • transform(X matrix): this is just like our current "featurize_dataframe()" function, except it takes in a numpy matrix instead of a dataframe & cols.
  • fit_transform (X matrix) - OPTIONAL. One can implement this if there is an efficient process to do both fit and transform, otherwise this is automatically implemented by sequentially calling fit() and transform() (via TransformerMixin).

Some changes that will need to be made to BaseFeaturizer:

  1. Need to add the fit method to BaseFeaturizer. Default implementation can be pass, the simple Featurizers will do this.
  2. Need to add the transform method to BaseFeaturizer. Currently, I am thinking a default implementation based on how featurize_dataframe operates (i.e., using the user-defined fit() and featurize() methods instead of expecting the user to implement transform.
  3. Probably, change featurize_dataframe to mainly just call transform.
  4. Keep the feature_labels around the same as before, but note that one might need to call fit() first (which might set an internal variable to get those feature labels. This means that BaseFeaturizer() is stateful - i.e., feature_labels returns the labels from the most recent fit() operation only and might throw an error if it's called before calling fit(). Under the new rules, that's OK.
  5. Note also that, for certain featurizers, featurize() could also throw an error if fit() is not called first. Again, under the new rules, that's OK.
  6. Possibly add a method to BaseFeaturizer called requires_fit() with default implementation returning False. The more complex featurizers can override this to True. This might help communicate to the user which featurizers require a fit. It's possible that the BaseFeaturizer's implementation of transform or featurize_dataframe will leverage this function.

For the current set of feature implementations, technically one does not need to change anything for the vast majority of featurizers. All that code can stay the same - the fit() method already has a default implementation of pass, the transform() method is getting handled by the BaseFeaturizer, and the fit_transform method from the TransformerMixin. However, the more complex featurizers like BagofBonds or Dos that are heterogeneous should be rewritten to take advantage of fit() - this will simplify their code and make it more natural, as well as make featurize_dataframe work out-of-the-box.

Note that as a positive result of all these changes, all the matminer featurizers will be compatible with standard sklearn Transformers and use the same language - which sounds to me like the right way to go.

from matminer.

computron avatar computron commented on August 16, 2024

(see also this link re:Pipelines: https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.2-lesson/readme.html)

from matminer.

WardLT avatar WardLT commented on August 16, 2024

👍 I'm on board with this plan.

from matminer.

WardLT avatar WardLT commented on August 16, 2024

A quick thought: is the desired behavior of transform close to what we currently have named featurize_many?

from matminer.

dyllamt avatar dyllamt commented on August 16, 2024

I see why you were both in favor of this paradigm now! I didn't see the vision before.

I think you're right about transform and featurize_many @WardLT. sklearn's FeatureUnion might also take care of MultipleFeaturizer.

I noticed that in sklearn's CountVectorizer that the fit method is just a call to fit_transform. That might be convenient for BagofBonds and Dos where computing the features gives you the correct column headers. BagofBonds seems to be already coded in that style.

from matminer.

WardLT avatar WardLT commented on August 16, 2024

I'm currently in the process of trying to capture @computron's description with an update of BaseFeaturizer and an implementation of the PRDF. Is anyone else already doing that?

from matminer.

dyllamt avatar dyllamt commented on August 16, 2024

Not me right now.

from matminer.

WardLT avatar WardLT commented on August 16, 2024

And, it occurs to me that we should probably make BaseFeaturizer inherent from BaseEstimator from scikit-learn. I think that will provide us the ability to work with Pipelines, with the only additional coding requirement being that we cannot use *.args and **kwargs in the initializers (see here). Sound reasonable?

from matminer.

computron avatar computron commented on August 16, 2024

I am mainly chasing around my daughter this long weekend so I am not implementing - @WardLT I think you are good to go!

Subclassing BaseEstimator sounds good to me.

As an aside, at first I thought we might want to have *args and **kwargs in certain Featurizer constructors (e.g., to pass those arguments into various pymatgen objects), but I think we can do without them. For example, we can directly take in a pymatgen object in the constructor and skip the need for passing *args and **kwargs.

from matminer.

computron avatar computron commented on August 16, 2024

Closing this now based on #156

from matminer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.