Comments (4)
I kinda wanted to have it about the DataArray
only at first, but through discussions we decided to include that in the SLEP.
from enhancement_proposals.
It looks like the two motivations for the InputArray
is for:
- Input feature names for model inspection, which pushes for defining
input_feature_names_
. - Allow estimators to treat columns differently. (For fairness or categories)
Would voting for SLEP 012 indirectly vote for input_feature_names_
? If not, the SLEP vote is kind of for "infrastructure". In other words, it will enable us to work on the two items above.
(BTW I trying to see what we need to do to get SLEP 012 to be ready to vote on)
Is there any other use cases you see for InputArray
?
from enhancement_proposals.
I guess I need to give a bit of a context.
The history of feature names "feature" (:D) seems something like the following to me (please correct me if I'm missing something @amueller and @jorisvandenbossche ):
-
a
get_feature_names
is introduced in some of the estimators. In most estimators it's the output feature names, but if implemented in a predictor, or a pipeline which has a predictor as the last step, it's the input features to the predictor (hence kind of inconsistent). -
Propagating the feature names through a pipeline in order to be able to inspect the feature names going into each step, can be done if we allow
get_feature_names
to accept an existing set of feature names (from the previous step, in the pipeline), and having theget_feature_names
of the pipeline doing that for us, would allow the user to inspect the features going to the last step. This would be done once thefit
is called. Some of the drawbacks:get_feature_names
not clear if it's the input or output, only works in a pipeline, estimators don't have access to the feature names atfit
time (@amueller has an implementation of this). -
Pipeline propagates the feature names at fit time, getting the output feature names of each step, passing it to the next step's
fit
. This could either be done withget_feature_names
or thefeature_names_in_
andfeature_names_out_
. The latter makes it clear which one's which. It also means estimators have access to the feature names duringfit
. However, it's quite a bit of change in the pipeline, and also requires us to pass an especialfeature_names
arg tofit
, which doesn't seem the nicest design (@amueller has an implementation of this). -
Building on the previous step, we could use a data structure which includes the feature names as the output of
transform
and pass that tofit
, and have estimators understand feature names from that. I had an implementation of this usingxarray
and we seemed pretty okay with the implementation. It also doesn't touch the pipeline process, since it's transparent to the pipeline, or other meta-estimators. The main issue with this solution is dependency onxarray
and hencepandas
. -
Further developing the idea, we had a few discussions about developing a new data structure to keep the feature names alongside the data, and that's how the
DataArray
orInputArray
came to be. It has the downside of introducing yet another data structure to the ecosystem, and that we'd need to maintain it. We also contemplated the idea of ideally pushing it out of sklearn in the future to have it a standalone package. The upside is that we'd avoidxarray
and/orpandas
dependency. It's also more of a backward compatible implementation since any operation on our object would result in anumpy
array.Understanding this history, makes it kinda clear (I hope) why the SLEP includes
feature_names_in_
. If you look at the SLEP's history, you'll see that I started it as the infrastructure slep as you mention, but the reviewers were not happy with it since it didn't have enough motivation, and that motivation would include the introduction of feature names. -
The latest development of the idea is to have feature names implemented using
xarray
and/orpandas
, but with a soft dependency, i.e. users would be able to have/enable the feature only if they have the dependency installed. From my conversations with a few of the core-devs, it seems it's a solution which would rather easily get a consensus.
With the above background in mind, I'm hoping that the last solution would gain consensus, and I hope the implementation is not too complicated. There are a few bits which need to be discussed, such as:
xarray
orpandas
? I very much preferxarray
, Andy preferspandas
, and there are cons and pros on either side.- global config or a local one, to enable feature names
- is it enabled by default if user passes a data structure containing feature names
- what's the default value of those configs
- probably off b default (on both accounts) for at least a few releases since returning the new data structure is not backward compatible with people's existing code
The parts where we seem to already do have a consensus on are (I think):
- having both
feature_names_in_
andfeature_names_out_
- deprecating
get_feature_names
- transformers returning a data structure which includes the feature names, if feature names are enabled
If we move towards the latest solution (which I support), then I'd withdraw SLEP012 in favor of the new solution.
from enhancement_proposals.
Some additions:
-
It's right now always the output feature names, so it's consistent, though potentially unexpected.
-
This got a bit better since we can now slice pipelines and we could do
pipe[:-1].get_feature_names()
. That doesn't solve issues for other meta-estimators (if we want feature names there) though. -
If we do this with
get_feature_names()
it has similar limit limitations wrt other meta-estimators like 2). It's true that I mixed the fit-time aspect with thefeature_names_in_
andfeature_names_out_
parts. These are somewhat orthogonal but together they help consistently addressing all meta-estimators. -
It's not entirely clear whether this requires adding
feature_names_in_
. Probably it does.
I agree with the summary. Though I think one aspect you're missing is that @jnothman (and also @thomasjpfan) have concerns that having feature_names_in_
and feature_names_out_
will mean storing potentially generating and storing a lot of strings.
Will the HashingVectorizer
create 1M strings at fit time independent of the input? Will they be copied at every step in a pipeline? We could not have feature names for sparse data. But then we can't really deprecate get_feature_names
in CountVectorizer
.
Having it be opt-in might solve some of these issue, but that means you need to decide ahead of time whether you might want to look at the feature names or not.
from enhancement_proposals.
Related Issues (20)
- I still prefer advancement proposals ;) HOT 4
- SLEP needed: fit_transform does something other than fit(X).transform(X) HOT 8
- SLEP needed: slicling pipelines HOT 5
- SLEP needed: freezing estimators
- Augment template HOT 3
- Scorers might need to know about training and testing data HOT 5
- SLEP007: Meta-estimators section expansion HOT 1
- SLEP 007 feature-name generation: adding a constructor argument for verbosity HOT 23
- SLEP 001: why do we need trans_modify? HOT 2
- SLEPs in draft missing from index.rst
- CI: check the docs build without warning
- Questions and comments about SLEP006 -- Sample props HOT 1
- Questions and comments about SLEP006 -- Sample props HOT 5
- SLEP needed: warm starting with Pipelines, and safer warm starting HOT 2
- Mark implemented SLEPS HOT 3
- SLEP006 (sample props) should handle when metaestimator consumes the same key as its descendant HOT 4
- Rename repository to slep for nicer url HOT 12
- Update of SLEP002 HOT 2
- Shall we rename the default branch here? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from enhancement_proposals.