Comments (4)
If we'd implement it for TargetEncoder we'd need it also for all other encoders where each column is encoded independently (which are all encoders except hashing).
I think the library should focus on just encoding and not do these kinds of feature engineering. My subjective opinion is that leaving concatenation to the user is the way to go. Is it really that clunky? It's just a line of code is it? What exactly do you mean by leads to some unnecessary categorical features
? Do you mean that if you concat product
and color
to productcolor
it will also encode product
and color
(if you do not explicitely specify columns)? On this topic I agree it's annoying.
Maybe we could not change all the encoders but offer some preprocessing functions? There could be a module preprocessing
with a function create_composite_columns(input_df: pd.DataFrame, composite_cols: List[List[str]]) -> pd.DataFrame
that will concatenate the columns and drop the individual cols for encoding.
from category_encoders.
I like the idea that you can choose if you want to encode product
and color
also as separate columns or not.
I also like the tuple solution. This does not interfere with the current API and can be understood if documented well.
If you want to create this PR please go ahead. I think it's a useful feature and it's sufficiently clean and backwards compatible. The only thing I'd ask for is to implement it in all encoders not just target encoding in order to have a uniform API across all encoders
from category_encoders.
Great! I like the preprocessing idea. I will scope it out and work on a PR for this in the next couple weeks.
from category_encoders.
One point of clarification on what you wrote:
that will concatenate the columns and drop the individual cols for encoding.
The use case I have in mind is to get an encoding of the joint product & color fields, but not return a string column of those two (as it will internally handle the encoding and then throw away the concatenated field). I may or may not want to also encode product & color separately. Here is a pseudocode example for how I would do it now.
df['product_color'] = df.product + df.color
encoder = ce.TargetEncoder(cols=['product_color', 'product', 'color'])
encoder.fit(X, y)
X_enc = encoder.transform(X)
X_enc = X_enc.drop(columns=['product_color'])
So it is a minor nuisance (2 extra lines of code). The cleanest solution I can think of is to allow tuples to be passed into the cols arg which indicates those columns should be concatenated before encoding. So I may have spoken too soon about the preprocessing idea. Let me know if this seems like enough of a quality of life improvement to warrant a modification.
I'm also curious about this point:
Do you mean that if you concat product and color to productcolor it will also encode product and color (if you do not explicitly specify columns)? On this topic I agree it's annoying.
If I were in this position and didn't want to specify columns (maybe I have a long list of cat columns to encode), then I think it would be simple enough to drop product & color before encoding? Let me know if I'm missing something and if you'd like me to build a solution for it.
from category_encoders.
Related Issues (20)
- Equivalent method to sklearn's partial_fit? HOT 1
- CountEncoder incorrectly counts Timestamp columns HOT 3
- Target encoding categories with a single training example HOT 1
- DOC: one of the source links is dead HOT 1
- Missing text in documentation HOT 2
- Support Pandas 2.1 HOT 1
- Feature Request: Count-Based Target Encoder (Dracula)? HOT 1
- Pandas' string columns are not recognized HOT 3
- Pandas copy-on-write doesn't work properly HOT 2
- pd.NA should behave as np.nan HOT 5
- FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. HOT 2
- Support for Spark HOT 1
- EOF Error Raised while Calling HashingEncoders function HOT 6
- why we combine this library with main sklearn ? HOT 1
- catboost encoder get different result with catboost HOT 8
- Combining with set_output can produce errors HOT 1
- AttributeError: 'DataFrame' object has no attribute 'unique' HOT 1
- [Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting? HOT 1
- FutureWarning in ordinal encoder when downcasting objects HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from category_encoders.