Comments (1)
Hi @bking124
I haven't heard of the approach before. Searching "Dracula Encoder" or "CTR encoder" (as mentioned in the talk) also doesn't yield much. Since the talk and blog post are already 8 years old and it didn't get much traction since I'd be surprised if yields great results.
On the other hand we could include it into the package. I think it should be rather straight forward to implement.
From what I understood the encoded value is calculated as:
- calculate the counts for each label
df.groupBy(col, label).count()
. This can be only done for the top N and the rest will go to a rest category - use as encoded value for a label x:
counts[x, target=0], counts[x, target=1], ..., log-odds, flag_is_rest
I'm not quite sure how to handle the regression case. Probably we'd need some binning of the target variable there?
Also small categories might result in overfitting if the classifier basically ignores the counts and just uses the log odds (which it will). This might be a potential issue (just like in target encoding with too little regularization).
In fact this is pretty much what you'd get when you encode a variable with both count encoder and target encoder (with no regularisation).
from category_encoders.
Related Issues (20)
- Equivalent method to sklearn's partial_fit? HOT 1
- CountEncoder incorrectly counts Timestamp columns HOT 3
- Target encoding categories with a single training example HOT 1
- DOC: one of the source links is dead HOT 1
- Missing text in documentation HOT 2
- Support Pandas 2.1 HOT 1
- Pandas' string columns are not recognized HOT 3
- Pandas copy-on-write doesn't work properly HOT 2
- pd.NA should behave as np.nan HOT 5
- Multidimensional/composite target encoding HOT 4
- FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. HOT 2
- Support for Spark HOT 1
- EOF Error Raised while Calling HashingEncoders function HOT 6
- why we combine this library with main sklearn ? HOT 1
- catboost encoder get different result with catboost HOT 8
- Combining with set_output can produce errors HOT 1
- AttributeError: 'DataFrame' object has no attribute 'unique' HOT 1
- [Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting? HOT 1
- FutureWarning in ordinal encoder when downcasting objects HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from category_encoders.