Comments (8)
Thanks for the example! (In the future, providing code as formatted text is more helpful: people can copy and paste to quickly retry what you're showing.)
This demonstrates what I was alluding to in my second paragraph: this is expected behavior. If you print the results of the penultimate line (fit_transform
), you'll see different values within each category. You should get the same output if you change the last line to cbe_encoder.transform(df1['f1'], df1['label'])
. Using fit_transform
, or transform
with y
specified, tells the package that you're transforming the training dataset, and so it takes the sliding transformation that CatBoost is known for. On the other hand, when you transform
with y=None
, the package takes that as meaning you're transforming the test set, and so fixed values per category are used (roughly, the mean target from the entire training set). See the NOTE at the end of the docstring.
from category_encoders.
@bmreiniger I revisited the code and noticed that in the source code of CatBoost,the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?
I'm not entirely sure, in particular what the CatBoost source's definitions of those terms are. But they seem to be just two (probably different but similar) ways to regularize/smooth the raw mean-target-so-far. The mean
here is the global mean, a sensible default for a prior. And note that the - y
part is just to remove the row's own contribution from pandas's cumulative sum.
from category_encoders.
Can you provide code with your example?
Note that fit_transform
should produce different values per category, whereas transform
should not. (Edit: transform
with y=None
should not, but if y
is provided, then it should behave the same as fit_transform
.)
from category_encoders.
Here are my code
As the code showed, different categories will have different results. But in CatBoost, the same category with different orders will also produce different results.
from category_encoders.
Very clear! Appreciate for your reply!
from category_encoders.
@bmreiniger I revisited the code and noticed that in the source code of CatBoost,the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?
from category_encoders.
@bmreiniger
When conducting a regression task, I observed that CatBoost implements bucketing for labels, which is not the case here. Is the bucketing process necessary in this context?
from category_encoders.
I don't see why it would be, but maybe bucketing (depending on how you assign the label then) acts as another source of regularization? Or maybe it's just faster? Can you link to their source that performs bucketing? (Maybe better to ask over there.)
from category_encoders.
Related Issues (20)
- Equivalent method to sklearn's partial_fit? HOT 1
- CountEncoder incorrectly counts Timestamp columns HOT 3
- Target encoding categories with a single training example HOT 1
- DOC: one of the source links is dead HOT 1
- Missing text in documentation HOT 2
- Support Pandas 2.1 HOT 1
- Feature Request: Count-Based Target Encoder (Dracula)? HOT 1
- Pandas' string columns are not recognized HOT 3
- Pandas copy-on-write doesn't work properly HOT 2
- pd.NA should behave as np.nan HOT 5
- Multidimensional/composite target encoding HOT 4
- FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. HOT 2
- Support for Spark HOT 1
- EOF Error Raised while Calling HashingEncoders function HOT 6
- why we combine this library with main sklearn ? HOT 1
- Combining with set_output can produce errors HOT 1
- AttributeError: 'DataFrame' object has no attribute 'unique' HOT 1
- [Question; need help; support request] Possible to join multiple CountEncoders after parallel (multiprocessing) fitting? HOT 1
- FutureWarning in ordinal encoder when downcasting objects HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from category_encoders.