Expected Behavior Actual Behavior Steps to Re

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Can you provide code with your example? Note that <code class="notra

Here are my code <a target="_blank" rel="noopener noreferrer" href="https://privat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

catboost encoder get different result with catboost about category_encoders HOT 8 CLOSED

ccylance commented on August 28, 2024

catboost encoder get different result with catboost

from category_encoders.

Comments (8)

bmreiniger commented on August 28, 2024 1

Thanks for the example! (In the future, providing code as formatted text is more helpful: people can copy and paste to quickly retry what you're showing.)

This demonstrates what I was alluding to in my second paragraph: this is expected behavior. If you print the results of the penultimate line (fit_transform), you'll see different values within each category. You should get the same output if you change the last line to cbe_encoder.transform(df1['f1'], df1['label']). Using fit_transform, or transform with y specified, tells the package that you're transforming the training dataset, and so it takes the sliding transformation that CatBoost is known for. On the other hand, when you transform with y=None, the package takes that as meaning you're transforming the test set, and so fixed values per category are used (roughly, the mean target from the entire training set). See the NOTE at the end of the docstring.

from category_encoders.

bmreiniger commented on August 28, 2024 1

@bmreiniger I revisited the code and noticed that in the source code of CatBoost，the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?

I'm not entirely sure, in particular what the CatBoost source's definitions of those terms are. But they seem to be just two (probably different but similar) ways to regularize/smooth the raw mean-target-so-far. The mean here is the global mean, a sensible default for a prior. And note that the - y part is just to remove the row's own contribution from pandas's cumulative sum.

from category_encoders.

bmreiniger commented on August 28, 2024

Can you provide code with your example?

Note that fit_transform should produce different values per category, whereas transform should not. (Edit: transform with y=None should not, but if y is provided, then it should behave the same as fit_transform.)

from category_encoders.

ccylance commented on August 28, 2024

Here are my code

As the code showed, different categories will have different results. But in CatBoost, the same category with different orders will also produce different results.

from category_encoders.

ccylance commented on August 28, 2024

Very clear! Appreciate for your reply!

from category_encoders.

ccylance commented on August 28, 2024

@bmreiniger I revisited the code and noticed that in the source code of CatBoost，the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?

from category_encoders.

ccylance commented on August 28, 2024

@bmreiniger
When conducting a regression task, I observed that CatBoost implements bucketing for labels, which is not the case here. Is the bucketing process necessary in this context?

from category_encoders.

bmreiniger commented on August 28, 2024

I don't see why it would be, but maybe bucketing (depending on how you assign the label then) acts as another source of regularization? Or maybe it's just faster? Can you link to their source that performs bucketing? (Maybe better to ask over there.)

from category_encoders.

catboost encoder get different result with catboost about category_encoders HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent