Code Monkey home page Code Monkey logo

Comments (8)

bmreiniger avatar bmreiniger commented on August 28, 2024 1

Thanks for the example! (In the future, providing code as formatted text is more helpful: people can copy and paste to quickly retry what you're showing.)

This demonstrates what I was alluding to in my second paragraph: this is expected behavior. If you print the results of the penultimate line (fit_transform), you'll see different values within each category. You should get the same output if you change the last line to cbe_encoder.transform(df1['f1'], df1['label']). Using fit_transform, or transform with y specified, tells the package that you're transforming the training dataset, and so it takes the sliding transformation that CatBoost is known for. On the other hand, when you transform with y=None, the package takes that as meaning you're transforming the test set, and so fixed values per category are used (roughly, the mean target from the entire training set). See the NOTE at the end of the docstring.

from category_encoders.

bmreiniger avatar bmreiniger commented on August 28, 2024 1

@bmreiniger I revisited the code and noticed that in the source code of CatBoost,the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?

I'm not entirely sure, in particular what the CatBoost source's definitions of those terms are. But they seem to be just two (probably different but similar) ways to regularize/smooth the raw mean-target-so-far. The mean here is the global mean, a sensible default for a prior. And note that the - y part is just to remove the row's own contribution from pandas's cumulative sum.

from category_encoders.

bmreiniger avatar bmreiniger commented on August 28, 2024

Can you provide code with your example?

Note that fit_transform should produce different values per category, whereas transform should not. (Edit: transform with y=None should not, but if y is provided, then it should behave the same as fit_transform.)

from category_encoders.

ccylance avatar ccylance commented on August 28, 2024

Here are my code
image
As the code showed, different categories will have different results. But in CatBoost, the same category with different orders will also produce different results.

from category_encoders.

ccylance avatar ccylance commented on August 28, 2024

Very clear! Appreciate for your reply!

from category_encoders.

ccylance avatar ccylance commented on August 28, 2024

@bmreiniger I revisited the code and noticed that in the source code of CatBoost,the value is calculated by (coutinclass + piror)/(counts + priorDenominator) while in sklearn, the function is ( coutinclass - y + mean*a) /(counts + a), Why would there be a difference in this part?

from category_encoders.

ccylance avatar ccylance commented on August 28, 2024

@bmreiniger
When conducting a regression task, I observed that CatBoost implements bucketing for labels, which is not the case here. Is the bucketing process necessary in this context?

from category_encoders.

bmreiniger avatar bmreiniger commented on August 28, 2024

I don't see why it would be, but maybe bucketing (depending on how you assign the label then) acts as another source of regularization? Or maybe it's just faster? Can you link to their source that performs bucketing? (Maybe better to ask over there.)

from category_encoders.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.