Code Monkey home page Code Monkey logo

Comments (3)

FlorianWetschoreck avatar FlorianWetschoreck commented on July 21, 2024 1

It would be great if you can share some of the datasets where you observed those huge differences and it most likely has to do with the high number of unique values.

And I agree that it would be good to show a warning if there is a large number of unique values which threatens a valid calculation result.
We could also push this even further and automatically adjust the sample size based on the distribution of the target/feature values

from ppscore.

8080labs avatar 8080labs commented on July 21, 2024

In the future, we will provide more in-depth documentation (and a paper) where we document all the tests and the results.

In short, here are the thoughts and observations:

  • General note 1: the sampling is a heuristic in order to reduce the computation time and it has drawbacks. There are also other methods for reducing the computation time.

  • General note 2: the ppscore only takes into account 2 columns during a single calculation. And there are only so many patterns that can exist between 2 columns.

  • If you have two numeric columns, it often does not matter very much how many rows you have. And a sample of 5000 is already plenty. And yes, there are edge cases where this is not true.

  • If you have categoric columns with many unique values (let's say more than 500), then there is a good chance that this might be a problem. And in case that your dataset has millions of rows, there is a good chance that there are more than 500 unique categoric values. However, many distributions of categoric values are highly skewed and in this case, there might be hardly any problem.

Which observations did you make so far? Do you want to share some of them?

from ppscore.

Dyex719 avatar Dyex719 commented on July 21, 2024

I agree with what you said.
I saw a difference in some scores to be ~0.5 in my dataset, which is huge.
These were mostly in columns with high number of unique values, however I will need to dig deeper to see if this is always the case.

I think it would be useful to provide this information, maybe as a warning to columns that have a large number of unique values.

from ppscore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.