In the readme it is mentioned, In case that the dataset

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? about ppscore HOT 3 OPEN

8080labs commented on July 21, 2024 1

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices?

from ppscore.

Comments (3)

FlorianWetschoreck commented on July 21, 2024 1

It would be great if you can share some of the datasets where you observed those huge differences and it most likely has to do with the high number of unique values.

And I agree that it would be good to show a warning if there is a large number of unique values which threatens a valid calculation result.
We could also push this even further and automatically adjust the sample size based on the distribution of the target/feature values

from ppscore.

8080labs commented on July 21, 2024

In the future, we will provide more in-depth documentation (and a paper) where we document all the tests and the results.

In short, here are the thoughts and observations:

General note 1: the sampling is a heuristic in order to reduce the computation time and it has drawbacks. There are also other methods for reducing the computation time.
General note 2: the ppscore only takes into account 2 columns during a single calculation. And there are only so many patterns that can exist between 2 columns.
If you have two numeric columns, it often does not matter very much how many rows you have. And a sample of 5000 is already plenty. And yes, there are edge cases where this is not true.
If you have categoric columns with many unique values (let's say more than 500), then there is a good chance that this might be a problem. And in case that your dataset has millions of rows, there is a good chance that there are more than 500 unique categoric values. However, many distributions of categoric values are highly skewed and in this case, there might be hardly any problem.

Which observations did you make so far? Do you want to share some of them?

from ppscore.

Dyex719 commented on July 21, 2024

I agree with what you said.
I saw a difference in some scores to be ~0.5 in my dataset, which is huge.
These were mostly in columns with high number of unique values, however I will need to dig deeper to see if this is always the case.

I think it would be useful to provide this information, maybe as a warning to columns that have a large number of unique values.

from ppscore.

Recommend Projects

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices? about ppscore HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent