Comments (3)
It would be great if you can share some of the datasets where you observed those huge differences and it most likely has to do with the high number of unique values.
And I agree that it would be good to show a warning if there is a large number of unique values which threatens a valid calculation result.
We could also push this even further and automatically adjust the sample size based on the distribution of the target/feature values
from ppscore.
In the future, we will provide more in-depth documentation (and a paper) where we document all the tests and the results.
In short, here are the thoughts and observations:
-
General note 1: the sampling is a heuristic in order to reduce the computation time and it has drawbacks. There are also other methods for reducing the computation time.
-
General note 2: the ppscore only takes into account 2 columns during a single calculation. And there are only so many patterns that can exist between 2 columns.
-
If you have two numeric columns, it often does not matter very much how many rows you have. And a sample of 5000 is already plenty. And yes, there are edge cases where this is not true.
-
If you have categoric columns with many unique values (let's say more than 500), then there is a good chance that this might be a problem. And in case that your dataset has millions of rows, there is a good chance that there are more than 500 unique categoric values. However, many distributions of categoric values are highly skewed and in this case, there might be hardly any problem.
Which observations did you make so far? Do you want to share some of them?
from ppscore.
I agree with what you said.
I saw a difference in some scores to be ~0.5 in my dataset, which is huge.
These were mostly in columns with high number of unique values, however I will need to dig deeper to see if this is always the case.
I think it would be useful to provide this information, maybe as a warning to columns that have a large number of unique values.
from ppscore.
Related Issues (20)
- Data preprocessing and information leakage HOT 14
- [SUGGEST] Release a verson supported GPU HOT 2
- ppscore when model_score>baseline_score HOT 3
- There should be an option to override the attribute type like PyCaret HOT 4
- Scikit-learn dependency < 1.0.0 HOT 14
- [Suggestion]: Plot the Decision Tree for pps.score HOT 3
- Readme / docs unclear about using ppscore on time series data HOT 3
- pytests failing with pandas==1.4.0 HOT 1
- Thought on a possible enhancement of the PPS HOT 2
- What does PPS score? HOT 4
- Add support to release Linux aarch64 wheels HOT 4
- Cannot install ppscore HOT 1
- Your package isn't compatible with scikit-learn 1.0.1 HOT 2
- How to report PPS HOT 1
- Question About Data Order HOT 12
- y predicted values given x HOT 3
- Performance HOT 3
- differnt baseline scores for the same y HOT 1
- How to deal with heavy imbalanced data? For example, when the target is 99 "negative" to 1 "positive"
- pandas >2 support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ppscore.