8080labs / ppscore Goto Github PK
View Code? Open in Web Editor NEWPredictive Power Score (PPS) in Python
License: MIT License
Predictive Power Score (PPS) in Python
License: MIT License
I was wondering if you explored the option of using pyspark to reduce the running time.
Since all the regression/classification models are independent of each other, it seems like a good candidate for parallelisation.
Would love to discuss more about this if you are interested.
To find the score, two columns (namely the params x
and y
in ppscore.score()
) need to be supplied in call to score()
.
It would be nice to have a functionality where the user can get an array of results corresponding to all columns in a single line of code.
For instance, if comparisons between columns 'A', 'B', 'C', and 'D' are to be made, it should be possible to do so with a single line of code. Results could be obtained as a list of comparisons between A-B
, A-C
, A-D
, B-C
, B-D
, and C-D
.
This can be achieved in two ways -
score
for the same.Regarding test cases, it'd be easier to write a new one for a new function (as per 1), rather than modifying the existing implementation.
@8080labs I'd like to work on it.
Thank you for creating a power metric to calculate the score for choosing features.
I hope that you can release a version which supports to calculate this score by GPU. It will reduce time for computating.
Thank you.
Numerical attributes not always point to regression and could be categorical attributes, where a Classification model should be used.
PyCaret offers such features, where we can list numerical and categorical features. Something of that kind will greatly increase the usability of this library.
I am running the following line with all numerical dataset.
pps.predictors(df, "organic")
And, it throws the following error.
AttributeError: module 'ppscore' has no attribute 'predictors'
pps.matrix(df) runs fine.
I would just love to use this on timeseries data and out of the box it seems to do pretty well, but I don't know if I'm interpreting the score right, however, I read over the readme at this link: https://github.com/8080labs/ppscore#calculation-of-the-pps
'''
The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via cross_validation). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
'''
I don't understand, should I set sample=None
for time series or should I modify the cross_validation
kwarg for timeseries data?
Question
At the moment this is just a query, if there is anything equivalent to "Factor analysis" in PPScore or if someone knows of steps to come up with then given a dataset.
It's possible this is a missing feature but before I make such claims I wanted to inquire and start a discussion.
Additional context
Found these resources related to this topic/concept:
Hi,
Im having trouble understanding the baseline score. From what i think i can see in the code then the baseline is the score when we are being naive and choosing the median of y as our predicted value.
That would mean if the baseline_score is higher than the model_score then using the median as a the predicited value would be better than the prediction from the decision tree. Is this correct?
Btw. Thanks for releasing
Correlation works perfectly okay but the PPS is giving me weird results
The dataset link is:
dataset6 TO 10.xlsx
Please help me resolve this, I am trying to predict Ev/EBITDA here.
When the PPS is applied toward linear relationships with the same error but different slopes, the score varies a lot e.g. from 0.1 to 0.7 depending on the slope.
This might not be the behaviour that we expect intuitively and normalizing the target does not help.
The reason for this is that the ppscore calculates the ratio of the variance of the predictor to the variance of the baseline. If the slope is steep, the ratio is higher because the baseline makes more errors. If the slope is flat, the variances are nearly the same.
The underlying problem is that the current metric and calculation of the ppscore couples to questions:
If either of those two criteria is wrong or weak, the ppscore will be low, too.
Only if both are true, the ppscore will be high.
The problem with the linear cases is that the pattern is valid BUT the variance of the pattern is not low because there is a lot of noise - even if the pattern is statistically significant. (High error to signal ratio)
For this scenario (and maybe also for others), we might want to find a calculation that decouples those two concerns
Some rough code:
import pandas as pd
import numpy as np
import seaborn as sns
import bamboolib as bam
import ppscore as pps
df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
df["0.3_linear_x"] = 0.3*df["x"]+df["error"] #0.11 pps
df["0.5_linear_x"] = 0.5*df["x"]+df["error"] #0.4 pps
df["1_linear_x"] = 1*df["x"]+df["error"] # 0.68 pps
# normalized linear to [0,1] via +2 and /4
df["1_linear_x_norm"] = (df["1_linear_x"] + 2)/4 #0.68 pps, too
Hello folks,
I really like the idea of your package and the approach. I was just curious how difficult it might be to introduce a custom CV validation (or even just a ts-meaningful) validation.
I could probably assist in that with a bit of guidance from you :)
Thanks,
Anton.
As it is written in your code, if the column has less then 15 unique values, it is considered as a categorical during fitting.
But, I am facing the following warning during the execution:
Reproducible sample:
#importing the libs
import ppscore as pps
import numpy as np
import pandas as pd
#Creating a toy df
#feature column 1
#feature column 2
col1= np.random.randn(10)*10
col2=np.random.randn(10)*10
#feature colum 3- 5 unique values
col3=np.random.choice(range(4),10)
#creating dataframe
df=pd.DataFrame(
{'feature1': col1,
'feature2': col2,
'feature3':col3}
)
#trying to calculate pps matrix
pps.matrix(df)
Warning that I observed:
anaconda3\envs\DL_py37\lib\site-packages\sklearn\model_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: Unknown label type: 'continuous'
It is not a new issue, and example of solution (find on Kaggle) - temporarily convert the data type of feature3 column to categorical with LabelEncoder
I have improvised the Error Handling Code in ppscore. As the same error handling statements were being used repetitively, I have created a class called Check_Error using which it will be quite easy to call the necessary 'raise error' statements. I have tested it too and only 1 test failed but it also failed in the main repo.
As Hacktoberfest has started, I would like to contribute this code as a contribution to open source, so it would be great if you could label this issue and the pull request 'hacktoberfest'.
A pull request has been created on the master branch.
I am trying to calculate PPS for a classification task between two categorical columns and get this warning:
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
I included a screenshot of my code with some information on these two columns.
Also, there are no missing values in the data.
Thank you for your help!
Hi guys,
I heard of PPS, through your article and was curious to test it. I have tried implementing it on some data I've been working on.
Unfortunately, I get numerous error messages when calculating the pps matrix :
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4.
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
My guess is pps is considering my data to be categorical and therefore trying to apply classification with a huge number of labels.
Looking at how pps determines if the data is numerical or categorical, I cannot find the reason it would consider my data categorical :
Also, if I try to force the pps score to be calculated using task = 'regression', I get the following error :
'DataFrame' object has no attribute 'dtype'
Here is my code :
import pandas as pd
import ppscore as pps
df = pd.read_csv('seattle_building_energy_benchmark.csv', sep = ';')
df.dtypes
df.nunique()
pps.NUMERIC_AS_CATEGORIC_BREAKPOINT = 10
for col in df.columns:
print(col)
pps.score(df, x = 'YearBuilt', y = col, task = None)
for col in df.columns:
print(col)
pps.score(df, x = 'YearBuilt', y = col, task = 'regression')
pps.matrix(df)
Is there something I am missing ? If not, would you like me to share the data with you ? (I do not know which sharing method is more convenient for you)
It shows this warning. I attached the csv file
The least populated class in y has only 1 members, which is less than n_splits=4.
C:\Users....\sklearn\model_selection_split.py:667: UserWarning:
Hi @FlorianWetschoreck, @tkrabel , @SuryaThiru
Quick question: how to properly interpret ppscore?
Say if you have a dataset, 3000 rows x 30 columns; you then apply pps.matrix(), then sort values by ppscore. Is there a "rule of thumb" or rational-guideline to categorize ppscore levels?
Like the following:
I read this article - https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 , but couldnt find an answer there
Thanks a million, Fernando
Hello,
First of all PPScore is so good, great job! Running only PPS score to reduce feature numbers gives better results than using other feature elimination techniques, or running it first before using other feature elimination techniques ALWAYS yields better results than running pretty much every other feature elimination technique(s) alone.
I just wanted to ask how to change from F1 score to ROC, specifically, Precision-Recall Curves (as I have moderate imbalanced classes), for my binary classification problem.
Thank you for your help in the matter.
Regards,
Achilleas
Hi there,
This appears to be quite a valuable metric and I am curious, have you published a pre-print, or paper on the details surrounding the experiments and calculations?
Thanks.
Nice idea! From a blog post I read it's not very fast.
If you use randomized trees, you'll make it very much faster, doing hundreds of columns and many thousands of rows in a few seconds. Moreover, because you are fitting a single column, the randomized trees don't have any drawback compared to very advanced trees.
Can think more about it, if you need help.
What about the duplicate entries [of feature-target pair] which could become parts of both train and test dataset — they shall be removed otherwise, it is clear overfitting?
Hi, first of all, thank you for a great tool!
I tried reproducing the experiment from jupyter notebooks, but got different results, both for quadratic function and titanic dataset.
For the quadratic function, I am getting 0.67 instead of the value of 0.88 mentioned in the article. Although that discrepancy might have been caused by the randomness of data, the discrepancy I get with Titanic dataset is bigger:
in the article, you mention the PPS score between TicketId and TicketPrice of 0.67, whereas reproducing your notebooks, I am getting a score of 0.27.
You can see the steps to reproduce the discrepancy in this notebook, please skip to line 26.
I have python version 3.6.8, sklearn 0.22 and ppscore version 0.0.2, you can see them in the mentioned above notebook.
Scikit-Learn recently released their first stable version.
There aren't many breaking changes, so I think it would be nice if this could be reflected in the dependencies of PPScore.
Would be happy to help with this as well.
Note: "Issue" is not the correct tag for this as it is more of a comment, but anyone working with noisy data should consider it.
I'm interest in using PPS for feature selection. I work with data that has a high noise to signal ratio and the PPS score is consistently 0 despite changes to parameters such as sample size and number of cross validation folds. One can easily reproduce this result by changing the "error" term in the example from the "Getting started" section to be uniform -5 to 5 instead of -0.5 to 0.5 and leaving the "x" value as uniform -2 to 2. Does anyone have a similar experience or any insight on the usefulness of PPS for this type of data?
I think a more constructive approach to identifying when a relationship exists between data would be to not RIP correlation. Despite some downsides, it still has its place. Instead, please consider the benefits of using multiple scores that measure relatedness between data when working on your data science projects.
While Pearson correlation measures linear relationships between data, their are several other correlation measures that do not make that assumption and should be examined before abandoning sound mathematical/statistical methods. Spearman correlation measures rank correlation (i.e., does not measure linear relationship) and Kendall's tau (and several variants of it such as Goodman and Kruskal's gamma) measures ordinal association between data with a non-parametric construction. It would be interesting to see how these other correlation measures stack up against PPS in the canonical "0 correlation" scenarios.
Also, on a slightly different topic: One inherent benefit of PPS that its authors have not taken advantage of is its ability to easily be extended to measure predictive power of combinations of features (i.e., interaction terms). The decision tree model used under the hood already supports multiple input features so why not allow "x" to be "x's"? Traditional correlation measures (that I mention above) require the user to assume the form of the interaction between features (is it x1 * x2? x1 / x2? etc.) in order to test the relationship while an extended PPS would not require that assumption. I think this is a very powerful use case for PPS despite its challenge of being visualized in a traditional correlation matrix format.
Sorry for the rant. I do not mean to disparage PPS and the work the authors have done. I think it does have its merits as a practical solution to real world data science problems that we are all facing in our work and with some improvements it can be even more useful.
Hi,
I'm quickly experimenting by implementing ppscore in my pipeline for the assessment of functional connectivity between brain regions, and I noticed two things:
1/ I think we should be able to use pps.matrix() even on a 2D numpy array when we don't have explicit column names: as of now, it is raising the error AttributeError: 'numpy.ndarray' object has no attribute 'columns'
2/ I got a strange error telling me that "continuous" is an unknown label. File "/home/clementpoiret/anaconda3/envs/nilearn/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets raise ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'continuous'
Code to reproduce the error:
import numpy as np
import pandas as pd
X = pd.DataFrame(np.random.randn(10,10))
pps.matrix(X)
The error is solved by passing task='regression'
. I have sklearn 0.23.0
Maybe an additional comment: maybe that the diagonal of the resulting matrix should be 1, because it makes sense that the predictive power of a vector on itself is 1, no?
Can missing be treated as a separate category? I see errors sometimes as "after dropping missing no valid rows are there".
I see in the doc
"All rows which have a missing value in the feature or the target column are dropped"
this is not desirable as missingness maybe a predictive factor which lightgbm and xgboost can handle.
Hello, before anything thanks for the package, it is very useful and the overall approach is innovative and generates a lot of efficiency. I have a comment regarding the "state" of the data to run the pps analysis on, it seems (I may be mistaken) that any transformation to the data (standardization for example) will lead to large data leakage into the Kfolds cross-validations. Is it correct? The module could use sklearn´s pipelining and standard transforms to possibly increase the information generated, would this be of value to the module?
I am having a similar issue to #11. I recreated the sample dataset from their example but with 100 rows and made features 2 and 3 both categorical. When I run pps.matrix
with features 2 and 3 as dtype=object
, I get the expected outcome. However, if I convert those same features to dtype=category
, the output of pps.matrix
for those features are only 1's and 0's. Is this intentional behavior? I appreciate your help. Thank you.
This works:
#importing the libs
import ppscore as pps
import numpy as np
import pandas as pd
#Creating a sample df
col1 = np.random.randn(100)*10
col2 = np.random.choice(['cat1', 'cat2', 'cat3', 'cat4'],100)
col3 = np.random.choice(['yes', 'no'],100)
#creating dataframe
df=pd.DataFrame(
{'feature1': col1,
'feature2': col2,
'feature3':col3}
)
#trying to calculate pps matrix
pps.matrix(df)
This does not:
df['feature2'] = pd.Categorical(df.feature2)
df['feature3'] = pd.Categorical(df.feature3)
pps.matrix(df)
Hello,
I can get pps.matrix to work just fine but I'm having trouble running pps.score.
I get "TypeError: unhashable type: 'numpy.ndarray'"
This is the code I've used
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import ppscore as pps
dataset = pd.read_csv('D:/location/test.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
pps_matrix_results = pps.matrix(dataset)
pps_score_results = pps.score(dataset, x,y)
What am I doing wrong?
Has anyone compared RF feature importance to PPS?
My one concern with PPS is that it is a univariate calculation. Even in the matrix, the numbers correspond to a single feature's predictive power on the target.
My inkling is that feature importance can cut across multiple variables when determining which variables provide the most power in that sense.
Just curious. Thanks!
I have a simple question regarding ppscore. When I was calculating the correlation between two datasets(Columns) the result was was -0.248, which means when one when data increases the other will decreases but when I calculated the ppscore of the same columns the result was 0.37 from x to y and 0 from y to x. It clearly indicates that x can predicts y with 0.37 ppscore and y cannot predict x.
But what I actually want to know is the relation between 2 datasets, either it is directly proportional (positive) or inversely proportional(negative) with each other.
Thank you,
Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.
As mentioned in the article RIP correlation. Introducing the Predictive Power Score, When using the PPS one should keep in mind that it only captures direct relations, and not combinations of input features.
To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.
For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:
This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?
Hello!
I wonder if there is a possibility to test a single target variable against all others and further print the coefficient in descending order?
pps.score(df, "x", "y")
What do I put instead of "x" in that case?
Sorry, Im rather new to Python.
Hi,
Currently, PPS can be used with pandas dataframes, can you include the capability to run PPS on koalas dataframes as well?
https://github.com/databricks/koalas
Thanks!
This is an exception from the general rule of respecting the data types but this still might make sense
Would it be possible to create the ability to print the Decision Classifier Tree using pps.score?
There might be a way to do this directly from my end, but I am currently unsure of how to do this.
Thanks, Lauren
Since most of the sklearn predictors (especially DT) release the GIL, one could simply wrap the for x in features / for y in features loops into a single threadpool, collecting the results in the end.
Also, I think the predict itself "task" could also return earlier, even before dropping Nan values, simply by additionally performing the check whether columns are identic, or even by simply appending a 1 (by definition) instead of calling the score function directly.
I want to use ppscore
via R by means of reticulate
(which I know from Keras in R; see link for details) and do the following in a fresh R session:
library(reticulate)
conda_list()
This shows the Python environments actually available on my computer:
1 Miniconda3 C:\Miniconda3\python.exe
2 r-reticulate C:\Miniconda3\envs\r-reticulate\python.exe
I choose use_condaenv("r-reticulate")
, then py_install("ppscore", pip = TRUE)
, which produces:
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
All requested packages already installed.
Collecting ppscore
Using cached ppscore-0.0.2.tar.gz (38 kB)
Building wheels for collected packages: ppscore
Building wheel for ppscore (setup.py): started
Building wheel for ppscore (setup.py): finished with status 'done'
Created wheel for ppscore: filename=ppscore-0.0.2-py2.py3-none-any.whl size=9634 sha256=6d04a943bc87ef27f697de2cedef048966e2a56f4f41667949d37f3f4fcebc2c
Stored in directory: c:\users\lf\appdata\local\pip\cache\wheels\fd\39\a8\130eda2ee307e849923caf5b555b0d113ec7f7e8c7de731f9f
Successfully built ppscore
Installing collected packages: ppscore
Successfully installed ppscore-0.0.2
Now, I want to import ppscore, ppscore <- import('ppscore')
, which produces
Error in py_module_import(module, convert = convert) :
ModuleNotFoundError: No module named 'sklearn'
Ok - so I do py_install(sklearn)
, which gives
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
All requested packages already installed.
Error in conda_install(envname, packages = packages, conda = conda, python_version = python_version, :
Object 'sklearn' not found
This is somewhat surprising since when I do a pip list
under Anaconda prompt I get
which clearly shows that package sklearn
is present. And now I am at a loss what I could further do to successfully import and use (!) ppscore
. I have to add that I am a complete newcomer to Python, so I am not at all familiar with Python environments.
I really would appreciate any useful hints - thanks in advance,
Leo
In the readme it is mentioned,
In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (ppscore.RANDOM_SEED). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar.
What datasets were tested on to make this claim?
It seems highly unlikely that sampling 5000 rows from a dataset with millions of rows would lead to consistent ppscore matrices.
Hi All,
Found useful for extracting non linear relations . What I understood that the relation estimated by ppscore between the two variables or features are not impacted by the other available variables if I use pps.matrix ,or pps.predictors(). Does my understanding correct?
Found weird that pps.matrix tends to give different scores for features that pps.predictors() for the same target column .
Can you check what is the cause for the difference, is it valid ?
I'm using PPScore in my thesis and I was hoping you could create a DOI reference using zenodo so that I can cite it properly.
Here is a guide on how to do it if you are interested. Takes like five minutes!
https://guides.github.com/activities/citable-code/
Vinicius reached out and proposed to maybe use another evaluation metric for classification problems:
Have you used the average_precision metric (average_precision = AUPRC = area under precision-recall curve)? I think that it's better metric for imbalanced class problem.
Read the post: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
On multiclass problem, it's possible to use the same options as F1_Score: macro, micro, weight, etc.
This might be a viable alternative to the F1 Score.
Next steps (can be taken by anyone)
Create experiment notebook which clearly shows the differences/advantages of the AUPRC in certain situations (what is the benefit? how does it change the PPS? What is a suitable naive baseline of the AUPRC?)
Decide about the potential usage of AUPRC
Decide about possible integration into ppscore (change the default? add as an option? ...?)
Hi! I'm wondering if you've compiled your performance tests on different learning algorithms? I'd like to see what you found, in addition to what you mention in the readme.
Thanks a lot in advance
Currently, setup.py doesn't include the package dependencies, so they don't automatically install when using pip install
.
To fix this, use the install_requires
kwarg in the setup()
call in setup.py. Documentation and examples are available here: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-dependencies
Hi,
The ppscore is very good, I have tried it out previously and yielded good results.
I was using the 0.02 version and very recently (about 2 weeks ago) I tried it out on a dataset which gave a score to multiple variables. This week I had to upgrade because I recieve an AttributeError: module 'ppscore' has no attribute 'predictors'
After the upgrade I only had 2 variables which yielded ppscores, with all the other variables going to value 0.
Nothing else from the code was changed so I asume this was due to the upgrade, I suspect the 0 value is not correct because I have visually confirmed some distributions and applied statistical test to some variables (KS and chi-square) and they seemed to be somewhat predictive (plus they previously had scores).
Any clue on what might have happended? Many thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.