8080labs / ppscore Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 162.0 123 KB

Predictive Power Score (PPS) in Python

License: MIT License

Python 99.62% Shell 0.38%

ppscore's People

Contributors

Stargazers

Watchers

Forkers

marcogandra fdoperezi akhand1994 zuxfoucault sekhar2017 klao-thongchan ssitb sts-sadr hugotave marciosaraiva aespar21 bsuryachandra kalufinnle abenavs asonay robertdelacruz1551 kenuku jbdatascience abualhamd jimmyjoy43 veritaem ramonvo74 avkash udemirezen giginghn b2bda dr-alok-tiwari durbin-watson debasishsarangi88 jimmy-inl michaelgallacher roshnet micseb jiteshrustagi katameronin kaustavpurkait slotbite sreev eugeniogrant minghao2016 maxwelllzh cpyang amit2016-17 antonbiryukovuofc fagan2888 ruvex ivanletteri suryathiru shivam11 dondon1993 sahanduiuc virenk sacharose23 ollawone esskay0000 andrzejtunkiel vishalbelsare minalspatil robertodelamora juneelletanedo algoricky antonshelepov junaidqazi mukiljchandiran hrocha pnijhara hectorpl igor-siciliani franzoni dennis055 fabianofilho adbmd razorhedge schonrahul rohantrix njjetha ostirion-net rsest ftmorsun jeongho-shin afcarl deathlyhallows010 timothyxp mschirdel hectorbarrio weibohuang-tw ogweno geopars tbredbenner nathanwalker-sp mahdirag johanneshirn marcusaht spra03 amdq1971 sroecker sohail-sankanur we1l1n i-am-yulang nielsuit227

ppscore's Issues

Improving time using pyspark?

I was wondering if you explored the option of using pyspark to reduce the running time.
Since all the regression/classification models are independent of each other, it seems like a good candidate for parallelisation.
Would love to discuss more about this if you are interested.

feat: Return result for all possible combintations of columns

Feature Request 🎉

Description

To find the score, two columns (namely the params x and y in ppscore.score()) need to be supplied in call to score().
It would be nice to have a functionality where the user can get an array of results corresponding to all columns in a single line of code.

For instance, if comparisons between columns 'A', 'B', 'C', and 'D' are to be made, it should be possible to do so with a single line of code. Results could be obtained as a list of comparisons between A-B, A-C, A-D, B-C, B-D, and C-D.

Possible Solution

This can be achieved in two ways -

create another function which iterates over all possible combinations and yields/returns a list
add logic to existing implementation of score for the same.

Regarding test cases, it'd be easier to write a new one for a new function (as per 1), rather than modifying the existing implementation.

Remarks

@8080labs I'd like to work on it.

[SUGGEST] Release a verson supported GPU

Thank you for creating a power metric to calculate the score for choosing features.

I hope that you can release a version which supports to calculate this score by GPU. It will reduce time for computating.

Thank you.

Support pandas >1.0.0

There should be an option to override the attribute type like PyCaret

Numerical attributes not always point to regression and could be categorical attributes, where a Classification model should be used.
PyCaret offers such features, where we can list numerical and categorical features. Something of that kind will greatly increase the usability of this library.

I am getting error while importing ppscore in kaggle

https://drive.google.com/file/d/1MHbhOyK8Ng-TYoNF8tPYqxF4Orx-VbKo/view?usp=sharing

AttributeError: module 'ppscore' has no attribute 'predictors'

I am running the following line with all numerical dataset.

pps.predictors(df, "organic")

And, it throws the following error.

AttributeError: module 'ppscore' has no attribute 'predictors'

pps.matrix(df) runs fine.

Readme / docs unclear about using ppscore on time series data

I would just love to use this on timeseries data and out of the box it seems to do pretty well, but I don't know if I'm interpreting the score right, however, I read over the readme at this link: https://github.com/8080labs/ppscore#calculation-of-the-pps

'''
The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via cross_validation). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
'''

I don't understand, should I set sample=None for time series or should I modify the cross_validation kwarg for timeseries data?

[Question] Factor analysis or it's equivalent in PPScore

Question

At the moment this is just a query, if there is anything equivalent to "Factor analysis" in PPScore or if someone knows of steps to come up with then given a dataset.

It's possible this is a missing feature but before I make such claims I wanted to inquire and start a discussion.

Additional context

Found these resources related to this topic/concept:

How to understand the baseline_score?

Hi,

Im having trouble understanding the baseline score. From what i think i can see in the code then the baseline is the score when we are being naive and choosing the median of y as our predicted value.

That would mean if the baseline_score is higher than the model_score then using the median as a the predicited value would be better than the prediction from the decision tree. Is this correct?

Btw. Thanks for releasing

PPS not working for my dataset( Financial Data)

The following is the output for PPS

The following is the output for Correlation

Correlation works perfectly okay but the PPS is giving me weird results
The dataset link is:
dataset6 TO 10.xlsx

Please help me resolve this, I am trying to predict Ev/EBITDA here.

Detection of linear patterns and decoupling of concerns

When the PPS is applied toward linear relationships with the same error but different slopes, the score varies a lot e.g. from 0.1 to 0.7 depending on the slope.

This might not be the behaviour that we expect intuitively and normalizing the target does not help.
The reason for this is that the ppscore calculates the ratio of the variance of the predictor to the variance of the baseline. If the slope is steep, the ratio is higher because the baseline makes more errors. If the slope is flat, the variances are nearly the same.

The underlying problem is that the current metric and calculation of the ppscore couples to questions:

Is there a valid pattern? e.g. statistical significance or predictive power after cross-validation
Is the variance of the pattern low? (compared to baseline variance)

If either of those two criteria is wrong or weak, the ppscore will be low, too.
Only if both are true, the ppscore will be high.

The problem with the linear cases is that the pattern is valid BUT the variance of the pattern is not low because there is a lot of noise - even if the pattern is statistically significant. (High error to signal ratio)
For this scenario (and maybe also for others), we might want to find a calculation that decouples those two concerns

Some rough code:

import pandas as pd
import numpy as np
import seaborn as sns
import bamboolib as bam

import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]
df["0.3_linear_x"] = 0.3*df["x"]+df["error"] #0.11 pps
df["0.5_linear_x"] = 0.5*df["x"]+df["error"] #0.4 pps
df["1_linear_x"] = 1*df["x"]+df["error"] # 0.68 pps

# normalized linear to [0,1] via +2 and /4
df["1_linear_x_norm"] = (df["1_linear_x"] + 2)/4 #0.68 pps, too

Time-series validation workflow

Hello folks,

I really like the idea of your package and the approach. I was just curious how difficult it might be to introduce a custom CV validation (or even just a ts-meaningful) validation.

I could probably assist in that with a bit of guidance from you :)

Thanks,

Anton.

performance of PPS in non-linear situation from blog

Handling of categorical columns

As it is written in your code, if the column has less then 15 unique values, it is considered as a categorical during fitting.

But, I am facing the following warning during the execution:

Reproducible sample:

#importing the libs
import ppscore as pps
import numpy as np
import pandas as pd

#Creating a toy df
#feature column 1
#feature column 2
col1= np.random.randn(10)*10
col2=np.random.randn(10)*10
#feature colum 3- 5 unique values
col3=np.random.choice(range(4),10)

#creating dataframe
df=pd.DataFrame(
{'feature1': col1,
'feature2': col2,
'feature3':col3}
)

#trying to calculate pps matrix
pps.matrix(df)

Warning that I observed:

anaconda3\envs\DL_py37\lib\site-packages\sklearn\model_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: Unknown label type: 'continuous'

It is not a new issue, and example of solution (find on Kaggle) - temporarily convert the data type of feature3 column to categorical with LabelEncoder

Code Improvisation

I have improvised the Error Handling Code in ppscore. As the same error handling statements were being used repetitively, I have created a class called Check_Error using which it will be quite easy to call the necessary 'raise error' statements. I have tested it too and only 1 test failed but it also failed in the main repo.

As Hacktoberfest has started, I would like to contribute this code as a contribution to open source, so it would be great if you could label this issue and the pull request 'hacktoberfest'.

A pull request has been created on the master branch.

Warning Message: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

I am trying to calculate PPS for a classification task between two categorical columns and get this warning:
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

I included a screenshot of my code with some information on these two columns.
Also, there are no missing values in the data.

Thank you for your help!

Numerical columns treated as categorical

Hi guys,

I heard of PPS, through your article and was curious to test it. I have tried implementing it on some data I've been working on.

Unfortunately, I get numerous error messages when calculating the pps matrix :

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

My guess is pps is considering my data to be categorical and therefore trying to apply classification with a huge number of labels.

Looking at how pps determines if the data is numerical or categorical, I cannot find the reason it would consider my data categorical :

The dtypes are int or float
The number of unique values is higher than 15 (except for 1 column which is equal to 15, but changing the NUMERIC_AS_CATEGORIC_BREAKPOINT constant to 10 does not resolve the problem)

Also, if I try to force the pps score to be calculated using task = 'regression', I get the following error :

'DataFrame' object has no attribute 'dtype'

Here is my code :

import pandas as pd
import ppscore as pps

df = pd.read_csv('seattle_building_energy_benchmark.csv', sep = ';')

df.dtypes

df.nunique()

pps.NUMERIC_AS_CATEGORIC_BREAKPOINT = 10

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = None)

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = 'regression')

pps.matrix(df)

Is there something I am missing ? If not, would you like me to share the data with you ? (I do not know which sharing method is more convenient for you)

warning message

It shows this warning. I attached the csv file

The least populated class in y has only 1 members, which is less than n_splits=4.

C:\Users....\sklearn\model_selection_split.py:667: UserWarning:

for_test.txt

ppscore interpretation

Hi @FlorianWetschoreck, @tkrabel , @SuryaThiru

Quick question: how to properly interpret ppscore?

Say if you have a dataset, 3000 rows x 30 columns; you then apply pps.matrix(), then sort values by ppscore. Is there a "rule of thumb" or rational-guideline to categorize ppscore levels?
Like the following:

If ppscore is in range 0.6 - 1.0, means strong (so X feature has a strong predictive power on Y)
If ppscore is in range 0.4 - 0.6, means moderate (so X feature has a moderate predictive power on Y)
If ppsscore is lower than 0.4, means weak (so X feature has a weak predictive power on Y)
Note: the ranges and categories I gave are totally arbitrary

I read this article - https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598 , but couldnt find an answer there

Thanks a million, Fernando

Adding the ability to use different evaluation metrics

Hello,

First of all PPScore is so good, great job! Running only PPS score to reduce feature numbers gives better results than using other feature elimination techniques, or running it first before using other feature elimination techniques ALWAYS yields better results than running pretty much every other feature elimination technique(s) alone.

I just wanted to ask how to change from F1 score to ROC, specifically, Precision-Recall Curves (as I have moderate imbalanced classes), for my binary classification problem.

Thank you for your help in the matter.

Regards,

Achilleas

Is there a paper available?

Hi there,

This appears to be quite a valuable metric and I am curious, have you published a pre-print, or paper on the details surrounding the experiments and calculations?

Thanks.

Can be much faster

Nice idea! From a blog post I read it's not very fast.
If you use randomized trees, you'll make it very much faster, doing hundreds of columns and many thousands of rows in a few seconds. Moreover, because you are fitting a single column, the randomized trees don't have any drawback compared to very advanced trees.
Can think more about it, if you need help.

Overfitting because of duplicate feature-target entries in the sample

What about the duplicate entries [of feature-target pair] which could become parts of both train and test dataset — they shall be removed otherwise, it is clear overfitting?

Getting different results from the ones in article

Hi, first of all, thank you for a great tool!

I tried reproducing the experiment from jupyter notebooks, but got different results, both for quadratic function and titanic dataset.

For the quadratic function, I am getting 0.67 instead of the value of 0.88 mentioned in the article. Although that discrepancy might have been caused by the randomness of data, the discrepancy I get with Titanic dataset is bigger:
in the article, you mention the PPS score between TicketId and TicketPrice of 0.67, whereas reproducing your notebooks, I am getting a score of 0.27.

You can see the steps to reproduce the discrepancy in this notebook, please skip to line 26.

I have python version 3.6.8, sklearn 0.22 and ppscore version 0.0.2, you can see them in the mentioned above notebook.

Scikit-learn dependency < 1.0.0

Scikit-Learn recently released their first stable version.
There aren't many breaking changes, so I think it would be nice if this could be reflected in the dependencies of PPScore.

Would be happy to help with this as well.

PPS is not useful for data with high noise to signal ratio

Note: "Issue" is not the correct tag for this as it is more of a comment, but anyone working with noisy data should consider it.

I'm interest in using PPS for feature selection. I work with data that has a high noise to signal ratio and the PPS score is consistently 0 despite changes to parameters such as sample size and number of cross validation folds. One can easily reproduce this result by changing the "error" term in the example from the "Getting started" section to be uniform -5 to 5 instead of -0.5 to 0.5 and leaving the "x" value as uniform -2 to 2. Does anyone have a similar experience or any insight on the usefulness of PPS for this type of data?

I think a more constructive approach to identifying when a relationship exists between data would be to not RIP correlation. Despite some downsides, it still has its place. Instead, please consider the benefits of using multiple scores that measure relatedness between data when working on your data science projects.

While Pearson correlation measures linear relationships between data, their are several other correlation measures that do not make that assumption and should be examined before abandoning sound mathematical/statistical methods. Spearman correlation measures rank correlation (i.e., does not measure linear relationship) and Kendall's tau (and several variants of it such as Goodman and Kruskal's gamma) measures ordinal association between data with a non-parametric construction. It would be interesting to see how these other correlation measures stack up against PPS in the canonical "0 correlation" scenarios.

Also, on a slightly different topic: One inherent benefit of PPS that its authors have not taken advantage of is its ability to easily be extended to measure predictive power of combinations of features (i.e., interaction terms). The decision tree model used under the hood already supports multiple input features so why not allow "x" to be "x's"? Traditional correlation measures (that I mention above) require the user to assume the form of the interaction between features (is it x1 * x2? x1 / x2? etc.) in order to test the relationship while an extended PPS would not require that assumption. I think this is a very powerful use case for PPS despite its challenge of being visualized in a traditional correlation matrix format.

Sorry for the rant. I do not mean to disparage PPS and the work the authors have done. I think it does have its merits as a practical solution to real world data science problems that we are all facing in our work and with some improvements it can be even more useful.

Numpy arrays and Unknown label type: 'continuous'

Hi,
I'm quickly experimenting by implementing ppscore in my pipeline for the assessment of functional connectivity between brain regions, and I noticed two things:
1/ I think we should be able to use pps.matrix() even on a 2D numpy array when we don't have explicit column names: as of now, it is raising the error AttributeError: 'numpy.ndarray' object has no attribute 'columns'
2/ I got a strange error telling me that "continuous" is an unknown label. File "/home/clementpoiret/anaconda3/envs/nilearn/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets raise ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'continuous'
Code to reproduce the error:

import numpy as np
import pandas as pd

X = pd.DataFrame(np.random.randn(10,10))
pps.matrix(X)

The error is solved by passing task='regression'. I have sklearn 0.23.0
Maybe an additional comment: maybe that the diagonal of the resulting matrix should be 1, because it makes sense that the predictive power of a vector on itself is 1, no?

PPSScore drop rows with missing: can be improved.

Can missing be treated as a separate category? I see errors sometimes as "after dropping missing no valid rows are there".

I see in the doc
"All rows which have a missing value in the feature or the target column are dropped"

this is not desirable as missingness maybe a predictive factor which lightgbm and xgboost can handle.

Data preprocessing and information leakage

Hello, before anything thanks for the package, it is very useful and the overall approach is innovative and generates a lot of efficiency. I have a comment regarding the "state" of the data to run the pps analysis on, it seems (I may be mistaken) that any transformation to the data (standardization for example) will lead to large data leakage into the Kfolds cross-validations. Is it correct? The module could use sklearn´s pipelining and standard transforms to possibly increase the information generated, would this be of value to the module?

Failure to handle dtype=category

I am having a similar issue to #11. I recreated the sample dataset from their example but with 100 rows and made features 2 and 3 both categorical. When I run pps.matrix with features 2 and 3 as dtype=object, I get the expected outcome. However, if I convert those same features to dtype=category, the output of pps.matrix for those features are only 1's and 0's. Is this intentional behavior? I appreciate your help. Thank you.

This works:

#importing the libs
import ppscore as pps
import numpy as np
import pandas as pd

#Creating a sample df
col1 = np.random.randn(100)*10
col2 = np.random.choice(['cat1', 'cat2', 'cat3', 'cat4'],100)
col3 = np.random.choice(['yes', 'no'],100)

#creating dataframe
df=pd.DataFrame(
{'feature1': col1,
'feature2': col2,
'feature3':col3}
)

#trying to calculate pps matrix
pps.matrix(df)

This does not:

df['feature2'] = pd.Categorical(df.feature2)
df['feature3'] = pd.Categorical(df.feature3)

pps.matrix(df)

pps.score not executing

Hello,

I can get pps.matrix to work just fine but I'm having trouble running pps.score.

I get "TypeError: unhashable type: 'numpy.ndarray'"

This is the code I've used

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import ppscore as pps

dataset = pd.read_csv('D:/location/test.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

pps_matrix_results = pps.matrix(dataset)
pps_score_results = pps.score(dataset, x,y)

What am I doing wrong?

Comparing RF feature importance to PPS

Has anyone compared RF feature importance to PPS?

My one concern with PPS is that it is a univariate calculation. Even in the matrix, the numbers correspond to a single feature's predictive power on the target.

My inkling is that feature importance can cut across multiple variables when determining which variables provide the most power in that sense.

Just curious. Thanks!

Comparing Negative result of correlation to ppscore.

I have a simple question regarding ppscore. When I was calculating the correlation between two datasets(Columns) the result was was -0.248, which means when one when data increases the other will decreases but when I calculated the ppscore of the same columns the result was 0.37 from x to y and 0 from y to x. It clearly indicates that x can predicts y with 0.37 ppscore and y cannot predict x.

But what I actually want to know is the relation between 2 datasets, either it is directly proportional (positive) or inversely proportional(negative) with each other.

Thank you,

Thought on a possible enhancement of the PPS

Currently, the PPS score is already very useful and I regularly use it for feature selection and general insights whenever I encounter a new data set. Recently I had an idea to maybe increase the capabilities of the metric.

As mentioned in the article RIP correlation. Introducing the Predictive Power Score, When using the PPS one should keep in mind that it only captures direct relations, and not combinations of input features.

To address this weakness, would it be an idea to give the underlying decision tree 2 variables instead of one? This will take a significantly longer time, but it gives combinations of variables a chance and might also be able to give additional information about the input features.

For example, if I have target variable 'y' and input features 'x1, x2, x3, x4'. I apply the pps and find the scores 0.4, 0, 0.4 and 0.6 respectively. Now, as a follow up I try all combinations and discover the following:

If I combine x1 and x2, I get a predictive power of 0.5, I now know that this combination increases the PPS by 0.5 - 0.4 = 0.1
If I combine x1 and x4, I get a predictive power score of 0.6. The increase is now 0.6 - 0.6 = 0. Implying that even though x1 has a pps of 0.4, I might as well use x4 and drop x1.

This requires a slightly different implementation of the algorithm, and before committing to developing the implementation I was wondering if this train of thought makes any sense. Opinions on such an additional feature?

Test target variable against all others

Hello!

I wonder if there is a possibility to test a single target variable against all others and further print the coefficient in descending order?

pps.score(df, "x", "y")

What do I put instead of "x" in that case?
Sorry, Im rather new to Python.

Kindly provide support for koalas dataframes as well

Hi,

Currently, PPS can be used with pandas dataframes, can you include the capability to run PPS on koalas dataframes as well?

https://github.com/databricks/koalas

Thanks!

Maybe handle columns with only 2 values always as (boolean) classification

This is an exception from the general rule of respecting the data types but this still might make sense

[Suggestion]: Plot the Decision Tree for pps.score

Would it be possible to create the ability to print the Decision Classifier Tree using pps.score?
There might be a way to do this directly from my end, but I am currently unsure of how to do this.
Thanks, Lauren

Quick Win: Performance Improvements

Since most of the sklearn predictors (especially DT) release the GIL, one could simply wrap the for x in features / for y in features loops into a single threadpool, collecting the results in the end.
Also, I think the predict itself "task" could also return earlier, even before dropping Nan values, simply by additionally performing the check whether columns are identic, or even by simply appending a 1 (by definition) instead of calling the score function directly.

pytests failing with pandas==1.4.0

ppscore when model_score>baseline_score

Not sure why ppscore is still zero when model_score>baseline_score

Problems installing ppscore via reticulate

I want to use ppscore via R by means of reticulate (which I know from Keras in R; see link for details) and do the following in a fresh R session:

library(reticulate)
conda_list()

This shows the Python environments actually available on my computer:

1 Miniconda3 C:\Miniconda3\python.exe
2 r-reticulate C:\Miniconda3\envs\r-reticulate\python.exe

I choose use_condaenv("r-reticulate"), then py_install("ppscore", pip = TRUE), which produces:

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
All requested packages already installed.
Collecting ppscore
Using cached ppscore-0.0.2.tar.gz (38 kB)
Building wheels for collected packages: ppscore
Building wheel for ppscore (setup.py): started
Building wheel for ppscore (setup.py): finished with status 'done'
Created wheel for ppscore: filename=ppscore-0.0.2-py2.py3-none-any.whl size=9634 sha256=6d04a943bc87ef27f697de2cedef048966e2a56f4f41667949d37f3f4fcebc2c
Stored in directory: c:\users\lf\appdata\local\pip\cache\wheels\fd\39\a8\130eda2ee307e849923caf5b555b0d113ec7f7e8c7de731f9f
Successfully built ppscore
Installing collected packages: ppscore
Successfully installed ppscore-0.0.2

Now, I want to import ppscore, ppscore <- import('ppscore') , which produces

Error in py_module_import(module, convert = convert) :
ModuleNotFoundError: No module named 'sklearn'

Ok - so I do py_install(sklearn) , which gives

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
All requested packages already installed.
Error in conda_install(envname, packages = packages, conda = conda, python_version = python_version, :
Object 'sklearn' not found

This is somewhat surprising since when I do a pip list under Anaconda prompt I get

which clearly shows that package sklearn is present. And now I am at a loss what I could further do to successfully import and use (!) ppscore. I have to add that I am a complete newcomer to Python, so I am not at all familiar with Python environments.

I really would appreciate any useful hints - thanks in advance,
Leo

Does sampling 5000 rows from a dataset lead to consistent ppscore matrices?

In the readme it is mentioned,

In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (ppscore.RANDOM_SEED). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar.

What datasets were tested on to make this claim?
It seems highly unlikely that sampling 5000 rows from a dataset with millions of rows would lead to consistent ppscore matrices.

pps.matrix and pps.predictors are giving different score values for the same dataset against the same target feature

Hi All,

Found useful for extracting non linear relations . What I understood that the relation estimated by ppscore between the two variables or features are not impacted by the other available variables if I use pps.matrix ,or pps.predictors(). Does my understanding correct?

Found weird that pps.matrix tends to give different scores for features that pps.predictors() for the same target column .
Can you check what is the cause for the difference, is it valid ?

Giving a DOI citation to PPscore

I'm using PPScore in my thesis and I was hoping you could create a DOI reference using zenodo so that I can cite it properly.

Here is a guide on how to do it if you are interested. Takes like five minutes!
https://guides.github.com/activities/citable-code/

ENH: maybe use AUPRC as classification score

Vinicius reached out and proposed to maybe use another evaluation metric for classification problems:

Have you used the average_precision metric (average_precision = AUPRC = area under precision-recall curve)? I think that it's better metric for imbalanced class problem.
Read the post: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
On multiclass problem, it's possible to use the same options as F1_Score: macro, micro, weight, etc.

This might be a viable alternative to the F1 Score.

Next steps (can be taken by anyone)

Create experiment notebook which clearly shows the differences/advantages of the AUPRC in certain situations (what is the benefit? how does it change the PPS? What is a suitable naive baseline of the AUPRC?)
Decide about the potential usage of AUPRC
Decide about possible integration into ppscore (change the default? add as an option? ...?)

Assessing performance for different learning algorithms

Hi! I'm wondering if you've compiled your performance tests on different learning algorithms? I'd like to see what you found, in addition to what you mention in the readme.

Thanks a lot in advance

Requirements not installed when using pip

Currently, setup.py doesn't include the package dependencies, so they don't automatically install when using pip install.

To fix this, use the install_requires kwarg in the setup() call in setup.py. Documentation and examples are available here: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-dependencies

ppscore changes to 0 for multiple variables after upgrade

Hi,

The ppscore is very good, I have tried it out previously and yielded good results.

I was using the 0.02 version and very recently (about 2 weeks ago) I tried it out on a dataset which gave a score to multiple variables. This week I had to upgrade because I recieve an AttributeError: module 'ppscore' has no attribute 'predictors'

After the upgrade I only had 2 variables which yielded ppscores, with all the other variables going to value 0.

Nothing else from the code was changed so I asume this was due to the upgrade, I suspect the 0 value is not correct because I have visually confirmed some distributions and applied statistical test to some variables (KS and chi-square) and they seemed to be somewhat predictive (plus they previously had scores).

Any clue on what might have happended? Many thanks