Code Monkey home page Code Monkey logo

Comments (7)

Ijustwantyouhappy avatar Ijustwantyouhappy commented on August 22, 2024 1
from scipy.stats import pearsonr, spearmanr, kendalltau
import ppscore
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()


def plot_data(xy, corrs, ax=None, xlim=None, ylim=None):
    """
    
    :param xy: np.array, shape: n * 2
    :param corrs: [(corr_name, corr_func), ...], corr_func(x, y) -> corr_xy
    :param ax:
    :return: 
    """
    if ax is None:
        _, ax = plt.subplots()
    if xlim is not None:
        ax.set_xlim(xlim)
    if ylim is not None:
        ax.set_ylim(ylim)
    ax.set_frame_on(False)
    ax.axes.get_xaxis().set_visible(False)
    ax.axes.get_yaxis().set_visible(False)
    # 
    ax.plot(*xy.T, ',')
    title = []
    for name, func in corrs:
        corr_xy = func(xy[:, 0], xy[:, 1])
        corr_yx = func(xy[:, 1], xy[:, 0]) 
        title.append(f"{name}: {np.round(corr_xy, 3)}, {np.round(corr_yx, 3)}")
    ax.set_title('\n'.join(title))


def pearson(x, y):
    return pearsonr(x, y)[0]

def spearman(x, y):
    return spearmanr(x, y)[0]

def kendall(x, y):
    return kendalltau(x, y)[0]

def pps(x, y):
    df = pd.DataFrame({"x": x, "y": y})
    return ppscore.score(df, 'x', 'y')['ppscore']

def rotation(xy, t):
    return np.dot(xy, [[np.cos(t), -np.sin(t)],
                       [np.sin(t), np.cos(t)]])


corr_config = [('pearson', pearson), ('spearman', spearman), ('kendall', kendall), ('pps', pps)]

# image1
corrs = [1.0, 0.8, 0.4, 0.0, -0.4, -0.8, -1.0]
n = 800
_, axes = plt.subplots(1, len(corrs), figsize=(2 * len(corrs), 2.5))
for i, corr in enumerate(corrs):
    cov = [[1, corr], [corr, 1]]  # 协方差矩阵
    xy = np.random.multivariate_normal([0, 0], cov, n)  # 多维正态分布
    plot_data(xy, corr_config, axes[i])
plt.tight_layout();


# image2 rotation normal
ts = np.array([0, 1/12, 1/6, 1/4, 1/3, 5/12, 1/2]) * np.pi
n = 1000
xy = np.random.multivariate_normal([0, 0], [[1, 1], [1, 1]], n)
_, axes = plt.subplots(1, len(ts), figsize=(2 * len(ts), 2.5))
for i, t in enumerate(ts):
    xy_rot = rotation(xy, t)
    plot_data(xy_rot, corr_config,
              axes[i], xlim=(-4, 4), ylim=(-4, 4))
plt.tight_layout();


# image3 non-linear situations
_, axes = plt.subplots(1, len(ts), figsize=(2 * len(ts), 2.5))
n = 1000

# fig1
x = np.random.uniform(-1, 1, n)
y = 4 * (x**2 - 0.5)**2 + np.random.uniform(-1, 1, n) / 3
plot_data(np.array([x, y]).T, corr_config, ax=axes[0])

# fig2
y = np.random.uniform(-1, 1, n)
xy = np.array([x, y]).T
xy = rotation(xy, -np.pi / 8)
plot_data(xy, corr_config, ax=axes[1])

# fig3
xy = rotation(xy, -np.pi / 8)
plot_data(xy, corr_config, ax=axes[2])

# fig4
y = 2 * x**2 + np.random.uniform(-1, 1, n)
plot_data(np.array([x, y]).T, corr_config, ax=axes[3])

# fig5
y = (x**2 + np.random.uniform(0, 0.5, n)) * \
        np.array([-1, 1])[np.random.randint(0, 2, size=n)]
plot_data(np.array([x, y]).T, corr_config, ax=axes[4])

# fig6
y = np.cos(x * np.pi) + np.random.uniform(0, 1/8, n)
x = np.sin(x * np.pi) + np.random.uniform(0, 1/8, n)
plot_data(np.array([x, y]).T, corr_config, ax=axes[5])

# fig7
xy1 = np.random.multivariate_normal([3, 3], [[1, 0], [0, 1]], int(n/4))
xy2 = np.random.multivariate_normal([-3, 3], [[1, 0], [0, 1]], int(n/4))
xy3 = np.random.multivariate_normal([-3, -3], [[1, 0], [0, 1]], int(n/4))
xy4 = np.random.multivariate_normal([3, -3], [[1, 0], [0, 1]], int(n/4))
xy = np.concatenate((xy1, xy2, xy3, xy4), axis=0)
plot_data(xy, corr_config, ax=axes[6])

plt.tight_layout();

from ppscore.

Ijustwantyouhappy avatar Ijustwantyouhappy commented on August 22, 2024 1
corrs = [1.0, 0.8, 0.4, 0.0, -0.4, -0.8, -1.0]
n = 800
_, axes = plt.subplots(1, len(corrs), figsize=(2 * len(corrs), 2.5))
for i, corr in enumerate(corrs):
    cov = [[1, corr], [corr, 1]]  # 协方差矩阵
    xy = np.random.multivariate_normal([0, 0], cov, n)  # 多维正态分布
    plot_data(xy, corr_config, axes[i])
plt.tight_layout();

1596089945(1)

Still in fake data, pps and correlations both perform good in perfect linear relations (fig1 and fig7), but when noises become heavier, for single feature x, the range of y becomes wider, correlations still perform good while pps almost reduces to 0 (fig2 and fig6), and even by eyes we can detect this strong linear relationship. Honestly, this situation is fairly common in practical problems. So in my last comment, I think we can't just ignore correlations and visualizations of data.

I do like this concept of pps, actually, it provides a novel perspective in EDA. So I try to use it in my recent works for feature selection. It's a online sales forecast problem for a global cosmetic company giant. We transform this time series forecast
probelm to... Uh in short we extract about more than one hundred features and establish tree-based ensemble models, but unfortunately we only have a little historical data less than 2 years. Uh I seems to be too wordy... Because of the privacy policy, I can't show specific graphs or scores here.

The design of pps only considers one single feature's predictive power of target value, so many feature's pps will get exactly 0, but they can reach high feature importance when used together with others, for example features like Sex used in regression problems.

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

Thank you for sharing this analysis @Ijustwantyouhappy
Can you maybe share the code for the analysis and translate the Chinese (?) characters? :)

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

Great, thank you. :)
Can you please also translate the Chinese characters that you added between some of the graphs in the picture? I am interested what they say

It is also fine if you just copy the characters here so that I can run them through Google translate but it is a little unconvenient to extract them from the picture.

from ppscore.

Ijustwantyouhappy avatar Ijustwantyouhappy commented on August 22, 2024

Uh... that's just my feelings about these metrics. I'm used to writing some notes after trying out a brand-new tool.

In my opinion, PPS is beyond doubt a creative and informative metric, but if there are heavy noises, or the relationship from x to y is one-to-many potentially, PPS will perform poorly, even worse than correlations, and the blog didn't seem to mention this.

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

Sure, thank you for sharing your honest thoughts.
To which scenario (in the graph or outside the graph) do you refer where PPS is performing worse than correlations?

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

Thank you for sharing this comparison and the code.

And I agree that the PPS is worse in those linear cases. We have also thought about maybe reporting the max of the PPS and the correlation in order to not lose the insights from correlation but I think we need some more testing to see if this makes sense. Maybe it also makes sense to merge the PPS with MIC or another score.
Also, currently, the PPS has some problems when there are numeric outliers because they distort the total sum of errors. We hope to find a workaround there, too.

And of course I agree that PPS alone is not sufficient for feature selection because we also need to assess the feature importance when using multiple variables in the model.

from ppscore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.