<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

Thank you for sharing this analysis <a class="user-mention notranslate" data-hovercard

performance of PPS in non-linear situation from blog about ppscore HOT 7 CLOSED

Ijustwantyouhappy commented on August 22, 2024

performance of PPS in non-linear situation from blog

from ppscore.

Comments (7)

Ijustwantyouhappy commented on August 22, 2024 1

from scipy.stats import pearsonr, spearmanr, kendalltau
import ppscore
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()


def plot_data(xy, corrs, ax=None, xlim=None, ylim=None):
    """
    
    :param xy: np.array, shape: n * 2
    :param corrs: [(corr_name, corr_func), ...], corr_func(x, y) -> corr_xy
    :param ax:
    :return: 
    """
    if ax is None:
        _, ax = plt.subplots()
    if xlim is not None:
        ax.set_xlim(xlim)
    if ylim is not None:
        ax.set_ylim(ylim)
    ax.set_frame_on(False)
    ax.axes.get_xaxis().set_visible(False)
    ax.axes.get_yaxis().set_visible(False)
    # 
    ax.plot(*xy.T, ',')
    title = []
    for name, func in corrs:
        corr_xy = func(xy[:, 0], xy[:, 1])
        corr_yx = func(xy[:, 1], xy[:, 0]) 
        title.append(f"{name}: {np.round(corr_xy, 3)}, {np.round(corr_yx, 3)}")
    ax.set_title('\n'.join(title))


def pearson(x, y):
    return pearsonr(x, y)[0]

def spearman(x, y):
    return spearmanr(x, y)[0]

def kendall(x, y):
    return kendalltau(x, y)[0]

def pps(x, y):
    df = pd.DataFrame({"x": x, "y": y})
    return ppscore.score(df, 'x', 'y')['ppscore']

def rotation(xy, t):
    return np.dot(xy, [[np.cos(t), -np.sin(t)],
                       [np.sin(t), np.cos(t)]])


corr_config = [('pearson', pearson), ('spearman', spearman), ('kendall', kendall), ('pps', pps)]

# image1
corrs = [1.0, 0.8, 0.4, 0.0, -0.4, -0.8, -1.0]
n = 800
_, axes = plt.subplots(1, len(corrs), figsize=(2 * len(corrs), 2.5))
for i, corr in enumerate(corrs):
    cov = [[1, corr], [corr, 1]]  # 协方差矩阵
    xy = np.random.multivariate_normal([0, 0], cov, n)  # 多维正态分布
    plot_data(xy, corr_config, axes[i])
plt.tight_layout();


# image2 rotation normal
ts = np.array([0, 1/12, 1/6, 1/4, 1/3, 5/12, 1/2]) * np.pi
n = 1000
xy = np.random.multivariate_normal([0, 0], [[1, 1], [1, 1]], n)
_, axes = plt.subplots(1, len(ts), figsize=(2 * len(ts), 2.5))
for i, t in enumerate(ts):
    xy_rot = rotation(xy, t)
    plot_data(xy_rot, corr_config,
              axes[i], xlim=(-4, 4), ylim=(-4, 4))
plt.tight_layout();


# image3 non-linear situations
_, axes = plt.subplots(1, len(ts), figsize=(2 * len(ts), 2.5))
n = 1000

# fig1
x = np.random.uniform(-1, 1, n)
y = 4 * (x**2 - 0.5)**2 + np.random.uniform(-1, 1, n) / 3
plot_data(np.array([x, y]).T, corr_config, ax=axes[0])

# fig2
y = np.random.uniform(-1, 1, n)
xy = np.array([x, y]).T
xy = rotation(xy, -np.pi / 8)
plot_data(xy, corr_config, ax=axes[1])

# fig3
xy = rotation(xy, -np.pi / 8)
plot_data(xy, corr_config, ax=axes[2])

# fig4
y = 2 * x**2 + np.random.uniform(-1, 1, n)
plot_data(np.array([x, y]).T, corr_config, ax=axes[3])

# fig5
y = (x**2 + np.random.uniform(0, 0.5, n)) * \
        np.array([-1, 1])[np.random.randint(0, 2, size=n)]
plot_data(np.array([x, y]).T, corr_config, ax=axes[4])

# fig6
y = np.cos(x * np.pi) + np.random.uniform(0, 1/8, n)
x = np.sin(x * np.pi) + np.random.uniform(0, 1/8, n)
plot_data(np.array([x, y]).T, corr_config, ax=axes[5])

# fig7
xy1 = np.random.multivariate_normal([3, 3], [[1, 0], [0, 1]], int(n/4))
xy2 = np.random.multivariate_normal([-3, 3], [[1, 0], [0, 1]], int(n/4))
xy3 = np.random.multivariate_normal([-3, -3], [[1, 0], [0, 1]], int(n/4))
xy4 = np.random.multivariate_normal([3, -3], [[1, 0], [0, 1]], int(n/4))
xy = np.concatenate((xy1, xy2, xy3, xy4), axis=0)
plot_data(xy, corr_config, ax=axes[6])

plt.tight_layout();

from ppscore.

Ijustwantyouhappy commented on August 22, 2024 1

corrs = [1.0, 0.8, 0.4, 0.0, -0.4, -0.8, -1.0]
n = 800
_, axes = plt.subplots(1, len(corrs), figsize=(2 * len(corrs), 2.5))
for i, corr in enumerate(corrs):
    cov = [[1, corr], [corr, 1]]  # 协方差矩阵
    xy = np.random.multivariate_normal([0, 0], cov, n)  # 多维正态分布
    plot_data(xy, corr_config, axes[i])
plt.tight_layout();

Still in fake data, pps and correlations both perform good in perfect linear relations (fig1 and fig7), but when noises become heavier, for single feature x, the range of y becomes wider, correlations still perform good while pps almost reduces to 0 (fig2 and fig6), and even by eyes we can detect this strong linear relationship. Honestly, this situation is fairly common in practical problems. So in my last comment, I think we can't just ignore correlations and visualizations of data.

I do like this concept of pps, actually, it provides a novel perspective in EDA. So I try to use it in my recent works for feature selection. It's a online sales forecast problem for a global cosmetic company giant. We transform this time series forecast
probelm to... Uh in short we extract about more than one hundred features and establish tree-based ensemble models, but unfortunately we only have a little historical data less than 2 years. Uh I seems to be too wordy... Because of the privacy policy, I can't show specific graphs or scores here.

The design of pps only considers one single feature's predictive power of target value, so many feature's pps will get exactly 0, but they can reach high feature importance when used together with others, for example features like Sex used in regression problems.

from ppscore.

FlorianWetschoreck commented on August 22, 2024

Thank you for sharing this analysis @Ijustwantyouhappy
Can you maybe share the code for the analysis and translate the Chinese (?) characters? :)

from ppscore.

FlorianWetschoreck commented on August 22, 2024

Great, thank you. :)
Can you please also translate the Chinese characters that you added between some of the graphs in the picture? I am interested what they say

It is also fine if you just copy the characters here so that I can run them through Google translate but it is a little unconvenient to extract them from the picture.

from ppscore.

Ijustwantyouhappy commented on August 22, 2024

Uh... that's just my feelings about these metrics. I'm used to writing some notes after trying out a brand-new tool.

In my opinion, PPS is beyond doubt a creative and informative metric, but if there are heavy noises, or the relationship from x to y is one-to-many potentially, PPS will perform poorly, even worse than correlations, and the blog didn't seem to mention this.

from ppscore.

FlorianWetschoreck commented on August 22, 2024

Sure, thank you for sharing your honest thoughts.
To which scenario (in the graph or outside the graph) do you refer where PPS is performing worse than correlations?

from ppscore.

FlorianWetschoreck commented on August 22, 2024

Thank you for sharing this comparison and the code.

And I agree that the PPS is worse in those linear cases. We have also thought about maybe reporting the max of the PPS and the correlation in order to not lose the insights from correlation but I think we need some more testing to see if this makes sense. Maybe it also makes sense to merge the PPS with MIC or another score.
Also, currently, the PPS has some problems when there are numeric outliers because they distort the total sum of errors. We hope to find a workaround there, too.

And of course I agree that PPS alone is not sufficient for feature selection because we also need to assess the feature importance when using multiple variables in the model.

from ppscore.

performance of PPS in non-linear situation from blog about ppscore HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent