Hi, first of all, thank you for a great tool! I tried reproducing th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Getting different results from the ones in article about ppscore HOT 8 CLOSED

8080labs commented on August 22, 2024

Getting different results from the ones in article

from ppscore.

Comments (8)

8080labs commented on August 22, 2024

Hi Ivan,

thank you for pointing this out. We will have a further look into this and adjust the article.

The main reason for the difference should be the randomness. In the Titanic example, there most likely occurs another split for the crossvalidation sets based on your and my calculation.

In addition, the score seems to be unstable because of the specific relationships:

First, TicketId has many unique values and 60% of the values are unique. In total, only 49 (5%) rows belong to TicketIds which occur 5 times or more. The maximum value count is 7.

The score with the mismatch is the score from TicketPrice to TicketId.

For a given TicketId the TicketPrice is always constant. So, the model can perfectly predict the TicketPrice if it already saw the TicketId before (eg for the 40% of rows which belong to TicketIds that occur more than once). This is also the reason why I was not suspicious of the high score.

In the other direction from TicketPrice to TicketID the relationship is sometimes ambiguous. Here it would be interesting to see how the choice of the crossvalidation splits affect the splits of the DecisionTreeClassifier.

In order to better understand the variability we might need to dig deeper but those are already some first observations. As some next steps it would be interesting to plot the actual predictions of the models during crossvalidation. Also, it is interesting to have a look at how the final F1 is build.

Best,
Florian

from ppscore.

lucazav commented on August 22, 2024

In order to guarantee the reproducibility of results, I'm trying to use the random_seed parameter into the matrix function. But I get a matrix having all zeros except for the diagonal (all ones) for any value passed to the parameter.

Am I doing anything wrong?

from ppscore.

FlorianWetschoreck commented on August 22, 2024

Can you please share the full code of your analysis? Then I can have a look at it.
Also, which version of ppscore are you using?

from ppscore.

lucazav commented on August 22, 2024

@FlorianWetschoreck my fault! Now it's working like a charm.

from ppscore.

FlorianWetschoreck commented on August 22, 2024

@lucazav happy to hear that :)

from ppscore.

ibuda commented on August 22, 2024

Gentlemen, your conversation triggered my interest to check if the results changed since I last posted here. I upgraded all the modules used in the article and re-ran all the examples. I noticed a new interesting thing, different from the previous execution. I assume, its cause is the same as @FlorianWetschoreck describes, but still, it is worth exploring (IMHO). The coefficient between Class and Survived features is now 0.
You cand check the notebook here, in lines 34.
Note: This is in no way a critique, on the contrary, an attempt to discuss interesting behavior. As before, very grateful for your contribution to data science community.

from ppscore.

FlorianWetschoreck commented on August 22, 2024

Hi Ivan,
in ppscore 1.0 we changed the way ppscore chooses the case (classification, regression) based on the data columns. Currently, Survived (as it is encoded in the titanic dataset) is seen as numeric and thus, ppscore tries a regression. That should be the source of the difference but we will have a look again

from ppscore.

ibuda commented on August 22, 2024

@FlorianWetschoreck Thank you for explaining the difference. I see no reason in keeping this issue open since the discrepancy comes from the difference in the approach between the previous and current version. I think it would be appropriate to update the article with the latest and greatest information.

from ppscore.

Getting different results from the ones in article about ppscore HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent