Code Monkey home page Code Monkey logo

Comments (8)

8080labs avatar 8080labs commented on August 22, 2024

Hi Ivan,

thank you for pointing this out. We will have a further look into this and adjust the article.

The main reason for the difference should be the randomness. In the Titanic example, there most likely occurs another split for the crossvalidation sets based on your and my calculation.

In addition, the score seems to be unstable because of the specific relationships:

First, TicketId has many unique values and 60% of the values are unique. In total, only 49 (5%) rows belong to TicketIds which occur 5 times or more. The maximum value count is 7.

The score with the mismatch is the score from TicketPrice to TicketId.

For a given TicketId the TicketPrice is always constant. So, the model can perfectly predict the TicketPrice if it already saw the TicketId before (eg for the 40% of rows which belong to TicketIds that occur more than once). This is also the reason why I was not suspicious of the high score.

In the other direction from TicketPrice to TicketID the relationship is sometimes ambiguous. Here it would be interesting to see how the choice of the crossvalidation splits affect the splits of the DecisionTreeClassifier.

In order to better understand the variability we might need to dig deeper but those are already some first observations. As some next steps it would be interesting to plot the actual predictions of the models during crossvalidation. Also, it is interesting to have a look at how the final F1 is build.

Best,
Florian

from ppscore.

lucazav avatar lucazav commented on August 22, 2024

In order to guarantee the reproducibility of results, I'm trying to use the random_seed parameter into the matrix function. But I get a matrix having all zeros except for the diagonal (all ones) for any value passed to the parameter.

Am I doing anything wrong?

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

Can you please share the full code of your analysis? Then I can have a look at it.
Also, which version of ppscore are you using?

from ppscore.

lucazav avatar lucazav commented on August 22, 2024

@FlorianWetschoreck my fault! Now it's working like a charm.

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

@lucazav happy to hear that :)

from ppscore.

ibuda avatar ibuda commented on August 22, 2024

Gentlemen, your conversation triggered my interest to check if the results changed since I last posted here. I upgraded all the modules used in the article and re-ran all the examples. I noticed a new interesting thing, different from the previous execution. I assume, its cause is the same as @FlorianWetschoreck describes, but still, it is worth exploring (IMHO). The coefficient between Class and Survived features is now 0.
You cand check the notebook here, in lines 34.
Note: This is in no way a critique, on the contrary, an attempt to discuss interesting behavior. As before, very grateful for your contribution to data science community.

from ppscore.

FlorianWetschoreck avatar FlorianWetschoreck commented on August 22, 2024

Hi Ivan,
in ppscore 1.0 we changed the way ppscore chooses the case (classification, regression) based on the data columns. Currently, Survived (as it is encoded in the titanic dataset) is seen as numeric and thus, ppscore tries a regression. That should be the source of the difference but we will have a look again

from ppscore.

ibuda avatar ibuda commented on August 22, 2024

@FlorianWetschoreck Thank you for explaining the difference. I see no reason in keeping this issue open since the discrepancy comes from the difference in the approach between the previous and current version. I think it would be appropriate to update the article with the latest and greatest information.

from ppscore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.