Code Monkey home page Code Monkey logo

Comments (10)

matthewcarbone avatar matthewcarbone commented on July 17, 2024

@mikkokotila I can try to work on this but that second link is broken! Would be good to get a better understanding of what is required before I do anything.

from talos.

mikkokotila avatar mikkokotila commented on July 17, 2024

@x94carbone I think that would be great. I've fixed the link above.

Right now the performance.py implements a modified version of f1 score that is "better" than regular f1 score in two ways:

  • it avoids high score in the corner cases that actually warrant poor scores
  • it works exactly the same for both binary and multi-class tasks (this is through dealing with both prediction outputs in the same manner)

The question would be if this can be reasonably converted into a Keras metric with the same logic. Or is it perhaps better to take as a base the keras fmeasure_acc here and make the modifications to it based on the performance.py.

Then the second step would be to identify a similar "best of class" objective measure for continuous prediction tasks, and have that in both python (as we have performance.py now) and keras versions (for callbacks).

What do you think?

from talos.

matthewcarbone avatar matthewcarbone commented on July 17, 2024

Yeah that sounds good. I would also at some point like to give the user some freedom to choose a custom metric, although that will be a challenge. I'm not sure I know what you mean by best of class though. Do you mean implementing a multi-class version of precision/recall/f1?

I'll definitely look into this when I have the time!

from talos.

mikkokotila avatar mikkokotila commented on July 17, 2024

Ah my bad with the choice of words. Basically right now in the world of statistics it seems that f1 gets us close to best possible objective measure, but not quite there yet (because of the corner cases). The corner cases fixed, this seems to be be the best possible way to handle classification tasks. This would naturally then lead to the question of what is the "gold standard" model for continuous prediction tasks.

The current modified f1 works for all kinds of classification tasks, it's modified to solve that problem. I have to test it a lot more to validate, but as far as I can gather the way I build leads to a situation where binary class and multi-label are both objective speaking in the same way.

For the part about user choosing their own metric, that's possible already now by using any metric in model.compile and that will come into the log.

With perfomance.my my focus have been in creating a metric we can "trust" in terms of completely automated pipeline later (where for example evolutionary algos help handle some of the parts that are still manual).

from talos.

matthewcarbone avatar matthewcarbone commented on July 17, 2024

@mikkokotila sorry for the late response!

Ok so I'm still confused about something. In performance.py it appears that what's being used are just the usual F1 and beta-F1 scores. What's modified about them? I could just be missing this.

As for a trustworthy metric, that is something I could possibly help with, since I understand that right off the bat. For multi-class classification, we will want to look into the macro F1 score. It is something of an intensive quantity in the sense that it averages each class's F1 scores and treats them equally as opposed to what's going on now. This ensures that classes with small numbers of entries have equal weight compared with those classes with many entries.

Isn't it a somewhat impossible task of creating a totally generalized trustworthy metric that works for any system? I'm sorry for all the questions here but I still don't totally understand the direction you want to go in with this 👍

from talos.

matthewcarbone avatar matthewcarbone commented on July 17, 2024

Bumping this.

@mikkokotila let me know if you can clarify when you have the time! Thanks! 👌

from talos.

mikkokotila avatar mikkokotila commented on July 17, 2024

There are two sides to this. First is "why it's important to have as much of unified metric across all experiments as possible?". My goal would be to have one day a database of millions of experiments all measured against a single objective metric. Currently this lives in the master log, which then users could choose to make as a contribution to the open database. It's important to note that here we just talk about what is stored to the master log, and otherwise of course the users use any metric they like. That said, at this point it does seem reasonable to accept that we will have to metrics; one for category predictions (single, multi-class, multi-label) and one for continuous. performance.my is an attempt to be the first one. A little bit more about this in the FAQ.

The the question about how is performance.py different from standard F1. It handles some of the corner cases as "null labels" instead of giving a miss-leading numeric result. The way it's currently done is just focused 100% on the above purpose. If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough).

Of course, when we talk about an objective metric, ideally this is something that would be something we use also as the base metric for automatic optimization / callbacks, etc.

from talos.

matthewcarbone avatar matthewcarbone commented on July 17, 2024

My goal would be to have one day a database of millions of experiments all measured against a single objective metric.

This is interesting, but why? Millions of experiments for the purpose of what? Are these all related experiments? Sorry for all the questions - I'm really interested. Although I think I'm beginning to understand a bit more what you mean by a unified metric. F1 is currently the gold standard I guess since it accounts for class imbalances...

If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough).

Yup, precision and recall were removed. Very strange considering they're such important metrics. Their reason:

Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place.

found here.

Although from the look of it you've already implemented all the relevant functions in terms of Keras backends. Isn't that all that is necessary to implement them at epoch level?

from talos.

mikkokotila avatar mikkokotila commented on July 17, 2024

@x94carbone sorry, missed this.

The reason for having results of many experiments (regardless of the prediction type) logged has to do with the potential value such data will have for better understanding the a) hyperparameter optimization problem b) optimization the of the hyperparameter optimization process

Yes I've already implemented the fscore etc from Keras backend (old version), but this does not quite do what Performance does i.e. treat single-class, multi-class, and multi-label prediction tasks all in the same way (i.e. objective metric across all those kinds of tasks) nor does it deal with fscore corner cases e.g. when there is many positives and few negatives and all are predicted as positives.

from talos.

mikkokotila avatar mikkokotila commented on July 17, 2024

Closing this to make way for soon-to-happen inclusion of sklearn.metrics and then gradually making the most important ones available on the Keras backend level for epoch-by-epoch evaluation.

from talos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.