Write an article about binned RMSE and cheating calibration metrics about srs-benchmark HOT 7 CLOSED

open-spaced-repetition commented on September 25, 2024

Write an article about binned RMSE and cheating calibration metrics

from srs-benchmark.

Comments (7)

L-M-Sherlock commented on September 25, 2024 1

I invite you to join the organization. After you accept it, I will give you the access right of the fsrs4anki repository. Then you can edit the following wiki page:

https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Metric

from srs-benchmark.

user1823 commented on September 25, 2024 1

It would be more convenient if I had editing rights too.

from srs-benchmark.

L-M-Sherlock commented on September 25, 2024

I'm writing a draft. You can give some advice, or just write some paragraphs.

Introduction

Weighted Root Mean Square Error in Bins (RMSE (bins)) is a metric engineered to evaluate the accuracy of memory prediction by FSRS and other spaced repetition algorithms in our SRS Benchmark.

In our recent algorithm comparison experiment, we found that the RMSE (bins) metric can be deceived in certain cases. To prevent cheating algorithms from obtaining artificially high scores, we modified the definition of the RMSE metric.

This article contains three parts:

the old definition of RMSE (bins)
the cheating case
the new definition

Old definition

https://www.reddit.com/r/Anki/comments/15mab6e/fsrs_explained_part_2_accuracy/

The cheating case

The cheating method is very simple: output the average probability. Taking weather forecasting as an example, a weather forecast that always predicts tomorrow's probability of rain as the historical average will have a very low RMSE(bins). However, such a forecast is completely useless.

New definition

The main difference is the binning method. Instead of grouping the predictions and review outcomes by the predicted probability, the new method group them based on three features: the interval length, the number of reviews, and the number of lapses.

Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated.

Taking weather forecasting as an example again, we can group the predicted probability of rain by season, temperature, air pressure, and other features. It's obvious that the historical average method will has very poor performance in this metric.

from srs-benchmark.

Expertium commented on September 25, 2024

It would be more convenient if you made a wiki page and gave me editing rights.

from srs-benchmark.

Expertium commented on September 25, 2024

Thank you!

from srs-benchmark.

Expertium commented on September 25, 2024

Alright, I think it looks good. @user1823 here: https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Metric. Any feedback is welcome.

from srs-benchmark.

user1823 commented on September 25, 2024

I have made some edits. It LGTM.

from srs-benchmark.

Write an article about binned RMSE and cheating calibration metrics about srs-benchmark HOT 7 CLOSED

Comments (7)

Introduction

Old definition

The cheating case

New definition

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent