Light

[Feature request] Add confidence intervals for all metrics about srs-benchmark HOT 9 CLOSED

open-spaced-repetition commented on July 28, 2024

[Feature request] Add confidence intervals for all metrics

from srs-benchmark.

Comments (9)

L-M-Sherlock commented on July 28, 2024 1

Done in open-spaced-repetition/fsrs-vs-sm17@6d4585a

from srs-benchmark.

Expertium commented on July 28, 2024 1

That's my post, lol. And the conclusion that FSRS is more accurate than algorithms in this repo is still valid: confidence intervals don't overlap. I even mentioned that in the post to avoid confusion.

Thankfully, since LMSherlock has plenty of Anki data, the conclusion regarding FSRS vs other algorithms (not SuperMemo) is still valid: FSRS is more accurate than any other open-source algorithm that was used in the benchmark with Anki data.

When I said "FSRS is not (necessarily) better than SuperMemo", I was referring to results from this repo, which are based on very little data.

from srs-benchmark.

Expertium commented on July 28, 2024

So the final code would look like this:
confidence_interval_99 = 2.576 * np.sqrt(np.cov(metric, aweights=sizes)) / np.sqrt(len(sizes))

from srs-benchmark.

Expertium commented on July 28, 2024

I did this in 2 different ways: theoretical and by using bootstrapping. I didn't know how to do bootstrapping with weighted data, but thankfully someone already did that for me.

        CI_99 = 2.576 * np.sqrt(np.cov(metrics, aweights=sizes)) / np.sqrt(len(sizes))
        print(f"99% CI of {metric} (n_reviews), theoretical: {CI_99:.4f}")
        identifiers = [i for i in range(len(metrics))]
        # we create the mapping dictionary
        data = zip(metrics, sizes)
        dict_x_w = {identifier: (value, weight) for identifier, (value, weight) in enumerate(data)}

        def weighted_mean(z, axis):
            # creating an array of weights, by mapping z to dict_x_w
            data = np.vectorize(dict_x_w.get)(z)
            return np.average(data[0], weights=data[1], axis=axis)


        CI_99_bootstrap = scipy.stats.bootstrap((identifiers,), statistic=weighted_mean, confidence_level=0.99, axis=0, method='percentile')
        low = list(CI_99_bootstrap.confidence_interval)[0]
        high = list(CI_99_bootstrap.confidence_interval)[1]
        print(f"99% CI of {metric} (n_reviews), bootsrapping (scipy): {(high - low) / 2:.4f}")

And I get quite different results, unfortunately. This means that the weighted mean is not distributed normally. I tried this on random data generated from np.random.lognormal() and np.random.normal(), and the bootstrapping method agrees with theory for normally distributed data, but not for lognormal data. So I don't know whether we should use the theoretical method or the bootstrapping method in this case.
Below you can see the confidence intervals for RMSE (bins). I'll go over my math again, to see if maybe I messed something up.

While at first glance the confidence intervals may look very similar, numerical values differ by roughly a factor of 1.8 for logloss and by a factor of 1.5 for RMSE.

from srs-benchmark.

Expertium commented on July 28, 2024

I checked everything and I don't think I've made any errors. I tried this again with np.random.lognormal() and np.random.normal(), and found that the greater the skewness of the distribution, the greater the confidence interval given by the bootstrapping method. I believe bootstrapping is superior in that case (and in our case) since it doesn't assume normality and better captures the properties of a non-normal dfistribution. So I think we should use that for confidence intervals. Here's the code:

    for metric in ("LogLoss", "RMSE", "RMSE(bins)")::
        values = np.array([item[metric] for item in m])
        identifiers = [i for i in range(len(values))]
        data = zip(values, sizes)
        dict_x_w = {identifier: (value, weight) for identifier, (value, weight) in enumerate(data)}
        
        def weighted_mean(z, axis):
            # creating an array of weights, by mapping z to dict_x_w
            data = np.vectorize(dict_x_w.get)(z)
            return np.average(data[0], weights=data[1], axis=axis)

        CI_99_bootstrap = scipy.stats.bootstrap((identifiers,), statistic=weighted_mean, confidence_level=0.99, axis=0, method='percentile')
        low = list(CI_99_bootstrap.confidence_interval)[0]
        high = list(CI_99_bootstrap.confidence_interval)[1]
        print(f"99% CI of {metric} (n_reviews), bootstrapping (scipy): {(high - low) / 2:.4f}")

(high - low) / 2 is the final value. Then all you have to do is change the current tables:

For example, instead of displaying 0.0533 for RMSE (bins) for FSRS v4, it would display 0.0533±0.0012 (0.0032 is the value that I got using a 3000 collections dataset, the real value should be around 0.0012 since you have a larger dataset).

from srs-benchmark.

Expertium commented on July 28, 2024

Please do this in benchmark repos with SM-15 and SM-17 as well.

from srs-benchmark.

Expertium commented on July 28, 2024

I want to update my benchmark post, so I would appreciate if you added confidence intervals here: https://github.com/open-spaced-repetition/fsrs-vs-sm17

from srs-benchmark.

aehlke commented on July 28, 2024

I haven't verified this but am sharing in case of interest:

This thread's analysis claims that the confidence intervals show that nothing can be concluded about which one is more effective https://www.reddit.com/r/Anki/comments/194tkjz/fsrs_is_not_better_than_supermemo_or_why_you/

from srs-benchmark.

aehlke commented on July 28, 2024

Thx, someone asked me to share that post with this org after I mentioned it but I have no personal investment in it so apologies for the inaccuracy from my lack of attention

from srs-benchmark.

Related Issues (20)

[Feature Request] Group users into single dataset HOT 15
Using the mode to find the best default parameters HOT 6
[Feature Request] Add a Transformer HOT 15
collect bad cases from Anki users' dataset HOT 9
visualize metrics over time HOT 2
[Feature Request] Train a gradient-boosted decision tree HOT 36
Some weird first forgetting curves HOT 9
accidental post
Revlogs parsing HOT 12
[Question] A “raw” version of the tiny_dataset.zip HOT 3
[Feature Request] Add a BiLSTM HOT 2
[Feature request] Add the ACT-R model (see paper) HOT 21
[TODO] Add DASH and its variants HOT 13
[Feature request] A quantitative measure of cheating HOT 9
Write an article about binned RMSE and cheating calibration metrics HOT 7
Ebisu? HOT 7
[Question] Some more details from a ML perspective HOT 8
Cannot download dataset from huggingface HOT 4
Neural network scheduler HOT 42

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.