Code Monkey home page Code Monkey logo

Comments (9)

L-M-Sherlock avatar L-M-Sherlock commented on July 28, 2024 1

Done in open-spaced-repetition/fsrs-vs-sm17@6d4585a

from srs-benchmark.

Expertium avatar Expertium commented on July 28, 2024 1

That's my post, lol. And the conclusion that FSRS is more accurate than algorithms in this repo is still valid: confidence intervals don't overlap. I even mentioned that in the post to avoid confusion.

Thankfully, since LMSherlock has plenty of Anki data, the conclusion regarding FSRS vs other algorithms (not SuperMemo) is still valid: FSRS is more accurate than any other open-source algorithm that was used in the benchmark with Anki data.

When I said "FSRS is not (necessarily) better than SuperMemo", I was referring to results from this repo, which are based on very little data.

from srs-benchmark.

Expertium avatar Expertium commented on July 28, 2024

So the final code would look like this:
confidence_interval_99 = 2.576 * np.sqrt(np.cov(metric, aweights=sizes)) / np.sqrt(len(sizes))

from srs-benchmark.

Expertium avatar Expertium commented on July 28, 2024

I did this in 2 different ways: theoretical and by using bootstrapping. I didn't know how to do bootstrapping with weighted data, but thankfully someone already did that for me.

        CI_99 = 2.576 * np.sqrt(np.cov(metrics, aweights=sizes)) / np.sqrt(len(sizes))
        print(f"99% CI of {metric} (n_reviews), theoretical: {CI_99:.4f}")
        identifiers = [i for i in range(len(metrics))]
        # we create the mapping dictionary
        data = zip(metrics, sizes)
        dict_x_w = {identifier: (value, weight) for identifier, (value, weight) in enumerate(data)}

        def weighted_mean(z, axis):
            # creating an array of weights, by mapping z to dict_x_w
            data = np.vectorize(dict_x_w.get)(z)
            return np.average(data[0], weights=data[1], axis=axis)


        CI_99_bootstrap = scipy.stats.bootstrap((identifiers,), statistic=weighted_mean, confidence_level=0.99, axis=0, method='percentile')
        low = list(CI_99_bootstrap.confidence_interval)[0]
        high = list(CI_99_bootstrap.confidence_interval)[1]
        print(f"99% CI of {metric} (n_reviews), bootsrapping (scipy): {(high - low) / 2:.4f}")

And I get quite different results, unfortunately. This means that the weighted mean is not distributed normally. I tried this on random data generated from np.random.lognormal() and np.random.normal(), and the bootstrapping method agrees with theory for normally distributed data, but not for lognormal data. So I don't know whether we should use the theoretical method or the bootstrapping method in this case.
Below you can see the confidence intervals for RMSE (bins). I'll go over my math again, to see if maybe I messed something up.
1
While at first glance the confidence intervals may look very similar, numerical values differ by roughly a factor of 1.8 for logloss and by a factor of 1.5 for RMSE.

from srs-benchmark.

Expertium avatar Expertium commented on July 28, 2024

I checked everything and I don't think I've made any errors. I tried this again with np.random.lognormal() and np.random.normal(), and found that the greater the skewness of the distribution, the greater the confidence interval given by the bootstrapping method. I believe bootstrapping is superior in that case (and in our case) since it doesn't assume normality and better captures the properties of a non-normal dfistribution. So I think we should use that for confidence intervals. Here's the code:

    for metric in ("LogLoss", "RMSE", "RMSE(bins)")::
        values = np.array([item[metric] for item in m])
        identifiers = [i for i in range(len(values))]
        data = zip(values, sizes)
        dict_x_w = {identifier: (value, weight) for identifier, (value, weight) in enumerate(data)}
        
        def weighted_mean(z, axis):
            # creating an array of weights, by mapping z to dict_x_w
            data = np.vectorize(dict_x_w.get)(z)
            return np.average(data[0], weights=data[1], axis=axis)

        CI_99_bootstrap = scipy.stats.bootstrap((identifiers,), statistic=weighted_mean, confidence_level=0.99, axis=0, method='percentile')
        low = list(CI_99_bootstrap.confidence_interval)[0]
        high = list(CI_99_bootstrap.confidence_interval)[1]
        print(f"99% CI of {metric} (n_reviews), bootstrapping (scipy): {(high - low) / 2:.4f}")

(high - low) / 2 is the final value. Then all you have to do is change the current tables:
image
For example, instead of displaying 0.0533 for RMSE (bins) for FSRS v4, it would display 0.0533Β±0.0012 (0.0032 is the value that I got using a 3000 collections dataset, the real value should be around 0.0012 since you have a larger dataset).

from srs-benchmark.

Expertium avatar Expertium commented on July 28, 2024

Please do this in benchmark repos with SM-15 and SM-17 as well.

from srs-benchmark.

Expertium avatar Expertium commented on July 28, 2024

I want to update my benchmark post, so I would appreciate if you added confidence intervals here: https://github.com/open-spaced-repetition/fsrs-vs-sm17

from srs-benchmark.

aehlke avatar aehlke commented on July 28, 2024

I haven't verified this but am sharing in case of interest:

This thread's analysis claims that the confidence intervals show that nothing can be concluded about which one is more effective https://www.reddit.com/r/Anki/comments/194tkjz/fsrs_is_not_better_than_supermemo_or_why_you/

from srs-benchmark.

aehlke avatar aehlke commented on July 28, 2024

Thx, someone asked me to share that post with this org after I mentioned it but I have no personal investment in it so apologies for the inaccuracy from my lack of attention

from srs-benchmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.