Comments (9)
Done in open-spaced-repetition/fsrs-vs-sm17@6d4585a
from srs-benchmark.
That's my post, lol. And the conclusion that FSRS is more accurate than algorithms in this repo is still valid: confidence intervals don't overlap. I even mentioned that in the post to avoid confusion.
Thankfully, since LMSherlock has plenty of Anki data, the conclusion regarding FSRS vs other algorithms (not SuperMemo) is still valid: FSRS is more accurate than any other open-source algorithm that was used in the benchmark with Anki data.
When I said "FSRS is not (necessarily) better than SuperMemo", I was referring to results from this repo, which are based on very little data.
from srs-benchmark.
So the final code would look like this:
confidence_interval_99 = 2.576 * np.sqrt(np.cov(metric, aweights=sizes)) / np.sqrt(len(sizes))
from srs-benchmark.
I did this in 2 different ways: theoretical and by using bootstrapping. I didn't know how to do bootstrapping with weighted data, but thankfully someone already did that for me.
CI_99 = 2.576 * np.sqrt(np.cov(metrics, aweights=sizes)) / np.sqrt(len(sizes))
print(f"99% CI of {metric} (n_reviews), theoretical: {CI_99:.4f}")
identifiers = [i for i in range(len(metrics))]
# we create the mapping dictionary
data = zip(metrics, sizes)
dict_x_w = {identifier: (value, weight) for identifier, (value, weight) in enumerate(data)}
def weighted_mean(z, axis):
# creating an array of weights, by mapping z to dict_x_w
data = np.vectorize(dict_x_w.get)(z)
return np.average(data[0], weights=data[1], axis=axis)
CI_99_bootstrap = scipy.stats.bootstrap((identifiers,), statistic=weighted_mean, confidence_level=0.99, axis=0, method='percentile')
low = list(CI_99_bootstrap.confidence_interval)[0]
high = list(CI_99_bootstrap.confidence_interval)[1]
print(f"99% CI of {metric} (n_reviews), bootsrapping (scipy): {(high - low) / 2:.4f}")
And I get quite different results, unfortunately. This means that the weighted mean is not distributed normally. I tried this on random data generated from np.random.lognormal()
and np.random.normal()
, and the bootstrapping method agrees with theory for normally distributed data, but not for lognormal data. So I don't know whether we should use the theoretical method or the bootstrapping method in this case.
Below you can see the confidence intervals for RMSE (bins). I'll go over my math again, to see if maybe I messed something up.
While at first glance the confidence intervals may look very similar, numerical values differ by roughly a factor of 1.8 for logloss and by a factor of 1.5 for RMSE.
from srs-benchmark.
I checked everything and I don't think I've made any errors. I tried this again with np.random.lognormal()
and np.random.normal()
, and found that the greater the skewness of the distribution, the greater the confidence interval given by the bootstrapping method. I believe bootstrapping is superior in that case (and in our case) since it doesn't assume normality and better captures the properties of a non-normal dfistribution. So I think we should use that for confidence intervals. Here's the code:
for metric in ("LogLoss", "RMSE", "RMSE(bins)")::
values = np.array([item[metric] for item in m])
identifiers = [i for i in range(len(values))]
data = zip(values, sizes)
dict_x_w = {identifier: (value, weight) for identifier, (value, weight) in enumerate(data)}
def weighted_mean(z, axis):
# creating an array of weights, by mapping z to dict_x_w
data = np.vectorize(dict_x_w.get)(z)
return np.average(data[0], weights=data[1], axis=axis)
CI_99_bootstrap = scipy.stats.bootstrap((identifiers,), statistic=weighted_mean, confidence_level=0.99, axis=0, method='percentile')
low = list(CI_99_bootstrap.confidence_interval)[0]
high = list(CI_99_bootstrap.confidence_interval)[1]
print(f"99% CI of {metric} (n_reviews), bootstrapping (scipy): {(high - low) / 2:.4f}")
(high - low) / 2 is the final value. Then all you have to do is change the current tables:
For example, instead of displaying 0.0533 for RMSE (bins) for FSRS v4, it would display 0.0533Β±0.0012 (0.0032 is the value that I got using a 3000 collections dataset, the real value should be around 0.0012 since you have a larger dataset).
from srs-benchmark.
Please do this in benchmark repos with SM-15 and SM-17 as well.
from srs-benchmark.
I want to update my benchmark post, so I would appreciate if you added confidence intervals here: https://github.com/open-spaced-repetition/fsrs-vs-sm17
from srs-benchmark.
I haven't verified this but am sharing in case of interest:
This thread's analysis claims that the confidence intervals show that nothing can be concluded about which one is more effective https://www.reddit.com/r/Anki/comments/194tkjz/fsrs_is_not_better_than_supermemo_or_why_you/
from srs-benchmark.
Thx, someone asked me to share that post with this org after I mentioned it but I have no personal investment in it so apologies for the inaccuracy from my lack of attention
from srs-benchmark.
Related Issues (20)
- [Feature Request] Group users into single dataset HOT 15
- Using the mode to find the best default parameters HOT 6
- [Feature Request] Add a Transformer HOT 15
- collect bad cases from Anki users' dataset HOT 9
- visualize metrics over time HOT 2
- [Feature Request] Train a gradient-boosted decision tree HOT 36
- Some weird first forgetting curves HOT 9
- accidental post
- Revlogs parsing HOT 12
- [Question] A βrawβ version of the tiny_dataset.zip HOT 3
- [Feature Request] Add a BiLSTM HOT 2
- [Feature request] Add the ACT-R model (see paper) HOT 21
- [TODO] Add DASH and its variants HOT 13
- [Feature request] A quantitative measure of cheating HOT 9
- Write an article about binned RMSE and cheating calibration metrics HOT 7
- Ebisu? HOT 7
- [Question] Some more details from a ML perspective HOT 8
- Cannot download dataset from huggingface HOT 4
- Neural network scheduler HOT 42
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from srs-benchmark.