open-spaced-repetition / srs-benchmark Goto Github PK
View Code? Open in Web Editor NEWA benchmark for spaced repetition schedulers/algorithms
Home Page: https://github.com/open-spaced-repetition/fsrs4anki/wiki
A benchmark for spaced repetition schedulers/algorithms
Home Page: https://github.com/open-spaced-repetition/fsrs4anki/wiki
Here's my code:
biLSTM.zip
It's basically other.py, but you don't need to specify the model, just run set DEV_MODE=1 && python biLSTM.py
. I removed other algorithms and changed class LSTM
. The problem is - I don't get any errors, but I don't get any output files either. You said that it could be a RAM issue, but I tried setting n_hidden
to 1 and still got no output. Since there are no errors, I found it hard to debug. So I want you to do 2 things:
Would it be possible to include any of the boosting models like catboost, lightgbm or xgboost, since they are very good with tabular timeseries data such as this and there is already neural network compared.
Thank you for this very interesting analysis! If you all feel inclined to include it, I'd be curious to see how Ebisu compares.
We can identify people who misuse Hard based on 2 criteria:
So here's what I want you, LMSherlock, to do:
This will ensure that the default parameters are not affected by the misuses of Hard.
I have an idea how to measure the degree of "cheatiness" of an algorithm.
Do the same procedure that you do for plotting the calibration graph.
Record the number of values in the densest bin, aka the highest bar. Example:
Divide it by the total number of reviews. For a cheating algorithm, this will be 100% since there is only one bin, so 100% of reviews fall into that bin.
Do this for every user for a given algorithm.
Calculate the (unweighted) average.
From a theoretical point of view, the issue is that the cutoff will be arbitrary. If the average is 90%, meaning that on average 90% of predicted R values fall within the same bin, is it cheating? What about 80%? Or 50%?
From a practical point of view, this will require re-running every single algorithm since this information cannot be obtained from .json result files right now. At the very least, you will have to re-run FSRS-4.5, ACT-R and DASH[ACT-R], since we are sure that FSRS-4.5 isn't cheating, and ACT-R algorithms are the main suspects. But of course, to get a better idea of what values of this metric are good and what values are bad, you need to re-run the entire benchmark.
Also, this is not intended to be included in the readme. It's for our internal testing.
Jones, M. N. (Ed.). (2016). Predicting and Improving Memory Retention: Psychological Theory Matters in the Big Data Era. In Big Data in Cognitive Science (0 ed., pp. 43–73). Psychology Press. https://doi.org/10.4324/9781315413570-8
Randazzo, Giacomo. (2020-21). Memory Models for Spaced Repetition Systems (Tesi di Laurea Magistrale in Mathematical Engineering - Ingegneria Matematica, Politecnico di Milano). Advisor: Marco D. Santambrogio. Retrieved from https://hdl.handle.net/10589/186407
It's not very important, but I do want to see how well (or, rather, how poorly) it performs.
https://memrise.zendesk.com/hc/en-us/articles/360015889057-How-does-the-spaced-repetition-system-work-
Next review in: 4 hours > 12 hours > 24 hours > 6 days > 12 days > 48 days > 96 days > 180 days.
My ancient python code:
def memrise(history):
ivl = 0
reps = 0
for delta_t, rating in history:
delta_t = delta_t.item()
rating = rating.item() + 1
intervals = [1, 6, 12, 24, 48, 96, 180]
if rating > 1:
reps += 1
if reps > 7:
reps = 7
ivl = intervals[reps-1]
else:
ivl = 1
reps = 1
return ivl
dataset['memrise_interval'] = dataset['tensor'].map(memrise)
dataset['memrise_p'] = np.exp(np.log(0.9) * dataset['delta_t'] / dataset['memrise_interval'])
The obvious issue is that it's unclear what retention level it aims for. I guess we should use 90%. I tried searching for anything that could give me a hint, but found nothing. Btw, what about HLR? I don't know what % it aims for.
As for description, you can add something like: "Memrise, the algorithm used by a language learning app Memrise.".
FSRS4 ignores it, but FSRS-rs creates a file for it, meaning that the file needs to be manually removed to be a fair comparison.
In #28 a link to a pre-processed small dataset was shared.
While testing different ways of converting review logs of different spacing algorithms
to FSRS, my evaluation on ~7000 reviews generated using an EmacsLisp implementation of py-fsrs
suggests that updating the difficulty and stability for reviews with an interval greater than 1 day
is slightly better than using the (re)learning/review states of the py-fsrs implementation.
To make sure I didn't make any mistake in my evaluation code and test on larger datasets,
I'd like to retry this experiment using the code and datasets of this benchmark
but I can't do so with the “tiny_dataset.zip” because the delta_t
have been rounded to days.
Would it be possible to get access to a similar dataset either in an unprocessed format
or with floating-point delta_t
values?
This seems to be related to a difference in how the benchmark and the optimizer
implement the FSRS algorithm (using the first review of each day, as I understand it) and how it's
implemented in e.g. py-fsrs (using states to decide when to update the parameters).
I'm not sure how to compare the two approaches other than using review logs from FSRS
and testing if the recall prediction would have been more accurate if we had included
reviews that occurred in the (re)learning state but after a sufficiently large interval or on a different day.
Although transformers are probably what would give the best performance with enough training and tweaking of hyperparameters, I suspect that a gradient boosted decision tree ensemble model might outperform FSRS with very little tweaking using a methodology similar to this: https://machinelearningmastery.com/xgboost-for-time-series-forecasting/. It would, however be a much heavier model with many more parameters than even the LSTM that was attempted.
This is something i'd be interested in exploring if I could have access to the training data.
Hey, first of all thank you for the dataset. I was wondering if you could provide some more details for people who want to train their own ML algorithms but might not be familiar with the internals of Anki.
Let me see if I understand the dataset correctly:
What is the target variable? And how are you handling the time series for RNNs?
open-spaced-repetition/fsrs4anki#493 (comment)
Now that Dae has provided us with far more data, I'd like you to update the benchmark repo and include the following algorithms:
LSTM (short-term) is optional IMO. I'm planning to make a reddit post, and I will need those 8 algorithms. Since benchmarking will take a lot more CPU time now, I can also help you to speed it up by doing some of it myself, if you want me to. Also, please make the dataset downloadable using the python download_data.py
command, I want to re-do my test of button usage and RMSE.
Currently each user is treated as an isolated dataset and the results are taken by a weighted average by various schemes.
It would be helpful to allow models that learns from one user and can apply it to others. even without card data. A simple one would just take the current FSRS or LSTM benchmark and apply regularization to the parameters relative to the individual mean (or the mean can be its own parameter).
This will be the new issue to discuss this, as I was polluting the previous issue.
@L-M-Sherlock here's some interesting data from my preliminary testing; with sqrt(n) in pretrain, 107 489 380 reviews from 2988 users. All three estimates (half-range, half-sample, kernel density) seem to agree with one another in most, but not all cases. Below are the values of modes obtained by these three estimators after sorting the values:
[0.3534, 0.3601, 0.4564]
[0.9, 0.9, 0.903]
[2.3, 2.5091, 2.5861]
[10.9, 10.9, 365.0]
[5.0375, 5.049, 5.0491]
[1.0596, 1.0597, 1.0598]
[0.7406, 0.8595, 0.86]
[0.0, 0.0, 0.0]
[1.49, 1.49, 1.49]
[0.1, 0.1, 0.1]
[0.94, 0.94, 0.9405]
[2.1257, 2.18, 2.1803]
[0.01, 0.01, 0.01]
[0.3399, 0.34, 0.34]
[1.26, 1.26, 2.0]
[0.0, 0.0, 0.0]
[2.61, 2.61, 4.0]
In order to calculate the final value, I use all three estimators (HRM, HSK, KDE), and then take the average of the two closest ones. This isn't used in the data above, just to clarify.
S0 for "Easy" is the problem. It's either the default value of 10.9, or the max. value, so modes don't help in this case.
Calculating the confidence interval is very easy if we assume that the values of metrics (RMSE, logloss, etc.) are distributed normally. I really hope they are, otherwise this will get complicated.
CI=const * st. dev. / sqrt(N)
For a 99% confidence interval, the constant is 2.576. sqrt(N) is just the square root of the number of collections, and st. dev is the standard deviation of the selected metric.
So all you need is just RMSE and logloss values for every collection, which you, of course, have. There is one tricky part, however. Since we are using a weighted average (with the number of reviews as a weight), the standard deviation also has to be weighted. I found a short and elegant solution after a little bit of googling:
np.sqrt(np.cov(values, aweights=weights))
Here values
would be logloss or RMSE of each collection, and weights
would be the number of reviews of each collection.
This should make our benchmark more rigorous as well as give us an idea of whether some algorithm actually performs better than another or whether it's just a fluke.
This isn't high priority at all, of course, but I would like to see how well a Transformer neural network can perform.
Slightly unrelated, but please add the number of parameters used by LSTM to the description.
Also, ideally, the number of parameters in LSTM and in the Transformer should be roughly the same to make the comparison fair and clearly see how much the difference in architecture matters.
https://github.com/maimemo/SSP-MMC-Plus
Is there a reason this isn't added to the benchmark?
Relevant paper.
Main formulas:
The appendix has example calculations.
It only has 4 parameters: decay intercept (a), decay scale (c), activation threshold (𝜏) and noise (s). Seems to be fairly straightforward to implement. In the paper 𝜏 and s are fixed, but I think we should make them optimizable. I don't see why they should be fixed.
Btw, a reminder: please add the Transformer to the table with p-values.
I tried reading this code, and while I don't really understand how it works, it seems that there is some sort of training procedure. So I'm surprised that it performs about as poorly as SM-2.
revlogs2dataset.zip
Here are the stats_pb2.py and revlogs2dataset.py
Also, here are the 10 revlog.
10.zip
For file 1 I expected this result, card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3
2,5,-1,1
2,7,0,1
2,10,0,3
2,17,0,3
2,101,2,4
2,158,4,3
2,243,1,2
2,352,7,4
2,384,4,2 from revlog 1, but got this result:
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3
2,4999,-1,1
2,5001,0,1
2,5004,0,3
2,5011,0,3
2,5266,1,4
2,5841,4,3
2,6111,2,2
2,6383,7,4
2,6800,4,2
Hello, I believe our approaches are not compatible, but ClarityInMadness seems otherwise convinced https://www.reddit.com/r/Anki/comments/1cdm1y2/ What are your thoughts?
from datasets import load_dataset
raw_datasets = load_dataset("open-spaced-repetition/FSRS-Anki-20k")
produces the following error:
Generating train split: 720284748 examples [05:41, 2111906.74 examples/s]
Traceback (most recent call last):
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 2011, in _prepare_split_single
writer.write_table(table)
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/arrow_writer.py", line 585, in write_table
pa_table = table_cast(pa_table, self._schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/table.py", line 2295, in table_cast
return cast_table_to_schema(table, schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/table.py", line 2249, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
card_id: null
review_th: null
delta_t: null
rating: null
__index_level_0__: null
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 780
to
{'card_id': Value(dtype='int64', id=None), 'review_th': Value(dtype='int64', id=None), 'delta_t': Value(dtype='int64', id=None), 'rating': Value(dtype='int64', id=None)}
because column names don't match
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/nieradzik/anki/download.py", line 3, in <module>
raw_datasets = load_dataset("open-spaced-repetition/FSRS-Anki-20k")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
builder_instance.download_and_prepare(
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
self._download_and_prepare(
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1882, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
raise DatasetGenerationCastError.from_cast_error(
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 1 new columns ({'__index_level_0__'})
This happened while the csv dataset builder was generating data using
hf://datasets/open-spaced-repetition/FSRS-Anki-20k/dataset/2/10054.csv (at revision 9440578f519d7113db474c284bba7828fcbeccaf)
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
I'll copy what I said in Discord
Another request: calculate Matthew's correlation coefficient: https://en.wikipedia.org/wiki/Phi_coefficient
# do this, but using a for loop or list comprehensions or something
if y_pred > 0.5:
y_pred = 1
else:
y_pred = 0
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
import math
def mcc(tn, fp, fn, tp):
sums = [tp + fp, tp + fn, tn + fp, tn + fn]
n_zero = 0
for i in range(4):
if sums[i] == 0:
n_zero += 1
if n_zero == 0:
x = sums[0] * sums[1] * sums[2] * sums[3] # I tried using np.prod, but it outputs negative values sometimes, probably due to an overflow
# I also have to use math.sqrt, because np.sqrt doesn't work for very large numbers
return ((tp * tn) - (fp * fn)) / math.sqrt(x)
# if one of the four sums in the denominator is zero, return 0
elif n_zero == 1:
return 0
# if two of the four sums are zero, return 1 or -1, depending on TP, TN, FP and FN
# https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7
elif n_zero == 2:
if tp != 0 or tn != 0:
return 1
elif fp != 0 or fn != 0:
return -1
# if more than two sums are zero, return None
elif n_zero > 2:
return None
I linked 2 articles about MCC (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9938573 and https://ieeexplore.ieee.org/abstract/document/9440903). It has an advantage over AUC - it takes into account all four numbers (true positives, true negatives, false positives, false negatives), whereas AUC only takes into account two.
So with AUC and MCC added, we will have 2 calibration metrics and 2 classification metrics, which is more than enough
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.