Code Monkey home page Code Monkey logo

Comments (42)

lars76 avatar lars76 commented on June 22, 2024 1

If I duplicate the data in both the training and the test dataset, the results are better (0.6988735931049374). The modified model is better at predicting the probability outside the interval (e.g. day 13 was the Anki score, then the prediction for day >13 is better). However, if I only want to predict day 13, the duplicated data does not help.

So it depends on the task. If we need the probability for each day, duplicating the data might be beneficial. If we only need the probability for the first day when the user forgets the card, it is better not to change the data.

I will close this issue now. When I have more time, I will start working on a scheduler/metric again. Thank you for the discussion πŸ‘

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

I would suggest to show your code, at least the input and the output part. Also, here's the small dataset:
tiny_dataset.zip

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

It's not completely user-ready yet. I created the post on reddit in the hope of finding people interested in collaborating on a new algorithm. It's a proof of concept to show that the idea works. I believe it can outperform all other algorithms because it doesn't rely on assumptions like exponential decay. I did not yet optimize the architecture, the features and many other aspects.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

I meant that LMSherlock would need to know what your output looks like.

I believe it can outperform all other algorithms because it doesn't rely on assumptions like exponential decay.

If you predict the probability of recall directly without predicting memory stability (with a "fixed" shape of the forgetting curve) or some other intermediate value, it can lead to weird results, such as the probability never reaching 0% or 100%, or a forgetting curve that is non-monotonous with respect to time. I think we can all agree that the probability of recall should monotonically decrease as time passes.
Btw, DASH algorithms don't use exponential decay (well, neither does FSRS, it actually uses a power function).

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

I am not predicting recall. My approach is based on: https://en.wikipedia.org/wiki/Proportional_hazards_model As I wrote on reddit, I am predicting forgetness. The output is a sampled cumulative distribution function (CDF): https://en.wikipedia.org/wiki/Failure_rate#hazard_function

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

Is it possible to sample from this distribution? If so, we can convert the sampling results to recall.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

Yeah, I thought "If you can output a day for the specified probability, surely there must be a way to do the inverse and output a probability for the specified day?"

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Of course, I could also calculate 1 - F, where F stands for forgetfulness. Then we get the recall score. But the metric would still be a problem because I am not optimizing recall but have two objectives (sigmoid + softmax). It would be biased against my model. If I want to minimize the log likelihood, I would create some features like [elapsed_time, last_recall, ...] and train a network to predict 0/1 (sigmoid). This model would then be excellent at predicting whether we can still recall the card at time t. Then we can use https://arxiv.org/pdf/1706.04599 and get even better results. However, such a model itself would be useless.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

I agree that a well-calibrated model may be not a good scheduler. But a good scheduler should predict the recall well.

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Let us assume our model is perfectly calibrated. Then I am still not convinced that log-likelihood is the best metric.

Example: Anki gave us reviews at time t=3 (recall=1), t=9 (recall=1) and t=21 (recall=0). Using log likelihood, you compare individually the predicted and ground truth value at t=3, t=9, t=21.

But this does not tell us if the model is good. The sample t=9 could already be a mistake. If we have another sequence t=3, t=10, t=21 and it leads to the same result "t=21 with recall=0", then {3,10,21} would be a better sequence.

Hence, we have to look not at the probabilities individually but as a whole. Mathematically, this means that we have a monotonicity requirement such that the recall at t=k is greater than at t=k+1.

In my opinion, a good scheduler is not one that predicts recall well but one that delays the review as far as possible. We have essentially two objectives: make t really big (reduce the time the user spends on Anki), keep recall at 1.

For me a good metric, considers the times t=3,t=9,t=21 as a curve. All the three values are connected with each other. With increasing t and without review, the chance of forgetting increases.

I have been thinking about possible alternative metrics that may not suffer from the same issues. In the literature, the continuous ranked probability score is quite common for this use case: https://datascience.stackexchange.com/questions/63919/what-is-continuous-ranked-probability-score-crps

What do you think?

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

keep recall at 1.

It's impossible. Forgetting starts when you stop doing reviews. Otherwise, the forgetting curve should be a piecewise function, in which R(t) = 100% when t < T. But there is not literature supporting it.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

https://datascience.stackexchange.com/questions/63919/what-is-continuous-ranked-probability-score-crps

Interesting, I've never heard of this metric. It would be great if we could implement it in the benchmark.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

Anki gave us reviews at time t=3 (recall=1), t=9 (recall=1) and t=21 (recall=0).

For me a good metric, considers the times t=3,t=9,t=21 as a curve. All the three values are connected with each other. With increasing t and without review, the chance of forgetting increases.

The two things that I highlighted cannot both happen at the same time. If that's what you got from Anki, then those are three different reviews, and hence their forgetting curves will be different. You cannot treat them as a single curve. You cannot "probe" memory without affecting it

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

You are actually right about the sequence t=3, t=9 and t=21 because t=9 affects the memory, but we can at least pairwise consider the values.

curve0, t=1
curve0, t=2
curve0, t=3, Anki result = 1
curve1, t=4
curve1, t=5
curve1, t=6
curve1, t=7
curve1, t=8
curve1, t=9
curve1, t=10, Anki result = 0
curve2, t=11
...

One more issue would be: the ground truth is only partially correct. We do not know whether only at t=10, Anki's result is 0. Maybe at t > 3 and t < 10, the card was already forgotten.

But still I think a better metric can be constructed based on that.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

curve0, t=1
curve0, t=2
curve0, t=3, Anki result = 1

But we don't know the result at t=1 and t=2, so we cannot put them into any loss function

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

I would only consider all rows (t, t+1) where t+1 is 0 (forget) and t is 1 (recall). And there is more than one time step. In my example, I would only consider curve1. We actually know partially the result at t=4 to t=9. From the forgetting curve, we are sure that at least the probability should decrease. It is true we do not know whether already at t=9 or t=8, the probability drops to 0%. But this does not matter because the model prediction at t=9 should be at least close to 0 (e.g. 10%).

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Maybe I can clarify my idea with an example.

Example for continuous ranked probability score with "curve1":

  • ground truth is [1 1 1 1 1 1 1 0] (t=3 to t=10). Invert it as [0 0 0 0 0 0 0 1].
  • prediction depends on the model:
    • If you fit an exponential function exp(-theta * t/S), then we have 1 - exp(-theta * t/S). Sample it for t=3 to t=10.
    • If your model outputs a CDF, just take it.
    • If you only have two values e.g. t=3 and t=10. Interpolate the values between it.
  • take mean squared error between ground truth and prediction.
  • Repeat for all other curves.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024
  • ground truth is [1 1 1 1 1 1 1 0] (t=3 to t=10).

But that's what I don't understand - how will you find ground truth for all those intermediate values of t? As I've already mentioned, you can't probe memory without affecting it. Sure, if ground truth is 1 at t=3 and 0 at t=10, you can assume that everything inbetween is 1, but it's unclear how reasonable such an assumption is. Why not [1 1 1 1 0 0 0 0] instead?

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Most(all?) models of human memory make the assumption that with increasing t, the recall gets lower. For example, exponential forgetting curve, half-life regression and generalized power laws.

So we know two things about the ground truth:

  • function is monotonically decreasing
  • at t=3 it is 100% and at t=10 it has to be 0%

You are right that [1 1 1 1 0 0 0 0] would be possible as well. However, I think it is reasonable assumption to make that the "ground truth recall" is not exactly 100% after day 1. Maybe it is [1 0.9 0.8 0.7 0.4 0.3 0.2 0.0] or [1 0.5 0.3 0.2 0.2 0.2 0.1 0.0] etc.

For example, the real "unobserved" recall is 0.3, but we say the real one is 0.0. Then this is not a huge issue for the metric (even if it is wrong).

Another option would be to partially skip t=4 to t=9. For example, only consider t=3, t=10 and t=4.

I want to note, however, that even if we only consider just three individual data points it would still not be the same as log-likelihood. Reason is:

  • only consider tuples (recall, forget), ignore all other data points in the dataset
  • we measure the error based on the cumulative distribution function (CDF) of ground truth and prediction, not the regular output of the neural network (e.g. sigmoid output).

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

What's the role of CDF here?

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

The CDF accumulates the probabilities of all previous time steps. For example, at t=0, we have 0% forgetting probability. At t=1, we have the probability of the last and new step 0% + 5% = 5%. And so on.

Let me give an example with a half life regression. I fitted: f(t) = exp(t/(a*w_1 + b*w_2 + c*w_3 + w_4)) where w_i are weights and a, b, c are features.

Then I am using as CDF: P(T <= t) = 1 - f(t). We have, for example, P(T <= 1) = P(T = 0) + P(T = 1).


I thought about the metric again. I would consider all time steps after the review was done, as well.

Let us compute the error for the following example:

image

We compare time t from day=8 to day=infinity. Starting from approx. day=15, we have no error. But we have an error at day=8, day=9, day=10, day=11, ... The ground truth is 1, but we say that it is lower than 1.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

If we have such a dataset for the first forgetting curve of cards whose first rating is easy, how to calculate the error?

pretrain_3

Note: we have 2681 samples in this graph.

first rating delta_t y_mean size
4 1 0.979851538 943
4 2 0.980810235 469
4 3 0.97257384 474
4 4 0.95505618 445
4 5 0.975609756 287
4 6 0.984126984 63

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

I compute the error of the curves individually. So I would take one user and one card, then plot a curve and then compute the error.

If I understand correctly, you group all users+cards based on the difficulty of the first rating. Then you have an average probability for each day. This is not really a CDF because it is not monotonic.

You can estimate a CDF of that sample using: https://en.wikipedia.org/wiki/Empirical_distribution_function Then it is possible to use this CRPS metric.

However, I am still not sure it is the best approach. Why are you not measuring the error individually of each user and each card?

So basically. If you want to group it, count how many times your algorithm outputs a certain day. In your example:

Prediction CDF: P(T <= 1) = ?/n, P(T <= 2) = ?/n, ...
Ground truth CDF: P(T <= 1) = 943/n, P(T <= 2) = (943+469)/n, ...

For the prediction CDF, one could use maybe the sum of the probabilities.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

If I understand correctly, you group all users+cards based on the difficulty of the first rating.

They are from my collection. So there is only one user here.

Why are you not measuring the error individually of each user and each card?

Because I need to use the model to predict the forgetting curve for those cards unseen before.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

Here is a dataset extracted from my collection: evaluation.tsv.zip

It contains the prediction results (column retrievability) and the target labels (column review_rating). I hope it's useful.

In evaluation, 2, 3, 4 are 1 (successful recall) and 1 is 0 (failed recall).

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Let me see if I understand your dataset correctly. I only consider part of the second card. I only kept the day, replaced 2,3,4,1 with 1 and 1 with 0.

2018-06-17 1 0
2018-06-22 1,1 0,3
2018-07-02 1,1,1 0,3,5
2018-07-25 1,1,1,1 0,3,5,10
2018-09-20 1,1,1,1,1 0,3,5,10,23
2018-09-21 1,1,1,1,1,0 0,3,5,10,23,57

The information is here is:
t=0, recall=1,curve1
t=3, recall=1,curve2
t=5,recall=1,curve3
t=10,recall=1,curve4
t=23,recall=1,curve5
t=57,recall=0,curve6

Then I would only consider the curve5 from "t=23,recall=1" to "t=57,recall=0". This is the ground truth.

For the prediction, I would use the forgetting_curve of one of your models:

def forgetting_curve(self, t, s):
Or what I used before: half-life regression.

Then to compare measure the MSE of t=57 up to t=infty.

This is my idea.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

Then I would only consider the curve5 from "t=23,recall=1" to "t=57,recall=0".

Then you will be effectively throwing away a lot of data. I can't imagine that it would produce good results.

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

It is about having an evaluation metric that measures the error of the forgetting curve. Whether it is sufficient to measure the entire model performance, I cannot say yet. It might be possible to modify it to also consider curves with only 1s. However, the only forgetting curve where we know the ground truth is given in this example by t=23 to t=infty.

By only considering individual samples, I think you are also throwing away a lot of data.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

By only considering individual samples, I think you are also throwing away a lot of data.

Which data is thrown away? I cannot follow up. Did I miss something?

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Day 1: We did an Anki review. Anki reschedules the card for day 12.
Day 12: We get the card. Now, we know the result whether we forgot or did recall the card.


Assume we forgot the card at day 12.


We know that any day after 12 would lead to the same result: day 13, day 14, day 15, ... are all status "forget".

If Anki had rescheduled the card not for day 12 but for day 13, then at day 13 the status would have been "forget" as well.

You only consider day 1 and day 12. You throw away any day after 12.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

Does it matter? I think the information from day 1 and day 12 has included all of the rest if we have a hypothesis for the shape of forgetting curve.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

If we have 5 cards reviewed in day 5, and forget 2 of them, the data will look like this:

1 1 1 1 1 ? ? ?
1 1 1 1 1 ? ? ?
1 1 1 1 1 ? ? ?
? ? ? ? 0 0 0 0
? ? ? ? 0 0 0 0

How could we deal with these? in calculating the metric?

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Does it matter? I think the information from day 1 and day 12 has included all of the rest if we have a hypothesis for the shape of forgetting curve.

Sometimes we predict 5% for day 12, then day > 12 might not matter. But sometimes we predict 65% for T=12, then it will take a long time to reach 0%. Here is another example, each number indicates the probability at day t (starting at day 1 up to day 100):
prediction = [1.0, 0.94, 0.88, 0.83, 0.78, 0.74, 0.69, 0.65, 0.61, 0.58, 0.54, 0.51, 0.48, 0.45, 0.42, 0.4, 0.37, 0.35, 0.33, 0.31, 0.29, 0.28, 0.26, 0.24, 0.23, 0.22, 0.2, 0.19, 0.18, 0.17, 0.16, 0.15, 0.14, 0.13, 0.12, 0.12, 0.11, 0.1, 0.1, 0.09, 0.09, 0.08, 0.08, 0.07, 0.07, 0.06, 0.06, 0.06, 0.05, 0.05, 0.05, 0.04, 0.04, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

ground truth = [1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

The error will be really high for this card.


I also realized we should measure the error between two correct predictions.

Day 1: We did an Anki review. Anki reschedules the card for day 12.
Day 12: We get the card. It is correct.

Then between day 1 and 12, the forgetting curve has to decrease really slowly. It should start at around 100% at day 1, and at day 12 it should still be close to 100%.

Therefore, at day 2, 3, 4, 5, ... and 11 the recall should be 1 as well.


Another reason for including time steps outside the sample range (e.g. day 1 and 12) is that the samples come from Anki's default scheduler. If we train a model on this dataset, we effectively learn Anki's scheduler and not the real forgetting curve.

Most of my cards after the first review are 3 days, because this is the default interval for "Good".


This is at least, how I view it theoretically. One has to see what is the real difference in practice. Especially if it affects the ranking.

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

If we have 5 cards reviewed in day 5, and forget 2 of them, the data will look like this:

1 1 1 1 1 ? ? ? 1 1 1 1 1 ? ? ? 1 1 1 1 1 ? ? ? ? ? ? ? 0 0 0 0 ? ? ? ? 0 0 0 0

How could we deal with these? in calculating the metric?

We do not know anything about day 6, 7 and 8 for card1, card2 and card3. It is possible that for day 6 or 7 or 8 it is zero or one.

Similarly, we do not know anything about day 1, 2, 3, 4 for card4 and card5.

day 1 2 3 4 5 6 7 8
card1 = 1 1 1 1 1 ? ? ?
card2 = 1 1 1 1 1 ? ? ?
card3 = 1 1 1 1 1 ? ? ?
card4 = ? ? ? ? 0 0 0 0
card5 = ? ? ? ? 0 0 0 0

So, we can only skip it. Before I thought we could maybe just look at one sample between the unknown parts. But now I also think this is not good.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

Another reason for including time steps outside the sample range (e.g. day 1 and 12) is that the samples come from Anki's default scheduler. If we train a model on this dataset, we effectively learn Anki's scheduler and not the real forgetting curve.

Not really, no. The interval depends on Anki's scheduler, but not the shape of the forgetting curve. If Anki scheduled a card to appear after 13 days instead of 12 days, it wouldn't change the forgetting curve.

Especially if it affects the ranking.

What do you mean?

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Another reason for including time steps outside the sample range (e.g. day 1 and 12) is that the samples come from Anki's default scheduler. If we train a model on this dataset, we effectively learn Anki's scheduler and not the real forgetting curve.

Not really, no. The interval depends on Anki's scheduler, but not the shape of the forgetting curve. If Anki scheduled a card to appear after 13 days instead of 12 days, it wouldn't change the forgetting curve.

Especially if it affects the ranking.

What do you mean?

Let's say Anki schedules a card for day 52, but actually we forgot the card already at day 16. Then we learn the forgetting curve: day 52 should be recall = 0. The curve "day 1 to day 52" decays much slower than the curve "day 1 to day 16". So we learned the wrong forgetting curve. So the ground truth we have is not 100% exact. The smaller the time interval, the more correct is the prediction. So you are right that day 13 to 12 might not make a huge difference.

However, forgetting curves with really large time intervals are certainly incorrect due to data sparseness. The main issue is that the distributions are here long-tailed where we have a lot of samples in the range day 1-49 but for day 50+ the samples are much lower.

Regarding the ranking, I mean whether changing the metric affects which model is better. For example, I modified half-life regression by replacing the time t by time t^2. Then I compared some metrics and I can get different results.

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

Data spareness is a real problem for our case (optimization for single Anki user). That's why FSRS is better than some general models like GRU and transformer. FSRS is based on DSR model, in which we try to find a best SInc function to fit the data, where the shape of function has been fixed.

After considering your advice, I think it's a method of Data Augmentation. For successive successful reviews, we can insert more positive samples into the intervals between adjacent reviews. For successive failed reviews, we can insert more negative samples. The problem is we don't know which type of sample should we insert to the interval between a successful review and a failed review.

from srs-benchmark.

Expertium avatar Expertium commented on June 22, 2024

I thought about data augmentation before, and here's what I'm wondering about: should the final evaluation be conducted on the original dataset, or on the augmented dataset (regardless of the augmentation method)?

If we evaluate any algorithm on the augmented dataset, then the metrics (like log loss and RMSE) become kind of meaningless, since this isn't real data.
If we evaluate it on the original dataset, then the algorithm is almost guaranteed to perform worse, since it was optimized on different (augmented) data.

EDIT: after some googling, it seems like training the model on the augmented dataset and evaluating it on the original dataset is the standard practice.

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Data spareness is a real problem for our case (optimization for single Anki user). That's why FSRS is better than some general models like GRU and transformer. FSRS is based on DSR model, in which we try to find a best SInc function to fit the data, where the shape of function has been fixed.

After considering your advice, I think it's a method of Data Augmentation. For successive successful reviews, we can insert more positive samples into the intervals between adjacent reviews. For successive failed reviews, we can insert more negative samples. The problem is we don't know which type of sample should we insert to the interval between a successful review and a failed review.

Yeah, one could view it to some degree as data augmentation. Not sure if it will lead to better results. I need to do some tests as well.

But no matter the outcome, it could still be interesting for the evaluation to see the effects.

My new modified idea for a metric is:

  • Look at the mean squared error (MSE) between the predicted probability and the ground truth. This would be your "log loss" column.
  • Add to the previous MSE, also time steps outside the sample intervals (as discussed before).
  • Group all cards into three intervals: short term intervals (1-15 days), medium term intervals (15-50 days), long term intervals (50 to infinite days)

The MSE results for "short term intervals (1-15 days)" will be most accurate. This is the ground truth we can trust the most.

Then we can average the three MSE of the intervals (maybe with weighting?).

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

I thought about data augmentation before, and here's what I'm wondering about: should the final evaluation be conducted on the original dataset, or on the augmented dataset (regardless of the augmentation method)?

If we evaluate any algorithm on the augmented dataset, then the metrics (like log loss and RMSE) become kind of meaningless, since this isn't real data. If we evaluate it on the original dataset, then the algorithm is almost guaranteed to perform worse, since it was optimized on different (augmented) data.

EDIT: after some googling, it seems like training the model on the augmented dataset and evaluating it on the original dataset is the standard practice.

Yeah, I usually work on computer vision tasks. Then I just do rotations etc. on the training data, but for prediction I use the original dataset.

In our case now, however, the additional values are "real" data samples. So it might not be 100% data augmentation.

We kind of change the task for the model. Before we only have the first day when the user forgot the card. For example: start day 1, stop at day 12. The task is "how long until the user forgets the card". We never have any days higher than that the first "forgetting day".

After data augmentation, the dataset also contains samples day 13, 14, 15, ... The task is no longer "how long until" but "what is the probability at day t".

from srs-benchmark.

lars76 avatar lars76 commented on June 22, 2024

Here is an implementation:

https://github.com/lars76/spaced_repetition_benchmark_test

train.csv = "only train on the available data"
train_extra.csv = "duplicate data as described before"
test.csv = "only test on the available data"
test_extra.csv = "duplicate data as described before"

See predict_half_life.py.

Train: train.csv, Test: test.csv
Low interval 0.14023966177700425
Mid interval 0.11249547502469187
Long interval: 0.08420035903374629
Total 0.3369354958354424

Train: train.csv, Test: test_extra.csv
Low interval 0.11888207811944794
Mid interval 0.4312504243280266
Long interval: 0.7343956261327806
Total 1.2845281285802552

Train: train_extra.csv, Test: test.csv
Low interval 0.3616415048209399
Mid interval 0.2302383584575336
Long interval: 0.2227537455923405
Total 0.8146336088708139

Train: train_extra.csv, Test: test_extra.csv
Low interval 0.22079208085316215
Mid interval 0.24161667288426994
Long interval: 0.23646483936750526
Total 0.6988735931049374

from srs-benchmark.

L-M-Sherlock avatar L-M-Sherlock commented on June 22, 2024

What's your conclusion about the test?

from srs-benchmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.