Comments (10)
@pablo-sanchez-sony, would you mind opening a PR with the changes you suggest?
from pykeen.
Hi @arushi-08 ,
the checkpoint files are "just" normal torch archives, i.e., you can load them via torch.load
as done in the code snippet you linked (ore more precisely, just one line above; I have updated your text above to include it).
The checksum was calculated from the string representations of the model and the optimizer, cf. here
pykeen/src/pykeen/training/training_loop.py
Lines 203 to 209 in d1222b7
I would suggest that you load the checkpoint file via torch.load
and carefully compare it with the configuration. If you still think that everything is sane, I would suggest to manually overide the checkpoint file's checksum and write it to a new checkpoint file.
d = torch.load(path)
d["checksum"] = checksum
torch.save(d, new_path)
from pykeen.
Hi,
I was having the same error. I believe the problem comes when using the scheduler object from PyTorch. We can observe in the constructor whenever last_epoch=-1
the initial_lr of the optimizer is updated.
Basically, this makes str(self.optimizer).encode("utf-8")
to be different, given that we have not yet reloaded the optimizer nor the scheduler.
I believe the issue can be solved by moving the checksum comparison to the end of the method.
from pykeen.
Sure!
from pykeen.
I am facing this checkpoint mismatch error in the same training loop for RotatE KGE model.
Following log messages shows that rotate-checkpoint.pt is created at some initial epoch and then after 30 epochs it tries to read from this checkpoint and gives this error:
INFO:pykeen.training.training_loop:=> no checkpoint found at '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt'. Creating a new file.
Training epochs on cuda:0: 2%|▏ | 9/500 [07:47<6:22:12, 46.71s/epoch, loss=0.123, prev_loss=0.123]INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
Saved model weights to /afs/ars539/.data/pykeen/checkpoints/best-model-weights-0aa7f269-26c4-4e84-8d47-a55fc43911c6.pt
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 20.
INFO:pykeen.stoppers.early_stopping:Stopping early at epoch 30. The best result 0.14622531740871292 occurred at epoch 10.
INFO:pykeen.stoppers.early_stopping:Re-loading weights from best epoch from /afs/ars539/.data/pykeen/checkpoints/best-model-weights-0aa7f269-26c4-4e84-8d47-a55fc43911c6.pt
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 30.
INFO:pykeen.evaluation.evaluator:Evaluation took 547.88s seconds
Best is trial 0 with value: 0.06680432707071304.
INFO:pykeen.pipeline.api:loaded random seed 42 from checkpoint.
INFO:pykeen.pipeline.api:Using device: None
INFO:pykeen.stoppers.early_stopping:Inferred checkpoint path for best model weights: /afs/ars539/.data/pykeen/checkpoints/best-model-weights-ea7a231a-d250-422a-a747-49f6b3a70e2f.pt
INFO:pykeen.training.training_loop:=> loading checkpoint '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt'
[W 2023-09-22 18:37:16,297] Trial 1 failed with parameters: {'model.embedding_dim': 200, 'loss.margin': 1.0271124464019343, 'optimizer.lr': 0.026733931043720773, 'negative_sampler.num_negs_per_pos': 3, 'training.batch_size': 64} because of the following error: CheckpointMismatchError("The checkpoint file '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt' that was provided already exists, but seems to be from a different training loop setup.").
My training script is:
result = hpo_pipeline(
study_name='rotate_hpo',
training=training,
testing=testing,
validation=validation,
pruner="MedianPruner",
sampler="tpe",
model='RotatE',
model_kwargs={
"random_seed": 42,
},
model_kwargs_ranges=dict(
embedding_dim=dict(type=int, low=100, high=300, q=100),
),
negative_sampler_kwargs_ranges=dict(
num_negs_per_pos=dict(type=int, low=1, high=100),
),
stopper='early',
n_trials=30,
training_loop="sLCWA",
training_kwargs=dict(
num_epochs=500,
checkpoint_name='rotate-checkpoint.pt',
checkpoint_frequency=10,
),
evaluator_kwargs={"filtered": True, "batch_size":128},
)
Kindly suggest how to resolve this, as I am not explicitly trying to resume training, rather the hpo_pipeline itself is reloading from the checkpoint.
from pykeen.
When setting a checkpoint name
checkpoint_name='rotate-checkpoint.pt',
it seems to be used for all trials => the second run thinks it is a continuation of the first trial, but the model hyperparameters do not match.
from pykeen.
Here is a smaller reproduction script to reproduce the error
from pykeen.hpo import hpo_pipeline
result = hpo_pipeline(
study_name="rotate_hpo",
dataset="nations",
model="RotatE",
model_kwargs_ranges=dict(
embedding_dim=dict(type=int, low=8, high=24, q=8),
),
stopper="early",
n_trials=2,
training_loop="sLCWA",
training_kwargs=dict(
num_epochs=2,
checkpoint_name="rotate-checkpoint.pt",
checkpoint_frequency=1,
),
)
from pykeen.
@arushi-08 , what is your use case for providing a checkpoint name? Do you want to save each trial's model? If yes, we have an explicit save_model_directory
for that, which will take care of creating one sub-directory per trial.
from pykeen.
I have opened a small PR (#1324) to fail fast on the first trial with an error message about how to fix it 🙂
from pykeen.
@pablo-sanchez-sony , would this resolve your issue, too?
from pykeen.
Related Issues (20)
- Question about the use of `create_inverse_triples` HOT 2
- Want to train a model without any evaluate or test dataset HOT 1
- Bug in wandb result tracker HOT 1
- Possible issue with model evaluation when using datasets with inverse triples HOT 1
- RGCN RuntimeError: trying to backward through graph a second time. (has parameters but no reset_parameters) HOT 2
- QuatE: GPU memory is not released per epoch HOT 3
- Training loop does not update relation representations when continuing training HOT 2
- from pykeen.pipeline import pipeline, pipeline issue HOT 3
- Evaluating metrics on many subsets with multiple models HOT 2
- Shape Mismatch upon initializing pretrained ComplEx embeddings HOT 2
- TransE - CUDA out of memory HOT 3
- Importing model_resolver HOT 2
- Getting Embeddings of the Entity and Relations HOT 13
- RGCN Hyper parameter optimization error HOT 1
- MatKG HOT 1
- HPO_Pipeline fails on AutoSF models HOT 1
- Unable to reproduce TransE experiment
- EarlyStopper: show progress bar
- Cosine Annealing with Warm Restart LR Scheduler recieving an unexpected kwarg `T_i` HOT 1
- OOM Crash on MPS/Apple silicon HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pykeen.