Code Monkey home page Code Monkey logo

Comments (10)

mberr avatar mberr commented on June 9, 2024 1

@pablo-sanchez-sony, would you mind opening a PR with the changes you suggest?

from pykeen.

mberr avatar mberr commented on June 9, 2024

Hi @arushi-08 ,

the checkpoint files are "just" normal torch archives, i.e., you can load them via torch.load as done in the code snippet you linked (ore more precisely, just one line above; I have updated your text above to include it).

The checksum was calculated from the string representations of the model and the optimizer, cf. here

@property
def checksum(self) -> str: # noqa: D401
"""The checksum of the model and optimizer the training loop was configured with."""
h = md5() # noqa: S303
h.update(str(self.model).encode("utf-8"))
h.update(str(self.optimizer).encode("utf-8"))
return h.hexdigest()

I would suggest that you load the checkpoint file via torch.load and carefully compare it with the configuration. If you still think that everything is sane, I would suggest to manually overide the checkpoint file's checksum and write it to a new checkpoint file.

d = torch.load(path)
d["checksum"] = checksum
torch.save(d, new_path)

from pykeen.

pablo-sanchez-sony avatar pablo-sanchez-sony commented on June 9, 2024

Hi,

I was having the same error. I believe the problem comes when using the scheduler object from PyTorch. We can observe in the constructor whenever last_epoch=-1 the initial_lr of the optimizer is updated.

https://github.com/pytorch/pytorch/blob/a5d841ef01e615e2a654fb12cf0cd08697d12ccf/torch/optim/lr_scheduler.py#L38

Basically, this makes str(self.optimizer).encode("utf-8") to be different, given that we have not yet reloaded the optimizer nor the scheduler.

I believe the issue can be solved by moving the checksum comparison to the end of the method.

from pykeen.

pablo-sanchez-sony avatar pablo-sanchez-sony commented on June 9, 2024

Sure!

from pykeen.

arushi-08 avatar arushi-08 commented on June 9, 2024

I am facing this checkpoint mismatch error in the same training loop for RotatE KGE model.
Following log messages shows that rotate-checkpoint.pt is created at some initial epoch and then after 30 epochs it tries to read from this checkpoint and gives this error:

INFO:pykeen.training.training_loop:=> no checkpoint found at '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt'. Creating a new file.
Training epochs on cuda:0:   2%|▏         | 9/500 [07:47<6:22:12, 46.71s/epoch, loss=0.123, prev_loss=0.123]INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...                                                                              
Saved model weights to /afs/ars539/.data/pykeen/checkpoints/best-model-weights-0aa7f269-26c4-4e84-8d47-a55fc43911c6.pt
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 10.
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 20.
INFO:pykeen.stoppers.early_stopping:Stopping early at epoch 30. The best result 0.14622531740871292 occurred at epoch 10.
INFO:pykeen.stoppers.early_stopping:Re-loading weights from best epoch from /afs/ars539/.data/pykeen/checkpoints/best-model-weights-0aa7f269-26c4-4e84-8d47-a55fc43911c6.pt
INFO:pykeen.training.training_loop:=> Saved checkpoint after having finished epoch 30.
INFO:pykeen.evaluation.evaluator:Evaluation took 547.88s seconds
Best is trial 0 with value: 0.06680432707071304.
INFO:pykeen.pipeline.api:loaded random seed 42 from checkpoint.
INFO:pykeen.pipeline.api:Using device: None
INFO:pykeen.stoppers.early_stopping:Inferred checkpoint path for best model weights: /afs/ars539/.data/pykeen/checkpoints/best-model-weights-ea7a231a-d250-422a-a747-49f6b3a70e2f.pt
INFO:pykeen.training.training_loop:=> loading checkpoint '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt'
[W 2023-09-22 18:37:16,297] Trial 1 failed with parameters: {'model.embedding_dim': 200, 'loss.margin': 1.0271124464019343, 'optimizer.lr': 0.026733931043720773, 'negative_sampler.num_negs_per_pos': 3, 'training.batch_size': 64} because of the following error: CheckpointMismatchError("The checkpoint file '/afs/ars539/.data/pykeen/checkpoints/rotate-checkpoint.pt' that was provided already exists, but seems to be from a different training loop setup.").

My training script is:

result = hpo_pipeline(
    study_name='rotate_hpo',
    training=training,
    testing=testing,
    validation=validation,
    pruner="MedianPruner",
    sampler="tpe",
    model='RotatE',
    model_kwargs={
        "random_seed": 42,
    },
    model_kwargs_ranges=dict(
        embedding_dim=dict(type=int, low=100, high=300, q=100),
    ),
    negative_sampler_kwargs_ranges=dict(
        num_negs_per_pos=dict(type=int, low=1, high=100),
    ),
    stopper='early',
    n_trials=30,
    training_loop="sLCWA",
    training_kwargs=dict(
        num_epochs=500,
        checkpoint_name='rotate-checkpoint.pt',
        checkpoint_frequency=10,
     ),
    evaluator_kwargs={"filtered": True, "batch_size":128},
)

Kindly suggest how to resolve this, as I am not explicitly trying to resume training, rather the hpo_pipeline itself is reloading from the checkpoint.

from pykeen.

mberr avatar mberr commented on June 9, 2024

When setting a checkpoint name

checkpoint_name='rotate-checkpoint.pt',

it seems to be used for all trials => the second run thinks it is a continuation of the first trial, but the model hyperparameters do not match.

from pykeen.

mberr avatar mberr commented on June 9, 2024

Here is a smaller reproduction script to reproduce the error

from pykeen.hpo import hpo_pipeline

result = hpo_pipeline(
    study_name="rotate_hpo",
    dataset="nations",
    model="RotatE",
    model_kwargs_ranges=dict(
        embedding_dim=dict(type=int, low=8, high=24, q=8),
    ),
    stopper="early",
    n_trials=2,
    training_loop="sLCWA",
    training_kwargs=dict(
        num_epochs=2,
        checkpoint_name="rotate-checkpoint.pt",
        checkpoint_frequency=1,
    ),
)

from pykeen.

mberr avatar mberr commented on June 9, 2024

@arushi-08 , what is your use case for providing a checkpoint name? Do you want to save each trial's model? If yes, we have an explicit save_model_directory for that, which will take care of creating one sub-directory per trial.

from pykeen.

mberr avatar mberr commented on June 9, 2024

I have opened a small PR (#1324) to fail fast on the first trial with an error message about how to fix it 🙂

from pykeen.

mberr avatar mberr commented on June 9, 2024

@pablo-sanchez-sony , would this resolve your issue, too?

from pykeen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.