Bug deion When using the <a href="https://lightning.ai/docs/

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Using the MLflow logger produces Inconsistent metric plots about pytorch-lightning HOT 2 OPEN

gboeer commented on July 3, 2024

Using the MLflow logger produces Inconsistent metric plots

from pytorch-lightning.

Comments (2)

Antoine101 commented on July 3, 2024

Hi @gboeer

I am "happy" to see I am not the only one having issues logging with MLFlow.

I am finetuning a pretrained transformer model on 2000ish images. So not an insane amount of data.

Here is what am I seeing:

As you can see, metrics such as validation_accuracy although recorded on_step=False, on_epoch=True only always show me the value of the last epoch. I would like to see an actual graph with all my previous epochs, it's just a scalar here.

Also, I tell my trainer to log every 50 steps, but in my epochs-step plot I see points at the following steps only: 49, 199, 349, 499, ... not every 50.

Here is my logger:

logger = MLFlowLogger(
            experiment_name=config['logger']['experiment_name'], 
            tracking_uri=config['logger']['tracking_uri'],
            log_model=config['logger']['log_model']
        )

Passed to my trainer:

trainer = Trainer(
    accelerator=config['accelerator'],
    devices=config['devices'],
    max_epochs=config['max_epochs'],
    logger=logger,
    log_every_n_steps=50,
    callbacks=[early_stopping, lr_monitor, checkpoint, progress_bar],
)

My metrics are logged in the following way in the training_step and validation_step functions:

def training_step(self, batch, batch_idx): 
    index, audio_name, targets, inputs = batch
    logits = self.model(inputs) 
    loss = self.loss(logits, targets)
    predictions = torch.argmax(logits, dim=1)
    self.train_accuracy.update(predictions, targets)
    self.log("training_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
    self.log("training_accuracy", self.train_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("training_gpu_allocation", torch.cuda.memory_allocated(), on_step=True, on_epoch=False)        
    return {"inputs":inputs, "targets":targets, "predictions":predictions, "loss":loss}

        
def validation_step(self, batch, batch_idx):
    index, audio_name, targets, inputs = batch
    logits = self.model(inputs)
    loss = self.loss(logits, targets)
    predictions = torch.argmax(logits, dim=1)
    self.validation_accuracy(predictions, targets)
    self.validation_precision(predictions, targets)
    self.validation_recall(predictions, targets)
    self.validation_f1_score(predictions, targets)
    self.validation_confmat.update(predictions, targets)
    self.log("validation_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
    self.log("validation_accuracy", self.validation_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("validation_precision", self.validation_precision, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("validation_recall", self.validation_recall, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
    self.log("validation_f1_score", self.validation_f1_score, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)

I guess it's a problem from lightning but not 100% sure.

I hope we'll get suppot soon. I serve my ML models on MLFlow and it works fine, so I don't want to go back to tensorboard for my DL models only.

EDIT: My bad, it seems to do that just when the training is still on. When the training is finished, the plots display correctly.

But still, I thought we were supposed to be able to follow the evolution of metrics as training progresses, and in this case it's not very possible.

from pytorch-lightning.

gboeer commented on July 3, 2024

@Antoine101
Interesting, that your plots change after the training is finished. For me, they stay the same, though. I tried opening the app in private window to see if there are any caching issues, but it didn't change anything.

I guess what you observed about the stepsize may just have to do with zero-indexing.

from pytorch-lightning.

Using the MLflow logger produces Inconsistent metric plots about pytorch-lightning HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent