Comments (2)
Hi @gboeer
I am "happy" to see I am not the only one having issues logging with MLFlow.
I am finetuning a pretrained transformer model on 2000ish images. So not an insane amount of data.
As you can see, metrics such as validation_accuracy
although recorded on_step=False
, on_epoch=True
only always show me the value of the last epoch. I would like to see an actual graph with all my previous epochs, it's just a scalar here.
Also, I tell my trainer to log every 50 steps, but in my epochs-step plot I see points at the following steps only: 49, 199, 349, 499, ... not every 50.
Here is my logger:
logger = MLFlowLogger(
experiment_name=config['logger']['experiment_name'],
tracking_uri=config['logger']['tracking_uri'],
log_model=config['logger']['log_model']
)
Passed to my trainer:
trainer = Trainer(
accelerator=config['accelerator'],
devices=config['devices'],
max_epochs=config['max_epochs'],
logger=logger,
log_every_n_steps=50,
callbacks=[early_stopping, lr_monitor, checkpoint, progress_bar],
)
My metrics are logged in the following way in the training_step and validation_step functions:
def training_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.train_accuracy.update(predictions, targets)
self.log("training_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("training_accuracy", self.train_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("training_gpu_allocation", torch.cuda.memory_allocated(), on_step=True, on_epoch=False)
return {"inputs":inputs, "targets":targets, "predictions":predictions, "loss":loss}
def validation_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.validation_accuracy(predictions, targets)
self.validation_precision(predictions, targets)
self.validation_recall(predictions, targets)
self.validation_f1_score(predictions, targets)
self.validation_confmat.update(predictions, targets)
self.log("validation_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("validation_accuracy", self.validation_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_precision", self.validation_precision, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_recall", self.validation_recall, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_f1_score", self.validation_f1_score, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
I guess it's a problem from lightning but not 100% sure.
I hope we'll get suppot soon. I serve my ML models on MLFlow and it works fine, so I don't want to go back to tensorboard for my DL models only.
EDIT: My bad, it seems to do that just when the training is still on. When the training is finished, the plots display correctly.
But still, I thought we were supposed to be able to follow the evolution of metrics as training progresses, and in this case it's not very possible.
from pytorch-lightning.
@Antoine101
Interesting, that your plots change after the training is finished. For me, they stay the same, though. I tried opening the app in private window to see if there are any caching issues, but it didn't change anything.
I guess what you observed about the stepsize may just have to do with zero-indexing.
from pytorch-lightning.
Related Issues (20)
- `grep: Invalid option -- P` when running `./tests/run_standalone_tests.sh` on macOS HOT 1
- Callback for logging forward, backward and update time
- Custom batch selection for logging HOT 3
- `make test` fails with `subprocess-exited-with-error`: `AssertionError: Could not find cmake executable!`
- Avoid casting with `numpy()` in `multiprocessing.py` HOT 1
- Use lr setter callback instead of `attr_name` in `LearningRateFinder` and `Tuner` HOT 4
- Autocast "cache_enabled=True" failing HOT 1
- Official docker image doesn't have pytorch_lightning
- Class name displayed incorrectly
- Make TensorBoardLogger default version creation ascii sortable
- Adam optimizer is slower after loading model from checkpoint HOT 9
- ValueError: range() arg 3 must not be zero - Need to Identify the Root Cause HOT 1
- Logging Hyperparameters for list of dicts
- Returning num_replicas=world_size when using distributed sampler in ddp
- Documentation: writing custom samplers compatible with multi GPU training
- Loading saved config file fails because of InterpolationMode HOT 1
- min_epochs and EarlyStopping in conflict HOT 1
- Make checkpoint saving fully atomic HOT 1
- Increase MlflowLogger parameter value length limit
- FSDPPrecision should support 16-true with a loss scaler HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-lightning.