Comments (7)
@moyid Good catch. In commit 682aaa2 (when I switched the loss function from BCELoss to BCEWithLogitsLoss) I forgot to add the sigmoid function to the validation and testing steps. So any very large output from the model crashed the accuracy calculation. This should be fixed in commit 6808a42. The validation check passed since it does not try large values.
from transformersum.
ok, I'll try first with 32bit and let you know how that turns out.
from transformersum.
@HHousen - I'm able to train past 1 epoch with 32bit!
from transformersum.
this is with the extractive cnn dm data set downloaded from here: https://drive.google.com/uc?id=1-DLTTioISS8i3UrOG4sjjc_js0ncnBnn
from transformersum.
@HHousen - thank you for making the change! For some reason, this worked fine when I tried training with a small data (only train.0.pt, and all of the validation data), but failed again when I tried with the full set of training data.
again, here is my error:
Epoch 0: 96%|█████████▌| 71750/75115 [7:49:12<22:00, 2.55it/s, loss=nan, v_num=zq23, train_loss_total=nan, train_loss_total_norm_batch=nan, train_loss_avg_seq_sum=nan, train_loss_avg_seq_mean=nan, train
_loss_avg=nan] Traceback (most recent call last):
File "src/main.py", line 398, in <module>
main(main_args)
File "src/main.py", line 97, in main
trainer.fit(model)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 482, in train
self.train_loop.run_training_epoch()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 569, in run_training_epoch
self.trainer.run_evaluation(test_mode=False)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 567, in run_evaluation
output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 171, in evaluation_step
output = self.trainer.accelerator_backend.validation_step(args)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 76, in validation_step
output = self.__validation_step(args)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 86, in __validation_step
output = self.trainer.model.validation_step(*args)
File "/home/jupyter/TransformerSum/src/extractive.py", line 686, in validation_step
y_hat.detach().cpu().numpy(), y_true.float().detach().cpu().numpy()
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/transformers/data/metrics/__init__.py", line 37, in acc_and_f1
f1 = f1_score(y_true=labels, y_pred=preds)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1047, in f1_score
zero_division=zero_division)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1175, in fbeta_score
zero_division=zero_division)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1434, in precision_recall_fscore_support
pos_label)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1250, in _check_set_wise_labels
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 83, in _check_targets
type_pred = type_of_target(y_pred)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/utils/multiclass.py", line 287, in type_of_target
_assert_all_finite(y)
File "/home/jupyter/TransformerSum/env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 99, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
from transformersum.
Okay, this is a strange issue. The model must be outputting NaNs because we know the NaNs are not in the ground truth values. Neither the sigmoid nor the greater than or less than statements transform the NaNs.
Additionally, I'm concerned with the log output that shows the loss is NaN. This suggests that the model isn't training properly. Does your loss curve look correct? Are you using 16bit precision? Can you try using 32bit precision?
If you're using 16 bit precision, then I think the problem may be caused by replacing padding values with -9e9
in the classifiers. -9e9
cannot be represented in 16 bits. I've changed this value to -9e3
, which is still small enough that the sigmoid function will output a 0. However, this doesn't answer the question of why it when it was trained only a little data. It also might be an exploding gradient problem.
from transformersum.
You can also try validating with the lines pertaining to this calculation commented out (682-687, 696-698, 720-722, and 731-733 in commit e1f6022). However, the underlying problem may still persist.
from transformersum.
Related Issues (20)
- TypeError: __init__() got an unexpected keyword argument 'gradient_checkpointing' HOT 1
- predictions_website.py raises AttributeError: '_LazyAutoMapping' object has no attribute '_mapping' HOT 6
- ModuleNotFoundError: No module named 'extractive' HOT 1
- AttributeError: '_LazyAutoMapping' object has no attribute '_mapping' HOT 1
- Abstractive BART Model , RuntimeError: The size of tensor a (64000) must match the size of tensor b (64001) at non-singleton dimension 1
- ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on. HOT 3
- error when training an extractive summarization model HOT 2
- Found keys that are in the model state dict but not in the checkpoint HOT 3
- Suggest about the index order of extractive results
- A Chinese solution for TransformerSum-extractive, and I've implemented your work in my project HOT 1
- After extractive training, a process on one GPU won't terminate automatically.
- Fine-tuning/Inference commands for "roberta-base-ext-sum"
- '--data_type' is not accepted when running main.py (extractive mode)
- Why tokenize twice?
- TypeError: forward() got an unexpected keyword argument 'source'
- Instruction for fine tune
- Installation via Pip
- Some versioning problems when installing the environment HOT 2
- pytorch_lightning.callbacks update HOT 1
- RoBERTa & Longformer extractive model checkpoints availability
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformersum.