We are trying to find out why the Transformer model is performing poorer than expected

We decreased the GPU usage by using larger strides. See report <a href="https://wandb.

Transformer model performance vs GPU memory usage about platalea HOT 4 CLOSED

spokenlanguage commented on September 23, 2024

Transformer model performance vs GPU memory usage

from platalea.

Comments (4)

egpbos commented on September 23, 2024

Transformer model run with --trafo_d_model=512 --trafo_encoder_layers=4 --trafo_heads=2 --trafo_dropout=0.4 --trafo_feedforward_dim=512 on flickr1d dataset (30 samples only, batch_size also 30).

python -m platalea.experiments.flickr8k.transformer -c $DATAPATH/config.yml --flickr8k_root=$DATAPATH --epochs=1 --trafo_d_model=512 --trafo_encoder_layers=4 --trafo_heads=2 --trafo_dropout=0.4 --trafo_feedforward_dim=512 --batch_size=30
INFO:root:Arguments: {'config': '/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/flickr1d/config.yml', 'verbose': False, 'silent': False, 'audio_features_fn': 'mfcc_features.pt', 'seed': 123, 'epochs': 1, 'downsampling_factor': None, 'lr_scheduler': 'cyclic', 'cyclic_lr_max': 0.0002, 'cyclic_lr_min': 1e-06, 'constant_lr': 0.0001, 'device': None, 'hidden_size_factor': 1024, 'l2_regularization': 0, 'flickr8k_root': '/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/flickr1d', 'flickr8k_meta': 'dataset.json', 'flickr8k_audio_subdir': 'flickr_audio/wavs/', 'flickr8k_image_subdir': 'Flickr8k_Dataset/Flicker8k_Dataset/', 'flickr8k_language': 'en', 'librispeech_root': '/home/bjrhigy/corpora/LibriSpeech', 'librispeech_meta': 'metadata.json', 'batch_size': 30, 'trafo_d_model': 512, 'trafo_encoder_layers': 4, 'trafo_heads': 2, 'trafo_feedforward_dim': 512, 'trafo_dropout': 0.4, 'score_on_cpu': False, 'validate_on_cpu': False}
INFO:root:Loading data
INFO:root:Building model
INFO:root:Training
INFO:root:Run 'wandb disabled' if you don't want to use wandb cloud logging.
INFO:wandb:setting login settings: {}
wandb: Offline run mode, not syncing to the cloud.
wandb: W&B is disabled in this directory.  Run `wandb on` to enable cloud syncing.
INFO:root:Setting stepsize of 4
/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
===============================================================================================
Layer (type:depth-idx)                        Output Shape              Param #
===============================================================================================
├─Conv1d: 1-1                                 [30, 64, 299]             14,976
├─Linear: 1-2                                 [299, 30, 512]            33,280
├─TransformerEncoder: 1-3                     [299, 30, 512]            --
|    └─ModuleList: 2                          []                        --
|    |    └─TransformerEncoderLayer: 3-1      [299, 30, 512]            --
|    |    |    └─MultiheadAttention: 4-1      [299, 30, 512]            --
|    |    |    └─Dropout: 4-2                 [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-3               [299, 30, 512]            1,024
|    |    |    └─Linear: 4-4                  [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-5                 [299, 30, 512]            --
|    |    |    └─Linear: 4-6                  [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-7                 [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-8               [299, 30, 512]            1,024
|    |    └─TransformerEncoderLayer: 3-2      [299, 30, 512]            --
|    |    |    └─MultiheadAttention: 4-9      [299, 30, 512]            --
|    |    |    └─Dropout: 4-10                [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-11              [299, 30, 512]            1,024
|    |    |    └─Linear: 4-12                 [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-13                [299, 30, 512]            --
|    |    |    └─Linear: 4-14                 [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-15                [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-16              [299, 30, 512]            1,024
|    |    └─TransformerEncoderLayer: 3-3      [299, 30, 512]            --
|    |    |    └─MultiheadAttention: 4-17     [299, 30, 512]            --
|    |    |    └─Dropout: 4-18                [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-19              [299, 30, 512]            1,024
|    |    |    └─Linear: 4-20                 [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-21                [299, 30, 512]            --
|    |    |    └─Linear: 4-22                 [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-23                [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-24              [299, 30, 512]            1,024
|    |    └─TransformerEncoderLayer: 3-4      [299, 30, 512]            --
|    |    |    └─MultiheadAttention: 4-25     [299, 30, 512]            --
|    |    |    └─Dropout: 4-26                [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-27              [299, 30, 512]            1,024
|    |    |    └─Linear: 4-28                 [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-29                [299, 30, 512]            --
|    |    |    └─Linear: 4-30                 [299, 30, 512]            262,656
|    |    |    └─Dropout: 4-31                [299, 30, 512]            --
|    |    |    └─LayerNorm: 4-32              [299, 30, 512]            1,024
├─Attention: 1-4                              [30, 512]                 --
|    └─Linear: 2-1                            [30, 299, 128]            65,664
|    └─Linear: 2-2                            [30, 299, 512]            66,048
|    └─Softmax: 2-3                           [30, 299, 512]            --
===============================================================================================
Total params: 2,289,408
Trainable params: 2,289,408
Non-trainable params: 0
Total mult-adds (M): 23.66
===============================================================================================
Input size (MB): 2.82
Forward/backward pass size (MB): 675.12
Params size (MB): 9.16
Estimated Total Size (MB): 687.10
===============================================================================================
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─Linear: 1-1                            [30, 512]                 1,049,088
==========================================================================================
Total params: 1,049,088
Trainable params: 1,049,088
Non-trainable params: 0
Total mult-adds (M): 1.05
==========================================================================================
Input size (MB): 0.25
Forward/backward pass size (MB): 0.12
Params size (MB): 4.20
Estimated Total Size (MB): 4.56
==========================================================================================

from platalea.

egpbos commented on September 23, 2024

GRU model run with default parameters on flickr1d dataset (30 samples only, so batch_size of 32 is not filled up completely).

python -m platalea.experiments.flickr8k.basic_default -c $DATAPATH/config.yml --flickr8k_root=$DATAPATH --epochs=1
INFO:root:Arguments: {'config': '/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/flickr1d/config.yml', 'verbose': False, 'silent': False, 'audio_features_fn': 'mfcc_features.pt', 'seed': 123, 'epochs': 1, 'downsampling_factor': None, 'lr_scheduler': 'cyclic', 'cyclic_lr_max': 0.0002, 'cyclic_lr_min': 1e-06, 'constant_lr': 0.0001, 'device': None, 'hidden_size_factor': 1024, 'l2_regularization': 0, 'flickr8k_root': '/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/flickr1d', 'flickr8k_meta': 'dataset.json', 'flickr8k_audio_subdir': 'flickr_audio/wavs/', 'flickr8k_image_subdir': 'Flickr8k_Dataset/Flicker8k_Dataset/', 'flickr8k_language': 'en', 'librispeech_root': '/home/bjrhigy/corpora/LibriSpeech', 'librispeech_meta': 'metadata.json'}
INFO:root:Loading data
INFO:root:Building model
INFO:root:Training
INFO:root:Run 'wandb disabled' if you don't want to use wandb cloud logging.
INFO:wandb:setting login settings: {}
wandb: Offline run mode, not syncing to the cloud.
wandb: W&B is disabled in this directory.  Run `wandb on` to enable cloud syncing.
INFO:root:Setting stepsize of 4
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─Conv1d: 1-1                            [30, 64, 299]             14,976
├─GRU: 1-2                               [5780, 2048]              63,356,928
├─Attention: 1-3                         [30, 2048]                --
|    └─Linear: 2-1                       [30, 299, 128]            262,272
|    └─Linear: 2-2                       [30, 299, 2048]           264,192
|    └─Softmax: 2-3                      [30, 299, 2048]           --
==========================================================================================
Total params: 63,898,368
Trainable params: 63,898,368
Non-trainable params: 0
Total mult-adds (M): 68.83
==========================================================================================
Input size (MB): 2.82
Forward/backward pass size (MB): 255.44
Params size (MB): 255.59
Estimated Total Size (MB): 513.86
==========================================================================================
torch.Size([30, 2048])
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
├─Linear: 1-1                            [30, 2048]                4,196,352
==========================================================================================
Total params: 4,196,352
Trainable params: 4,196,352
Non-trainable params: 0
Total mult-adds (M): 4.19
==========================================================================================
Input size (MB): 0.25
Forward/backward pass size (MB): 0.49
Params size (MB): 16.79
Estimated Total Size (MB): 17.52
==========================================================================================

from platalea.

egpbos commented on September 23, 2024

Now also used torchprof to profile memory usage of the Transformer model. Results (slightly different parameters than above):

python -m platalea.experiments.flickr8k.transformer -c $DATAPATH/config.yml --flickr8k_root=$DATAPATH --epochs=1 --trafo_d_model=512 --trafo_encoder_layers=4 --trafo_heads=4 --trafo_dropout=0.4 --trafo_feedforward_dim=512 --batch_size=30
INFO:root:Arguments: {'config': '/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/flickr1d/config.yml', 'verbose': False, 'silent': False, 'audio_features_fn': 'mfcc_features.pt', 'seed': 123, 'epochs': 1, 'downsampling_factor': None, 'lr_scheduler': 'cyclic', 'cyclic_lr_max': 0.0002, 'cyclic_lr_min': 1e-06, 'constant_lr': 0.0001, 'device': None, 'hidden_size_factor': 1024, 'l2_regularization': 0, 'flickr8k_root': '/Users/pbos/sw/miniconda3/envs/platalea/lib/python3.8/site-packages/flickr1d', 'flickr8k_meta': 'dataset.json', 'flickr8k_audio_subdir': 'flickr_audio/wavs/', 'flickr8k_image_subdir': 'Flickr8k_Dataset/Flicker8k_Dataset/', 'flickr8k_language': 'en', 'librispeech_root': '/home/bjrhigy/corpora/LibriSpeech', 'librispeech_meta': 'metadata.json', 'batch_size': 30, 'trafo_d_model': 512, 'trafo_encoder_layers': 4, 'trafo_heads': 4, 'trafo_feedforward_dim': 512, 'trafo_dropout': 0.4, 'score_on_cpu': False, 'validate_on_cpu': False}
INFO:root:Loading data
INFO:root:Building model
INFO:root:Training
INFO:root:Run 'wandb disabled' if you don't want to use wandb cloud logging.
INFO:wandb:setting login settings: {}
wandb: Offline run mode, not syncing to the cloud.
wandb: W&B is disabled in this directory.  Run `wandb on` to enable cloud syncing.
INFO:root:Setting stepsize of 4
INFO:root:Saving model in net.1.pt
INFO:root:Calculating and saving epoch score results
Module                   | Self CPU total | CPU total | Self CPU Mem | CPU Mem   | Number of Calls
-------------------------|----------------|-----------|--------------|-----------|----------------
SpeechImage              |                |           |              |           |
├── SpeechEncoder        |                |           |              |           |
│├── Conv                | 66.873ms       | 289.453ms | 6.65 Mb      | 36.94 Mb  | 2
│├── Transformer         |                |           |              |           |
││├── layers             |                |           |              |           |
│││├── 0                 |                |           |              |           |
││││├── self_attn        |                |           |              |           |
│││││└── out_proj        | 0.000us        | 0.000us   |              |           | 0
││││├── linear1          | 83.397ms       | 151.518ms | 41.41 Mb     | 124.22 Mb | 3
││││├── dropout          | 35.445ms       | 72.829ms  | 52.56 Mb     | 227.75 Mb | 3
││││├── linear2          | 58.937ms       | 109.521ms | 41.41 Mb     | 124.22 Mb | 3
││││├── norm1            | 13.010ms       | 26.011ms  | 41.47 Mb     | 124.61 Mb | 3
││││├── norm2            | 5.002ms        | 9.874ms   | 41.47 Mb     | 124.61 Mb | 3
││││├── dropout1         | 45.915ms       | 100.343ms | 52.56 Mb     | 227.75 Mb | 3
││││└── dropout2         | 52.156ms       | 114.384ms | 52.56 Mb     | 227.75 Mb | 3
│││├── 1                 |                |           |              |           |
││││├── self_attn        |                |           |              |           |
│││││└── out_proj        | 0.000us        | 0.000us   |              |           | 0
││││├── linear1          | 86.873ms       | 157.309ms | 41.41 Mb     | 124.22 Mb | 3
││││├── dropout          | 23.239ms       | 48.686ms  | 52.56 Mb     | 227.75 Mb | 3
││││├── linear2          | 60.659ms       | 115.586ms | 41.41 Mb     | 124.22 Mb | 3
││││├── norm1            | 14.783ms       | 29.562ms  | 41.47 Mb     | 124.61 Mb | 3
││││├── norm2            | 19.597ms       | 39.195ms  | 41.47 Mb     | 124.61 Mb | 3
││││├── dropout1         | 24.864ms       | 50.021ms  | 52.56 Mb     | 227.75 Mb | 3
││││└── dropout2         | 22.024ms       | 44.320ms  | 52.56 Mb     | 227.75 Mb | 3
│││├── 2                 |                |           |              |           |
││││├── self_attn        |                |           |              |           |
│││││└── out_proj        | 0.000us        | 0.000us   |              |           | 0
││││├── linear1          | 50.357ms       | 98.688ms  | 41.41 Mb     | 124.22 Mb | 3
││││├── dropout          | 37.660ms       | 80.871ms  | 52.56 Mb     | 227.75 Mb | 3
││││├── linear2          | 56.380ms       | 110.692ms | 41.41 Mb     | 124.22 Mb | 3
││││├── norm1            | 9.368ms        | 18.736ms  | 41.47 Mb     | 124.61 Mb | 3
││││├── norm2            | 6.272ms        | 12.558ms  | 41.47 Mb     | 124.61 Mb | 3
││││├── dropout1         | 35.492ms       | 70.916ms  | 52.56 Mb     | 227.75 Mb | 3
││││└── dropout2         | 45.605ms       | 100.806ms | 52.56 Mb     | 227.75 Mb | 3
│││└── 3                 |                |           |              |           |
│││ ├── self_attn        |                |           |              |           |
│││ │└── out_proj        | 0.000us        | 0.000us   |              |           | 0
│││ ├── linear1          | 59.860ms       | 116.771ms | 41.41 Mb     | 124.22 Mb | 3
│││ ├── dropout          | 33.931ms       | 70.847ms  | 52.56 Mb     | 227.75 Mb | 3
│││ ├── linear2          | 49.675ms       | 94.920ms  | 41.41 Mb     | 124.22 Mb | 3
│││ ├── norm1            | 6.048ms        | 12.110ms  | 41.47 Mb     | 124.61 Mb | 3
│││ ├── norm2            | 7.821ms        | 15.640ms  | 41.47 Mb     | 124.61 Mb | 3
│││ ├── dropout1         | 37.179ms       | 79.899ms  | 52.56 Mb     | 227.75 Mb | 3
│││ └── dropout2         | 32.213ms       | 64.872ms  | 52.56 Mb     | 227.75 Mb | 3
│├── scale_conv_to_trafo | 20.789ms       | 30.776ms  | 26.87 Mb     | 83.60 Mb  | 2
│├── att                 |                |           |              |           |
││├── hidden             | 26.160ms       | 56.518ms  | 29.86 Mb     | 113.46 Mb | 2
││├── out                | 10.915ms       | 19.699ms  | 23.89 Mb     | 71.66 Mb  | 2
││└── softmax            | 24.973ms       | 49.969ms  | 23.89 Mb     | 71.66 Mb  | 2
└── ImageEncoder         |                |           |              |           |
 └── linear_transform    | 1.649ms        | 1.849ms   | 64.00 Kb     | 64.00 Kb  | 2

from platalea.

cwmeijer commented on September 23, 2024

We decreased the GPU usage by using larger strides. See report https://wandb.ai/spokenlanguage/platalea_transformer/reports/Feb-4-Project-Update-Conv-stride-sweep--Vmlldzo0NDkwMzA.
With also tried to use the extra available memory by adding layers. See report https://wandb.ai/spokenlanguage/platalea_transformer/reports/Feb-23-Project-Update-grid-search-on-trafo-layers-and-heads--Vmlldzo0ODQ5MzE.

from platalea.

Transformer model performance vs GPU memory usage about platalea HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent