Code Monkey home page Code Monkey logo

dlperf's Introduction

DLPerf - Deep Learning Framework Performance Profiling Toolkit

Introduction

This repository provides State-of-the-Art classical deep neural network(DNN) models of different deep learning frameworks which are easy to train and deploy, achieving the best reproducible performance with NVIDIA GPU Server Clusters.

DLPerf measures how fast deep learning frameworks can train DNN models, so both DL frameworks and DNN models are involved in this benchmark test.

Evaluated Deep Learning Frameworks

5 deep learning frameworks are evaluated in this repository, they are:

  1. OneFlow
  2. TensorFlow 1.x and 2.x
  3. PyTorch
  4. MXNet
  5. PaddlePaddle

More and more frameworks will be included in the future, such as MindSpore and MegEngine.

Evaluated Deep Neural Network models

2 classical deep neural network models are tested in this repository, they are:

  1. ResNet-50 v1.5
  2. BERT-Base

There are a lot of different implementations of these DNN models, we choose official benchmark source as well as NVIDIA-DeepLearningExamples. In most cases, we avoid changing any scripts and codes from origin. If we have to, changes were mentioned in the documents.

More DNN models will be tested in the future.

Benchmark Test Scopes

Each DNN model of a framework should be tested on a multi-node cluster with different batch sizes, XLA enabled or not, auto mixed precision enabled or not.

Multi-Node and Multi-Device

We suggest to perform each test with 1-node-1-device, 1-node-8-device, 2-node-16-device, 4-node-32-device configuration.

Batch Size

In this repository, when talking about batch size, it always means the number of samples per device during training. The total batch size is scaled up with the total number of devices for training.

Because each DL framework has its own device memory management strategy, so the maximum batch size per device is different between DL frameworks. For this reason, we perform several group tests with different batch sizes.

Normally, larger batch size produces better performance.

XLA

XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate models with potentially no source code changes.

We plan to test these DNN models with or without XLA if the framework supports.

AMP

On some NVIDIA GPUs, Automatic Mixed Precision(AMP) uses FP16 to deliver a performance boost of 3X versus FP32.

We plan to test these DNN models with or without AMP if the framework supports.

Median Value Principle

According to chapter Benchmark Test Scopes, each test case varies with following parameters:

  • number of nodes, number of devices
  • batch size per device
  • XLA
  • AMP

Each test case will repeat several times(suggest 7 times). The median value is chose as the final result.

Throughput

Throughput is the average training samples per second, e.g. images/sec for image classification.

To get a continuous and stable output, first several training steps are ignored. In practice, we ignore 20 training steps of the beginning, and measure the following 100 steps processing time to calculate throughput.

Files and Folders

  • README.md: introduces general information of this repository.
  • NVIDIADeepLearningExamples/: holds the reproducible scripts and test reports for DNN models from NVIDIA DeepLearningExamples, which includes the frameworks (like TensorFlow 1.x, PyTorch, MXNet) and the corresponding models optimized by NVIDIA;
  • OneFlow/: holds the reproducible scripts and test reports for DNN models from OneFlow official benchmark;
  • PaddlePaddle/: holds the reproducible scripts and test reports for DNN models from PaddlePaddle official benchmark;
  • TensorFlow/: holds the reproducible scripts and test reports for DNN models from TensorFlow 2.x official benchmark;
  • PyTorch/: holds the reproducible scripts and test reports for DNN models from PyTorch official benchmark;
  • MxNet/: holds the reproducible scripts and test reports for DNN models from gluon-nlp and gluon-cv;
  • reports: holds rounds of DNN's benchmark test reports.

Summary of Latest Test Results

This section maintains the summary of the latest results. For more and more details, please find in reports folder.

Latest Test Report

DLPerf Benchmark Test Report v1.0 on 4 nodes with 8x Tesla V100-SXM2-16GB GPUs each.

DLPerf 性能评测报告中文版 v1.0

ResNet50-V1.5 Training Performance images/sec

Our results were obtained by running the applicable training scripts on 4 nodes with 8x Tesla V100-SXM2-16GB GPUs each. The specific training script that was run is documented in the corresponding model's README. The bsz means batch size per GPU.

The difference between v1 and v1.5 is in the bottleneck blocks which require down sampling. ResNet50 v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution

This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% images/sec).

Framework Source FP32 throughput
(img/s) bsz=128
FP32 speedup
bsz=128
AMP throughput
(img/s) bsz=256
AMP speedup
bsz=256
OneFlow OneFlow-Benchmark 12411.97 31.21 33141.02 22.50
NGC MXNet NVIDIA-DeepLearningExamples 11233.92 28.67 30713.68 22.03
NGC TensorFlow 1.x NVIDIA-DeepLearningExamples 9514.64 26.25 [1]29171.69w/XLA
24734.22
24.34w/XLA
26.17
NGC PyTorch NVIDIA-DeepLearningExamples 10917.09 29.72 22551.16 28.09
MXNet gluon-cv 9579.74 24.93 10565.55 12.67
TensorFlow 2.x TensorFlow-models 9418.44 29.27 19314.31 17.96
PyTorch PyTorch-examples 10021.29 28.75 [2] - -
PaddlePaddle PaddleCV 9348.17 26.50 [3]10633.22
11617.57w/DALI
10.2
13.1w/DALI

[1]: AMP throughput of TensorFlow 1.x is obtained with or without XLA and using bsz = 224, because when bsz = 256 OOM (out of memory) will be encountered.

[2]: The PyTorch official benchmark repository PyTorch-examples does NOT support AMP, we will use NVIDIA-APEX plug-in for testing in the near future.

[3]: The AMP throughput 10633.22 img/s of PaddlePaddle is obtained with bsz = 224 and without DALI, because when bsz = 256 OOM will be encountered. The throughput 11617.57 img/s is obtained with bsz = 196 and with DALI-paddle plug-in because using DALI will occupy more GPU device memory, so bsz = 224 or 256 both encounters OOM. The official data 28594 img/s provided by PaddlePaddle is tested on V100 32G and the PaddlePaddle docker image with DALI not released, so we cannot replicate this result. If anyone can help us improve PaddlePaddle test results, please contact us by issue.

BERT base Pretraining Performance sentences/sec

Our results were obtained by running the applicable training scripts on 4 nodes with 8x Tesla V100-SXM2-16GB GPUs each. The specific training script that was run is documented in the corresponding model's README. The bsz means batch size per GPU.

Framework Source FP32 throughput
bsz=max
FP32 throughput
bsz=32
AMP throughput
bsz=max
AMP throughput
bsz=64
OneFlow OneFlow-Benchmark 4664.10
bsz=96
3689.80 15724.70
bsz=160
9911.78
NGC TensorFlow 1.x NVIDIA-DeepLearningExamples 3089.74
bsz=48
2727.90 11650.0w/XLA
bsz=96
[4]9409.2w/XLA
5189.07W/O XLA
NGC PyTorch NVIDIA-DeepLearningExamples 3039.3
bsz=48
2885.81 10349.12
bsz=96
9331.72
PaddlePaddle PaddleNLP 3167.68
bsz=96
2073.60 5452.35
bsz=160
3406.36
OneFlowW/O clip OneFlow-Benchmark 4799.64
bsz=96
4019.45 17210.63
bsz=160
11195.72
[5]MXNetW/O clip gluon-nlp 4340.89
bsz=64
3671.45 14822.31
bsz=128
11269.14

[4]: AMP throughput of TensorFlow 1.x is obtained with or without XLA.

[5]: The MXNet BERT script of the gluon-nlp repository does NOT support clip_by_ global_norm operation in Adam optimizer. W/O clip_by_global_norm operation, the throughput will be larger and the the fine-tuning accuracy may be lower. So we also tested OneFlow data W/O clip operation for comparison.

dlperf's People

Contributors

chengtbf avatar flowingsun007 avatar jackalcooper avatar ldpe2g avatar mir-of avatar nlqq avatar ouyangyu avatar shawnxuan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.