Code Monkey home page Code Monkey logo

dawn-bench-entries's Introduction

DAWNBench Submission Instructions

Thank you for the interest in DAWNBench!

To add your model to our leaderboard, open a Pull Request with title <Model name> || <Task name> || <Author name> (example PR), with JSON (and TSV where applicable) result files in the format outlined below.

Tasks

CIFAR10 Training

Task Description

We evaluate image classification performance on the CIFAR10 dataset.

For training, we have two metrics:

  • Training Time: Train an image classification model for the CIFAR10 dataset. Report the time needed to train a model with test set accuracy of at least 94%
  • Cost: On public cloud infrastructure, compute the total time needed to reach a test set accuracy of 94% or greater, as outlined above. Multiply the time taken (in hours) by the cost of the instance per hour, to obtain the total cost of training the model

Including cost is optional and will only be calculated if the costPerHour field is included in the JSON file. Submissions that only aim for time aren't restricted to public cloud infrastructure.

JSON Format

Results for the CIFAR10 training tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model training was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • costPerHour: [Optional] Reported in USD ($). Cost of instance per hour
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as learning rate schedule, optimization algorithm, framework version, etc.

In addition, report training progress at the end of every epoch in a TSV with the following format,

epoch\thours\ttop1Accuracy

We will compute time to reach a test set accuracy of 94% by reading off the first entry in the above TSV with a top-1 test set accuracy of at least 94%.

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the CIFAR10/train/ sub-directory.

Example JSON and TSV

JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 56",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "costPerHour": 0.90,
    "timestamp": "2017-08-14",
    "misc": {}
}

TSV

epoch   hours top1Accuracy
1       0.07166666666666667     33.57
2       0.1461111111111111      52.51
3       0.21805555555555556     61.71
4       0.2902777777777778      69.46
5       0.3622222222222222      71.47
6       0.43416666666666665     69.64
7       0.5061111111111111      75.81

CIFAR10 Inference

Task Description

We evaluate image classification performance on the CIFAR10 dataset.

For inference, we have two metrics:

  • Latency: Use a model that has a test set accuracy of 94% or greater. Measure the total time needed to classify all 10,000 images in the CIFAR10 test set one-at-a-time, and then divide by 10,000
  • Cost: Use a model that has a test set accuracy of 94% or greater. Measure the average per-image latency in the CIFAR10 test set, and then multiply by the cost of the instance per unit time

JSON Format

Results for the CIFAR10 inference tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model inference was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • latency: Reported in milliseconds. Time needed to classify one image
  • cost: Reported in USD ($). Cost of performing inference on a single image. Computed as costPerHour * latency
  • top1Accuracy: Reported in percentage points from 0 to 100. Accuracy of model on CIFAR10 test dataset.
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as batch size, framework version, etc.

Note that it is only necessary to specify one of the latency and cost fields outlined above. However, it is encouraged to specify both (if available) in a single JSON result file.

JSON files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.json. Put the JSON file in the CIFAR10/inference/ sub-directory.

Example JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 56",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "latency": 43.45,
    "cost": 1e-6,
    "accuracy": 94.45,
    "timestamp": "2017-08-14",
    "misc": {}
}

ImageNet Training

Task Description

We evaluate image classification performance on the ImageNet dataset.

For training, we have two metrics:

  • Training Time: Train an image classification model for the ImageNet dataset. Report the time needed to train a model with top-5 validation accuracy of at least 93%
  • Cost: On public cloud infrastructure, compute the total time needed to reach a validation accuracy of 93% or greater, as outlined above. Multiply the time taken by the cost of the instance per hour, to obtain the total cost of training the model

Including cost is optional and will only be calculated if the costPerHour field is included in the JSON file. Submissions that only aim for time aren't restricted to public cloud infrastructure.

JSON Format

Results for the ImageNet training tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model training was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • costPerHour: [Optional] Reported in USD ($). Cost of instance per hour
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as learning rate schedule, optimization algorithm, framework version, etc.

In addition, report training progress at the end of every epoch in a TSV with the following format,

epoch\thours\ttop1Accuracy\ttop5Accuracy

We will compute time to reach a top-5 validation accuracy of 93% by reading off the first entry in the above TSV with a top-5 validation accuracy of at least 93%.

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the ImageNet/train/ sub-directory.

Example JSON and TSV

JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 50",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "costPerHour": 0.90,
    "timestamp": "2017-08-14",
    "misc": {}
}

TSV

epoch   hours top1Accuracy top5Accuracy
1       0.07166666666666667     33.57     68.93
2       0.1461111111111111      52.51     72.48 
3       0.21805555555555556     61.71     81.46
4       0.2902777777777778      69.46     81.92
5       0.3622222222222222      71.47     82.17 
6       0.43416666666666665     69.64     83.68
7       0.5061111111111111      75.81     84.31 

ImageNet Inference

Task Description

We evaluate image classification performance on the ImageNet dataset.

For inference, we have two metrics:

  • Latency: Use a model that has a top-5 validation accuracy of 93% or greater. Measure the total time needed to classify all 50,000 images in the ImageNet validation set one-at-a-time, and then divide by 50,000
  • Cost: Use a model that has a top-5 validation accuracy of 93% or greater. Measure the average latency of performing inference on a single image (as described above), then multiply by cost of the instance per hour to get total time to perform inference

JSON Format

Results for the ImageNet inference tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model inference was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • latency: Reported in milliseconds. Time needed to classify one image
  • cost: Reported in USD ($). Cost of performing inference on a single image. Computed as costPerHour * latency
  • top5Accuracy: Reported in percentage points from 0 to 100. Accuracy of model on ImageNet test dataset.
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as batch size, framework version, etc.

Note that it is only necessary to specify one of the latency and cost fields outlined above. However, it is encouraged to specify both (if available) in a single JSON result file.

JSON files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.json. Put the JSON file in the ImageNet/inference/ sub-directory.

Example JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 50",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "latency": 43.45,
    "cost": 4.27e-6,
    "top5Accuracy": 93.45,
    "timestamp": "2017-08-14",
    "misc": {}
}

SQuAD Training

Task Description

We evaluate question answering performance on the SQuAD dataset.

For training, we have two metrics:

  • Training Time: Train a question answering model for the SQuAD dataset. Report the time needed to train a model with a dev set F1 score of at least 0.73
  • Cost: On public cloud infrastructure, compute the total time needed to reach a dev set F1 score of 0.73 or greater, as outlined above. Multiply the time taken by the cost of the instance per hour, to obtain the total cost of training the model

Including cost is optional and will only be calculated if the costPerHour field is included in the JSON file. Submissions that only aim for time aren't restricted to public cloud infrastructure.

JSON Format

Results for the SQuAD training tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model training was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • costPerHour: [Optional] Reported in USD ($). Cost of instance per hour
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as learning rate schedule, optimization algorithm, framework version, etc.

In addition, report training progress at the end of every epoch in a TSV with the following format,

epoch\thours\tf1Score

We will compute time to reach a F1 score of 0.73 by reading off the first entry in the above TSV with a F1 score of at least 0.73.

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_bidaf_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the SQuAD/train/ sub-directory.

Example JSON and TSV

JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow_qa/bi-att-flow",
    "model": "BiDAF",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "costPerHour": 0.90,
    "timestamp": "2017-08-14",
    "misc": {}
}

TSV

epoch   hours f1Score
1     0.7638888888888888      0.5369029640999999
2     1.5238381055555557      0.6606892943
3     2.2855751       0.700419426
4     3.0448481305555557      0.7229908705
5     3.806446388888889       0.731013
6     4.5750864       0.7370445132
7     5.346703258333334       0.7413719296

SQuAD Inference

Task Description

We evaluate question answering performance on the SQuAD dataset.

For inference, we have two metrics:

  • Latency: Use a model that has a dev set F1 measure of 0.73 or greater. Measure the total time needed to answer all questions in the SQuAD dev set one-at-a-time, and then divide by the number of questions
  • Cost: Use a model that has a dev set F1 measure of 0.73 or greater. Measure the average latency needed to perform inference on a single question, and then multiply by the cost of the instance

JSON Format

Results for the SQuAD inference tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model inference was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • latency: Reported in milliseconds. Time needed to answer one question
  • cost: Reported in USD ($). Cost of performing inference on a single question. Computed as costPerHour * latency
  • f1Score: Reported in fraction from 0.0 to 1.0. F1 score of model on SQuAD development dataset
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as batch size, framework version, etc.

Note that it is only necessary to specify one of the latency and cost fields outlined above. However, it is encouraged to specify both (if available) in a single JSON result file.

JSON files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_bidaf_1k80-gc_tensorflow.json. Put the JSON file SQuAD/inference/ sub-directory.

Example JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow_qa/bi-att-flow",
    "model": "BiDAF",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "latency": 590.0,
    "cost": 2e-6,
    "f1Score": 0.7524165510999999,
    "timestamp": "2017-08-14",
    "misc": {}
}

FAQ

  • Can spot instances be used for cost metrics? For submissions including cost, please use on-demand, i.e., non-preemptible, instance pricing. Spot pricing is too volatile for the current release the benchmark. We're open to suggestions on better ways to deal with pricing volatility, so if you have ideas, please pitch them on the google group
  • Is validation time included in training time? No, you don't need to include the time required to calculate validation accuracy and save checkpoints.
  • What happens after I submit a pull request with a new result? After you submit a PR, unit tests should automatically run to determine basic requirements. Assuming the unit tests pass, we review the code and the submission. If it is sufficiently similar to existing results or the difference is easily justified, we accept the submission without reproducing. If there issues with the code or someone questions the results, the process is a little more complicated and can vary from situation to situation. If the issues are small, it may be as simple as changing the JSON file.

Disclosure: The Stanford DAWN research project is a five-year industrial affiliates program at Stanford University and is financially supported in part by founding members including Intel, Microsoft, NEC, Teradata, VMWare, and Google. For more information, including information regarding Stanford’s policies on openness in research and policies affecting industrial affiliates program membership, please see DAWN's membership page.

dawn-bench-entries's People

Contributors

alicloud-damo-hci avatar baidu-usa-gait-leopard avatar bearpelican avatar bignamehyp avatar bkj avatar brettkoonce avatar chuanli11 avatar codeforfun9 avatar codyaustun avatar daisyden avatar deepakn94 avatar dmrd avatar felixgwu avatar iyaja avatar jph00 avatar kay-tian avatar listenlink avatar lvniqi avatar lxgsbqylbk avatar lxylzxc avatar mzhangatge avatar raoxing avatar shaohuawu2018 avatar sleepfin avatar stephenbalaban avatar tccccd avatar terminatorzwm avatar wang-chen avatar wbaek avatar yaroslavvb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dawn-bench-entries's Issues

Printing # of epochs and training time during training

Hi, @yaroslavvb

What script and flags are you using to get the training results with Resnet50-Imagenet?. I am running the benchmarks tf_cnn_benchmarks.py with Imagenet, and I need to print the epochs and training time during the training. I don't find the flags to activate the printing of this information, or do I need to modify the script to do it?

Here is the display I am getting:
Step Img/sec total_loss top_1_accuracy top_5_accuracy
1 images/sec: 451.0 +/- 0.0 (jitter = 0.0) 8.168 0.003 0.005

Questions on inference latency

For DAWNBench latency rule,
I have a question and need your confirmation:

when we calculate the latency, could we ignore image processing time?
for example, we hanle image processing(including decoding, resize and crop) offline ?

Thanks

Kindly requesting some info

Hi everyone, thanks for this amazing work.
I was wondering if you could shed some light on the following questions.

I see different benchmarks on different dataset and different hardware but I am having some trouble inferring the following information:

For instance we see a difference by half at training time on TPUs but this between different models, i.e. resnet & amoeba-net.

I would like to know what speed gain if:

  1. We have the same model on same hardware but different version of library, of instance TF 1.7 vs TF 1.8. In other words how much of that speed gain is solely based on the new software release

  2. Test the same model on different hardware but same version of library so that we can understand how much of the percentage gain is solely from the hardware.

Thanks in advance!

Clarification about checkpoints in training on Imagenet

We're working on a DAWNBench entry that uses a slice of a TPU pod. Each epoch is processed so quickly on the pod that a significant amount of time is now being spent on saving checkpoints. Would it be possible for us to provide a submission where we only checkpoint once at the end of the training run and then run eval to validate accuracy?

As another possibility, we could provide data on two runs, one with checkpointing enabled for every epoch and the other with checkpointing disabled until the end. You could use the timing of the run without checkpointing but inspect the accuracy values along the way via the auxiliary run with checkpointing.

Please let us know if either of these paths would be acceptable.

Questions on inference Latency

Hello~
I'm trying to reproduce the PingAn GammaLab & PingAn Cloud team's work, which is the No.1 in inference latency benchmark. This work uses this model to evaluate the inference time.

I notice that this model is not the original resnet50, and the network architecture is quite different from resnet50.

So, could you help me to confirm their real network architecture?
And i'm wondering is it allowed to use these light network like mobilenet there?

Resubmission should not be allowed after competition deadline

I noticed that there were resubmissions from fast.ai for the ImageNet training track after competition deadline:
https://github.com/stanford-futuredata/dawn-bench-entries/blob/master/ImageNet/train/fastai_pytorch.json
https://github.com/stanford-futuredata/dawn-bench-entries/blob/master/ImageNet/train/fastai_pytorch.tsv

Their new result was baked from lots of code changes including hyper-parameter tuning and model changes. I don't think it is fair to other participants. Resubmission should not be allowed after competition deadline.

As a fair alternative, I would suggest the organizer to create a new ranking list for ImageNet without those blacklisted images. Any submission prior to the deadline can be ranked in the respective list. This can avoid confusion and also honor the game rules.

Initialization / data preparation / checkpointing time included?

I assumed the hours field reported in dawn benchmark is between the time each checkpoint is saved and the start of the program. Can we exclude initialization time before training. For example, we can load the entire CIFAR dataset to memory first. And also saving checkpoints to disks is expensive especially when the training is super fast. Can we exclude the checkpointing time as well?

Besides can we report training progress that saved every x epochs?

Blacklisted or non-blacklisted validation set:

The ImageNet validation set consists of 50,000 images. In the 2014 devkit, there is a list of 1762 "blacklisted" files. When we report the top-5 accuracy, should we use the blacklisted or non-blacklisted version? In Google's submission, results are obtained by using the full 50,000 including those blacklisted images. But some submissions used the blacklisted version. Just make sure we're comparing the same thing.

Clarification about training time for Imagenet

I had a quick question. In measuring the training cost and time should the job be run as:

  1. Series of train and eval steps (1 epoch at a time) and then the total time is measured for both the train and eval combined.
    OR
  2. Do the training and store checkpoints at each epoch. This is the training cost.
    Eval results can be generated posthoc from the checkpoints but don't contribute to the training time.

What are the guidelines on how should the job be configured here?

Questions on inference latency/cost

Hello,

I am understanding the latency rule in DAWNBench:
• Latency: Use a model that has a top-5 validation accuracy of 93% or greater. Measure the total time needed to classify all 50,000 images in the ImageNet validation set one-at-a-time, and then divide by 50,000

I am not sure how to better understand "one-at-a-time" here, so I raised some questions here and need your confirmation:

  1. Does it allow the pipeline of image processing and CNN inference?
  2. Does it allow preprocessed images (resize and crop done offline)?
  3. Does it allow dummy data?

Thanks.

Question on inference cost

Hi,
To calculate the inference cost, is it permitted to use dual-instances in one VM?
To say it explicitly:

  1. Launch 2 inference processes(or threads), each serving 25k images in imagenet-2012-val
  2. Get the total time and calculate the average cost for every 10k images.
    The formula just like: max[sum(process1_time), sum(process2_time)] / 50 * 10 * vm_cost_per_milliseconds.

For inference latency, we will still use: [sum(process1_time) + sum(process2_time)] / total_images to measure per image latency.

Thx very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.