Code Monkey home page Code Monkey logo

dlrm's Introduction

Deep Learning Recommendation Model for Personalization and Recommendation Systems:

Copyright (c) Facebook, Inc. and its affiliates.

Description:

An implementation of a deep learning recommendation model (DLRM). The model input consists of dense and sparse features. The former is a vector of floating point values. The latter is a list of sparse indices into embedding tables, which consist of vectors of floating point values. The selected vectors are passed to mlp networks denoted by triangles, in some cases the vectors are interacted through operators (Ops).

output:
                    probability of a click
model:                        |
                             /\
                            /__\
                              |
      _____________________> Op  <___________________
    /                         |                      \
   /\                        /\                      /\
  /__\                      /__\           ...      /__\
   |                          |                       |
   |                         Op                      Op
   |                    ____/__\_____           ____/__\____
   |                   |_Emb_|____|__|    ...  |_Emb_|__|___|
input:
[ dense features ]     [sparse indices] , ..., [sparse indices]

More precise definition of model layers:

  1. fully connected layers of an mlp

    z = f(y)

    y = Wx + b

  2. embedding lookup (for a list of sparse indices p=[p1,...,pk])

    z = Op(e1,...,ek)

    obtain vectors e1=E[:,p1], ..., ek=E[:,pk]

  3. Operator Op can be one of the following

    Sum(e1,...,ek) = e1 + ... + ek

    Dot(e1,...,ek) = [e1'e1, ..., e1'ek, ..., ek'e1, ..., ek'ek]

    Cat(e1,...,ek) = [e1', ..., ek']'

    where ' denotes transpose operation

See our blog post to learn more about DLRM: https://ai.facebook.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/.

Cite Work:

@article{DLRM19,
  author    = {Maxim Naumov and Dheevatsa Mudigere and Hao{-}Jun Michael Shi and Jianyu Huang and Narayanan Sundaraman and Jongsoo Park and Xiaodong Wang and Udit Gupta and Carole{-}Jean Wu and Alisson G. Azzolini and Dmytro Dzhulgakov and Andrey Mallevich and Ilia Cherniavskii and Yinghai Lu and Raghuraman Krishnamoorthi and Ansha Yu and Volodymyr Kondratenko and Stephanie Pereira and Xianjie Chen and Wenlin Chen and Vijay Rao and Bill Jia and Liang Xiong and Misha Smelyanskiy},
  title     = {Deep Learning Recommendation Model for Personalization and Recommendation Systems},
  journal   = {CoRR},
  volume    = {abs/1906.00091},
  year      = {2019},
  url       = {https://arxiv.org/abs/1906.00091},
}

Related Work:

On the system architecture implications, with DLRM as one of the benchmarks,

@article{ArchImpl19,
  author    = {Udit Gupta and Xiaodong Wang and Maxim Naumov and Carole{-}Jean Wu and Brandon Reagen and David Brooks and Bradford Cottel and Kim M. Hazelwood and Bill Jia and Hsien{-}Hsin S. Lee and Andrey Malevich and Dheevatsa Mudigere and Mikhail Smelyanskiy and Liang Xiong and Xuan Zhang},
  title     = {The Architectural Implications of Facebook's DNN-based Personalized Recommendation},
  journal   = {CoRR},
  volume    = {abs/1906.03109},
  year      = {2019},
  url       = {https://arxiv.org/abs/1906.03109},
}

On the embedding compression techniques (for number of vectors), with DLRM as one of the benchmarks,

@article{QuoRemTrick19,
  author    = {Hao{-}Jun Michael Shi and Dheevatsa Mudigere and Maxim Naumov and Jiyan Yang},
  title     = {Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems},
  journal   = {CoRR},
  volume    = {abs/1909.02107},
  year      = {2019},
  url       = {https://arxiv.org/abs/1909.02107},
}

On the embedding compression techniques (for dimension of vectors), with DLRM as one of the benchmarks,

@article{MixDimTrick19,
  author    = {Antonio Ginart and Maxim Naumov and Dheevatsa Mudigere and Jiyan Yang and James Zou},
  title     = {Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems},
  journal   = {CoRR},
  volume    = {abs/1909.11810},
  year      = {2019},
  url       = {https://arxiv.org/abs/1909.11810},
}

Implementation

DLRM PyTorch. Implementation of DLRM in PyTorch framework:

   dlrm_s_pytorch.py

DLRM Caffe2. Implementation of DLRM in Caffe2 framework:

   dlrm_s_caffe2.py

DLRM Data. Implementation of DLRM data generation and loading:

   dlrm_data_pytorch.py, dlrm_data_caffe2.py, data_utils.py

DLRM Tests. Implementation of DLRM tests in ./test

   dlrm_s_test.sh

DLRM Benchmarks. Implementation of DLRM benchmarks in ./bench

   dlrm_s_criteo_kaggle.sh, dlrm_s_criteo_terabyte.sh, dlrm_s_benchmark.sh

Related Work:

On the Glow framework implementation

https://github.com/pytorch/glow/blob/master/tests/unittests/RecommendationSystemTest.cpp

On the FlexFlow framework distributed implementation with Legion backend

https://github.com/flexflow/FlexFlow/blob/master/examples/cpp/DLRM/dlrm.cc

How to run dlrm code?

  1. A sample run of the code, with a tiny model is shown below
$ python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6
time/loss/accuracy (if enabled):
Finished training it 1/3 of epoch 0, -1.00 ms/it, loss 0.451893, accuracy 0.000%
Finished training it 2/3 of epoch 0, -1.00 ms/it, loss 0.402002, accuracy 0.000%
Finished training it 3/3 of epoch 0, -1.00 ms/it, loss 0.275460, accuracy 0.000%
  1. A sample run of the code, with a tiny model in debug mode
$ python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --debug-mode
model arch:
mlp top arch 3 layers, with input to output dimensions:
[8 4 2 1]
# of interactions
8
mlp bot arch 2 layers, with input to output dimensions:
[4 3 2]
# of features (sparse and dense)
4
dense feature size
4
sparse feature size
2
# of embeddings (= # of sparse features) 3, with dimensions 2x:
[4 3 2]
data (inputs and targets):
mini-batch: 0
[[0.69647 0.28614 0.22685 0.55131]
 [0.71947 0.42311 0.98076 0.68483]]
[[[1], [0, 1]], [[0], [1]], [[1], [0]]]
[[0.55679]
 [0.15896]]
mini-batch: 1
[[0.36179 0.22826 0.29371 0.63098]
 [0.0921  0.4337  0.43086 0.49369]]
[[[1], [0, 2, 3]], [[1], [1, 2]], [[1], [1]]]
[[0.15307]
 [0.69553]]
mini-batch: 2
[[0.60306 0.54507 0.34276 0.30412]
 [0.41702 0.6813  0.87546 0.51042]]
[[[2], [0, 1, 2]], [[1], [2]], [[1], [1]]]
[[0.31877]
 [0.69197]]
initial parameters (weights and bias):
[[ 0.05438 -0.11105]
 [ 0.42513  0.34167]
 [-0.1426  -0.45641]
 [-0.19523 -0.10181]]
[[ 0.23667  0.57199]
 [-0.16638  0.30316]
 [ 0.10759  0.22136]]
[[-0.49338 -0.14301]
 [-0.36649 -0.22139]]
[[0.51313 0.66662 0.10591 0.13089]
 [0.32198 0.66156 0.84651 0.55326]
 [0.85445 0.38484 0.31679 0.35426]]
[0.17108 0.82911 0.33867]
[[0.55237 0.57855 0.52153]
 [0.00269 0.98835 0.90534]]
[0.20764 0.29249]
[[0.52001 0.90191 0.98363 0.25754 0.56436 0.80697 0.39437 0.73107]
 [0.16107 0.6007  0.86586 0.98352 0.07937 0.42835 0.20454 0.45064]
 [0.54776 0.09333 0.29686 0.92758 0.569   0.45741 0.75353 0.74186]
 [0.04858 0.7087  0.83924 0.16594 0.781   0.28654 0.30647 0.66526]]
[0.11139 0.66487 0.88786 0.69631]
[[0.44033 0.43821 0.7651  0.56564]
 [0.0849  0.58267 0.81484 0.33707]]
[0.92758 0.75072]
[[0.57406 0.75164]]
[0.07915]
DLRM_Net(
  (emb_l): ModuleList(
    (0): EmbeddingBag(4, 2, mode=sum)
    (1): EmbeddingBag(3, 2, mode=sum)
    (2): EmbeddingBag(2, 2, mode=sum)
  )
  (bot_l): Sequential(
    (0): Linear(in_features=4, out_features=3, bias=True)
    (1): ReLU()
    (2): Linear(in_features=3, out_features=2, bias=True)
    (3): ReLU()
  )
  (top_l): Sequential(
    (0): Linear(in_features=8, out_features=4, bias=True)
    (1): ReLU()
    (2): Linear(in_features=4, out_features=2, bias=True)
    (3): ReLU()
    (4): Linear(in_features=2, out_features=1, bias=True)
    (5): Sigmoid()
  )
)
time/loss/accuracy (if enabled):
Finished training it 1/3 of epoch 0, -1.00 ms/it, loss 0.451893, accuracy 0.000%
Finished training it 2/3 of epoch 0, -1.00 ms/it, loss 0.402002, accuracy 0.000%
Finished training it 3/3 of epoch 0, -1.00 ms/it, loss 0.275460, accuracy 0.000%
updated parameters (weights and bias):
[[ 0.0543  -0.1112 ]
 [ 0.42513  0.34167]
 [-0.14283 -0.45679]
 [-0.19532 -0.10197]]
[[ 0.23667  0.57199]
 [-0.1666   0.30285]
 [ 0.10751  0.22124]]
[[-0.49338 -0.14301]
 [-0.36664 -0.22164]]
[[0.51313 0.66663 0.10591 0.1309 ]
 [0.32196 0.66154 0.84649 0.55324]
 [0.85444 0.38482 0.31677 0.35425]]
[0.17109 0.82907 0.33863]
[[0.55238 0.57857 0.52154]
 [0.00265 0.98825 0.90528]]
[0.20764 0.29244]
[[0.51996 0.90184 0.98368 0.25752 0.56436 0.807   0.39437 0.73107]
 [0.16096 0.60055 0.86596 0.98348 0.07938 0.42842 0.20453 0.45064]
 [0.5476  0.0931  0.29701 0.92752 0.56902 0.45752 0.75351 0.74187]
 [0.04849 0.70857 0.83933 0.1659  0.78101 0.2866  0.30646 0.66526]]
[0.11137 0.66482 0.88778 0.69627]
[[0.44029 0.43816 0.76502 0.56561]
 [0.08485 0.5826  0.81474 0.33702]]
[0.92754 0.75067]
[[0.57379 0.7514 ]]
[0.07908]

Testing

Testing scripts to confirm functional correctness of the code

./test/dlrm_s_test.sh
Running commands ...
python dlrm_s_pytorch.py
python dlrm_s_caffe2.py
Checking results ...
diff test1 (no numeric values in the output = SUCCESS)
diff test2 (no numeric values in the output = SUCCESS)
diff test3 (no numeric values in the output = SUCCESS)
diff test4 (no numeric values in the output = SUCCESS)

NOTE: Testing scripts accept extra arguments which will be passed along to the model, such as --use-gpu

Benchmarking

  1. Performance benchmarking

    ./bench/dlrm_s_benchmark.sh
    
  2. The code supports interface with the Criteo Kaggle Display Advertising Challenge Dataset.

    • Please do the following to prepare the dataset for use with DLRM code:
      • First, specify the raw data file (train.txt) as downloaded with --raw-data-file=<path/train.txt>
      • This is then pre-processed (categorize, concat across days...) to allow using with dlrm code
      • The processed data is stored as .npz file in <root_dir>/input/.npz
      • The processed file (.npz) can be used for subsequent runs with --processed-data-file=<path/.npz>
    • The model can be trained using the following script
      ./bench/dlrm_s_criteo_kaggle.sh [--test-freq=1024]
      

  1. The code supports interface with the Criteo Terabyte Dataset.
    • Please do the following to prepare the dataset for use with DLRM code:
      • First, download the raw data files day_0.gz, ...,day_23.gz and unzip them
      • Specify the location of the unzipped text files day_0, ...,day_23, using --raw-data-file=<path/day> (the day number will be appended automatically)
      • These are then pre-processed (categorize, concat across days...) to allow using with dlrm code
      • The processed data is stored as .npz file in <root_dir>/input/.npz
      • The processed file (.npz) can be used for subsequent runs with --processed-data-file=<path/.npz>
    • The model can be trained using the following script
      ./bench/dlrm_s_criteo_terabyte.sh ["--test-freq=10240 --memory-map --data-sub-sample-rate=0.875"]
    

NOTE: Benchmarking scripts accept extra arguments which will be passed along to the model, such as --num-batches=100 to limit the number of data samples

  1. The code supports interface with MLPerf benchmark.

    • Please refer to the following training parameters
      --mlperf-logging that keeps track of multiple metrics, including area under the curve (AUC)
    
      --mlperf-acc-threshold that allows early stopping based on accuracy metric
    
      --mlperf-auc-threshold that allows early stopping based on AUC metric
    
      --mlperf-bin-loader that enables preprocessing of data into a single binary file
    
      --mlperf-bin-shuffle that controls whether a random shuffle of mini-batches is performed
    
    • The MLPerf training model is completely specified and can be trained using the following script
      ./bench/run_and_time.sh [--use-gpu]
    
  2. The code now supports synchronous distributed training, we support gloo/nccl/mpi backend, we provide launching mode for pytorch distributed launcher and Mpirun. For MPI, users need to write their own MPI launching scripts for configuring the running hosts. For example, using pytorch distributed launcher, we can have the following command as launching scripts:

# for single node 8 gpus and nccl as backend on randomly generated dataset:
python -m torch.distributed.launch --nproc_per_node=8 dlrm_s_pytorch.py --arch-embedding-size="80000-80000-80000-80000-80000-80000-80000-80000" --arch-sparse-feature-size=64 --arch-mlp-bot="128-128-128-128" --arch-mlp-top="512-512-512-256-1" --max-ind-range=40000000
--data-generation=random --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2 --print-time --test-freq=2 --test-mini-batch-size=2048 --memory-map --use-gpu --num-batches=100 --dist-backend=nccl

# for multiple nodes, user can add the related argument according to the launcher manual like:
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234

Model checkpoint saving/loading

During training, the model can be saved using --save-model=<path/model.pt>

The model is saved if there is an improvement in test accuracy (which is checked at --test-freq intervals).

A previously saved model can be loaded using --load-model=<path/model.pt>

Once loaded the model can be used to continue training, with the saved model being a checkpoint. Alternatively, the saved model can be used to evaluate only on the test data-set by specifying --inference-only option.

Version

0.1 : Initial release of the DLRM code

1.0 : DLRM with distributed training, cpu support for row-wise adagrad optimizer

Requirements

pytorch-nightly (11/10/20)

scikit-learn

numpy

onnx (optional)

pydot (optional)

torchviz (optional)

mpi (optional for distributed backend)

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

dlrm's People

Contributors

amirstar avatar artemry-nv avatar colin2328 avatar d4l3k avatar dkorchevgithub avatar dmudiger avatar huwan avatar janekl avatar jayhpark530 avatar jeng1220 avatar joshuadeng avatar mnaumovfb avatar mneilly-et avatar narayanan2004 avatar r-barnes avatar reissaavedra avatar rohithkrn avatar s4ayub avatar samiwilf avatar taylanbil avatar tayo avatar tginart avatar tgrel avatar tigrani avatar tonytong999 avatar xulunfan avatar xw285cornell avatar yazhigao avatar ylgh avatar zsol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlrm's Issues

Can't create output model + plots (pytorch)

Hi,

I'm running the benchmark (./bench/dlrm_s_criteo_kaggle.sh) with the smaller dataset
I added to the command the --save-model=./output/model.pt and --memory-map flags
by using the memory map i was able to actually load the data, and I think the training was successful as I saw the following output:

Finished training it 297984/306969 of epoch 0, 41.97 ms/it, loss 0.449257, accuracy 79.112 %
Finished training it 299008/306969 of epoch 0, 41.55 ms/it, loss 0.446398, accuracy 79.331 %
Finished training it 300032/306969 of epoch 0, 42.07 ms/it, loss 0.448324, accuracy 79.166 %
Finished training it 301056/306969 of epoch 0, 41.56 ms/it, loss 0.447720, accuracy 79.162 %
Finished training it 302080/306969 of epoch 0, 41.75 ms/it, loss 0.443990, accuracy 79.356 %
Finished training it 303104/306969 of epoch 0, 41.58 ms/it, loss 0.445195, accuracy 79.299 %
Finished training it 304128/306969 of epoch 0, 41.76 ms/it, loss 0.448235, accuracy 79.118 %
Finished training it 305152/306969 of epoch 0, 41.64 ms/it, loss 0.447002, accuracy 79.177 %
Finished training it 306176/306969 of epoch 0, 41.81 ms/it, loss 0.446282, accuracy 79.279 %
Finished training it 306969/306969 of epoch 0, 41.33 ms/it, loss 0.445721, accuracy 79.342 %

A few points that I couldn't understand:

  1. I see many filed that with .npz extension in my folder but I don't see the file name that is passed in the benchmark script ./input/kaggleAdDisplayChallenge_processed.npz
  2. save model doesn't save anything to the output folder
  3. How can I generate the plots from the benchmark?
  4. What is the recommended hardware for training on the kaggle ad dataset? I'm using a machine with 64 cpu cores and it took about 5 hours

Any help will be greatly appreciated!

ONNX export in pytorch

--save-onnx does not work.

  1. nn.EmbeddingBag is not supported in ONNX converter
  2. Zflat = Z[:, li, lj] at line 235 of dlrm_s_pytorch.py is translated to torch index which cant be
    exported.

proposed solution:

  1. Since in criteo dataset the sparse features are categorical nn.EmbeddingBag can be replaced with nn.Embedding.
  2. since the size of bmm matrix is known, prepare the li,lj indexes with np.tril_indices. then use torch.index_select which will be translated to onnx gather:
    li,lj = np.tril_indices(ni,offset,nj)
    tril_indexes=[]
    for i,j in zip(li,lj):
    tril_indexes.append(i*ni+j)
    Zflat=torch.reshape(Z,(Z.shape[0],Z.shape[1]*Z.shape[2]))
    tril_indexes_tensor=torch.LongTensor(tril_indexes)
    Zflat=torch.index_select(Zflat,1,tril_indexes_tensor

After all that, there is still an issue since pytorch onnx translator uses protobuf which doesnt support buffers that exceed 2GB (which is true for criteo dlrm model).
model can be saved without the parameters using "export_params=False"

GPU Configuration Error

When I run the following command for testing out the GPU configuration:

python dlrm_s_caffe2.py --inference-only --use-gpu

I get the following error message:

 File "dlrm_s_caffe2.py", line 831, in <module>
    device_opt = core.DeviceOption(workspace.GpuDeviceType, 0)
AttributeError: 'module' object has no attribute 'GpuDeviceType'

visualization is malfunction

I followed the instructions to enable visualization:

-# from torchviz import make_dot
-# import torch.nn.functional as Functional
-# from torch.nn.parameter import Parameter
+from torchviz import make_dot
+import torch.nn.functional as Functional
+from torch.nn.parameter import Parameter

and

     # plot compute graph
     if args.plot_compute_graph:
-        sys.exit(
-            "ERROR: Please install pytorchviz package in order to use the"
-            + " visualization. Then, uncomment its import above as well as"
-            + " three lines below and run the code again."
-        )
-        # V = Z.mean() if args.inference_only else L
-        # make_dot(V, params=dict(dlrm.named_parameters()))
-        # dot.render('dlrm_s_pytorch_graph') # write .pdf file
+        V = Z.mean() if args.inference_only else L
+        make_dot(V, params=dict(dlrm.named_parameters()))
+        dot.render('dlrm_s_pytorch_graph') # write .pdf file

then ran the command:

$ python ./dlrm_s_pytorch.py  --plot-compute-graph

However, I encountered error:

Traceback (most recent call last):
  File "./dlrm_s_pytorch.py", line 891, in <module>
    make_dot(V, params=dict(dlrm.named_parameters()))
  File "/opt/conda/lib/python3.6/site-packages/torchviz/dot.py", line 37, in make_dot
    output_nodes = (var.grad_fn,) if not isinstance(var, tuple) else tuple(v.grad_fn for v in var)
AttributeError: 'numpy.ndarray' object has no attribute 'grad_fn'

Could anyone help please?

Environment:

  • ubuntu 18.04
  • python 3.6.8
  • torch 1.2.0a0+0885dd2
  • torchtext 0.4.0
  • torchvision 0.2.1
  • torchviz 0.0.1
  • graphviz 0.11.1

Is there a way to avoid model parallelism in multi-gpu environment

I have a model whose embedding table takes less than 2 gb in total. I have a 4 gpu system with me. When I try to run model in multi-gpu environment, it takes 20-30 percent more time per iteration compared to when I run model on a single gpu. This should not have happened.
I am guessing that the architecture divides the embedding tables across multiple gpu, and the increased time is due to communication between these machines during butterfly shuffle. Am I right?
If yes, is there a way to copy all tables across all GPU's and make model data-parallel only.
Am I looking at this increased iteration time correctly.

Use of --max-ind-range

I am having an issue with using --max-ind-range arg. It is used when preprocessing dataset and if we set it to low value while training run, it produces runtime error saying index out of range. Ideally, preprocessing should not use it but training run should do the modulo operation in data loader, what do you think?

Here is an example of error. I have preprocessed terabyte dataset using 10M range. Then if I run training with --max-ind-range=1000000 (1M), it produces this run time error:

python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" --arch-mlp-top="1024-1024-512-256-1" --data-generation=dataset --data-set=terabyte --raw-data-file=$HOME/dlrm_dataset/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=10 --num-batches=100 --print-time --test-freq=0 --test-mini-batch-size=16384 --test-num-workers=0 --memory-map --mlperf-logging  --max-ind-range=1000000 --numpy-rand-seed 4


time/loss/accuracy (if enabled):
Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1055, in <module>
    Z = dlrm_wrap(X, lS_o, lS_i, use_gpu, device)
  File "dlrm_s_pytorch.py", line 948, in dlrm_wrap
    return dlrm(X, lS_o, lS_i)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in _call_
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 384, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 396, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l)
  File "dlrm_s_pytorch.py", line 338, in apply_emb
    V = E(sparse_index_group_batch.contiguous(), sparse_offset_group_batch)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in _call_
    result = self.forward(*input, **kwargs)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 281, in forward
    per_sample_weights)
  File "/nfs_home/ddkalamk/venv/pytsrc/lib/python3.7/site-packages/torch/nn/functional.py", line 1646, in embedding_bag
    per_sample_weights)
RuntimeError: [enforce fail at embedding_lookup_idx.cc:226] 0 <= idx && idx < data_size. Index 815 is out of bounds: 2147347, range 0 to 1000000

How should I set the parameters of CriteoDataset()?

I download the raw kaggle criteo dataset, there's a readme.txt, train.txt and test.txt.

  1. How can I get the training set, val set and test set by your dlrm_data_pytorch.py?
  2. And how do I set the raw_path, pro_data and memory_map hyper-params when training?
    Thank you!

Checkpoint

Can you please add a pytorch checkpoint save (best validation score) besides onnx?

Accuracy for Testing and Training

I am trying to better understand the code and how to get accuracy numbers for testing and training. I have few questions. Please help!

Training Command:

Terminal_linux> python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --processed-data-file=./kaggle_data/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1000 --use-gpu --test-freq=1000 --print-time --save-model=model.pt --num-batches=2000
--plot-compute-graph $dlrm_extra_option 2>&1 | tee run_kaggle_pt.log

--num-batches=2000
run pytorch ...
Using 1 GPU(s)...
Loading kaggle dataset...
Reading from pre-processed data=./kaggle_data/kaggleAdDisplayChallenge_processed.npz
Training data
Limiting to 2000 batches of the total 306968 batches
Reading in batch: 2000 / 2000

Testing data
Limiting to 2000 batches of the total 25580 batches
Reading in batch: 2000 / 2000

time/loss/accuracy (if enabled):
Time, Loss, and Accuracy
Finished training it 1000/2000 of epoch 0, 115.30 ms/it, loss 0.517118, accuracy 0.000 %
Saving model to model.pt
Testing at - 1000/2000 of epoch 0, loss 0.516816, accuracy 75.354 %, best 75.354 %
Time, Loss, and Accuracy
Finished training it 2000/2000 of epoch 0, 19.67 ms/it, loss 0.509206, accuracy 0.000 %
Saving model to model.pt
Testing at - 2000/2000 of epoch 0, loss 0.505281, accuracy 76.267 %, best 76.267 %
done

Question - The command I gave - is it training or testing?
Question - Why is it showing 0.0% accuracy for training?
Question - How does the script handles training data? How can I run tests for the 7th day Kaggle data that is split between validation and test data?

Inference command:

python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --processed-data-file=./kaggle_data/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1000 --use-gpu --test-freq=1000 --print-time --load-model=model.pt --inference-only --plot-compute-graph $dlrm_extra_option 2>&1 | tee run_kaggle_pt.log

Loading saved mode model.pt
Saved model Training state: epoch = 0/1, batch = 2000/2000, train loss = 0.509206, train accuracy = 0.000 %
Saved model Testing state: nbatches = 2000, test loss = 0.505281, test accuracy = 76.267 %
time/loss/accuracy (if enabled):
Time, Loss, and Accuracy
Finished inference it 2000/2000 of epoch 0, 50.94 ms/it, loss 0.505281, accuracy 0.000 %

Question For inference, I didn't get any accuracy. Why is that?
Question How can I get the accuracy for testing and training just like the authors did (picture attached below)?

kaggle_dac_loss_accuracy_plots

Thanks!

Is there a way to use text field?

Hi,

Is there an easy way to leverage text data (words ID) in this model?
Categorical data extraction as done in the convertUStringToDistinctInts doesn't seem to be usable for text.

Parameter of RMC1, RMC2, RMC3

Hi,

What are the exact parameters for RMC1, RMC2 and RMC3? In the "The Architectural Implications of Facebook’s DNN-based Personalized Recommendation" paper, only the normalized parameters are specified, what are the exact parameters of these recommendation models?

Memory error while pre processing Criteo Terabyte Dataset

What is the memory requirement for pre processing Criteo Terabyte Dataset? I am getting Memory Error even after having more than a terabyte memory installed on the system.

Loaded day: 1 y = 1: 12659815 y = 0: 382745703
Loaded day: 2 y = 1: 18909454 y = 0: 573288083
Loaded day: 3 y = 1: 24713866 y = 0: 748598879
Loaded day: 4 y = 1: 29955964 y = 0: 895472591
Loaded day: 5 y = 1: 35796385 y = 0: 1062180677
Loaded day: 6 y = 1: 42279110 y = 0: 1260544797
Loaded day: 7 y = 1: 48934987 y = 0: 1454689923
Loaded day: 8 y = 1: 55323510 y = 0: 1642073892
Loaded day: 9 y = 1: 61896030 y = 0: 1833925744
Loaded day: 10 y = 1: 68022095 y = 0: 2013577734
Loaded day: 11 y = 1: 73425097 y = 0: 2161763432
Loaded day: 12 y = 1: 79235670 y = 0: 2324956223
Loaded day: 13 y = 1: 85501891 y = 0: 2512906522
Loaded day: 14 y = 1: 91803478 y = 0: 2700686214
Loaded day: 15 y = 1: 97911762 y = 0: 2881732526
Loaded day: 16 y = 1: 103881614 y = 0: 3053747608
Loaded day: 17 y = 1: 109329329 y = 0: 3211682495
Traceback (most recent call last):
  File "cython_criteo.py", line 52, in <module>
    args.memory_map
  File "data_utils_cython.pyx", line 1253, in data_utils_cython.loadDataset
    file = getCriteoAdData(
  File "data_utils_cython.pyx", line 1194, in data_utils_cython.getCriteoAdData
    o_file = concatCriteoAdData(
  File "data_utils_cython.pyx", line 758, in data_utils_cython.concatCriteoAdData
    with np.load(filename_i) as data:
  File "data_utils_cython.pyx", line 764, in data_utils_cython.concatCriteoAdData
    X_cat = np.concatenate((X_cat, data["X_cat"]))
MemoryError

Is there a flag to get validation(test) accuracy in caffe2?

Hi, I want to ask something about the code.

I found that on PyTorch version, there is flag '--test-freq' to get validation accuracy, but I can't figure it out in the caffe2 version. Is there the same flag on caffe2 for this purpose?

Thanks,
Jeageun

Pre-trained Model Availability

Hello,

Are there any pre-trained models that we can use for benchmark testing / evaluation? I'm trying to reproduce the results from section 5.1/5.2 of the original DLRM paper and it would be useful to have a pre-trained model available to do speed / accuracy tradeoff comparisons for a class project.

Thanks!

Segmentation fault when using 4 GPUs for training

Specs:

Python version: 3.6.8
Pytorch version: 1.4.0
4 v100 GPUs
Cuda version: 10.1
Nvidia Driver Version: 418.87.00 

I added thefollowing line in dlrm_s_pytorch.py

import faulthandler; faulthandler.enable()

and used the following command to run the code

python3 -X faulthandler dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=/path-to-data --processed-data-file=/path-to-npz-file --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=64 --test-freq 0 --print-freq=1024 --print-time --use-gpu

It executes for some iterations(variable: different during different runs)
and then fails with a segmentation fault.
Here is a sample output

Using 4 GPU(s)...
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=/users/ushmal/kaggleAdDisplayChallenge_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/613937 of epoch 0, 51.35 ms/it, loss 0.520202, accuracy 75.478 %
Finished training it 2048/613937 of epoch 0, 28.98 ms/it, loss 0.506464, accuracy 76.196 %
Finished training it 3072/613937 of epoch 0, 29.48 ms/it, loss 0.505029, accuracy 76.314 %
Finished training it 4096/613937 of epoch 0, 30.34 ms/it, loss 0.494111, accuracy 76.935 %
Finished training it 5120/613937 of epoch 0, 30.36 ms/it, loss 0.496054, accuracy 76.781 %
Finished training it 6144/613937 of epoch 0, 30.44 ms/it, loss 0.487835, accuracy 77.235 %
Finished training it 7168/613937 of epoch 0, 30.65 ms/it, loss 0.486214, accuracy 77.292 %
Fatal Python error: Segmentation fault

Thread 0x00007f64c1a25700 (most recent call first):

Thread 0x00007f64c2226700 (most recent call first):

Current thread 0x00007f64c2a27700 (most recent call first):

Thread 0x00007f64c3a29700 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 165 in gather
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68 in forward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 101 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/function.py", line 77 in apply

Thread 0x00007f64c3228700 (most recent call first):

Thread 0x00007f65f70b2740 (most recent call first):
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward
  File "/users/ushmal/.local/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward
  File "dlrm_s_pytorch.py", line 814 in <module>
Segmentation fault (core dumped)

When using 2 GPUs or a single GPU the segmentation fault doesn't arise even after 100000 iterations.
Thanks!

C++ frontend support

Hi there,

I wonder if there happens to be a plan to release DLRM written in C++ frontend. In addition, is there any noticeable performance difference between the implementation here and Glow one under the production scale setting?

Thanks,
Yongkee

Terabyte dataset

Hi,

Is there a chance to upload the preprocessed TeraByte dataset?

Thanks.

RuntimeError: [enforce fail at init.cc:88] success. Failed to run some init functions for caffe2

Hello.
I succeeded to profile in the python version code.
However, I tried to profile in the same caffe2 code but got the following error. ()

python dlrm_s_caffe2.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/kaggle_caffe2/train.txt --processed-data-file=./input/kaggle_caffe2/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --num-batches=10000 --mini-batch-size=128 --print-freq=100 --print-time --enable-profiling --plot-compute-graph --inference-only

My test environment is like below.
caffe_master_cudnn_v7.4_cuda10.0
cudnn_v7.4_cuda10.0
nccl_2.3.7
openMPI_v2.1.2
magma_2.5.0
python-3.6.6

Error Log.

Traceback (most recent call last):
File "dlrm_s_caffe2.py", line 1056, in
enable_prof=args.enable_profiling,
File "dlrm_s_caffe2.py", line 496, in init
workspace.GlobalInit(global_init_opt)
RuntimeError: [enforce fail at init.cc:88] success. Failed to run some init functions for caffe2.
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x2b94d3161021 in /home/sr2/junki.park/jk_env/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x49 (0x2b94d3160dc9 in /home/sr2/junki.park/jk_env/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #2: caffe2::GlobalInit(int*, char***) + 0x29c (0x2b949a80689c in /home/sr2/junki.park/jk_env/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #3: + 0x506d7 (0x2b9498dc16d7 in /home/sr2/junki.park/jk_env/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x8c45e (0x2b9498dfd45e in /home/sr2/junki.park/jk_env/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)

frame #26: __libc_start_main + 0xf5 (0x2b9477efec05 in /lib64/libc.so.6)
frame #27: python() [0x400bda]

There seems to be an error in the next section of code.
if enable_prof:
global_init_opt += [
"--logtostderr=0",
"--log_dir=./",
"--caffe2_logging_print_net_summary=1",
]

Do you have anyone to help me?
Thank you.

PyTorch script copies all the embeddings into GPU0 in a multi-gpu environment.

In the below line dlrm = dlrm.to(device) all the tables are initially copied to gpu0 and then copied to other devices in parallel_forward() .

dlrm = dlrm.to(device) # .cuda()

This will be a problem when you pass in large embeddings and all of them do not fit in a single gpu. Ideally we should be able to split tables across gpus without the need to copy them to one gpu first. That is how the caffe2 script is doing it.

Incorrect "No Data in Split" for the last split

It seems that "Number Data in Split" for the last split is incorrect when using Kaggle Criteo Ad Data.

dlrm/data_utils.py

Lines 388 to 399 in a60b5e6

print(
"Loading %d/%d Split: %d No Data in Split: %d true label: %d stored label: %d"
% (
i,
total_count,
split,
num_data_in_split,
data["label"],
y[i - count],
),
end="\r",
)

For example, if the total number of datapoints is 100, "No Data in Split" is 15 for the first 6 splits, but for the last split, it should be 10 = 100-15*6, instead of 15.

Total number of datapoints: 100
Loading 14/100   Split: 1   No Data in Split: 15  true label: 0  stored label: 0
Saved kaggle_day_1.npz!
Loading 29/100   Split: 2   No Data in Split: 15  true label: 1  stored label: 1
Saved kaggle_day_2.npz!
Loading 44/100   Split: 3   No Data in Split: 15  true label: 0  stored label: 0
Saved kaggle_day_3.npz!
Loading 59/100   Split: 4   No Data in Split: 15  true label: 0  stored label: 0
Saved kaggle_day_4.npz!
Loading 74/100   Split: 5   No Data in Split: 15  true label: 0  stored label: 0
Saved kaggle_day_5.npz!
Loading 89/100   Split: 6   No Data in Split: 15  true label: 0  stored label: 0
Saved kaggle_day_6.npz!
Loading 99/100   Split: 7   No Data in Split: 15  true label: 0  stored label: 0
Saved kaggle_day_7.npz!

PS: Would it be better to change "No Data in Split" to "Number Data in Split"?

Caffe2 using 8 GPUs

Hi,
When using 8 GPUs with dlrm_s_caffe2.py it crashes with the following error. When using 4 GPUs out of the 8 it works fine. It seems to be connected with the loss averaging operator that has access to a different device than allocated for some tensor. Any idea what may be the problem?

Traceback (most recent call last):
File "dlrm_s_caffe2.py", line 1127, in
dlrm.run(lX[j], lS_l[j], lS_i[j], lT[j]) # args.enable_profiling
File "dlrm_s_caffe2.py", line 648, in run
workspace.RunNet(self.model.net)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/workspace.py", line 254, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at context_gpu.h:206] error == cudaSuccess. 700 vs 0. Error at: /opt/conda/conda-bld/pytorch_1573049304260/work/caffe2/core/context_gpu.h:206: an illegal memory access was encountered
Error from operator:
input: "gpu_7/sd2" input: "gpu_7/loss_autogen_grad" output: "gpu_7/sd2_grad" name: "" type: "AveragedLossGradient" device_option { device_type: 1 device_id: 7 } is_gradient_op: trueframe #0: c10::ThrowEnforceNotMet(char const
, int, char const
, std::string const&, void const
) + 0x5b (0x7f2f2b90341b in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: + 0x3f09035 (0x7f2f2fc5c035 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #2: + 0x40364d0 (0x7f2f2fd894d0 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #3: caffe2::SimpleNet::Run() + 0x1a9 (0x7f2f2e446d39 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #4: caffe2::Workspace::RunNet(std::string const&) + 0x802 (0x7f2f2e4916b2 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #5: + 0x457eb (0x7f2f5a7ef7eb in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x46a1f (0x7f2f5a7f0a1f in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)
frame #7: + 0x95696 (0x7f2f5a83f696 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)

frame #29: __libc_start_main + 0xe7 (0x7f2f70847b97 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
what(): [enforce fail at context_gpu.h:206] error == cudaSuccess. 700 vs 0. Error at: /opt/conda/conda-bld/pytorch_1573049304260/work/caffe2/core/context_gpu.h:206: an illegal memory access was encountered
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x5b (0x7f2f2b90341b in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: + 0x3f09035 (0x7f2f2fc5c035 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #2: + 0x3f091fa (0x7f2f2fc5c1fa in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #3: + 0x5ce4568 (0x7f2f31a37568 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #4: + 0x26f873e (0x7f2f2e44b73e in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/../../torch/lib/libtorch.so)
frame #5: std::_Rb_tree<std::string, std::pair<std::string const, std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > >, std::_Select1st<std::pair<std::string const, std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::string const, std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > > >) + 0x2f9 (0x7f2f5a81a839 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)
frame #6: std::_Rb_tree<std::string, std::pair<std::string const, std::unique_ptr<caffe2::Workspace, std::default_deletecaffe2::Workspace > >, std::_Select1st<std::pair<std::string const, std::unique_ptr<caffe2::Workspace, std::default_deletecaffe2::Workspace > > >, std::lessstd::string, std::allocator<std::pair<std::string const, std::unique_ptr<caffe2::Workspace, std::default_deletecaffe2::Workspace > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::string const, std::unique_ptr<caffe2::Workspace, std::default_deletecaffe2::Workspace > > >
) + 0x149 (0x7f2f5a81cfb9 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)
frame #7: + 0x2f684 (0x7f2f5a7d9684 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)
frame #8: + 0x95696 (0x7f2f5a83f696 in /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/caffe2/python/caffe2_pybind11_state_gpu.cpython-36m-x86_64-linux-gnu.so)

frame #15: __libc_start_main + 0xe7 (0x7f2f70847b97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

lS_o vs. lS_i

Hi,
Sparse data is translated into the PyTorch implementation into two embeddings, lS_o and lS_i. Can you please explain how this translation from the Kaggle input data happens, and what the difference between lS_o (sparse offset) and lS_i (sparse index) is?

loadDataset function call is not consistent with its api

Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 490, in
args.processed_data_file,
File "/home/bweinstein/software/Algorithms/dlrm-melrose/dlrm_data_pytorch.py", line 53, in read_dataset
dataset, num_samples, raw_data, processed_data
TypeError: loadDataset() takes 1 positional argument but 4 were given

Saving convertDicts on data_utils.py seems necessary

I think the dictionary to convert categories in test.txt (kaggle_data) is necessary.
It would be better if there is an example on how to use kaagle's test.txt after training kaggle's train.txt even though current example uses only train.txt to split train, val, and test dataset.

Cannot save the model

I am able to train the model successfully with and without GPUs. But, when I try to save the model using "--test-freq=1024 --save-model=./input/model.pt" I don't see any output model being generated.

Adagrad optimizer

Hi,

I see in the paper that you get slightly better results with Adagrad rather than SGD, but your code supports only SGD. Why is that?

Thanks,
Berry

Run the code with train dataset

Which command is correct
python dlrm_s_pytorch.py --data-generation="dataset" --mini-batch-size=2 --data-size=6 --data-set = "train.txt"
or
python dlrm_s_pytorch.py --data-generation="dataset" --mini-batch-size=2 --data-size=6 --raw-data-file = "train.txt"
or
anything else please suggest

pretrained model mode (ADV/TB) to skip training

Hello,

First, thank for the code quality (on pytorch at least).
My problem is simple.
I need to perform accuracy validation of dlrm (in ACV and TB challenge) with my company DSP target. So, I'm not interested on the training phase. Unless I misunderstood the code, it is not possible to load a pretrained model to only preform the inference.
Unless there is another solution, it should be sexy to support a mode like "pretrained model inference".

PS: By the way, I think that args.inference_only is ambiguous.

plot_compute_graph

I'm trying to use the torchviz to generize a compute graph, but fail to do it.
A runtime error happens when using make_dot.
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 962, in
dot = make_dot(V, params=dict(dlrm.named_parameters()))
File "/anaconda3/lib/python3.6/site-packages/torchviz-0.0.1-py3.6.egg/torchviz/dot.py", line 70, in make_dot
File "/anaconda3/lib/python3.6/site-packages/torchviz-0.0.1-py3.6.egg/torchviz/dot.py", line 59, in add_nodes
File "/anaconda3/lib/python3.6/site-packages/torchviz-0.0.1-py3.6.egg/torchviz/dot.py", line 59, in add_nodes
File "/anaconda3/lib/python3.6/site-packages/torchviz-0.0.1-py3.6.egg/torchviz/dot.py", line 59, in add nodes
File "/anaconda3/lib/python3.6/site-packages/torchviz-0.0.1-py3.6.egg/torchviz/dot.py", line 60, in add_nodes
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Can't use --memory-map and --debug-mode options simultaneously

The following command crashes when using both --memory-map and --debug-mode options. (The command works if I disable one of them.)

python ../dlrm/dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot='13-512-256-64-16' --arch-mlp-top='512-256-1' --data-generation=dataset --data-set=kaggle --raw-data-file=./data.dlrm/train.txt --processed-data-file=./data.dlrm/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=8 --test-freq=1024 --load-model=./data.dlrm/dlrm.model.pt --inference-only --memory-map --debug-mode

Output:

Using CPU...
Reading pre-processed data=./data/kaggleAdDisplayChallenge_processed.npz
Sparse fea = 26, Dense fea = 13
Defined train indices...
Randomized indices across days ...
......
Loading saved model ./data.dlrm/dlrm.model.pt

Saved at: epoch = 0/1, batch = 299000/306969, ntbatch = 200
Training state: loss = 0.450191, accuracy = 79.028 %
Testing state: loss = 0.452747, accuracy = 78.865 %
time/loss/accuracy (if enabled):
Traceback (most recent call last):
  File "../dlrm/dlrm_s_pytorch.py", line 850, in <module>
    for j, (X, lS_o, lS_i, T) in enumerate(train_ld):
  File "/home/test/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/test/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in <listcomp>
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/home/test/dlrm/dlrm_data_pytorch.py", line 292, in __getitem__
    return self.X_int[i], self.X_cat[i], self.y[i]
IndexError: index -32743299 is out of bounds for axis 0 with size 6548659

Thanks.

TeraByte training starts with high accuracy

Hi,
I finally managed to reach TeraByte beginning the training, but the average accuracy seems to be suspiciously high. Does it mean that something wasn't processed correctly?

dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot=13-512-256-64 --arch-mlp-top=512-512-256-1 --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./data/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --memory-map
Using 4 GPU(s)...
Reading pre-processed data=./input/terabyte_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=./input/terabyte_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/2048437 of epoch 0, 53.58 ms/it, loss 0.138725, accuracy 96.726 %
Finished training it 2048/2048437 of epoch 0, 35.21 ms/it, loss 0.136623, accuracy 96.696 %
Finished training it 3072/2048437 of epoch 0, 35.06 ms/it, loss 0.135998, accuracy 96.691 %
Finished training it 4096/2048437 of epoch 0, 34.12 ms/it, loss 0.135519, accuracy 96.685 %

Killedg in train batch

Hi,
After processing the train files, I get this "killedge" and the counting stops. Can you please advice?

Running:
python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --processed-data-file=./kaggle_data/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --raw-data-file ~/criteo/train.txt --use-gpu

Using 4 GPU(s)...
Loading kaggle dataset...
Reading dataset from ./kaggle_data/kaggleAdDisplayChallenge_processed.npz
Defined training and testing indices...
Randomized indices...
Split data according to indices...
Converted to tensors...done!
Sparse features = 26, Dense features = 13
Training data
Total number of batches 263115
Killedg in train batch: 127687 / 263115

Convert dlrm to torch.jit.script model

Hi, I wanna convert DLRM to a JIT model, I find the instructions at https://pytorch.org/docs/stable/jit.html#creating-torchscript-code, follow it I add one line
" dlrm = torch.jit.script(dlrm)" after dlrm have been init, but it dosen't work.

Error Message:
TypeError:
'int64' object for attribute 'num_embeddings' is not a valid constant.
Valid constants are:

  1. a nn.ModuleList
  2. a value of type {bool, float, int, str, NoneType, device, layout, dtype}
  3. a list or tuple of (2)

After soving it by replacing "n = ln[i]" by "n = int(ln[i])" in function "create_emb", I got Error Message Below:

'DLRM_Net.sequential_forward' is being compiled since it was called from 'DLRM_Net.forward'
File "dlrm_s_pytorch.py", line 288
def forward(self, dense_x, lS_o, lS_i):
if self.ndevices <= 1:
return self.sequential_forward(dense_x, lS_o, lS_i)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
else:
return self.parallel_forward(dense_x, lS_o, lS_i)

I would be grateful if there is any suggestions from you.

KeyError: 'counts is not a file in the archive'

Command Running:
python3 dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./train.txt --processed-data-file=./kaggle_data/kaggle_day_1.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --use-gpu | tee run_kaggle_pt.log

Full Error Message:
Using 1 GPU(s)...TITAN V Loading kaggle dataset... Reading from pre-processed data=./kaggle_data/kaggle_day_1.npz Traceback (most recent call last): File "dlrm_s_pytorch.py", line 491, in <module> args.processed_data_file, File "/home2/abhiman1/dlrm/dlrm_data_pytorch.py", line 53, in read_dataset dataset, num_samples, raw_data, processed_data File "/home2/abhiman1/dlrm/data_utils.py", line 417, in loadDataset counts = data["counts"] File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 239, in __getitem__ raise KeyError("%s is not a file in the archive" % key) KeyError: 'counts is not a file in the archive'

How did I get here?
I ran ./bench/dlrm_s_criteo_kaggle.sh (for pytorch) with the same args as above command, except I did not specify the --processed-data-file as I didn't have those. The script pre-processed the data and saved kaggle_day_i.npz (i = 1,2,..,7). After pre-processing, the script failed at line 491 in dlrm_s_pytorch.py in relation to line 98 in data_utils.py:
with np.load(str(d_path) + "kaggle_day_{0}_processed.npz".format(i)) as data:

With error:
./kaggle_data/kaggle_day_1_processed.npz doesn't exist
(Or something like that, I don't have the exact error message with me).
This might be because the files were stored with a different name (kaggle_day_1.npz).

What does the kaggle_day_1.npz contain?
I ran the following script:

from numpy import load

data = load('kaggle_day_1.npz')
lst = data.files
for item in lst:
    print(item)
    print(data[item])

Output:

X_int

[[  1   1   5 ...   2  -1   2]
 [  2   0  44 ...   1  -1   4]
 [  2   0   1 ...   3   3  45]
 ...
 [ -1   4   3 ...   1  -1   3]
 [ -1  -1   0 ...  -1  -1  -1]
 [  0 115   5 ...   2  -1  16]]

X_cat

[['68fd1e64' '80e26c9b' 'fb936136' ... 'c5c50484' 'e8b83407' '9727dd16']
 ['68fd1e64' 'f0cf0024' '6f67f7e5' ... '43f13e8b' 'e8b83407' '731c3655']
 ['287e684f' '0a519c5c' '02cf9876' ... '3b183c5c' '' '']
 ...
 ['05db9164' '8084ee93' '02cf9876' ... '3b183c5c' '' '']
 ['5a9ed9b0' '38a947a1' 'df65428e' ... '4bc4a47f' '' '']
 ['f473b8dc' '09e68b86' '22451e36' ... '3fdb382b' 'e8b83407' '49d68486']]

y

[0 0 0 ... 0 0 0]

Clearly, it doesn't contain the field "counts".

Caffe2 profiling

I have installed dependencies through Anaconda, apart from pytorch/caffe2, which I installed from source at SHA f2623c74a9e76dd3aa5dcdd5aaeb63d12db52587 to match the stipulation of pytorch-nightly 06-10-2019. I also tried some of the latest point releases of pytorch/caffe2.

Running ./bench/dlrm_s_benchmark.sh, the PyTorch output is fine, but the only text in the caffe2 output is an error message:

WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
ERROR: unknown command line flag 'caffe2_logging_print_net_summary'

where is the data pipeline for training?

I do not find the info for training and testing on the kaggle dataset.

there is only one line for dataset creation in data_utils.py

getKaggleCriteoAdData(datafile="<path-to-train.txt>", o_filename=kaggleAdDisplayChallenge_processed.npz")

however, in the dlrm_s_pytorch.py, there is no where can input a dataset.

    parser.add_argument("--data-set", type=str, default="kaggle")  # or terabyte
    parser.add_argument("--raw-data-file", type=str, default="")
    parser.add_argument("--processed-data-file", type=str, default="")

these 3 lines above are useless, since no ref in the code for these parameters.

what is the purpose to publish the source code that is not able to read any dataset?

dlrm terabyte training crashes when running on test data set

Using the below command from ./bench/dlrm_s_criteo_terabyte.sh

python3.6 dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=/rafa/terabyte/day --processed-data-file=/rafa/terabyte/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --data-sub-sample-rate=0.875 --memory-map --test-freq 1024

The command crashes with the error below:

Using 1 GPU(s)...
Reading pre-processed data=/rafa/terabyte/terabyte_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=/rafa/terabyte/terabyte_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/315310 of epoch 0, 97.42 ms/it, loss 0.476448, accuracy 78.929 %
Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 866, in <module>
    for jt, (X_test, lS_o_test, lS_i_test, T_test) in enumerate(test_loader):
  File "/root/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
    return self._process_data(data)
  File "/root/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/root/.local/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/root/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/rafa/dlrm/dlrm_data_pytorch.py", line 286, in __getitem__
    return self.X_int[i], self.X_cat[i], self.y[i]
AttributeError: 'CriteoDataset' object has no attribute 'X_int'

I could work around this by setting--test-num-workers=0 and the training runs to completion.

@mnaumovfb

Test the model for output

python dlrm_s_pytorch.py --data-generation=dataset --data-set=kaggle --raw-data-file=train.txt --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --test-freq=10 --save-model=model.pt

after saving the model by this command how can I get the output from the trained model?
python dlrm_s_pytorch.py --load-model=model.pt --data-generation=dataset --data-set=kaggle --raw-data-file=test.txt

is the command is correct for the output purpose?

dlrm_s_pytorch.py line=275
z = p
is z is the outputs?
please help !

KeyError: -1758995178

Hello,
I am stuck with a key error in executing the kaggle bench mark.

python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/kaggle/train.txt --processed-data-file=./input/kaggle/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --test-mini-batch-size=16384 --enable-profiling --plot-compute-graph --test-num-workers=16 $dlrm_extra_option 2>&1 | tee run_kaggle_pt.log

Using CPU...
Reading raw data=./input/kaggle/train.txt
Skipping counts per file (already exist)
Skip existing ./input/kaggle/train_day_0.npz
Skip existing ./input/kaggle/train_day_1.npz
Skip existing ./input/kaggle/train_day_2.npz
Skip existing ./input/kaggle/train_day_3.npz
Skip existing ./input/kaggle/train_day_4.npz
Skip existing ./input/kaggle/train_day_5.npz
Skip existing ./input/kaggle/train_day_6.npz
Total number of samples: 45840617
Divided into days/splits:
[6548660, 6548660, 6548660, 6548660, 6548659, 6548659, 6548659]
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 600, in
dp.make_criteo_data_and_loaders(args)
File "/home/junki.park/AXDIMM/dlrm-master/dlrm_data_pytorch.py", line 481, in make_criteo_data_and_loaders
args.memory_map
File "/home/junki.park/AXDIMM/dlrm-master/dlrm_data_pytorch.py", line 118, in init
memory_map
File "/home/junki.park/AXDIMM/dlrm-master/data_utils.py", line 1114, in getCriteoAdData
processCriteoAdData(d_path, d_file, npzfile, days, convertDicts, counts)
File "/home/junki.park/AXDIMM/dlrm-master/data_utils.py", line 153, in processCriteoAdData
X_cat_t[j, k] = convertDicts[j][x]
KeyError: -1758995178

To check it in detail,

(jk_env3) junki.park@npu134:~/AXDIMM/dlrm-master$ tail run_kaggle_pt.log
X_cat_t[24,6548655]: 4.0, convertDicts[24][2045441]: 4
k: 6548656, x: 0
X_cat_t[24,6548656]: 1.0, convertDicts[24][0]: 1
k: 6548657, x: 0
X_cat_t[24,6548657]: 1.0, convertDicts[24][0]: 1
k: 6548658, x: 0
X_cat_t[24,6548658]: 1.0, convertDicts[24][0]: 1
k: 6548659, x: -390581241
X_cat_t[24,6548659]: 0.0, convertDicts[24][-390581241]: 0
k: 0, x: -1758995178

I am suffering too much because of this problem. Help Me..

Adagrad doesn't learn, gradients quickly reaches to zero

Hi, I tried Adagrad as paper mention it to be better performer. But in my case, adagrad is not learning and gradient update quickly reaches to zero. My dataset is balanced, randomized, batch size is good enough(256) and I have tried a range of learning rate. But nothing seems to work.
What can be the issue and did you guys also faced such issue?
I have around 3k dense features and 50 sparse features. What would you suggest?

Also can we use a combination of Adam and SparseAdam?

evaluation for co-located models,

Hi again,

Is there any way in the current implementation to evaluate co-locating models, as discussed in [1]? If I understand correctly, it seems the current implementation supports for only one model at a time (while being configurable).

If I'd like to reproduce the results, for example, as shown in Figure 8 and 9 [1], I think I need to add yet another dimension to data generator and models creator.

Please correct me if I miss any. Also any advice for me to implement that feature would be appreciated!

Thanks,
Yongkee

[1] The Architectural Implications of Facebook’s DNN-based Personalized Recommendation

How is the number of rows for each embedding bag decided?

Hi,

From my understanding, in DLRM, the number of embedding bags is the same as the number of sparse feature types. For example, for Kaggle data, there are 26 sparse feature types and thus 26 different embedding bags.

Now, I was wondering how the number of rows for each embedding bag is decided. Is the number of rows for an embedding bag random, or is it proportional to the data size or any other factors?

Thank you!

day_count file for Kaggle

Hi,

First of all thanks for the high-quality code. On line 119 of "dlrm_data_pytorch.py" you are trying to read files containing the count numbers of data points in training data but when Kaggle raw data is processed, such a file is not produced. Is there supposed to be a file for each training day? or just one file for all training days? a "train_day_count.npz" file is produced and I am assuming that contains the count number for all days. I have been replacing that line of code when working with Kaggle with "train_day_count.npz" address. Is that an okay thing to do? and If that is what we are supposed to do, maybe that needs fixing so the name of files produced after processing Kaggle data matches with that part of code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.