Code Monkey home page Code Monkey logo

kobe's Introduction

Unittest GitHub stars GitHub license Black

New: We release KOBE v2, a refactored version of the original code with the latest deep learning tools in 2021 and greatly improved installation, reproducibility, performance, and visualization, in memory of Kobe Bryant.

This repo contains code and pre-trained models for KOBE, a sequence-to-sequence based approach for automatically generating product descriptions by leveraging conditional inputs, e.g., user category, and incorporating knowledge with retrieval augmented product titles.

Paper accepted at KDD 2019 (Applied Data Science Track). Latest version at arXiv.

Prerequisites

  • Linux
  • Python >= 3.8
  • PyTorch >= 1.10

Getting Started

Installation

Clone and install KOBE.

git clone https://github.com/THUDM/KOBE
cd KOBE
pip install -e .

Verify that KOBE is correctly installed by import kobe.

Dataset

We use the TaoDescribe dataset, which contains 2,129,187 product titles and descriptions in Chinese.

Run the following command to automatically download the dataset:

python -m kobe.data.download

The downloaded files will be placed at saved/raw/:

 1.6G KOBE/saved
 1.6G ├──raw
  42K │  ├──test.cond
 1.4M │  ├──test.desc
 2.0M │  ├──test.fact
 450K │  ├──test.title
  17M │  ├──train.cond
 553M │  ├──train.desc
 794M │  ├──train.fact
 183M │  ├──train.title
  80K │  ├──valid.cond
 2.6M │  ├──valid.desc
 3.7M │  ├──valid.fact
 853K │  └──valid.title
...
Meanings of downloaded data files
  • train/valid/test.title: The product title as input (source)
  • train/valid/test.desc: The product description as output (generation target)
  • train/valid/test.cond: The product attribute and user category used as conditions in the KOBE model. The interpretations of these tags are explained at #14 (comment).
  • train/valid/test.fact: The retrieved knowledge for each product

Preprocessing

Preprocessing is a commonly neglected part in code release. However, we now provide the preprocessing scripts to rebuild the vocabulary and tokenize the texts, just in case that you wish to preprocess the KOBE data yourself or need to run on your own data.

Build vocabulary

We use BPE to build a vocabulary on the conditions (including attributes and user categories). For texts, we will use existing BertTokenizer from the huggingface transformers library.

python -m kobe.data.vocab \
  --input saved/raw/train.cond \
  --vocab-file saved/vocab.cond \
  --vocab-size 31 --algo word

Tokenization

Then, we will tokenize the raw inputs and save the preprocessed samples to .tar files. Note: this process can take a while (about 20 minutes with a 8-core processor).

python -m kobe.data.preprocess \
  --raw-path saved/raw/ \
  --processed-path saved/processed/ \
  --split train valid test \
  --vocab-file bert-base-chinese \
  --cond-vocab-file saved/vocab.cond.model

You can peek into the saved/ directories to see what these preprocessing scripts did:

 8.2G KOBE/saved
  16G ├──processed
  20M │  ├──test.tar
 1.0G │  ├──train-0.tar
 1.0G │  ├──train-1.tar
 1.0G │  ├──train-2.tar
 1.0G │  ├──train-3.tar
 1.0G │  ├──train-4.tar
 1.0G │  ├──train-5.tar
 1.0G │  ├──train-6.tar
 1.0G │  ├──train-7.tar
  38M │  └──valid.tar
 1.6G ├──raw
      │  ├──...
 238K └──vocab.cond.model

Experiments

Visualization with WandB

First, set up WandB, which is an 🌟 incredible tool for visualize deep learning experiments. In case you haven't use it before, please login and follow the instructions.

wandb login

Training your own KOBE

We provide four training modes: baseline, kobe-attr, kobe-know, kobe-full, corresponding to the models explored in the paper. They can be trained with the following commands:

python -m kobe.train --mode baseline --name baseline
python -m kobe.train --mode kobe-attr --name kobe-attr
python -m kobe.train --mode kobe-know --name kobe-know
python -m kobe.train --mode kobe-full --name kobe-full

After launching any of the experiment above, please go to the WandB link printed out in the terminal to view the training progress and evaluation results (updated at every epoch end about once per 2 hours).

If you would like to change other hyperparameters, please look at kobe/utils/options.py. For example, the default setting train the models for 30 epochs with batch size 64, which is around 1 millison steps. You could add options like --epochs 100 to train for more epochs and obtain better results. You can also increase --num-encoder-layers and --num-decoder-layers if better GPUs available.

Expected Training Progress

We provide a reference for the training progress (training takes about 150 hours on a 2080 Ti). The full KOBE model achieves the best BERTScore and diversity, with a slightly lower BLEU score than KOBE-Attr (as shown in the paper).

The resulting training/validation/test curves and examples are shown below:

Training Progress

Evaluating KOBE

Evaluation is now super convenient and reproducible with the help of pytorch-lightning and WandB. The checkpoint with best bleu score will be saved at kobe-v2/<wandb-run-id>/checkpoints/<best_epoch-best_step>.ckpt. To evaluate this model, run the following command:

python -m kobe.train --mode baseline --name test-baseline --test --load-file kobe-v2/<wandb-run-id>/checkpoints/<best_epoch-best_step>.ckpt

The results will be displayed on the WandB dashboard with the link printed out in the terminal. The evaluation metrics we provide include BLEU score (sacreBLEU), diversity score and BERTScore. You can also manually view some generated examples and their references under the examples/ section on WandB.

We provide Nucleus sampling (https://arxiv.org/abs/1904.09751) to replace the beam search in the original KOBE paper. To test this great decoding strategy, run:

python -m kobe.train --mode baseline --name test-baseline --test --load-file kobe-v2/<wandb-run-id>/checkpoints/<best_epoch-best_step>.ckpt --decoding-strategy nucleus

Pre-trained Models

Pre-trained model checkpoints are available at https://bit.ly/3FiI7Ed (requires network access to Google Drive). In addition, download the vocabulary file and place under saved/

Cite

Please cite our paper if you use this code in your own work:

@inproceedings{chen2019towards,
  title={Towards knowledge-based personalized product description generation in e-commerce},
  author={Chen, Qibin and Lin, Junyang and Zhang, Yichang and Yang, Hongxia and Zhou, Jingren and Tang, Jie},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  pages={3040--3050},
  year={2019}
}

kobe's People

Contributors

qibinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kobe's Issues

Is there any dataset in English?

Hello, your model is very interesting, but I don't have any knowledge of Chinese as the model outputs. So, I'd to ask you about is there any dataset in English available?

Building Dataset In English Language

Hi, I have data consisting of product titles, descriptions, and categories of a marketplace in English. Would be kind enough if you could help me by explaining how to prepare a proper datasets/format of the datasets for that data.
Thankyou in advance.

inference

貌似最新的readme没有generation和api部分?该怎么做测试呢?

BOS EOS 與 attr embedding

不好意思我有個問題想要問,關於TItle的文字序列部分,有在前後加上BOS EOS等起始終止符嗎?

如果說有,那關於Title embedding的部分,如果要像這篇論文一樣對每個token加上attr embedding,

是否也包括加在BOS、EOS等token的embedding上了呢? 或者略過它們純粹對Title文字的token加上

呢?

謝謝作者的熱心!

请问怎么在自己的数据集上做finetune和推理?

我有一些同样是中文的产品信息和用户信息(图像和文本,还有用户最感兴趣的一个tag),请问该如何使用您发布的checkpoint进行推理?

另外,您的模型有没有可能进行fine tune?我在您的readme里找不到关于如何自行构建数据集的指示。

code error?

when you do knowledge encoding, the code use tgt_vocab for knowledge and change
src_vocab_size , but after that the code change back config.src_vocab_size = config.src_vocab_size,there may be a error

ignore this issue

image

python -m kobe.data.preprocess 你好,我看这一步不止20分钟吧,我这边也太慢了,是我哪里弄错了吗?
examples = Parallel(n_jobs=8) 改成了examples = Parallel(n_jobs=16)
我看速度也没有快起来

Reversed word order in target compared to input

I trained a baseline model using my own data and I found an interesting result. Different from your baseline setting, I use word-based encoding and char-based decoding. It seems that the model tends to generate words according to the input but in reversed order.
image
image

Could you explain this phenomenon? I also wonder if word-based encoding and char-based decoding leads to some information interpretation gap between encoder and decoder?

Error when preprocessing dataset

I am following the readme file instruction to download and preprocess the provided data, however I am stuck on the preprocess step.
I am trying to run python -m kobe.data.vocab --input saved/raw/train.cond --vocab-file saved/vocab.cond --vocab-size 31 --algo word

Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\alien\Documents\PyCharm-Projects\KOBE\kobe\data\vocab.py", line 35, in <module>
    spm.SentencePieceTrainer.Train(
  File "C:\Program Files\Python39\lib\site-packages\sentencepiece\__init__.py", line 407, in Train
    return SentencePieceTrainer._TrainFromString(arg)
  File "C:\Program Files\Python39\lib\site-packages\sentencepiece\__init__.py", line 385, in _TrainFromString
    return _sentencepiece.SentencePieceTrainer__TrainFromString(arg)
OSError: Not found: "C:\Users\alien\AppData\Local\Temp\tmpxplscd17": Permission denied Error #13

I have checked the folder's permission is fully accessible to my account, and I am already running cmd as admin.

关于24-V2训练

你好,两个问题,关于224-V2代码训练的问题:

  1. 请问,24-v2的代码目前是只提供了单卡训练吗?我看论文里面有说在多卡环境下训练,修改为多卡训练需要如何修改配置呢?

image

  1. 关于训练时间,readme里面说 We provide a reference for the training progress (training takes about 150 hours on a 2080 Ti). 这个指的是单卡下24-full 还是 24-baseline的模型?能否提供各个子模型(24-full、24-attr、...)具体训练时间,好有个参考。

辛苦作者解答,谢谢!

English language dataset

Hi Team,

Did you guys also tried training your model on english language dataset ? If yes, could you please provide the link for the same.

Thanks
Manish

Question about LabelSmoothingLoss

Hi @qibinc ,

I am using your code. I found that you do not use the CrossEntropy loss, instead of LabelSmoothingLoss, so can you give some explanation, and your paper also not explain this thing.

Best wishes,

preprocess.py数据预处理时一些困惑

感谢作者的开源~ aspect_user.yaml训练成功,本想用训练好的模型,用自己的数据试一下,好像一直没跑通,卡在makeVocabulary/makeData
通过天池下载了的TaoDescribe 数据集是源数据集;通过download_preprocessed_tao.py下载的是预处理的。
源数据集 -> 预处理数据过程中,在preprocess.py文件中:
其中,In src files, <x> <y> means this product is intended to show with aspect <x> and user category <y>中, aspect <x>这个定义没给出?

detail user category

Could you tell us 24 detail user categories?
I can not find them neither in your code nor in your paper.

Question about personalized generation result

I train your model with the default parameter. Is it the result same as you? I can't tell there is an obvious difference from the sentence (for me) with different item aspects and user categories. Can the author provide your result (best_bleu_prediction.txt) after training? Thanks!

Here is mine best_bleu_prediction.txt after training (no adapting beam searching).

image
best_bleu_prediction.txt

Question on the decoding behavior

Hi Qibin,

Thanks for the nice work. I have a question on the Transformer decoding behavior. I noticed that during training the whole ground truth is fed into the decoder, allowing the tokens in step $i$ to attend on the previous $i-1$ tokens as in the self-attention layer. However, while in the inference stage, you only feed one token for step $i$, i.e. the output token from step $i-1$, as follows:

https://github.com/THUDM/KOBE/blob/master/core/models/tensor2tensor.py#L369

I guess that the self-attention layer loses its effectiveness at this time. Why don't we feed the outputs from step 0 to step $i-1$ for decoding step $i$ as we do in the training phase?

Issue with evaluating model with beam search

Hello,

Thank you for providing this well-written and useful repository. After having trained a model, I try to evaluate the saved checkpoint model using beam search with a command similar to the one from the README:

python core/train.py --config configs/baseline.yaml --mode eval --restore experiments/finals-baseline/checkpoint.pt --expname eval-baseline --beam-size 10

However, I am getting an issue which produces a stack trace like this:

Traceback (most recent call last):  
    File 'KOBE/core/train.py', line 371, in <module>  
        score = eval_model(model, data, params, config, device, writer)  
    File 'KOBE/core/train.py', line 250, in eval_model
        samples, alignment = model.beam_sample(
    File 'KOBE/core/models/tensor2tensor.py', line 467, in beam_sample
        b.advance(output[j, :], attn[j, :]) # batch index
    File 'KOBE/core/models/beam.py', line 101, in advance
        self.attn.append(attnOut.index_select(0, prevK))
RuntimeError: "index_select_out_cuda_impl" not implemented for 'Float'
Process finished with exit code 1

It seems to me that it actually makes sense to happen, since we are trying to index a tensor (attnOut) with a tensor of floats (prevK). Here is the code chunk from beam.py for reference:

prevK = bestScoresId / numWords
self.prevKs.append(prevK)
self.nextYs.append((bestScoresId - prevK * numWords))
self.attn.append(attnOut.index_select(0, prevK))

Am I doing something wrong here? Thanks.

What is the proper configs?

Thanks for sharing the code.
In the config file, learning_rate is set to 2, but in the paper, saying that learning_rate is set to 10^-4.
I tried both settings for baseline, and founded 10^-4 was too small to improve BLEU score. But for learning_rate=2, BLEU score was about 6.0 last 1M steps which is not 7.2 showed in the paper. Did I make any mistakes, thx.

Question about dataset

Hi, i am interesting in this paper. But I am a little confused about the dataset for experiment. I can get product title in TaoBao easily, but can not find product description. So I am confused what should I do to get description especially personalized description for ground-truth like your dataset.

Named entity to match product title with knowledge graph

Hi,
Thank you so much for a great paper and for sharing the code!

I read the paper and have a small confusion.
https://arxiv.org/pdf/1903.12457.pdf

On page 5, you said "Formally, given a product title x = (x1, x2, . . . , xn), we match each word xi to a
named entity vi ∈ V, which should be a vertex in the knowledge graph."
Could you please explain what are those named entity?
What is the process you used to match the word to the given named entity vi
For Chinese, you used CN-DBPedia. How about English? Which English knowledge graph you recommend using?

Thank you so much again!

關於Title input的問題

作者您好,謝謝您的分享,我想請問一下,
問題一:
關於Title x 的作法,請問是把"(10)(a)牛仔外套女2019春秋装新款宽松学生韩版bf原宿风外套牛仔衣潮"這個當成x轉成embedding並與(10)(a)這個attribute的embedding做相加丟入encoder,
還是純粹把"牛仔外套女2019春秋装新款宽松学生韩版bf原宿风外套牛仔衣潮"當成x,
並與(10)(a)這個attribute的embedding做相加丟入呢?
問題二:
關於最後生成的personalized product description,生成的字數個數是隨機的嗎?
有辦法指定限制字數嗎? 還是這是根據訓練集的description長度來決定的呢?

User Categories as an attribute

What are the real tags (string format) corresponding to the user category attributes (int format) in the src.str files?

P.S.
Great works! Really appreciate your efforts in open sourcing the datasets.

Question about BiAttention

Hi, Qibin

First, thanks to your open source codes.

I have some questions about your code.

  1. I want to know the baseline model in your weather using two transformer encoders?
  2. In your code, the BiAttention is used, self.condition_context_attn = BiAttention(config.hidden_size, config.dropout). But the self.condition_context_attn is not be used. So I wanna get the detail setting.

Best Wishes!

How to build Data.pkl?

Thank you for sharing such a wonderful project. I find the code very inspiring. However, during viewing the code, I could not find the script to build "data.pkl" and therefore can hardly infer the format of the data. Can you please upload the corresponding code or provide the data format in data.pkl? Thank you again.

Building Datasets for an English Marketplace

Hi,
PS: I have data consisting of product titles, descriptions, and categories of a marketplace in English.

Could you explain what else data I need to build a dataset; if possible I will try to create or get that data. Also please help me understand what is the knowledge base I require? Do you think this link https://github.com/IBM/build-knowledge-base-with-domain-specific-documents/blob/master/README.md can help me out for the getting out the knowledge base from the data(product titles and descriptions) that I already have?

I would be glad if you help me create the dataset, and advise any changes that I need to make in code after changing the dataset.
Thanks in advance.

user's implicit feedback data

Thanks for the code and paper.
I am curious about the soft assignment of user categories. Is it possible to provide the implicit feedback data (click, dwell time) to the public?

Is reinforcement learning helpful?

I notice that there are some piece of code related to reinforcement learning. Did you try training by reinforcement learning and does rl improve the result?

Have you tryed any other Hyperparameter?

It seems that the Transformer-encoder is not more important than Decoder.
So why you use a 6 encoder layers but only 2 decoder layers ?
Does it help to improve performance?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.