thudm / kobe Goto Github PK

Towards Knowledge-Based Personalized Product Description Generation in E-commerce @ KDD 2019

License: MIT License

Python 100.00%

text-generation personalization knowledge-graph generative-models sequence-to-sequence

kobe's Introduction

KOBE v2: Towards Knowledge-Based Personalized Product Description Generation in E-commerce

New: We release KOBE v2, a refactored version of the original code with the latest deep learning tools in 2021 and greatly improved installation, reproducibility, performance, and visualization, in memory of Kobe Bryant.

This repo contains code and pre-trained models for KOBE, a sequence-to-sequence based approach for automatically generating product descriptions by leveraging conditional inputs, e.g., user category, and incorporating knowledge with retrieval augmented product titles.

Paper accepted at KDD 2019 (Applied Data Science Track). Latest version at arXiv.

KOBE v2: Towards Knowledge-Based Personalized Product Description Generation in E-commerce
Prerequisites
Getting Started
- Installation
- Dataset
Preprocessing
- Build vocabulary
- Tokenization
Experiments
Cite

Prerequisites

Linux
Python >= 3.8
PyTorch >= 1.10

Getting Started

Installation

Clone and install KOBE.

git clone https://github.com/THUDM/KOBE
cd KOBE
pip install -e .

Verify that KOBE is correctly installed by import kobe.

Dataset

We use the TaoDescribe dataset, which contains 2,129,187 product titles and descriptions in Chinese.

Run the following command to automatically download the dataset:

python -m kobe.data.download

The downloaded files will be placed at saved/raw/:

 1.6G KOBE/saved
 1.6G ├──raw
  42K │  ├──test.cond
 1.4M │  ├──test.desc
 2.0M │  ├──test.fact
 450K │  ├──test.title
  17M │  ├──train.cond
 553M │  ├──train.desc
 794M │  ├──train.fact
 183M │  ├──train.title
  80K │  ├──valid.cond
 2.6M │  ├──valid.desc
 3.7M │  ├──valid.fact
 853K │  └──valid.title
...

Meanings of downloaded data files

train/valid/test.title: The product title as input (source)
train/valid/test.desc: The product description as output (generation target)
train/valid/test.cond: The product attribute and user category used as conditions in the KOBE model. The interpretations of these tags are explained at #14 (comment).
train/valid/test.fact: The retrieved knowledge for each product

Preprocessing

Preprocessing is a commonly neglected part in code release. However, we now provide the preprocessing scripts to rebuild the vocabulary and tokenize the texts, just in case that you wish to preprocess the KOBE data yourself or need to run on your own data.

Build vocabulary

We use BPE to build a vocabulary on the conditions (including attributes and user categories). For texts, we will use existing BertTokenizer from the huggingface transformers library.

python -m kobe.data.vocab \
  --input saved/raw/train.cond \
  --vocab-file saved/vocab.cond \
  --vocab-size 31 --algo word

Tokenization

Then, we will tokenize the raw inputs and save the preprocessed samples to .tar files. Note: this process can take a while (about 20 minutes with a 8-core processor).

python -m kobe.data.preprocess \
  --raw-path saved/raw/ \
  --processed-path saved/processed/ \
  --split train valid test \
  --vocab-file bert-base-chinese \
  --cond-vocab-file saved/vocab.cond.model

You can peek into the saved/ directories to see what these preprocessing scripts did:

 8.2G KOBE/saved
  16G ├──processed
  20M │  ├──test.tar
 1.0G │  ├──train-0.tar
 1.0G │  ├──train-1.tar
 1.0G │  ├──train-2.tar
 1.0G │  ├──train-3.tar
 1.0G │  ├──train-4.tar
 1.0G │  ├──train-5.tar
 1.0G │  ├──train-6.tar
 1.0G │  ├──train-7.tar
  38M │  └──valid.tar
 1.6G ├──raw
      │  ├──...
 238K └──vocab.cond.model

Experiments

Visualization with WandB

First, set up WandB, which is an 🌟 incredible tool for visualize deep learning experiments. In case you haven't use it before, please login and follow the instructions.

wandb login

Training your own KOBE

We provide four training modes: baseline, kobe-attr, kobe-know, kobe-full, corresponding to the models explored in the paper. They can be trained with the following commands:

python -m kobe.train --mode baseline --name baseline
python -m kobe.train --mode kobe-attr --name kobe-attr
python -m kobe.train --mode kobe-know --name kobe-know
python -m kobe.train --mode kobe-full --name kobe-full

After launching any of the experiment above, please go to the WandB link printed out in the terminal to view the training progress and evaluation results (updated at every epoch end about once per 2 hours).

If you would like to change other hyperparameters, please look at kobe/utils/options.py. For example, the default setting train the models for 30 epochs with batch size 64, which is around 1 millison steps. You could add options like --epochs 100 to train for more epochs and obtain better results. You can also increase --num-encoder-layers and --num-decoder-layers if better GPUs available.

Expected Training Progress

We provide a reference for the training progress (training takes about 150 hours on a 2080 Ti). The full KOBE model achieves the best BERTScore and diversity, with a slightly lower BLEU score than KOBE-Attr (as shown in the paper).

The resulting training/validation/test curves and examples are shown below:

Evaluating KOBE

Evaluation is now super convenient and reproducible with the help of pytorch-lightning and WandB. The checkpoint with best bleu score will be saved at kobe-v2/<wandb-run-id>/checkpoints/<best_epoch-best_step>.ckpt. To evaluate this model, run the following command:

python -m kobe.train --mode baseline --name test-baseline --test --load-file kobe-v2/<wandb-run-id>/checkpoints/<best_epoch-best_step>.ckpt

The results will be displayed on the WandB dashboard with the link printed out in the terminal. The evaluation metrics we provide include BLEU score (sacreBLEU), diversity score and BERTScore. You can also manually view some generated examples and their references under the examples/ section on WandB.

We provide Nucleus sampling (https://arxiv.org/abs/1904.09751) to replace the beam search in the original KOBE paper. To test this great decoding strategy, run:

python -m kobe.train --mode baseline --name test-baseline --test --load-file kobe-v2/<wandb-run-id>/checkpoints/<best_epoch-best_step>.ckpt --decoding-strategy nucleus

Pre-trained Models

Pre-trained model checkpoints are available at https://bit.ly/3FiI7Ed (requires network access to Google Drive). In addition, download the vocabulary file and place under saved/

Cite

Please cite our paper if you use this code in your own work:

@inproceedings{chen2019towards,
  title={Towards knowledge-based personalized product description generation in e-commerce},
  author={Chen, Qibin and Lin, Junyang and Zhang, Yichang and Yang, Hongxia and Zhou, Jingren and Tang, Jie},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  pages={3040--3050},
  year={2019}
}

kobe's People

Contributors

Stargazers

Watchers

Forkers

justinlin610 miracelplus buffernihility cenyk1230 binwone zephyrchenzf chenjun0210 jiqiujia turboljy mwakaba2 mindis jmluu flitternie htfhxx awesome-archive colinsongf todun polybahn ahujack yonghangzhou hzlujunyi dandelionlu seventxt zwcdp gavinljj mattzheng kevincpa yangyang317 renlang97 caihao20 seeker1943 cas995 chemketoo databill86 eng-mohamedhussien jeonsworld iitians kiminh dustyymelody7 dhruvasood wanghia gachet leocnj juilin z99205388 hdvvip xiedake stungkit ruichixiong qibinc butyuhao fangzheng354 fishguysword dapeng2018 margaretnm lhongjum thomascx sunghoon014 chenlongzhen liushuchun jeongchunghun chenyifen 5l1v3r1 lironghua318 zpf2004 lokeshjonnakuti road2018

kobe's Issues

Implementation of BIDAF to combine knowledge encoding and title representation

Hi,would please figure out where is the implementation of BIDAF to combine knowledge encoding and title representation in this released code? Thanks a lot.

Is there any dataset in English?

Hello, your model is very interesting, but I don't have any knowledge of Chinese as the model outputs. So, I'd to ask you about is there any dataset in English available?

Building Dataset In English Language

Hi, I have data consisting of product titles, descriptions, and categories of a marketplace in English. Would be kind enough if you could help me by explaining how to prepare a proper datasets/format of the datasets for that data.
Thankyou in advance.

Open-sourcing the TaoDescribe dataset

Good name!

how long does the training process last?Could u please provide the training result?

Explaination about your preprocessed data

Could you please explain the accurate meaning of these various filenames?

inference

貌似最新的readme没有generation和api部分？该怎么做测试呢？

BOS EOS 與 attr embedding

不好意思我有個問題想要問，關於TItle的文字序列部分，有在前後加上BOS EOS等起始終止符嗎？

如果說有，那關於Title embedding的部分，如果要像這篇論文一樣對每個token加上attr embedding，

是否也包括加在BOS、EOS等token的embedding上了呢？或者略過它們純粹對Title文字的token加上

呢？

謝謝作者的熱心！

请问怎么在自己的数据集上做finetune和推理？

我有一些同样是中文的产品信息和用户信息（图像和文本，还有用户最感兴趣的一个tag），请问该如何使用您发布的checkpoint进行推理？

另外，您的模型有没有可能进行fine tune？我在您的readme里找不到关于如何自行构建数据集的指示。

biattention code

code error?

when you do knowledge encoding, the code use tgt_vocab for knowledge and change
src_vocab_size , but after that the code change back config.src_vocab_size = config.src_vocab_size,there may be a error

train by multi-gpu

How to train the model by multi-gpu

ignore this issue

python -m kobe.data.preprocess 你好，我看这一步不止20分钟吧，我这边也太慢了，是我哪里弄错了吗？
examples = Parallel(n_jobs=8) 改成了examples = Parallel(n_jobs=16)
我看速度也没有快起来

Question about using rouge to evaluate the model

Hi @qibinc

I am using rouge to evaluate the score. I change the metrics: ['bleu'] to metrics: ['rouge'], but it seems can not useful? Are there other things I need to do?

Best wishes,

Reversed word order in target compared to input

I trained a baseline model using my own data and I found an interesting result. Different from your baseline setting, I use word-based encoding and char-based decoding. It seems that the model tends to generate words according to the input but in reversed order.

Could you explain this phenomenon? I also wonder if word-based encoding and char-based decoding leads to some information interpretation gap between encoder and decoder?

请问怎么对训练好的模型进行测试

Error when preprocessing dataset

I am following the readme file instruction to download and preprocess the provided data, however I am stuck on the preprocess step.
I am trying to run python -m kobe.data.vocab --input saved/raw/train.cond --vocab-file saved/vocab.cond --vocab-size 31 --algo word

Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\alien\Documents\PyCharm-Projects\KOBE\kobe\data\vocab.py", line 35, in <module>
    spm.SentencePieceTrainer.Train(
  File "C:\Program Files\Python39\lib\site-packages\sentencepiece\__init__.py", line 407, in Train
    return SentencePieceTrainer._TrainFromString(arg)
  File "C:\Program Files\Python39\lib\site-packages\sentencepiece\__init__.py", line 385, in _TrainFromString
    return _sentencepiece.SentencePieceTrainer__TrainFromString(arg)
OSError: Not found: "C:\Users\alien\AppData\Local\Temp\tmpxplscd17": Permission denied Error #13

I have checked the folder's permission is fully accessible to my account, and I am already running cmd as admin.

关于24-V2训练

你好，两个问题，关于224-V2代码训练的问题：

请问，24-v2的代码目前是只提供了单卡训练吗？我看论文里面有说在多卡环境下训练，修改为多卡训练需要如何修改配置呢？

关于训练时间，readme里面说 We provide a reference for the training progress (training takes about 150 hours on a 2080 Ti). 这个指的是单卡下24-full 还是 24-baseline的模型？能否提供各个子模型(24-full、24-attr、...)具体训练时间，好有个参考。

辛苦作者解答，谢谢！

English language dataset

Hi Team,

Did you guys also tried training your model on english language dataset ? If yes, could you please provide the link for the same.

Thanks
Manish

Question about LabelSmoothingLoss

Hi @qibinc ,

I am using your code. I found that you do not use the CrossEntropy loss, instead of LabelSmoothingLoss, so can you give some explanation, and your paper also not explain this thing.

Best wishes,

At the end of the run, the bleu score can no longer be improved at around 6.

How can I use the data set in the example to train with baseline to achieve the total number of 81 in the paper

Failed to download the processed training data

Failed to download the processed training data~ Can you help me?

preprocess.py数据预处理时一些困惑

感谢作者的开源~ aspect_user.yaml训练成功，本想用训练好的模型，用自己的数据试一下，好像一直没跑通，卡在makeVocabulary/makeData。
通过天池下载了的TaoDescribe 数据集是源数据集；通过download_preprocessed_tao.py下载的是预处理的。
源数据集 -> 预处理数据过程中，在preprocess.py文件中：
其中，In src files, <x> <y> means this product is intended to show with aspect <x> and user category <y>中， aspect <x>这个定义没给出？

请问提供预训练模型了吗？

您好，看了Generation部分，还是不太明白该怎样使用api.py

Something about the version of pytorch

hi @qibinc

I found that the version of pytorch will impact the code.

For example, the torch.gt() in version 1.2 will get the bool value.

Best wishes

detail user category

Could you tell us 24 detail user categories?
I can not find them neither in your code nor in your paper.

关于模型的generate的API试用问题

你好，最近在研究相关的内容，看到您的项目很感兴趣，请问下如何采用这个model生成，能否提供一下demo

Question about padding

Hi @qibinc

I found that you reverse the src, can you give some reasons about this?

Best wishes,

Question about personalized generation result

I train your model with the default parameter. Is it the result same as you? I can't tell there is an obvious difference from the sentence (for me) with different item aspects and user categories. Can the author provide your result (best_bleu_prediction.txt) after training? Thanks!

Here is mine best_bleu_prediction.txt after training (no adapting beam searching).

best_bleu_prediction.txt

Difference between tensor2tensor.py and seq2seq.py

Question on the decoding behavior

Hi Qibin,

Thanks for the nice work. I have a question on the Transformer decoding behavior. I noticed that during training the whole ground truth is fed into the decoder, allowing the tokens in step $i$ to attend on the previous $i-1$ tokens as in the self-attention layer. However, while in the inference stage, you only feed one token for step $i$, i.e. the output token from step $i-1$, as follows:

https://github.com/THUDM/KOBE/blob/master/core/models/tensor2tensor.py#L369

I guess that the self-attention layer loses its effectiveness at this time. Why don't we feed the outputs from step 0 to step $i-1$ for decoding step $i$ as we do in the training phase?

How to use api.py to do the inference?

How can I use the pretrained model to do the model inference? seems api.py provides the example but cannot work

Issue with evaluating model with beam search

Hello,

Thank you for providing this well-written and useful repository. After having trained a model, I try to evaluate the saved checkpoint model using beam search with a command similar to the one from the README:

python core/train.py --config configs/baseline.yaml --mode eval --restore experiments/finals-baseline/checkpoint.pt --expname eval-baseline --beam-size 10

However, I am getting an issue which produces a stack trace like this:

Traceback (most recent call last):  
    File 'KOBE/core/train.py', line 371, in <module>  
        score = eval_model(model, data, params, config, device, writer)  
    File 'KOBE/core/train.py', line 250, in eval_model
        samples, alignment = model.beam_sample(
    File 'KOBE/core/models/tensor2tensor.py', line 467, in beam_sample
        b.advance(output[j, :], attn[j, :]) # batch index
    File 'KOBE/core/models/beam.py', line 101, in advance
        self.attn.append(attnOut.index_select(0, prevK))
RuntimeError: "index_select_out_cuda_impl" not implemented for 'Float'
Process finished with exit code 1

It seems to me that it actually makes sense to happen, since we are trying to index a tensor (attnOut) with a tensor of floats (prevK). Here is the code chunk from beam.py for reference:

prevK = bestScoresId / numWords
self.prevKs.append(prevK)
self.nextYs.append((bestScoresId - prevK * numWords))
self.attn.append(attnOut.index_select(0, prevK))

Am I doing something wrong here? Thanks.

What is the proper configs?

Thanks for sharing the code.
In the config file, learning_rate is set to 2, but in the paper, saying that learning_rate is set to 10^-4.
I tried both settings for baseline, and founded 10^-4 was too small to improve BLEU score. But for learning_rate=2, BLEU score was about 6.0 last 1M steps which is not 7.2 showed in the paper. Did I make any mistakes, thx.

Can you please help me with this cuda assertion error

File "core/train.py", line 327, in
device, devices_id = misc_utils.set_cuda(config)
File "/home/shreya/KOBE/core/utils/misc_utils.py", line 19, in set_cuda
assert config.use_cuda == use_cuda
AssertionError

Question about dataset

Hi, i am interesting in this paper. But I am a little confused about the dataset for experiment. I can get product title in TaoBao easily, but can not find product description. So I am confused what should I do to get description especially personalized description for ground-truth like your dataset.

Named entity to match product title with knowledge graph

Hi,
Thank you so much for a great paper and for sharing the code!

I read the paper and have a small confusion.
https://arxiv.org/pdf/1903.12457.pdf

On page 5, you said "Formally, given a product title x = (x1, x2, . . . , xn), we match each word xi to a
named entity vi ∈ V, which should be a vertex in the knowledge graph."
Could you please explain what are those named entity?
What is the process you used to match the word to the given named entity vi
For Chinese, you used CN-DBPedia. How about English? Which English knowledge graph you recommend using?

Thank you so much again!

How do you build the fact file in raw data?

Hello, I am trying to build my own data set, and I would like to know how do you build the facts file in raw data?

關於Title input的問題

作者您好，謝謝您的分享，我想請問一下，
問題一:
關於Title x 的作法，請問是把"(10)(a)牛仔外套女2019春秋装新款宽松学生韩版bf原宿风外套牛仔衣潮"這個當成x轉成embedding並與(10)(a)這個attribute的embedding做相加丟入encoder，
還是純粹把"牛仔外套女2019春秋装新款宽松学生韩版bf原宿风外套牛仔衣潮"當成x,
並與(10)(a)這個attribute的embedding做相加丟入呢?
問題二:
關於最後生成的personalized product description，生成的字數個數是隨機的嗎?
有辦法指定限制字數嗎? 還是這是根據訓練集的description長度來決定的呢?

User Categories as an attribute

What are the real tags (string format) corresponding to the user category attributes (int format) in the src.str files?

P.S.
Great works! Really appreciate your efforts in open sourcing the datasets.

Question about allWeight and eval_ in beam sample

Hi @qibinc

Thanks for your patience. I found that your beam sample including a parameter eval_. But it is likely useful for your training or I am wrong?

Best wishes,

Question about BiAttention

Hi, Qibin

First, thanks to your open source codes.

I have some questions about your code.

I want to know the baseline model in your weather using two transformer encoders?
In your code, the BiAttention is used, self.condition_context_attn = BiAttention(config.hidden_size, config.dropout). But the self.condition_context_attn is not be used. So I wanna get the detail setting.

Best Wishes!

How to build Data.pkl?

Thank you for sharing such a wonderful project. I find the code very inspiring. However, during viewing the code, I could not find the script to build "data.pkl" and therefore can hardly infer the format of the data. Can you please upload the corresponding code or provide the data format in data.pkl? Thank you again.

Building Datasets for an English Marketplace

Hi,
PS: I have data consisting of product titles, descriptions, and categories of a marketplace in English.

Could you explain what else data I need to build a dataset; if possible I will try to create or get that data. Also please help me understand what is the knowledge base I require? Do you think this link https://github.com/IBM/build-knowledge-base-with-domain-specific-documents/blob/master/README.md can help me out for the getting out the knowledge base from the data(product titles and descriptions) that I already have?

I would be glad if you help me create the dataset, and advise any changes that I need to make in code after changing the dataset.
Thanks in advance.

thudm / kobe Goto Github PK

kobe's Introduction

Prerequisites

Getting Started

Installation

Dataset

Preprocessing

Build vocabulary

Tokenization

Experiments

Visualization with WandB

Training your own KOBE

Evaluating KOBE

Pre-trained Models

Cite

kobe's People

Contributors

Stargazers

Watchers

Forkers

kobe's Issues

Recommend Projects

Recommend Topics

Recommend Org