thudm / kobe Goto Github PK
View Code? Open in Web Editor NEWTowards Knowledge-Based Personalized Product Description Generation in E-commerce @ KDD 2019
License: MIT License
Towards Knowledge-Based Personalized Product Description Generation in E-commerce @ KDD 2019
License: MIT License
你好,最近在研究相关的内容,看到您的项目很感兴趣,请问下如何采用这个model生成,能否提供一下demo
Hi Team,
Did you guys also tried training your model on english language dataset ? If yes, could you please provide the link for the same.
Thanks
Manish
Hello, your model is very interesting, but I don't have any knowledge of Chinese as the model outputs. So, I'd to ask you about is there any dataset in English available?
Hi @qibinc
I am using rouge to evaluate the score. I change the metrics: ['bleu'] to metrics: ['rouge'], but it seems can not useful? Are there other things I need to do?
Best wishes,
I notice that there are some piece of code related to reinforcement learning. Did you try training by reinforcement learning and does rl improve the result?
Hi,
Thank you so much for a great paper and for sharing the code!
I read the paper and have a small confusion.
https://arxiv.org/pdf/1903.12457.pdf
On page 5, you said "Formally, given a product title x = (x1, x2, . . . , xn), we match each word xi to a
named entity vi ∈ V, which should be a vertex in the knowledge graph."
Could you please explain what are those named entity?
What is the process you used to match the word to the given named entity vi
For Chinese, you used CN-DBPedia. How about English? Which English knowledge graph you recommend using?
Thank you so much again!
hi @qibinc
I found that the version of pytorch will impact the code.
For example, the torch.gt() in version 1.2 will get the bool value.
Best wishes
Hi Qibin,
Thanks for the nice work. I have a question on the Transformer decoding behavior. I noticed that during training the whole ground truth is fed into the decoder, allowing the tokens in step
https://github.com/THUDM/KOBE/blob/master/core/models/tensor2tensor.py#L369
I guess that the self-attention layer loses its effectiveness at this time. Why don't we feed the outputs from step 0 to step
不好意思我有個問題想要問,關於TItle的文字序列部分,有在前後加上BOS EOS等起始終止符嗎?
如果說有,那關於Title embedding的部分,如果要像這篇論文一樣對每個token加上attr embedding,
是否也包括加在BOS、EOS等token的embedding上了呢? 或者略過它們純粹對Title文字的token加上
呢?
謝謝作者的熱心!
貌似最新的readme没有generation和api部分?该怎么做测试呢?
I train your model with the default parameter. Is it the result same as you? I can't tell there is an obvious difference from the sentence (for me) with different item aspects and user categories. Can the author provide your result (best_bleu_prediction.txt) after training? Thanks!
Here is mine best_bleu_prediction.txt after training (no adapting beam searching).
感谢作者的开源~ aspect_user.yaml
训练成功,本想用训练好的模型,用自己的数据试一下,好像一直没跑通,卡在makeVocabulary
/makeData
。
通过天池下载了的TaoDescribe
数据集是源数据集;通过download_preprocessed_tao.py
下载的是预处理的。
源数据集 -> 预处理数据过程中,在preprocess.py
文件中:
其中,In src files, <x> <y> means this product is intended to show with aspect <x> and user category <y>
中, aspect <x>
这个定义没给出?
Hi, Qibin
First, thanks to your open source codes.
I have some questions about your code.
self.condition_context_attn = BiAttention(config.hidden_size, config.dropout)
. But the self.condition_context_attn is not be used. So I wanna get the detail setting.Best Wishes!
Hi,
PS: I have data consisting of product titles, descriptions, and categories of a marketplace in English.
Could you explain what else data I need to build a dataset; if possible I will try to create or get that data. Also please help me understand what is the knowledge base I require? Do you think this link https://github.com/IBM/build-knowledge-base-with-domain-specific-documents/blob/master/README.md can help me out for the getting out the knowledge base from the data(product titles and descriptions) that I already have?
I would be glad if you help me create the dataset, and advise any changes that I need to make in code after changing the dataset.
Thanks in advance.
您好,看了Generation部分,还是不太明白该怎样使用api.py
Hi, i am interesting in this paper. But I am a little confused about the dataset for experiment. I can get product title in TaoBao easily, but can not find product description. So I am confused what should I do to get description especially personalized description for ground-truth like your dataset.
作者您好,謝謝您的分享,我想請問一下,
問題一:
關於Title x 的作法,請問是把"(10)(a)牛仔外套女2019春秋装新款宽松学生韩版bf原宿风外套牛仔衣潮"這個當成x轉成embedding並與(10)(a)這個attribute的embedding做相加丟入encoder,
還是純粹把"牛仔外套女2019春秋装新款宽松学生韩版bf原宿风外套牛仔衣潮"當成x,
並與(10)(a)這個attribute的embedding做相加丟入呢?
問題二:
關於最後生成的personalized product description,生成的字數個數是隨機的嗎?
有辦法指定限制字數嗎? 還是這是根據訓練集的description長度來決定的呢?
I trained a baseline model using my own data and I found an interesting result. Different from your baseline setting, I use word-based encoding and char-based decoding. It seems that the model tends to generate words according to the input but in reversed order.
Could you explain this phenomenon? I also wonder if word-based encoding and char-based decoding leads to some information interpretation gap between encoder and decoder?
Thank you for sharing such a wonderful project. I find the code very inspiring. However, during viewing the code, I could not find the script to build "data.pkl" and therefore can hardly infer the format of the data. Can you please upload the corresponding code or provide the data format in data.pkl? Thank you again.
Version 2 has no bi-attention code, why?
How can I use the pretrained model to do the model inference? seems api.py provides the example but cannot work
How can I use the data set in the example to train with baseline to achieve the total number of 81 in the paper
Hi @qibinc
Thanks for your patience. I found that your beam sample including a parameter eval_. But it is likely useful for your training or I am wrong?
Best wishes,
It seems that the Transformer-encoder is not more important than Decoder.
So why you use a 6 encoder layers but only 2 decoder layers ?
Does it help to improve performance?
Thanks for sharing the code.
In the config file, learning_rate is set to 2, but in the paper, saying that learning_rate is set to 10^-4.
I tried both settings for baseline, and founded 10^-4 was too small to improve BLEU score. But for learning_rate=2, BLEU score was about 6.0 last 1M steps which is not 7.2 showed in the paper. Did I make any mistakes, thx.
Thanks for the code and paper.
I am curious about the soft assignment of user categories. Is it possible to provide the implicit feedback data (click, dwell time) to the public?
Hi @qibinc ,
I am using your code. I found that you do not use the CrossEntropy loss, instead of LabelSmoothingLoss, so can you give some explanation, and your paper also not explain this thing.
Best wishes,
Failed to download the processed training data~ Can you help me?
请问怎么对训练好的模型进行测试
Hi,would please figure out where is the implementation of BIDAF to combine knowledge encoding and title representation in this released code? Thanks a lot.
Hello,
Thank you for providing this well-written and useful repository. After having trained a model, I try to evaluate the saved checkpoint model using beam search with a command similar to the one from the README:
python core/train.py --config configs/baseline.yaml --mode eval --restore experiments/finals-baseline/checkpoint.pt --expname eval-baseline --beam-size 10
However, I am getting an issue which produces a stack trace like this:
Traceback (most recent call last):
File 'KOBE/core/train.py', line 371, in <module>
score = eval_model(model, data, params, config, device, writer)
File 'KOBE/core/train.py', line 250, in eval_model
samples, alignment = model.beam_sample(
File 'KOBE/core/models/tensor2tensor.py', line 467, in beam_sample
b.advance(output[j, :], attn[j, :]) # batch index
File 'KOBE/core/models/beam.py', line 101, in advance
self.attn.append(attnOut.index_select(0, prevK))
RuntimeError: "index_select_out_cuda_impl" not implemented for 'Float'
Process finished with exit code 1
It seems to me that it actually makes sense to happen, since we are trying to index a tensor (attnOut) with a tensor of floats (prevK). Here is the code chunk from beam.py for reference:
prevK = bestScoresId / numWords
self.prevKs.append(prevK)
self.nextYs.append((bestScoresId - prevK * numWords))
self.attn.append(attnOut.index_select(0, prevK))
Am I doing something wrong here? Thanks.
Hi, I have data consisting of product titles, descriptions, and categories of a marketplace in English. Would be kind enough if you could help me by explaining how to prepare a proper datasets/format of the datasets for that data.
Thankyou in advance.
我有一些同样是中文的产品信息和用户信息(图像和文本,还有用户最感兴趣的一个tag),请问该如何使用您发布的checkpoint进行推理?
另外,您的模型有没有可能进行fine tune?我在您的readme里找不到关于如何自行构建数据集的指示。
Could you tell us 24 detail user categories?
I can not find them neither in your code nor in your paper.
I am following the readme file instruction to download and preprocess the provided data, however I am stuck on the preprocess step.
I am trying to run python -m kobe.data.vocab --input saved/raw/train.cond --vocab-file saved/vocab.cond --vocab-size 31 --algo word
Traceback (most recent call last):
File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\alien\Documents\PyCharm-Projects\KOBE\kobe\data\vocab.py", line 35, in <module>
spm.SentencePieceTrainer.Train(
File "C:\Program Files\Python39\lib\site-packages\sentencepiece\__init__.py", line 407, in Train
return SentencePieceTrainer._TrainFromString(arg)
File "C:\Program Files\Python39\lib\site-packages\sentencepiece\__init__.py", line 385, in _TrainFromString
return _sentencepiece.SentencePieceTrainer__TrainFromString(arg)
OSError: Not found: "C:\Users\alien\AppData\Local\Temp\tmpxplscd17": Permission denied Error #13
I have checked the folder's permission is fully accessible to my account, and I am already running cmd as admin.
Hello, I am trying to build my own data set, and I would like to know how do you build the facts file in raw data?
How to train the model by multi-gpu
when you do knowledge encoding, the code use tgt_vocab for knowledge and change
src_vocab_size , but after that the code change back config.src_vocab_size = config.src_vocab_size,there may be a error
File "core/train.py", line 327, in
device, devices_id = misc_utils.set_cuda(config)
File "/home/shreya/KOBE/core/utils/misc_utils.py", line 19, in set_cuda
assert config.use_cuda == use_cuda
AssertionError
What are the real tags (string format) corresponding to the user category attributes (int format) in the src.str files?
P.S.
Great works! Really appreciate your efforts in open sourcing the datasets.
Where is the dataset?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.