Code Monkey home page Code Monkey logo

dialog's Introduction

Dialog

Dialog is japanese chatbot project.
Used architecture in this project is EncoderDecoder model that has BERT Encoder and Transformer Decoder.

Article written in Japanese.

News

Added colab notebooks.

You can run training and evaluation script on google colab without building environment.
Please click following link.
Note that in training notebook, download command is described in the end of note, but it hasn't tested yet. Therefore if you run training notebook and cannot download a trained weight file, please download manually.

  • Train: Open In Colab
  • Eval : Open In Colab

Text-to-Speech Examples

blog written in japanese

@ycat3 created text-to-speech example by using this project for sentence generation and Parallel Wavenet for speech synthesis. Source code isn't shared, but you can reproduce it if you leverage Parallel Wavenet. That blog has some audio samples, so please try listening to it.

I'd like to create app allowing us to talk with AI in voice by using speech synthesis and speech recognition if I have a lot of free time, but now I can't do it due to preparing for exams...

Contents

  1. Result
  2. PreTrained Model
  3. Usage
    1. Install Packages
    2. Train
    3. Evaluate
  4. Architecture

Result

2epochs

Result

This model has still contain the problem about dull response.
To solve this problem i'm researching now.

Then I found the paper tackled this problem.

Another Diversity-Promoting Objective Function for Neural Dialogue Generation

Authors belong to the Nara Institute of Science and Technology a.k.a NAIST.
They propose the new objective function of Neural dialogue generation.
I hope that this method can help me to solve that problem.

Pretrained Model

  • Pretrained model : ckpt.pth
  • Training data : training_data.txt or train_data.pkl

in google drive.

Usage

Install packages.

Needed packages are

  • pytorch
  • transformers
  • tqdm
  • MeCab(To use transformers.tokenization_bert_japanese.BertJapaneseTokenizer)
  • neologdn
  • emoji

If occur errors because of the packages, please install missing packages.

Example if you use conda.

# create new environment
$ conda create -n dialog python=3.7

# activate new environment
$ activate dialog

# install pytorch
$ conda install pytorch torchvision cudatoolkit={YOUR_VERSION} -c pytorch

# install rest of depending package except for MeCab
$ pip install transformers tqdm neologdn emoji

##### Already installed MeCab #####
### Ubuntu ###
$ pip install mecab-python3

### Windows ###
# check that "path/to/MeCab/bin" are added to system envrionment variable
$ pip install mecab-python-windows

##### Not Installed MeCab #####
# install Mecab in accordance with your OS.
# method described in below is one of the way,
# so you can use your way if you'll be able to use transformers.BertJapaneseTokenizer.
### Ubuntu ###
# if you've not installed MeCab, please execute following comannds.
$ apt install aptitude
$ aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
$ pip install mecab-python3

### Windows ###
# Install MeCab from https://github.com/ikegami-yukino/mecab/releases/tag/v0.996
# and add "path/to/Mecab/bin" to system environment variable.
# then run the following command.
$ pip install mecab-python-windows

Train

Prepare conversation data.

  1. Download training data from google drive
  • train_data.pkl
  1. Change path in config.py
# in config.py, line 24
# default value is './data'
data_dir = 'path/to/dir_contains_training_data'

Excecute

if you're ready to start training, run the main script.

$ python main.py

Evaluate

  • Download pretrained weight from google drive
  • Change a path of pre-trained model in config.py
# in config.py, line 24
# default value is './data'
data_dir = 'path/to/dir_contains_pretrained'
  • run eval.py
$ python run_eval.py

Usage of get_tweet.py

If you wanna get more conversation data, please use get_tweet.py

Note that you have to need to change consumer_key and access_token in order to use this script.

And then, execute following commands.

# usage
$ python get_tweet.py "query" "Num of continuous utterances"

# Example
# This command works until occurs errors 
# and makes a file named "tweet_data_私は_5.txt" in "./data"
$ python get_tweet.py 私は 5

If you execute the Example command, script start to collect consecutive 5 sentences if last sentence contains "私は".

However you set 3 or more number to "continuous utterances", make_training_data.py automatically create pair of utterances.

Then execute following command.

$ python make_training_data.py

This script makes training data using './data/tweet_data_*.txt', just like the name.

Architecture

If you want more information about architecture of BERT or Transformer, please refer to the following article.

dialog's People

Contributors

dependabot[bot] avatar reppy4620 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dialog's Issues

gpu memory estimation issue

I tried to use the evaluation edition by downloading pretrained weight.
But, when I entered the command "python3 run_eval.py", an error occurred below,

Traceback (most recent call last):
File "run_eval.py", line 12, in
state_dict = torch.load(f'{Config.data_dir}/{Config.fn}.pth')
File "/home/m-ishihara/.local/lib/python3.6/site-packages/torch/serialization.py", line 585, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/m-ishihara/.local/lib/python3.6/site-packages/torch/serialization.py", line 765, in _legacy_load
result = unpickler.load()
File "/home/m-ishihara/.local/lib/python3.6/site-packages/torch/serialization.py", line 721, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/m-ishihara/.local/lib/python3.6/site-packages/torch/serialization.py", line 174, in default_restore_location
result = fn(storage, location)
File "/home/m-ishihara/.local/lib/python3.6/site-packages/torch/serialization.py", line 154, in _cuda_deserialize
return storage_type(obj.size())
File "/home/m-ishihara/.local/lib/python3.6/site-packages/torch/cuda/init.py", line 480, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total capacity; 1.40 GiB already allocated; 11.81 MiB free; 1.48 GiB reserved in total by PyTorch)

After investigating gpu memory usage, I found all the gpu memory was free.
Please tell me what's wrong with it.

training error

Thank you for the great work.
I followed the usage guide in the github but some errors occurred. Any code not changed.
The following error occurred when copying pretrained model to './pretrained' folder and executing 'python main.py'.
"OSError: file ./pretrained/config.json not found"
'config.json' file was not found in your github. When I copied the 'config.json' from original Bert, it was cleared.
After that, I got an error "OSError: file ./pretrained/pytorch_model.bin not found"
Please give me some tips to solve it.

Thanks in advance.
Seungkwon.

Missing key(s) in state_dict: "encoder.embeddings.position_ids"

I was trying to test the pretrained model.
But an error popped out.

the version of my packages:
transformers: 3.3.1
torch: 1.3.1

Can I ask the exact version of the transformer and torch you used for this project?

Thank you so much in advance.


RuntimeError Traceback (most recent call last)
in
3 tokenizer = Tokenizer.from_pretrained(Config.model_name)
4 model = build_model(Config).to(device)
----> 5 model.load_state_dict(state_dict['model'])
6 model.eval()
7 model.freeze()

~/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
837 if len(error_msgs) > 0:
838 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 839 self.class.name, "\n\t".join(error_msgs)))
840 return _IncompatibleKeys(missing_keys, unexpected_keys)
841

RuntimeError: Error(s) in loading state_dict for EncoderDecoder:
Missing key(s) in state_dict: "encoder.embeddings.position_ids".

Seeking Assistance with FileNotFoundError in Neural Network Model

I attempted to run the code by executing !python main.py, and unfortunately, I encountered an issue with the following error:

!python main.py

Traceback (most recent call last):
File "/content/Dialog/main.py", line 10, in
from nn import build_model
File "/content/Dialog/nn/init.py", line 2, in
from .model import EncoderDecoder, build_model
File "/content/Dialog/nn/model/init.py", line 1, in
from .encoder_decoder import EncoderDecoder, build_model
File "/content/Dialog/nn/model/encoder_decoder.py", line 5, in
from .encoder import build_encoder
File "/content/Dialog/nn/model/encoder.py", line 2, in
from transformers.modeling_bert import BertModel
ModuleNotFoundError: No module named 'transformers.modeling_bert'


Additionally, when attempting to download the file with files.download('./data/ckpt.pth'), I received the following error:

from google.colab import files
files.download('./data/ckpt.pth')

FileNotFoundError Traceback (most recent call last)
in <cell line: 2>()
1 from google.colab import files
----> 2 files.download('./data/ckpt.pth')

/usr/local/lib/python3.10/dist-packages/google/colab/files.py in download(filename)
223 if not _os.path.exists(filename):
224 msg = 'Cannot find file: {}'.format(filename)
--> 225 raise FileNotFoundError(msg) # pylint: disable=undefined-variable
226
227 comm_manager = _IPython.get_ipython().kernel.comm_manager

FileNotFoundError: Cannot find file: ./data/ckpt.pth


I believe that the first error is related to a missing module, but I'm uncertain how to resolve it. Additionally, the second error seems to indicate that the file ckpt.pth cannot be found in the specified directory.

I would greatly appreciate it if you could provide some guidance on how to address these issues.

Tokenizer

Hello ,

when I execute the code, I get the following error: 'Tokenizer' object has no atribute 'ids_to_tokens
, I don't know where it comes from. Thanks you

Errors during evaluation

Thank you very much for the excellent program you have created.

I have trained 5 epochs in exactly the same way, but all the bots return "Good morning".

It seems to be a simple mistake, but I would appreciate any feedback.

In addition, I have modified the following part of the program so that it will work on 7/29/2021.

1.config.py
#model_name = "bert-base-japanese-whole-word-masking"
model_name = "cl-tohoku/bert-base-japanese-whole-word-masking"
2.tokenizer.py
#from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers import BertJapaneseTokenizer

Sincerely yours.

The following is what I did during the training.

python main.py
INFO:root:*** Initializing ***
INFO:root:Preparing training data
INFO:root:Define Models
Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertEncoder: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']

  • This IS expected if you are initializing BertEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    INFO:root:Define Loss and Optimizer
    INFO:root:Start Training
    Epoch: 1: 100%|██████████| 54882/54882 [3:42:46<00:00, 4.11it/s, Loss: 2.24628]
    *** Saved Model ***
    おはようございます。
    Epoch: 2: 100%|██████████| 54882/54882 [3:41:45<00:00, 4.12it/s, Loss: 2.35587]
    *** Saved Model ***
    おはようございます。
    Epoch: 3: 100%|██████████| 54882/54882 [3:41:52<00:00, 4.12it/s, Loss: 2.28201]
    *** Saved Model ***
    おはようございます
    Epoch: 4: 100%|██████████| 54882/54882 [3:43:05<00:00, 4.10it/s, Loss: 2.19522]
    *** Saved Model ***
    おはようございます
    Epoch: 5: 100%|██████████| 54882/54882 [3:41:27<00:00, 4.13it/s, Loss: 2.32014]
    *** Saved Model ***
    おはようございます

The following is errors of my chat bots
ckpt_4 (epoch 5)
 
You>こんにちは。
BOT>おはようございます
You>暑いですね。
BOT>おはようございます
You>元気ですか?
BOT>おはようございます
You>今日雨フルらしいよ
BOT>おはようございます

ckpt (original model)

You>おはよう
BOT>おはようございます
You>元気ですか?
BOT>おはようございます
You>むむ
BOT>おはようございます
You>

notebookエラー Model name 'bert-base-japanese-whole-word-masking' was not found

notebookのDialog-Evaluation.ipynbを実行したところ、
!python run_eval.pyのセルでエラーが発生しました。

Traceback (most recent call last):
  File "run_eval.py", line 15, in <module>
    tokenizer = Tokenizer.from_pretrained(Config.model_name)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 1591, in from_pretrained
    list(cls.vocab_files_names.values()),
OSError: Model name 'bert-base-japanese-whole-word-masking' was not found in tokenizers model name list (cl-tohoku/bert-base-japanese, cl-tohoku/bert-base-japanese-whole-word-masking, cl-tohoku/bert-base-japanese-char, cl-tohoku/bert-base-japanese-char-whole-word-masking). We assumed 'bert-base-japanese-whole-word-masking' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

config.pyのmodel_nameを'cl-tohoku/bert-base-japanese-whole-word-masking'として追加でライブラリをインストールすればこのエラーは発生しなくなるのですが、次はモデル読み込みでレイヤー名が合わずエラーとなってしまいます。

Traceback (most recent call last):
  File "run_eval.py", line 21, in <module>
    model.load_state_dict(state_dict['model'])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for EncoderDecoder:
	Missing key(s) in state_dict: "encoder.embeddings.position_id

ちなみに、現時点でpipでインストールされるライブラリのVersionは下記となっております。
emoji-0.6.0 neologdn-0.4 sacremoses-0.0.43 sentencepiece-0.1.94 tokenizers-0.9.2 transformers-3.4.0

お手数をおかけしますが、解決方法をご教授ください。

Question about the memory representation

Hello,

thank you for providing such an interesting project!
I have a question about the following lines:

x = torch.cat([x, x.clone()], dim=1)
source_mask = torch.cat([source_mask, source_mask.clone()], dim=1)

I do not understand why the two source embeddings are concatenated.
What does this implementation mean?

thanks!

"run_eval.py" error

Thank you for your support.
BTW, when I tried to run using a pre-trained model. unfortunately, I got the error below:

python3 run_eval.py
Traceback (most recent call last):
File "run_eval.py", line 17, in
model.load_state_dict(state_dict['model'])
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for EncoderDecoder:
Missing key(s) in state_dict: "encoder.embeddings.position_ids".

Please let me know if you have any idea.

pretrained model's response is just only 'おはようございます'

Dialog training has just finished.
Then, the training log is as follows:

INFO:root:*** Initializing ***
INFO:root:Preparing training data
INFO:root:Define Models
INFO:root:Define Loss and Optimizer
INFO:root:Start Training
Epoch: 1: 100%|██████████| 54882/54882 [9:39:27<00:00, 1.58it/s, Loss: 2.18444]
Epoch: 2: 100%|██████████| 54882/54882 [9:37:28<00:00, 1.58it/s, Loss: 2.54174]
Epoch: 3: 100%|██████████| 54882/54882 [9:31:07<00:00, 1.60it/s, Loss: 2.34356]
*** Saved Model ***
おはようございます
*** Saved Model ***
おはようございます。
*** Saved Model ***
おはようございます

Then I tried to test the trained 'ckpt.pth' file using run_eval.py.
But, the responses of the trained file is as follows:

$ python3 run_eval.py
You>おはよう
BOT>おはようございます
You>今日は疲れた
BOT>おはようございます
You>美味しいものを食べたい
BOT>おはようございます

Responses are just only 'おはようございます'.
What's wrong?
Please let me know your thinking.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.