kwonmha / bert-vocab-builder Goto Github PK

View Code? Open in Web Editor NEW

228.0 10.0 47.0 31 KB

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT

Python 100.00%

natural-language-processing vocabulary-builder bert

bert-vocab-builder's Introduction

👋 Hi, I’m @kwonmha

I'm working as a researcher at AI Center of Samsung Life Insurance.

👀 I’m interested in

NLP, including Language Modeling and Represenation Learning
And also ML, Recommend system

🌱 I’m currently working on

Fine-tuning and serving large language model
Retrieval augmented generation(RAG)

📫 How to reach me [email protected]

bert-vocab-builder's People

Stargazers

Watchers

Forkers

127 bhoomit swayam01 kastnerkyle allensmile little1tow rgib37190 yashar78 pengyuange dslim23 tanvidadu imyoungyang harsha-sharechat-account matobad hjalmarrsv guhaifudeng huanghao128 kksbell jerrian raj5287 jorgeih binhna clairett emudria markwzx buaahsh madhugraj rogervaas leowood rahulkg31 afcarl michael-wzhu mandalbiswadip makai281 arielsan99 umakot1974 hubishan golooloo gucasdongzi hon-fork pithysr nlj0011 yashskhandelwal yurikim2145 cidinene gracetrue icelighting 31415li

bert-vocab-builder's Issues

Issue with tf.gfile / tf.io.gfile

The code in the repo does not work with either Tensorflow version 1.11 or version 2.0. I got it to work in 1.11 by changing the code according to the 1.11 API

filenames = sorted(tf.gfile.Glob(filepattern))
  print(filenames)
  lines_read = 0
  for filename in filenames:
    start = time.time()
    with tf.gfile.Open(filename) as f:

Alternatively update the readme to the reflect the tf version you used.

Corpus preprocessing steps

Hi Kwonmha,
Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:

Convert corpus text file to lower case
Removal punctuations from corpus text file?
Build vocab
match the vocab file to bert model configuration, e.g. take the top 30k lines (as the vocab should be ordered by frequency descending order?), manually adjust the vocab file, so that it contains puncutations (i.e. vocabs for . , ? ! ##. ##, ##? ##! etc)?
use the vocab file for later pretraining bert model, the corpus of pretraining bert model needs to be lower cased, but without removal of punctuation?
Let me know if my understanding is not correct?

Thanks!
Regards

Should I match the vocabulary size with bert_config.json

I'm trying to pretrain from scratch this https://github.com/google-research/bert/ and using this library to make vocab.txt, after I have successfully make a vocab.txt by using this library, should I match the size of vocabulary on bert_config.json with newly created vocab.txt?

AttributeError: module 'tensorflow.io' has no attribute 'gfile'

Hello,
first of all, thank you for this project

I tried to run it using tensorflow==1.11.0 and I got this error
AttributeError: module 'tensorflow.io' has no attribute 'gfile'

I also tried to run it using tensorflow==2.0.0 and got this error
AttributeError: module 'tensorflow' has no attribute 'flags'`

could you list all the requirements or give a solution, thank you!

Unable to understand the input format and also the generated output

I wanted to build a vocab on my corpus. I made a folder named data, put the file in it(made a small text file just for sanity check), set corpus_max_lines to 8(no.of lines in my test text) and run the following command

python subword_builder.py --corpus_filepatter
n=/media/ayushjain1144/"New Volume"/"IGCAR PS"/data --corpus_max_lines=8 --output_filename=/media/ayushjain1144/"New Volume"/"IGCAR PS"/out

The output I received is this:

['/media/ayushjain1144/New Volume/IGCAR PS/data']
0.00014281272888183594 for reading read file : /media/ayushjain1144/New Volume/IGCAR PS/data
read all files
61
61
61
61

out directory is empty. Also not able to understand what 61 means! Please help!

is there a format for corpus_filepattern?

Hi, thank you for your sharing.
I am trying to make vocab.txt like below for IMDB moview review dataset.
python3 subword_builder.py --corpus_filepattern IMDB_review.txt --output_filename vocab.txt --min_count 30000
WARNING:tensorflow:From subword_builder.py:81: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:133: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W0304 18:26:35.470829 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:133: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

['./IMDB_review.txt']
WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:138: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

W0304 18:26:35.492865 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:138: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

19.23373532295227 for reading read file : ./IMDB_review.txt
read all files
WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/text_encoder.py:588: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0304 18:26:54.772613 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/text_encoder.py:588: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Iteration 0
I0304 18:26:54.772828 140030738089792 text_encoder.py:588] Iteration 0
INFO:tensorflow:vocab_size = 668
I0304 18:26:59.560518 140030738089792 text_encoder.py:660] vocab_size = 668
INFO:tensorflow:Iteration 1
I0304 18:26:59.560930 140030738089792 text_encoder.py:588] Iteration 1
INFO:tensorflow:vocab_size = 378
I0304 18:27:02.865697 140030738089792 text_encoder.py:660] vocab_size = 378
INFO:tensorflow:Iteration 2
I0304 18:27:02.866119 140030738089792 text_encoder.py:588] Iteration 2
INFO:tensorflow:vocab_size = 403
I0304 18:27:06.409686 140030738089792 text_encoder.py:660] vocab_size = 403
INFO:tensorflow:Iteration 3
I0304 18:27:06.409908 140030738089792 text_encoder.py:588] Iteration 3
INFO:tensorflow:vocab_size = 397
I0304 18:27:10.208530 140030738089792 text_encoder.py:660] vocab_size = 397
INFO:tensorflow:Iteration 4
I0304 18:27:10.208930 140030738089792 text_encoder.py:588] Iteration 4
INFO:tensorflow:vocab_size = 399
I0304 18:27:13.905530 140030738089792 text_encoder.py:660] vocab_size = 399
total vocab size : 456, 19.1799635887146 seconds elapsed
INFO:tensorflow:vocab_size = 456
I0304 18:27:13.912348 140030738089792 text_encoder.py:686] vocab_size = 456

but vocab size is very small?
whats wrong?

IMDB_review.txt
I thought this was wonderful way to spend time on too hot summer weekend sitting in the air conditioned theater and watching light hearted comedy The plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer While some may be disappointed when they realize this is not Match Point Risk Addiction thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love This was the most d laughed at one of Woody comedies in years dare say decade While ve never been impressed with Scarlet Johanson in this she managed to tone down her sexy image and jumped right into average but spirited young woman This may not be the crown jewel of his career but it was wittier than Devil Wears Prada and more interesting than Superman great comedy to go see with friends .
Basically there a family where little boy Jake thinks there a zombie in his closet his parents are fighting all the time This movie is slower than soap opera and suddenly Jake decides to become Rambo and kill the zombie OK first of all when you re going to make film you must Decide if its thriller or drama As drama the movie is watchable Parents are divorcing arguing like in real life And then we have Jake with his closet which totally ruins all the film expected to see BOOGEYMAN similar movie and instead watched drama with some meaningless thriller spots out of just for the well playing parents descent dialogs As for the shots with Jake just ignore them .

And my tensorflow version is 1.15

Merge conflict markers are still there…

bert-vocab-builder/text_encoder.py

Line 688 in 2e7d107

<<<<<<< Updated upstream

BERT trained on custom corpus

Hi M. H. Kwon,
Your tokenization script is really helpful.

I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py ,extract_features.py etc..as a result I got vocab file, .tfrecord file, .jason file and check point files.

Now how to use those file for the below tasks:

to predict a missing word in a given sentence?
for next sentence prediction
Q and A model

Need your help.

Not accurate sub-words for German

I tried using the vocab builder on the German Wikipedia, but some words aren't accurately represented into its sub words, for example, "eintausendneunhundertneunzig" is considered as a one sub word, although I expected "ein", "tausend", "neun", "hundert", "neun", "zig", is there any tweaks to make the model more specific to German which is very compound?
Thank you

If I change the min_count flag in order to produce vocab of size same bert's original vocab: can I then use this new vocab to pretrain from a checkpoint, or I have to train from the scratch?

I am getting an error while running vocab builder.

Code and files used for vocab bulider:
!git clone https://github.com/kwonmha/bert-vocab-builder.git
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review_nopunct.txt
!python ./bert-vocab-builder/subword_builder.py --corpus_filepattern "restaurant_review_nopunct.txt" --output_filename "vocab.txt" --min_count 1

Issue 1: fixed replacing 'tf.flags' by ' tf.compat.v1.flags' (Version issue)
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 37, in
tf.flags.DEFINE_string('output_filename', '/tmp/my.subword_text_encoder',
AttributeError: module 'tensorflow' has no attribute 'flags'

Issue 2:
The number of files to read : 1
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 86, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./bert-vocab-builder/subword_builder.py", line 67, in main
split_on_newlines=FLAGS.split_on_newlines, additional_chars=FLAGS.additional_chars)
File "/content/bert-vocab-builder/tokenizer.py", line 191, in corpus_token_counts
split_on_newlines=split_on_newlines):
File "/content/bert-vocab-builder/tokenizer.py", line 139, in _read_filepattern
tf.logging.INFO("Start reading ", filename)
TypeError: 'int' object is not callable

Could any one help please me out on this issue? Thanks in advance

Windows fatal exception: access violation

I run subword_builder.py in the Windows 7 system and have the error message below. Could you please let me know what the problem is?

Windows fatal exception: access violation

Current thread 0x00002078 (most recent call first):
File "D:\Anaconda3\lib\site-packages\tensorflow\python\lib\io\file_io.py", lin
e 384 in get_matching_files_v2
File "D:\Anaconda3\lib\site-packages\tensorflow\python\lib\io\file_io.py", lin
e 363 in get_matching_files
File "F:\BERT\bert-vocab-builder-master\tokenizer.py", line 133 in _read_filep
attern
File "F:\BERT\bert-vocab-builder-master\tokenizer.py", line 188 in corpus_toke
n_counts
File "subword_builder.py", line 63 in main
File "D:\Anaconda3\lib\site-packages\absl\app.py", line 250 in _run_main
File "D:\Anaconda3\lib\site-packages\absl\app.py", line 299 in run
File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line
40 in run
File "subword_builder.py", line 81 in

Projects using this and evaluation results

Hi @kwonmha,

your project is exactly what came into my mind when dealing with Bert vocab creation. Currently I'm doing some vocab optimizations for my Bert project, too.

Can you say something about improvements/degradations related to your vocab changes? I'm really curious if this approach delivers better results.

error in running ALBERT create_pretraining_data.py

I am facing the below issue while running the command for creating data for inputing to the albert model.

python create_pretraining_data.py --input_file /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/data_toy/restaurant_review_train --output_file /media/xxxx/NewVolume/ALBERT/ouput --vocab_file /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/models_toy/vocab.txt
WARNING:tensorflow:From create_pretraining_data.py:653: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

WARNING:tensorflow:From create_pretraining_data.py:618: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0114 11:29:58.957636 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:618: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From create_pretraining_data.py:618: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0114 11:29:58.957761 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:618: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From create_pretraining_data.py:626: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W0114 11:29:58.958572 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:626: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

WARNING:tensorflow:From create_pretraining_data.py:628: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0114 11:29:58.959418 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:628: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:*** Reading from input files ***
I0114 11:29:58.959500 140552204957504 create_pretraining_data.py:628] *** Reading from input files ***
INFO:tensorflow: /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/data_toy/restaurant_review_train
I0114 11:29:58.959625 140552204957504 create_pretraining_data.py:630] /media/xxxx/NewVolume/Albert_Finetune_with_Pretrain_on_Custom_Corpus/data_toy/restaurant_review_train
WARNING:tensorflow:From create_pretraining_data.py:228: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W0114 11:29:58.960052 140552204957504 module_wrapper.py:139] From create_pretraining_data.py:228: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

Traceback (most recent call last):
File "create_pretraining_data.py", line 653, in
tf.app.run()
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/xxxx/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/xxxx/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "create_pretraining_data.py", line 636, in main
rng)
File "create_pretraining_data.py", line 230, in create_training_instances
line = reader.readline()
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/lib/io/file_io.py", line 179, in readline
return self._prepare_value(self._read_buf.ReadLineAsString())
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/lib/io/file_io.py", line 98, in _prepare_value
return compat.as_str_any(val)
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/util/compat.py", line 123, in as_str_any
return as_str(value)
File "/home/xxxx/.local/lib/python3.6/site-packages/tensorflow_core/python/util/compat.py", line 93, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 8: invalid start byte

@kwonmha

splitting strategy in tokenize.py

I was trying to use the repo for building a vocab and I realized that the encode(text) function is used to as a tokenizer. I am not sure if I am right, but I am not able to get the last token in the returned result.

def encode(text):
  """Encode a unicode string as a list of tokens.

  Args:
    text: a unicode string
  Returns:
    a list of tokens as Unicode strings
  """
  if not text:
    return []
  ret = []
  token_start = 0
  # Classify each character in the input string
  is_alnum = [c in _ALPHANUMERIC_CHAR_SET for c in text]
  add_remaining = False
  for pos in range(1, len(text)):
    add_remaining = False
    if is_alnum[pos] != is_alnum[pos - 1]:
      if not is_alnum[pos]:
        token = text[token_start:pos]
        if token != u" " or token_start == 0:
          add_remaining = False
          ret.append(token)
      else:
        add_remaining = True
        token_start = pos


  final_token = text[token_start:] if text[-1] in _ALPHANUMERIC_CHAR_SET else text[token_start:-1]
  if add_remaining:
    ret.append(final_token)
  return ret

The following is a sample result:

print(encode("knee injury present"))
>>['knee', 'injury']