alasdairforsythe / tokenmonster Goto Github PK

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

License: MIT License

HTML 2.67% JavaScript 7.89% Python 14.51% Go 74.93%

tokenisation tokenization tokenize tokenizer tokenizing vocabulary vocabulary-builder text-tokenization vocabulary-generator

tokenmonster's Introduction

TokenMonster

UPDATE: Benchmark results from pretraining 16 language models on different tokenizers.

TokenMonster is an ungreedy subword tokenizer and vocabulary generator, enabling language models to run faster, cheaper, smarter and generate longer streams of text.

Large and sub-optimal vocabularies lead to the waste of computational and memory resources in language models. By switching to TokenMonster, you can potentially achieve the same or better performance with a vocabulary that is less than a quarter of the size.

TokenMonster can train and generate an optimal vocabulary on a 1 GB dataset within 24 hours on a typical desktop. 442 pretrained vocabularies are provided, as well as tools to train your own vocabularies & implementations in Go, Python & Javascript for tokenization and detokenization using the pretrained or your own vocabularies.

You can test TokenMonster in your browser here, tokenizing live in native Javascript.

TokenMonster is a novel approach to tokenization with broad-ranging use potential, but its primary motivation is to improve the training, inference and context-length of large language models. By using a more optimal vocabulary and ungreedy tokenization algorithm, text can be represented with 37.5% fewer tokens at the same vocabulary size compared to other modern tokenizing methods, increasing the speed of inference, training and the length of text. And/or the vocabulary size can be reduced by 75% or more, freeing resources that can be used to make the model smarter and faster.

You can also import existing vocabularies from other tokenizers, allowing you to take advantage of TokenMonster's fast, ungreedy tokenization whilst still using the existing vocabulary your model was trained for. TokenMonster vocabularies for GPT2 Tokenizer and LLaMa Tokenizer are included.

Features

Outperforms other tokenization algorithms in every area (benchmark)
Selects the optimal vocabulary for a given dataset
5 optimization modes to choose from: unfiltered, clean, balanced, consistent, strict
Ungreedy: follows up to 6 parallel branches at a time
Fast: follows 6 branches faster than other algorithms can follow 1 (benchmark)
Utilizes capcode marker tokens to encode uppercasing and forward delete
Successfully identifies words, subwords, common phrases and figures of speech by itself
Works with HTML tags, sequential spaces, tabs, etc. without wasting context
Can be trained on any language
Achieves up to 7 chr/token (depending on vocabulary size & optimization mode)
Vocabularies can be modified and resized after training
Full support for "special" and "single-byte" tokens
Import and export vocabularies to and from human-readable YAML format
422 pretrained vocabularies ready for use

Usage Go | Python | Javascript | Training
Benchmark
Pretrained Vocabularies
Optimization Modes
Vocabulary Selection Guidance
Capcode
Normalization
How does it work and how is it different from BPE?
The Ungreedy Tokenization Algorithm
Datasets
Support & Consultation

Pretrained Vocabularies

442 vocabularies are planned or have already been built. Download them from Hugging Face, or in the Python library you can simply specify them by name and they'll be downloaded automatically. (Note: the pretrained vocabularies are still being trained, check here to see which are currently available.)

Choose a dataset from: code english englishcode fiction
Choose a vocab size from: 1024 2048 4096 8000 16000 24000 32000 40000 50256 65536 100256
Choose an optimization mode from: unfiltered clean balanced consistent strict
For a capcode disabled vocabulary add: nocapcode
Finally add the version number: v1

Examples: fiction-24000-strict-v1 code-4096-clean-nocapcode-v1

Usage:

import tokenmonster
vocab = tokenmonster.load("englishcode-32000-consistent-v1")
tokens = vocab.tokenize("This is a test.")

There are 2 additional pre-built vocabularies: gpt2 and llama. These are imports of GPT2 Tokenizer and LLaMa Tokenizer from Hugging Face Transformers into TokenMonster. The tokens and IDs are identical, however they do not always tokenize the text in exactly the same way. For example, LLaMa Tokenizer on Hugging Face tokenizes " decoded" as dec oded, whilst TokenMonster tokenizes [correctly] to decode d. TokenMonster trained vocabularies are massively more efficient, so only use gpt2 and llama if you have to. The scripts used to import them into TokenMonster are here.

vocab = tokenmonster.load("gpt2")

Optimization Modes

All the optimization modes are lossless. The stricter the optimization mode (higher number), the more tokens will be used to tokenize the same text, but it'll be much easier for the language model to learn because the grammar is simpler. Less strict (lower number), more text can be represented with fewer tokens, but the language model will have to learn a more complicated grammar.

0 unfiltered allows the training process to freely determine the tokens. clean is preferred in almost every case, because unfiltered tends to result in overfitting, especially for code as it results in tokens for things like \n\t\t\t\tif (. Use unfiltered for tokenizing language or data that does not use spaces as word boundaries.

1 clean introduces filters to avoid overfitting. It forces the vocabulary to begin words with a space, and limits the way in which whitespace can be combined with other characters.

2 balanced prioritizes whole words and attempts to dissuade the vocabulary from doing things that are difficult to learn.

3 consistent is a looser version of strict. It aims to limit the number of different tokens that can represent the same word or phrase, and doesn't allow for open-close delimeters to be combined with words or each other. Numbers also become limited to fewer variants.

4 strict aims to have only 1 token per word, no matter how it is encoded. For example However, however, and HOWEVER! will all use the same however token, in combination with other tokens that indicate it's spacing and capitalization.

Vocabulary Selection Guidance

View TokenMonster Vocabulary Comparison, to see a line chart of the relationship between vocab size, optimization mode and characters/token. From this chart I can stipulate the rule of thumb that every doubling of vocabulary size inscreases the characters/token by 0.5. This pattern starts from vocab size 4096 and consistent up to 100256.

It's tempting to use large vocabularies, which has been norm, but you can see on the TokenMonster Tester and Interactive Benchmark that reducing the vocabulary by 50 - 75% can often result in only a relatively minor increase to the number of tokens required to tokenize it. Even the very general englishcode vocabularies, which are for all intents and purposes multi-lingual, do very well at vocab size 24000. Story or article writing models can go as low as 4096 vocabulary size and still tokenize at 4 characters per token.

TokenMonster works well with small vocabularies because it's using an optimal selection process. In most cases it's simply not necessary to use vocabulary sizes greater than 32000, unless it's a multi-lingual vocabulary. More is not better. Using a vocabulary that is excessively large can lead to inefficient usage of embeddings, not to mention an over-complicated grammar. The embeddings for all those unneeded tokens occupy memory and computational resources that could be used more efficiently.

In my opinion, the 100K vocabulary size is excessive and wasteful, unless your aim is to support at least three languages in the same vocabulary. With a 100K size, you have "spare" tokens. By "spare", I mean that the vocabulary starts assigning tokens to lengthy, specific sequences like "limitations under the License" and "#### According to", suggesting that the vocabulary has reached its optimal size and is now just compressing frequently occurring strings.

My advice is to find the smallest vocabulary size that meets your requirements. With this, you can either be content with a smaller, faster model, or opt to augment the size of the embeddings accordingly, or find a balance between the two.

In regards to optimization modes, strict is the one to go for if your model is limited by its size or largely undertrained. If it's a small model that isn't particularly smart, and you want to get the most out of it, choose strict because it'll probably result in a smarter model given that the simpler grammar is quicker to learn (words, punctuation and modifiers are all separate tokens.) On the other hand, if you're training something serious with enough training data so that each token is exposed to a variety of contexts in order to learn it's more complex grammar, you probably want to go for clean or balanced.

strict performs very well with longform natural text, such as novels and articles, but it's too strict for code. consistent will give the best balance of consistency for tokenizing code whilst keeping the grammar simple. balanced and clean are excellent at compressing code into fewer tokens, but this comes with the trade-off of more complex grammar. That said, a smaller vocabulary implies a simpler grammar (less possible combinations), so it may be in your interest to aim for balanced with a fairly small vocabulary size, such as 16000. All of this you can determine by playing around with TokenMonster Tester.

Capcode

Capcode is an alternative encoding for uppercase in UTF-8 text, supporting all UTF-8 characters. It's completely lossless, changing the way in which capital letters are encoded so they can share tokens with lowercase letters but without losing any information. In theory, capcode makes it easier for a model to learn the meaning of words. Additionally, capcode makes for more efficient tokenization because it frees up so many tokens that would otherwise be used for uppercase variants of already existing lowercase tokens.

Normalization

TokenMonster is designed to be plug-and-play, taking care of normalization concerns for you. UTF-8 and UTF-16 vocabularies are automatically NFD normalized and encoded Little Endian regardless of architecture. When tokenizing, the exact same transformations are applied transparently, so you can pass a string to either UTF-8 or UTF-16 vocabularies, with or without capcode, and on either Little or Big Endian architecture, and it will be processed correctly.

No normalizations are applied to charset "None" vocabularies. If you're not sure which to choose, UTF-8 is preferred.

How does it work and how is it different from BPE?

Byte-Pair-Encoding starts with single byte tokens and merges frequently occuring tokens together iteratively, growing the vocabulary out of single characters. TokenMonster takes an entirely different approach, beginning with all possible tokens, and distilling the vocabulary down to the vocab size using a method inspired by chemical distillation. TokenMonster thereby does not run into the issue BPE has, that once a branch is chosen, it's assumed to be beneficial, and although it can later be pruned, the alternative branch that might have performed better has already been lost.

The secret sauce that enables TokenMonster to outperform other algorithms is made from:

The distillation method is an effective means of separating that which is wanted from that which is not, without losing any of the cream.
The training process targets the tokenization method being used. The vocabulary is generated to be optimal for the specific tokenization algorithm and dataset, which is a necessary step for optimal tokenization.

In simplified terms it does the following:

Generates all possible tokens in the dataset (40 billion in 1 GB of text)
Deletes all tokens that have no more than 100 occurrences (4 million)
Generates random vocabularies of vocab_size
Tokenizes the dataset using the target tokenization algorithm with the random vocabulary
Deletes the 1% "worst" scoring tokens
Repeat hundreds of thousands of times
When vocab_size is reached, resurrect potential tokens
Keep doing this until a more optimal vocabulary cannot be found 1000 times in a row

TokenMonster does not need any information about the language or structure, and results in a neat list of words, subwords and common phrases. Sample:

a number of 
a series of 
a wonderful 
ability and 
able to get 
about being 
about their 
account for 
acknowledge 
acquisition 
addition to 
address the 
advertising 
affected by 
after being 
against the

The Ungreedy Tokenization Algorithm

TokenMonster uses an ungreedy tokenization method in which each token has up to 2 alternatives selected during training, which are subwords of itself. First the longest token that matches the next segment of text is selected in a greedy fashion. The alternative tokens are looked up on an index that is included in the vocabulary file. The longest token matching the following text segment is found for the original and its alternatives, giving 3 possible branches. If any of those do not end on a word boundary, a further branch is followed utilizing a forward delete token, which allows for words beginning with a space to be used as parts of other words. The 6 total branches are scored based on various rules, the optimal branch is chosen and the tokenization continues along that branch.

Because the training process targets the tokenization algorithm, the training is not only selecting for tokens but selecting for the relationship between tokens in the vocabulary.

Datasets

The datasets used for generating the pretrained vocabularies are all available on Hugging Face. The sources and scripts used to generate these datasets are included in the training directory.

The training data mostly came from Red Pajamas 1B Token Sample. However, to reduce formal English and emphasize other languages, informal writing and code, c4_sample & cc_sample were cropped to 100MB, and Reddit conversations data were added (also cropped to 100MB.)

Additionally, equally weighted code samples of 2MB per language (code_2mb) and 10MB per language (code_10mb) were added for 30 different programming languages to ensure all programming languages have representation. The source of this is codeparrot/github-code. To ensure a range of coding styles, I allowed only 1 file per GitHub repository, and per file a maximum of 200 lines selected from the middle of the file.

Given the evolving nature of writing styles, I felt that book_sample.txt, which consists of out-of-copyright books, was not a good representation of contemporary fiction. To better represent a more modern style, I curated fiction.txt and fiction_100mb.txt by throwing together a few other datasets and cleaning it up.

Note: fiction_100mb.txt is a subset of fiction.txt, and code_2mb.txt is a subset of code_10mb.txt.

english

Filename	Filesize
arxiv_sample.txt	88,925,569
book_sample.txt	108,069,616
c4_sample.txt	100,560,318
cc_2023-06_sample.txt	100,852,231
fiction_100mb.txt	94,235,489
stackexchange_sample.txt	71,940,138
wikipedia_sample.txt	79,181,873
reddit.txt	100,027,565
	743,792,799

englishcode

Filename	Filesize
arxiv_sample.txt	88,925,569
book_sample.txt	108,069,616
c4_sample.txt	100,560,318
cc_2023-06_sample.txt	100,852,231
code_2mb.txt	62,895,904
fiction_100mb.txt	94,235,489
github_sample.txt	191,123,094
stackexchange_sample.txt	71,940,138
wikipedia_sample.txt	79,181,873
reddit.txt	100,027,565
	997,811,797

fiction

Filename	Filesize
book_sample.txt	108,069,616
fiction.txt	357,119,086
reddit.txt	100,027,565
	565,216,267

code

Filename	Filesize
code_10mb.txt	314,006,799
github_sample.txt	191,123,094
stackexchange_sample.txt	71,940,138
	577,070,031

The following programming and markup languages are represented in both "englishcode" and "code" vocabularies:

Language
Assembly	Batchfile	C	C#	C++
CMake	CSS	Dockerfile	FORTRAN	Go
Haskell	HTML	Java	JavaScript	Julia
Lua	Makefile	Markdown	PHP	Perl
PowerShell	Python	Ruby	Rust	SQL
Scala	Shell	TypeScript	TeX	Visual Basic

Support & Consultation

Use the "Discussions" tab for free support on how to use TokenMonster. You can also hire me for a paid consultation on how to get the best out of TokenMonster, or to generate a vocabulary for you according to your specific requirements.

tokenmonster's People

Contributors

Stargazers

Watchers

Forkers

curiosity007 jjhw worthmining tarunchy maddurup kjongdae codinglover0111 darcstar-solutions-tech foobarprotocol jesusoctavioas gohan472 hacbui oytunturk amazingvince alecco vovw kennytat gmh5225 vectorrent

tokenmonster's Issues

Humble question regarding JS performance

First and foremost, very impressive work!

As a low-level JS performance enthusiast, I'd be interested to see how much faster I'd be able to make the JS implementation on V8 in particular (with expected gains across all JITs I'm aware of).

And by "faster" it's of course implied to mean repeatably, measurably, explainably, and significantly faster. (Just say no to microbenchmarks).

Main strategies include well-known run-of-the-mill techniques like enforcing 100% monomorphic code and other related JIT-appeasing goodness.

Would this be of any interest whatsoever to you? Absolutely fine if not, but I wanted to extend you a nerdy E.T. glow-finger of enthusiasm and test the waters before deciding to proceed on my own instead.

Apologies in advance for sending this much text your way unsolicited.

All the best, and again, great work. 👏

"data is required error"

I'm getting a "dataset is required" error with this command:

./getalltokens -capcode true -charset UTF-8 -chunk-size 100000 -dataset /Users/me/wikitext-103-raw/wikitest.raw -max-token-length 1000 -micro-chunks 10 -min-occur 30 -min-occur-chunk 3 -output string enwiki_dict.txt -workers 1

Implemented in the new AI framework Zeta

Hey I like TokenMonster alot and have implemented it into our framework, Zeta, the framework enabling one to build the best multi-modality transformer models!

https://github.com/kyegomez/zeta

#!pip install zetascale

from zeta.tokenizers import TokenMonster
tokenizer = TokenMonster("englishcode-32000-consistent-v1")
tokens = tokenizer.tokenize("Hello world!")
print(tokens)

Continuous training: Deleted 0 of 0 tokens; Remaining 0 tokens; reachedMidway withinVocabX2 reachedVocab

Splendid to see this algorithm and the name is stellar. I've been excited to test it out since it was shared!

I finally processed my files and got a vocab file. I executed the command on a very small text size for testing purposes:

./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
It gives seemingly no problem in output:

...
2023/06/08 09:55:34 Tokens before trimming: 350634
2023/06/08 09:55:34 Trimming final tokens for min 100
2023/06/08 09:55:34 Tokens after trimming: 14770
2023/06/08 09:55:34 Saving tokens list
2023/06/08 09:55:34 Done

I execute the next portion of the code:

./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 1048575

It looks like I needed to set vocab to < total number of tokens after trimming. Perhaps that should be documented in the notes.

Wrapping lib in a go cli client

Greetings, there are many situations where relying on curl is enough for utilizing inference on otherwise not very capable machines where a python or js interpreter isn't always assumed.

In case it is of interest in regards to you, I would like to kindly request that you wrap the go library into a very basic CLI which would allow to leverage the library in a small self contained go executable. I'm really referring the simplest, bare minimum client, no need for thinking about interactive fuss like completions etc.

I'd normally go ahead and just do it but I never worked with go and dedicating a couple of days to get myself onboard is more or less not an option as of now. I sincerely hope this is something you've thought about doing at some point and you're interested and comfortable enough to make it happen, with the least possible effort without giving up any significant amount of your otherwise precious time =)

I'm looking forward to your answer, feel free to turn it down without any hesitation if you think its appropriate to do so, its totally understandable in any case! Thanks

Idea: Wouldn't it be possible for Tokenmonster to stop when it reaches the idea vocab size?

Just an idea. When I train Tokenmonster with an immensely big dataset, I realize at a certain small vocab size, the workers struggle to remove anymore from the vocab. I take it as a sign that the vocab-size is already approaching optimal when it reaches that state.

What do you think?

Spacecode: extend Capcode idea to composite words

I have a suggestion to discuss that could enhance the tokenization in your already amazing approach. I wonder if there's a benefit to consistently pushing all spaces to the front (similar to what OpenAI does), end, or using some other strategy.

Currently, I don't see a specific strategy in english-100256-capcode. The patterns seem to stem from the statistical properties of the corpus:

Format	Count of tokens (after trim)	Count of tokens (unmodified)
word	47380	47125
word+space	45516	48173
space+word	1447	2652
space+word+space	1447	2050
other	4466	256

The difference between columns is subtle, and appears with multi-space tokens:

unmodified version just compares the first and the last byte of the token with a space
trim-version removes all the surrounding spaces and then tries to add them back and search the vocabulary.

There is a noticeable overlap between formats. We can also count how many forms form each trimmed word has:

Forms	1	2	3	4
Words	68897	11286	471	727

68897 (69%) of all tokens are alright. They might have spaces today, but at least there is exactly one version.
If we address this, we can save 2 * 11286 + 3 * 471 + 4 * 727 = 26893 = 27% of token ids and reuse them for something else.
I also believe it might help you with performance, because some tokens will be combined early in the process.

Extending Capcode to Composite Words

Capcode does a great job at treating Greedy• and greedy• as the same token (40734). However, issues can arise when considering alternate spellings, such as •greedy. By extending the Capcode idea to composite words, we could address these concerns.

What about •greedy? If such token can possibly appear, it would introduce a near-duplicate. Currently there are no alternative spellings, so greedy+space is the only token. Can it be used at the end of the sentence? greedy. Maybe it is exactly the use-case you foresee with delete token?

•and• is 18564, and• 13207, there is no •and

Proposal

All text tokens should have no spaces at the start or end. Punctuation tokens would remain unchanged.

In this approach, TokenMonster and ungreedy would be spelled using a version of zero-width joiner or word joiner from Unicode:

Token+Monster is an un+greedy tokenizer and vocabulary builder !

Or, with Capcode:

^token+^monster is an un+greedy tokenizer and vocabulary builder !

This extension could reduce the number of tokens used by 27% and repurpose them to bring more semantically meaningful words into the vocabulary. With more known individual words, there will be even less need to use concatenation token.

I believe implementing this idea would make the tokenization process more efficient and consistent. Looking forward to hearing your thoughts!

code-65536 models cannot decode

Hi,
I was just trying out the code tokenizers, seems like all the code-65636-* models are all unable to decode:

import tokenmonster

tokenizer = tokenmonster.load("code-65536-balanced-nocapcode-v1")
tokens = tokenizer.tokenize("hello world") # [  127 51042]
decoded_string = tokenizer.decode(tokens)
print(decoded_string)
> ''

The 100k and 32k models work.

hello!

Hi, i was reaching out because i couldnt find a way to contact you privately so i apologize for how out of place this message is, would you happen to be available for a meeting some time soon? i can be reached at [email protected], or here directly if you prefer :)

RuntimeError: tokenmonsterserver: Cannot open or save vocabulary file, please check permissions

What format should the vocab file be? I got the above error when I tried RWKV vocab which is a text file:
https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/rwkv_pip_package/src/rwkv/rwkv_vocab_v20230424.txt
I gave 777 permission to the vocab.

Special tokens not showing up correctly when tokenized.

I tried adding some special tokens to the vocab of a pretrained model. Made a PR for minor code fix. When I try and encode strings these new tokens are being broken out into many tokens sometimes instead of being encoded as single token.

How do I make sure my special tokens always map to the same id?
code to reproduce what I am seeing:

vocab = tokenmonster.load("englishcode-32000-consistent-v1")

vocab.modify(["<|im_start|>", "<|im_end|>", "<s>"], None, None, 0)


vocab.resize(32000, reset_token_ids=False)


# Tokenize some text
text = [
    "<s>Some text to turn into token IDs. Why is this happening?<|im_end|>",
    "<s>Some text to turn into token IDs. <|im_end|>",
    "<s>Some text to turn into token IDs....<|im_end|>",
]

HUggingface tokenizer coming soon?

Tokenize strings of only N-types of characters?

Hi, I'm looking at this for tokenizing biological sequences: protein, DNA, RNA. These have between 4-22 letters, generally. When I use the procedure, it only finds the base-letters as tokens. The vocabulary that is produced consists of the base ascii characters
less vocab.txt

^A
^B
^C
^D
^E
^F
^G
^H



^K
^L
^M
^N
^O
^P
^Q
^R
^S
^T
^U
^V
^W
^X
^Y
^Z
^[
^\
^]
^^
^_

!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
A
B
C
D
...

Even though none of those characters, except for ACGT were in the input text. The code I entered was this:

./getalltokens -charset utf8 -chunk-size 10000 -dataset my_data.txt -min-occur-chunk 2 -output vocab.vc -workers 2
./trainvocab -charset utf-8 -dataset my_data.txt -dir vocab_dir -dictionary vocab.vc -max-token-length 64 -vocab 4095
./exporttokens ./vocab_dir/794556_5737.zlib vocab -capcode -charset UTF-8 -txt

Can you advise?

What is the difference between `50256-consistent-oneword` and `50256-consistent`?

Discussed in #22

^{Originally posted by Calvinnncy97 September 4, 2023}
Hey guys,

Specifically, I would like to ask which flags are set to train these 2 tokenizers? I can't find any flags that force tokenizer to have only 1 word tokens.

Thank you.

C implementation

Great work! I noticed however there's no implementation in C or C++, only in higher-level languages which may make it difficult to integrate into projects like llama.cpp. Is this something being worked on?

Question/issue about uppercase

I'm currently training a language model with your tokenizer. I have to say that preliminary results are amazing and am very impressed.

I have a question though: how is uppercase handled by the tokenizer? It's maybe a side effect of the tokenizer, but my current model seems to produce a lot of fully uppercased texts (90% of the time), even though the training data does not contain that much uppercased text. What's strange is that the produced text is very coherent and likely comes from lowercased training data. I suspect that uppercase is some kind of state that can be switched on and off in your tokenizer and that the model does not learn to switch it off (EDIT: this is very likely, as my model is able to code html in uppercase). Another hypothesis is an error in your code.

I'm training on Eleuther's Pile in a way that I'm very familiar, and only the tokenizer changed, so it's not an error from this side of things.

"vocab.load_multiprocess_safe" doesn't work while multi-processing.

I have two datasets, a train_set and an eval_set.

When using a single instance of the tokenizer, using the vocab.load_multiprocess_safe, passed to each dataset, the tokenizer simply refuses to function, regardless of whether or not it is frozen, or whether the datasets are active at the same time.

I am able to resolve the issue by using only vocab.load, however I get yelled at about multiprocessing, so I need to further debug by passing separate instances of the tokenizer to each dataset. This not ideal, but it is at least functional.

Simply as an FYI. I appreciate the work you've done so far, it's always nice to see independent coders and researchers doing cool things.

Update on multilingual

Is there any update on the multilingual tokenizers? The project seems to be on pause.

Hangs with PyTorch data loaders when `num_workers > 0`

OS: Ubuntu 22.04
Python version: 3.11.8
PyTorch version: 2.2.1
Tokenmonster package version: 1.1.12
Other libraries: lightning==2.2.1, datasets==2.18.0

Like in the title, I load the tokenizer with load_multiprocess_safe, the dataset is just a bunch of plain text files to load and tokenize. I have tested each stage of loading and there are no problems until I wrap it in a DataLoader and use num_workers > 0, it hangs forever then.

Inquiry on Extending Algorithm to Other Languages

Impressed by Your Project

Dear alasdairforsythe,

I am genuinely impressed by your wonderful project and appreciate your sharing it. Thank you sincerely.

Inquiry on Documentation and Algorithm

I'm curious to know if there is any simple explanation or documentation about the entire development process of your project.

If not, could you please provide a brief description of the overall algorithm, even if it's very approximate? I am familiar with concepts like BPE, BBPE, unigram, ngram, and word piece, as well as various packages like SentencePiece, TikToken, tokenizers, and transformers. Therefore, feel free to skip any basic information and directly share what improvements you've made, the overall development process, your objectives, and the approaches you took to solve specific problems.

Inquiry on Extending Algorithm to Other Languages

I read on Reddit that your focus was on speed improvements, but I noticed you also reduced the vocab size. Could you elaborate on your overall approach to this?

Additionally, I am curious about where to start with your package to develop an efficient tokenizer for Korean. While I'm considering the BBPE method for creating an efficient Korean vocab, your advanced work in this area has prompted me to reach out for guidance.

Thank you for your time and insights.

Sincerely,
Daniel

Preferred citation `bibtex`

This is great work, do you have a preferred bibtex citation?

Meaning of C and D

Thanks for this amazing library. Looking forward to actually train and adapt some models for it.

After creating my first vocabulary I noticed that a lot of the tokens contain uppercase C and uppercase D. Do those have a special meaning? I could also see them referenced in the code, but I could not find the meaning.

Thanks in advance

Example:

tokens:
    - token:   "D"
      id:      35
      score:   0.006828829
      encoded: true
    - token:   " und"
      id:      2657
      score:   0.0047021606
      encoded: true
    - token:   " der"
      id:      2099
      score:   0.0032128973
      encoded: true
    - token:   "C"
      id:      34
      score:   0.0031624683
      encoded: true
    - token:   " die"
      id:      2105
      score:   0.002436903
      encoded: true
    - token:   " von"
      id:      2684
      score:   0.0021727835
      encoded: true
    - token:   ".C"
      id:      271
      score:   0.0020115946
      encoded: true
    - token:   " für"
      id:      5997
      score:   0.0017581019
      encoded: true
    - token:   "-DC"
      id:      1163
      score:   0.0017092729
      encoded: true
    - token:   " des"
      id:      2100
      score:   0.0016576286
      encoded: true
    - token:   " mit"
      id:      2407
      score:   0.0014818916
      encoded: true
    - token:   " in"
      id:      993
      score:   0.0014810717
      encoded: true
    - token:   ",C"
      id:      259
      score:   0.0014182056
      encoded: true
    - token:   ","

panic: assignment to entry in nil map

Thank you for your work.

I tried to train vocab with a new code but it is failing

Loading 3.dict
2023/07/03 21:07:33 Parsing 3.special.json
Charset: UTF-8, Capcode Enabled
Optimization mode: 4 (strict)
Vocabulary size: 65536
Single byte tokens: 233
Loading normalized.tsv
panic: assignment to entry in nil map

goroutine 1 [running]:
main.main()
        trainvocab.go:1551 +0x1d33

alasdairforsythe / tokenmonster Goto Github PK

tokenmonster's Introduction

TokenMonster

Features

Table of Contents

Pretrained Vocabularies

Optimization Modes

Vocabulary Selection Guidance

Capcode

Normalization

How does it work and how is it different from BPE?

The Ungreedy Tokenization Algorithm

Datasets

english

englishcode

fiction

code

Support & Consultation

tokenmonster's People

Contributors

Stargazers

Watchers

Forkers

tokenmonster's Issues

Extending Capcode to Composite Words

Proposal

Discussed in #22

Inquiry on Documentation and Algorithm

Inquiry on Extending Algorithm to Other Languages

Recommend Projects

Recommend Topics

Recommend Org