Code Monkey home page Code Monkey logo

factorizer's Introduction

Tokenization with Factorized Subword Encoding


David Samuel and Lilja Øvrelid

University of Oslo
Language Technology Group


Paper (arXiv)



Installation

git clone https://github.com/ltgoslo/factorizer.git
cd factorizer
python3 setup.py install  

Pretrained factorizer models

Language URL
Arabic (ar) arabic.dawg (191 MB)
Chinese (zh) chinese.dawg (180 MB))
Czech (cs) czech.dawg (158 MB)
English (en) english.dawg (129 MB)
Norwegian (no) norwegian.dawg (186 MB)
Scottish Gaelic (gd) gaelic.dawg (187 MB)
Turkish (tr) turkish.dawg (206 MB)

Usage

from factorizer import Factorizer


tokenizer = Factorizer("english.dawg")
sentence = "The echo of a distant time comes willowing across the sand, and everything is green and submarine."

encoding = tokenizer(sentence)

print(f"INPUT:    {sentence}")
print(f"SUBWORDS: {' '.join(encoding.tokens)}")
print(f"INDICES:  {' '.join(str(index) for index in encoding.ids)}")
print(f"DECODED:  {tokenizer.decode(encoding.ids}")

This should output:

INPUT:    The echo of a distant time comes willowing across the sand, and everything is green and submarine.
SUBWORDS: ⸥The⸤ ⸥echo⸤ ⸥of⸤ ⸥a⸤ ⸥distant⸤ ⸥time⸤ ⸥comes⸤ ⸥wil lowing⸤ ⸥across⸤ ⸥the⸤ ⸥sand ,⸤ ⸥and⸤ ⸥everything⸤ ⸥is⸤ ⸥green⸤ ⸥and⸤ ⸥submarine .⸤
INDICES:  (52, 74, 62) (221, 21, 77) (135, 64, 137) (181, 45, 79) (248, 77, 122) (88, 92, 159) (124, 92, 64) (49, 151, 114) (79, 180, 104) (129, 186, 151) (52, 74, 219) (49, 127, 34) (35, 174, 39) (76, 101, 35) (32, 176, 191) (135, 209, 205) (44, 28, 242) (76, 101, 35) (13, 171, 144) (211, 41, 131)
DECODED:  The echo of a distant time comes willowing across the sand, and everything is green and submarine.

Documentation

class Encoding:

A named tuple containing:

  • ids (List[Tuple[int, int, int]])
  • tokens (List[str])
  • perplexities (List[float])
  • offsets (List[Tuple[int, int]])

Factorizer.__init__

Argument Description
tokenizer_path (str) path to a DAWG file containing with pretrained vocabulary
alpha (float) the alpha_split hyperparameter controling the granularity of subword splits (default: 0.1)
sigma (float) the sigma_sample hyperparameter controling the randomness (temperature) of sampling (default: 0.0) (no sampling)
merge_unks (bool) set this argument to True if you want to merge consecutive UNK tokens (default: True)
allow_decoding (bool) set this argument to True if you want to precompute the inverse vocabulary for decoding (default: False)
sample (bool) set this argument to True if you want to sample from the subword distribution; set to False if you want to always do the optimal tokenization (default: False)

Factorizer.__call__

Factorizes the input string (or list of strings)

Argument Description
text (Union[str, List[str]]) the input string (or list of strings)

Returns: Union[Encoding, List[Encoding]]


Factorizer.encode

The same functions as Factorizer.__call__


Factorizer.decode

Takes the factorized indices and decodes them back to string (also accepts a batched input)

Argument Description
indices (Union[List[Tuple[int, int, int]], List[List[Tuple[int, int, int]]]]) the factorized indices

Returns: Union[str, List[str]]


Please cite the following publication

@inproceedings{samuel-ovrelid-2023-tokenization,
    title = "Tokenization with Factorized Subword Encoding",
    author = "Samuel, David  and
      {\O}vrelid, Lilja",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.890",
    doi = "10.18653/v1/2023.findings-acl.890",
    pages = "14143--14161",
    abstract = "In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. The effectiveness of the proposed tokenization method, referred to as the Factorizer, is evaluated on language modeling and morpho-syntactic tasks for 7 diverse languages. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.",
}

factorizer's People

Contributors

davda54 avatar

Stargazers

Avijit Thawani avatar  avatar Harold Figueroa avatar Yu Zhang avatar Kunat Pipatanakul avatar Jeff Carpenter avatar Arjun Nemani avatar Vladimir Gurevich avatar Stefan Schweter avatar Song avatar Yiran Wang avatar  avatar Egil Rønningstad avatar

Watchers

Lilja Øvrelid avatar  avatar Andrey Kutuzov avatar Erik Velldal avatar Stephan Oepen avatar

factorizer's Issues

Codes beyond 255

Thank you for releasing code and pretrained tokenizers!

Could you please elaborate why the codes sometimes are greater than 255? The simplest example is directly in the paper: melons → [261, 255, 209] - if the codebook size is 256 then shouldn't 255 be the maximum possible?

We found codes as far as 263 in practice.

How to train

Hi, thanks for sharing the code for both de/encoding and training. Could you put up a readme for training on new data? I would like to try this out on Greek and Hebrew text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.