Code Monkey home page Code Monkey logo

spacyface's Introduction

Spacyface aligner

Align Huggingface Transformer model tokenizations with linguistic metadata provided by spaCy!

Currently only supports English tokenizations

Getting started

Pip

  1. pip install spacyface
  2. python -m spacy download en_core_web_sm

Manual (Clone and conda)

From the root of this project:

conda env create -f environment.yml
conda activate spacyface
# conda env update -f environment-dev.yml # OPTIONAL
python -m spacy download en_core_web_sm
pip install -e .

Usage

Basic Usage on a sentence

Every aligner can be created and used as described in the example below:

from aligner import BertAligner

alnr = BertAligner.from_pretrained("bert-base-cased")
sentence = "Do you know why they call me the Count? Because I love to count! Ah-hah-hah!"
tokens = alnr.meta_tokenize(sentence)
print("Tokens:\n\n", [(tok.token, tok.pos) for tok in tokens])
Tokens:

   [('Do', 'AUX'), ('you', 'PRON'), ('know', 'VERB'), ('why', 'ADV'), ('they', 'PRON'), ('call', 'VERB'), ('me', 'PRON'), ('the', 'DET'), ('Count', 'PROPN'), ('?', 'PUNCT'), ('Because', 'SCONJ'), ('I', 'PRON'), ('love', 'VERB'), ('to', 'PART'), ('count', 'VERB'), ('!', 'PUNCT'), ('Ah', 'INTJ'), ('-', 'PUNCT'), ('ha', 'X'), ('##h', 'X'), ('-', 'PUNCT'), ('ha', 'NOUN'), ('##h', 'NOUN'), ('!', 'PUNCT')]

Because the information is coming directly from spaCy's Token class, any information that spaCy exposes about a token can be included in the huggingface token. The user only needs to modify the exposed attributes in the SimpleSpacyToken class.

This can also be extrapolated to tokenize entire English corpora with the use of a generator. An example raw corpus representing a subset of wikipedia is included in the directory.

Observing attention between linguistic features

This library also enables us to look at the attention pattern heatmaps for a particular layer and a particular head in terms of the linguistic features that belong to that layer and head.

from transformers import AutoModel
import torch
import matplotlib.pyplot as plt
import seaborn as sn
from spacyface import RobertaAligner

alnr_cls = RobertaAligner
model_name = "roberta-base"
sentence = "A simple sentence for the ages."
layer = 8
heads = [7]

alnr = alnr_cls.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)
model.eval() # Remove DropOut effect

model_input, meta_info = alnr.sentence_to_input(sentence)

_, _, atts = model(**model_input)

to_show = atts[layer][0][heads].mean(0)[1:-1, 1:-1] # Don't show special tokens for Roberta Model

deps = [t.dep for t in meta_info[1:-1]]

# Plot
plt.figure()
sn.set(font_scale=1.5)
sn.heatmap(to_show.detach().numpy(), xticklabels=deps, yticklabels=deps)
plt.title(f"Layer {layer} for head(s): {heads}\n\"{sentence}\"")
plt.show()

Attention heatmap Layer 8 head 7

Interestingly, we have discovered that Layer 8, head 7 has a strong affinity for a POBJ (Object of the Preposition) looking at a PREP (Preposition). Cool!

Background

Different transformer models use different tokenizations. At the time of this writing, many of these tokenizations split larger English words into smaller tokens called "wordpieces" and use different methods of indicating that a token was once part of a larger word.

For inspection and research, it is helpful to align these tokenizations with the linguistic features of the original words of the sentence. spaCy is a fantastic python library for assigning linguistic features (e.g., dependencies, parts of speech, tags, exceptions) to the words of different languages, but its method for tokenizing is vastly different from the tokenization schemes that operate on the wordpiece level. This repository aims to align spaCy tokens with the wordpiece tokens needed for training and inference of the different Huggingface Transformer models.

In short, this repository enables the strange and varied tokenizations belonging to different transformer models to be correctly annotated with the metadata returned by spaCy's tokenization.

Currently, the repository only supports the English language, and the following huggingface pretrained models have been tested:

  • Bert
  • GPT2 (covers distilgpt2)
  • Roberta (covers distilroberta)
  • DistilBert
  • TransfoXL
  • XLNet
  • XLM
  • Albert
  • CTRL
  • OpenAIGPT
  • XLMRoberta

At the time of release, the only model that doesn't work with the alignment is the T5 Tokenization scheme.

Originally created to ease the development of exBERT, these tools have been made available for others to use in their own projects as they see fit.

Testing the aligner

A few edge case sentences that include hardcoded exceptions to the English language as well as strange punctuation have been included in EN_TEST_SENTS.py. You can run these tests on the established aligners with python -m pytest from the root folder.

Sometimes, your application may not care about edge cases that are hard to detect. You can test an alignment on a more representative subset of the English language with the included wikipedia subset, or use your own text file corpus. To do this, run

from spacyface import TransfoXLAligner
from spacyface.checker import check_against_corpus
corpus = 'tests/wiki.test.txt'
alnr = TransfoXLAligner.from_pretrained('transfo-xl-wt103')
check_against_corpus(alnr, corpus)

and wait a few minutes to see if any sentences break.

Notable Behavior and Exceptions

This repository makes the large assumption that there is no English "word" which is smaller than a token needed for a transformer model. This is an accurate assumption for most of the published transformer models.

It is difficult to align such completely different tokenization schemes. Namely, there are a few strange behaviors that, while not desired, are intentional to create a simplified methods to aligned different tokenization schemes. These behaviors are listed below.

  • When a token exists as a part of a larger word, the linguistic information belonging to the larger word is bestowed on the token.
  • Multiple consecutive spaces in a sentence are replaced with a single space.
  • The English language is riddled with exceptions to tokenization rules. Sometimes, a punctuation is included in the middle of what is a single token (e.g., "Mr." or "N.Y."). Other times, contractions that look nothing like the words it combines (e.g., "ain't" looks nothing like "is not" or "am not" or "are not") create difficulties for aligning. To prevent these from being an issue, this repository replaces the exceptions to the language with their original "normalized" spacy representations.
  • Many tokenizers insert special tokens (e.g., "[CLS]", "[SEP]", "[MASK]", "<s>") for certain functionalities. The metadata for all these tokens is assigned to None.

Specific to GPT2

  • Sometimes, GPT2 tokenization will include a space before a punctuation mark that should not have been there. For example, the tokenization of "Hello Bob." should be ["Hello", "ĠBob", "."], but it is instead ["Hello", "ĠBob", "Ġ."] This has not had any notable effects on performance, but note that it is different from the way the original model was pretrained. Hidden representations may also be slightly different than expected for terminating punctuation.

Known Issues

  • A Spacy exception that is part of a --delimited word (e.g. "dont-touch-me") will cause the meta tokenization to produce a different result from the tokenization strategy. See github issues for a more detailed description of this problem.

Acknowledgements

  • IBM Research & Harvard NLP

spacyface's People

Contributors

bhoov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.