jbrry / irish-bert Goto Github PK
View Code? Open in Web Editor NEWRepository to store helper scripts for creating an Irish BERT model.
License: Other
Repository to store helper scripts for creating an Irish BERT model.
License: Other
There will be a lot of characters that are not in the word piece vocabulary, especially when we limit the building of the vocabulary to the most clean sources and move to very different domains such as social media content that can use a rich and expanding set of emojis.
<UNK>
to learn how to handle new characters that appear at test time?[MASK]
to pretend one cannot see them. (At pre-training time, it probably would be a good idea to exclude such tokens from the loss.)The heuristic in split_tokenised_text_into_sentences.py
is too simplistic:
' Is cuid den searmanas é . ' ar sise .
should not count as split point.1.
seem to be tokenised as two tokens in the NCI. These should not be split.IV
or iv
at the start of a sentence.Suggestion:
(a)
)DR .
(Dr.
is tokenised correctly.) ✔️Prof .
(Prof.
does not occur.) ✔️nDr .
(seems to be an inflected form of Dr.
; always following an
) ✔️ + Iml .
#41 (comment) links to vocabulary conversion code that converts from sentpiece to wordpiece format. However, this conversion does not catch cases where the sentpiece boundary is inside a sentpiece token, e.g. [Hell] [o▁Wor] [ld] [.]
instead of [Hello] [▁Wor] [ld] [.]
. Note this is not a normal underscore but U+2581
.
U+2581
symbol is not used. Doesn't this mean that such vocabulary entries are unused entries and could be removed or renamed to additional [unused%d]
entries?Issue #4 reports: Tokens in tab-separated columns may contain space characters.
This caused early versions of our extractor to miss some tokens.
With very aggressive filtering, we don't see improvements over the unfiltered results:
See what happens when we keep more, e.g. sentences containing titles in English.
Rather than having to tell users of our BERT models what tokeniser to use it would be nice to be robust to the choice of tokenisers. Robustness is likely to improve by combining data obtained with different tokenisers, preferably the most popular ones.
To some extend we are doing this already:
Carlini et al. (2020) suggest that training data can be recovered from modern LMs. Furthermore, there are "membership inference" methods that can check whether a given text fragment was part of the training data. Do such methods also work for ga_BERT? If yes, are our data providers ok with this?
It may also make our resource paper stronger to include such an analysis.
References:
Issue #33 points out that there are 99 unused entries in the mBERT vocabulary intended for users to add task-specific vocabulary entries for fine-tuning. We could use the entries to improve the vocabulary's coverage of Irish without having to train from scratch. However, to not put stones in the way of users of our models who want to use unused entries for their own tasks, we should not use all 99 entries.
A way to choose the entries to add would be to induce new vocabularies for a clean Irish corpus, reducing the size until the number of new entries, i.e. entries that are not in the mBERT vocabulary, is less than or equal to the number of entries we want to add, say 49.
Lauren's annotation of a sample of 1000 <s>
segments from the .vert
file, i.e. not yet split into sentences according to sentence final punctuation, indicates that about 1% of the NCI is English and about 0.6% is code-switching. 1.4% cannot be annotated out of context.
If we still want to try applying a language filter, we can choose between Ailbhe's hand-crafted filter and the machine-learning based filter in our current BERT pipeline. These could be tested using the sample annotated by Lauren.
We aren't using paracrawl at the moment, but it might be worth looking at the latest version (7.1)
As discussed in issue #33, having a few unused entries in the vocabulary is a great idea to make it easier for users of a model to add extra tokens for fine-tuning. We should do this as well when training our final "from scratch" models. Multilingual BERT provides 99 such entries. We should use the same number of entries and use the same ["[unused%d]" %i for i in range(99)]
format.
At the moment we are using 2.5 of IDT, switch to 2.7
It would be handy for issue #35 to have gdrive_filelist.csv
in this repo rather than in cloud storage. Can the list of filenames be published or is the list a secret?
investigate frequency of code fragments like color=
and filter / clean up if worthwhile
The file Irish_Data > processed_ga_files_for_BERT_runs > train.txt
mentioned in issue #32 is severely out of date and there is no documentation what setting was used. Please update and add a readme. Given that BERT requires multiple input files for its next sentence objective, it will also be better for reproducibility do provide these individual files, e.g. as a .tgz
.
Issue #39 (comment) discovered that the pre-processing pipeline removes lines with more than 100 tokens. Why? Is there a problem feeding very long sentences into BERT?
Issue #4 reports: There are empty sentences.
Our extractor skips these: https://github.com/jbrry/Irish-BERT/blob/master/scripts/extract_text_from_nci_vert.py
Lauren mentioned that she will be using parser-bootstapping to annotate the Irish Twitter UD treebank.
The current data used in ga_BERT might not be that suitable for parsing tweets. It might be a good idea to create a ga_BERT model tailored to Irish twitter data. This could either be:
ga_BERT
(with all data) with continued pre-training on a corpus of Irish tweets.ga_tweeBERT
(or some other name) that is all the data used in ga_BERT + ga
twitter data. This is initialised from scratch so the vocab contains code-switched tokens, acronyms, slang etc.For reference, see: BERTweet
Running python scripts/download_handler.py --datasets conll17 NCI
again, I get error messages
unxz: data/ga/conll17/Irish/ga-common_crawl-000.conllu: File exists
bzip2: Output file data/ga/conll17/raw/ga-common_crawl-000.txt.bz2 already exists.
and the process seems to take as long as in the initial run. It would be great if the download handler detect that nothing needs to be done.
Thanks to Joachim for pointing this issue out and providing the command line.
There are decoding issues with some (approx 65,000) characters in plaintext files.
Searching through text files for regex [&][#0-9a-zA-Z]+[;]
yields the counts and matched strings listed below.
(RegEx reminder [&][#0-9a-zA-Z]+[;]
is any string beginning with & followed by one or more # or alphanumeric and ending in ; )
Note that some/many/all of these strings may be in files that are on the exclude list of Irish_Data/gdrive_filelist.csv so could potentially be ignored. Assuming the strings should be replaced by the correct character in the first instance, further investigation and action is required.
39 �
125 |
1 é
1 ú
8 &
1854 [
1828 ]
176 ‘
174 ’
1 “
1 ”
1 á
2 &Dodgers;
1 á
26840 &
15743 '
4 &c;
85 >
59 <
35
18510 "
find Irish_Data -type f | fgrep -v .tmx | xargs grep -h -o -E "[&][#0-9a-zA-Z]+[;]" | sort | uniq -c
39 �
125 |
1 é
1 ú
8 &
1854 [
1828 ]
176 ‘
174 ’
1 “
1 ”
1 á
2 &Dodgers;
1 á
26840 &
15743 '
4 &c;
85 >
59 <
35
18510 "
Do the sentences follow on from each other? How big are the passages that are sampled? Do we have document/passage delimiter information?
This could affect the Next Sentence Prediction task in BERT.
Some corpora (e.g. Roinn na Gaeltachta) have anonymised versions available. In this situation, we have made the decision to train on the anonymised version.
However we still need to understand what the process of anonymisation does to the data, so that we can understand whether whole sentences have been deleted (could affect Next Sentence Prediction task) or whether emails, names have been masked with special tokens, or simply deleted.
Several changes, including removal of NSP task
This issue is about training a ga-roberta model on our data. For evaluating existing multilingual roberta models in our tasks, see issue #68 .
Issue #4 reports 7 occurrences of \x\x13
but no other use of backslashes as escape characters with special meaning.
Section "Steps for Downloading pre-training Corpora" gives the reader freedom to chose the bucket size as they see fit and from recent discussion I understood we need multiple buckets for the next sentence prediction objective. However, Section "Steps for Filtering Corpora" says one must have only 1 bucket.
What vector representation does the BERT model provide if the sequence length is over 128 tokens? Over 512 tokens?
Look at some long sentences in ga_idt
.
Preliminary results with less strict language filtering may mean that including English corpora helps.
What kind of material does the filter remove from the NCI? Take a random sample of the 886823 sentences and look for patterns.
Issue #4 reports: Looking at the first 100 lines, it seems that all-caps headings and the first sentence of a section are not separated. However, re-doing the sentence splitting without the extra signals from markup in the original documents probably would produce an overall worse segmentation.
Issue #4 reports: There are cases of words split into small pieces, e.g. T UA R A SC Á I L B H L I A N TÚ I L A N O M B UD S MA N 1 9 9 7.
How frequent is this issue? Are there any tools we could use to automatically detect and fix such cases?
An idea for detecting the errors may be to scan in a window of say 5 tokens for a surge in OOV rate that does not go hand-in-hand with a high rate of unknown character n-grams after removal of all spaces.
An idea for fixing the errors may be to synthesise a parallel corpus of text with this error automatically inserted and the original text and then train
Get the original version of the NCI from Meghan
Issue #4 reports: The content inside some <s>
elements is huge and spans many sentences. The longest element has 65094 tokens. The 100th longest has 5153 tokens.
Assuming the tokeniser used to tokenise the text is good, sentence-ending punctuation appearing as a separate token is a reliable indicator of a sentence boundary. In case of quotation at the end of a sentence, the boundary may have to be moved after a quote. A more tricky case are quotations of full sentences within a sentence.
For BERT, however, we should be ok with some wrong boundaries as these will only be visible to BERT if a pre-processing filter removes a sentence next to a wrong boundary.
Some characters in the NCI are not properly encoded. This affects characters in otherwise ok sentences, or whole blocks of text.
Issue #4 reports: Some of the <doc>
tags have a title attribute that has Irish text not part of the document itself. We could add this text as a separate sentence before the first sentence to get even more data. Same could be done with the author attribute whenever the pubdate field is not empty and medium one of "book" and "newspaper".
Our extraction script can include these with --title
and --author
. A restriction to particular media types or pubdates is not implemented yet.
To make our BERT model more robust to deviations from well-edited text typically found in real-world input, we could augment the training corpus with synthetic text derived from the current corpus by removing some accents from characters in a way to mimic social media content, putting text into all-caps, removing punctuation and/or spaces, using short forms used in text messages in Irish and inserting common spelling errors.
We noticed some all-caps text, mostly headings.
How frequent is this?
It can be argued that these cases should be kept as is to allow BERT to learn to produce useful representations for all-caps text as all-caps text may also occur at test / production time.
When training on Irish, English and possibly other languages, Hung et al. (2020) "Improving Multilingual Models with Language-Clustered Vocabularies" suggest to create wordpiece vocabularies for clusters of related languages and then use the union of these vocabularies as the final vocabulary used during BERT training and prediction. For us this could mean to split the data into (1) clearly English only, (2) clearly Irish only and (3) all other text, train 3 vocabularies and merge them.
Issue #4 reports various issues around ampersands in the NCI.
Somebody added a download script with a URL for oscar unshuffled that should not become public. TODO: Ask OSCAR people can they invalidate the URL. If not, we must be careful to never make this repo public. Only releases without code history can be made, after the URL has been removed from the code base.
https://arxiv.org/abs/2010.08275 found via https://twitter.com/hila_gonen/status/1318465935104245760 suggests that BERT can translate via simple text prompts (gap completion). This means that it learns the necessary connections and may mean that BERT's knowledge of words (and their translations) can be improved by feeding sentences containing statements about translation equivalences into BERT at training time. Same may work for phrases and complete sentences.
Related:
When combining NCI with common crawl, paracrawl, OSCAR and other noisy corpora, it may be beneficial to give more weight to clean corpora, e.g. by concatenating multiple copies.
Issue #4 reports that doc id="itgm0022"
, doc id="icgm1042"
and doc id="iwx00055"
have unescaped &
in the value of attributes. XML parser not happy.
We could augment the BERT training data with English text, or text in other languages, machine translated to Irish and/or with automatic paraphrases of Irish text.
Is their previous work adding synthetic text in the target language to the BERT training data, such as output from a machine translation model?
Issue #4 reports:
I found a missing <s>
tag. Our new extractor script should use any of <s>
, <p>
, <doc>
and <file>
(and the respective closing tags) as trigger for a sentence boundary.
Glue tags <g/>
indicating that there was no space between the neighbouring tokens are not used.
No occurrences of <
or >
outside tags.
Number of <p>
equals number of <s>
, i.e. <p>
are useless here.
Some </p>
and </s>
are missing.
Issue #4 reports: The unicode combining diaeresis character occurs 18 times. When slicing and recombining character sequences, care must be taken not to separate it from its preceding character, or at least not let it end up at the start of a token, not to fail strict unicode encoding checks.
Since inconsistencies in the tokenisation are hard to avoid when working with corpora from different sources, it may help the final model to force tokens like "etc." to be split into two word pieces by removing vocabulary entries X+PUNCT if X is in the vocabulary and replacing X+PUNCT with X if not before we train BERT, in particular if the user's tokeniser splits more aggressively than our tokenisers.
Issue #4 reports Some tokens contain unexpected hyphens, e.g. Massa-chusetts
. Probably a problem with conversion from PDF.
Wagner et al. (2007) Section 5.1.3 propose to "create three candidate substitutions (deletion, space, normal hyphen) and vote based on the frequency of the respective tokens and bigrams in [a reference corpus]".
Questions:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.