cisnlp / glot500 Goto Github PK
View Code? Open in Web Editor NEWGlot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023
Home Page: https://aclanthology.org/2023.acl-long.61
License: Other
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023
Home Page: https://aclanthology.org/2023.acl-long.61
License: Other
I've re-trained a Mistral (~ Llama) language-specific tokenizer on the training portion of the Yoruba samples and noticed strange tokens. As an example
{"Ìròyìn▁tó▁ṣe▁kókó▁Àbámọ̀▁ni▁yóò▁gbẹ̀yin▁ẹgbẹ́▁Association▁of▁Stingy▁Men▁tí▁kò▁fẹ́▁náwó▁fóbìnrin-▁Akeugbagoldwákàtí▁9▁sẹ́yìn▁Gbọ́,▁Ìṣẹ́jú▁kan▁BBC▁07:00▁UTCwákàtí▁kan▁sẹ́yìn▁Wo▁ohun▁tí▁a▁mọ̀▁nípa▁gbèdéke▁ti▁Sunday▁Igboho▁fún▁àwọn▁Fulani▁ní▁Ibarapa▁àti▁èsì▁tí▁wọ́n▁fún▁unwákàtí▁3▁sẹ́yìn▁Ìwádìí▁kíkún▁lóríi▁kókó▁ìròyìn▁Amad▁Diallo▁darapọ̀▁mọ́▁Manchester▁United8▁Sẹ́rẹ́▁2021▁Èsíò!": 29494}
which is a token which occurs 1832 times in the training split (rg $STRING yoruba_textified.txt | wc -l
) in ever so slightly different contexts (i.e., near duplicates).
I hence checked for duplicates in the dataset and found that there are abundant hard duplicates among 4.5M lines which reduce to 1.16M unique lines. I understand that datasets for low-resource languages are noisy, but I presume users expect that hard duplicates do not occur.
To reproduce:
from datasets import load_dataset
from collections import Counter
import numpy as np
import pandas as pd
dataset = load_dataset("cis-lmu/Glot500", "yor_Latn", split="train")
counter = Counter(dataset["text"])
c = sorted(counter.items(), key=lambda counts: counts[1])
_, counts = zip(*c)
counts = np.array(counts)
print(
pd.DataFrame(counts)
.round(0)
.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
.T
)
# original counts is 4.5M
# count mean std min 10% 25% 50% 75% 90% 95% 99% max
# 1167327.0 3.84903 1.379304 2.0 2.0 2.0 5.0 5.0 5.0 5.0 5.0 10.0
I have briefly checked train splits for some other languages, which also to varying degree comprise duplicates.
kin_Latn: nearly correct
Original length: 415405
Deduplicated length: 401856
count mean std min 10% 25% 50% 75% 90% 95% 99% max
0 401856.0 1.033716 0.180498 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0
uzb_Latn: OK
Original length: 3182175
Deduplicated length: 3182175
count mean std min 10% 25% 50% 75% 90% 95% 99% max
0 3182175.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
ibo_Latn: bad
Original length: 5608630
Deduplicated length: 1526812
count mean std min 10% 25% 50% 75% 90% 95% 99% max
0 1526812.0 3.673425 0.739383 2.0 2.0 4.0 4.0 4.0 4.0 4.0 4.0 20.0
wol_Latn: nearly correct
Original length: 92358
Deduplicated length: 92357
count mean std min 10% 25% 50% 75% 90% 95% 99% max
0 92357.0 1.000011 0.003291 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0
Hi there!
With now having access to the full Glot500 and starting my experiments, I realized that the publicly available Glot500 on HF does not seem to comprise the 41580525 sentences (or a probably very large subset thereof) for tel_Telu
.
How to reproduce the NER evaluation?
Hello,
I came across your work through a virtual talk by Prof. Schütze and found it to be a valuable resource. I'm particularly interested in the Glot500-c(Glot500 corpus) data.
At the moment, your README mentions that access to the corpus will be given after filling an online form, and the form will be available soon. Is there any tentative date for the release of the form? The multilingual corpus would greatly assist my research group with our research on language models.
Thank you for maintaining such a helpful repository, your work is greatly appreciated 😊
Yesterday, the Taxi1500 corpus was added to Glot500 on Github. I think is not meant to be the case.
See also https://huggingface.co/datasets/cis-lmu/Glot500/tree/main
tl;dr: some shards of languages below (potentially more) have the extra column "__index_level_0__". The dataset thus cannot be fully loaded.
Thanks for providing a potentially super cool dataset for multilingual NLP research!
While my request for access to the full Glot500-c is still awaiting processing, I thought I would try to use what's available on Hugging Face and quickly ran into the issue already documented here https://huggingface.co/datasets/cis-lmu/Glot500/discussions/3
While I am still loading the dataset, of the train split of the first 139 languages, the arrow files of
afr_Latn
amh_Ethi
ara_Arab
en_Latn
fra_Latn
hau_Latn
mlg_Latn
nya_Latn
sna_Latn
som_Latn
sot_Latn
swa_Latn
zul_Latn
have inconsistent column names. That is, some shards have "__index_level_0__" as an extra column. The below python file slowly but eventually should fix the problem.
# not mega pretty but gets the job done
from datasets import load_dataset, concatenate_datasets
from pathlib import Path
from datasets.exceptions import DatasetGenerationError
CWD = Path.cwd() # inside Glot500 folder
BACKUP = Path("../Glot500_backup")
if not BACKUP.exists():
BACKUP.mkdir()
SPLIT = "train"
langs = [p for p in CWD.glob("*") if p.is_dir() and "_" in str(p)]
def fix(lang: str, lang_split_dir: str, paths: list[Path]):
datasets = []
original_dir = BACKUP / lang / SPLIT
if not original_dir.exists():
original_dir.mkdir(parents=True)
for path in paths:
new_path = original_dir.joinpath(path.name)
path.rename(new_path)
for path in original_dir.glob("*.arrow"):
datasets.append(
load_dataset("arrow", data_files={"train": str(path)}, split="train")
)
col = "__index_level_0__"
datasets_ = []
counter = 0
for d in datasets:
if col in d.features:
d_ = d.remove_columns(col)
counter += 1
else:
d_ = d
datasets_.append(d_)
print(f"Cleaned up {counter} shards for {SPLIT} of {lang}")
dataset = concatenate_datasets(datasets_)
dataset.save_to_disk(lang_split_dir)
datasets = {}
for i, lang in enumerate(langs):
print(f"Processing {i}/{len(langs)}: {lang}")
lang_train = lang / "train"
lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
try:
datasets[lang] = load_dataset(
"arrow", data_files={"train": lang_train_arrow}, split="train"
)
except DatasetGenerationError as e:
print(f"Fixing {lang}")
fix(lang.stem, str(lang_train), list(lang_train.glob("*.arrow")))
lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
datasets[lang] = load_dataset(
"arrow", data_files={"train": lang_train_arrow}, split="train"
)
Proposed fix: I suppose it would be relatively straightforward for you to run a variant of the above script and reupload the fully loadable dataset.
I would highly appreciate also getting full access to the dataset :)
Thanks a lot in advance!
Are there plans to train a large glot500 model?
Thanks for your work so far!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.