Code Monkey home page Code Monkey logo

data-preparation's Introduction

Data prepation

This repository contains all the tools and code used to build the ROOTS dataset produced by the BigScience initiative to train the BLOOM models as well as a reduced version to train the tokenizer.

General pipeline for the preparation of the ROOTS dataset

diagram_preprocessing_roots

More detail on the process, including the specifics of the cleaning, filtering, and deduplication operations, can be found in Sections 2 "(Crowd)Sourcing a Language Resource Catalogue" and 3 "Processing OSCAR" of our paper on the ROOTS dataset creation.

Key resources

Citation

@inproceedings{
bigscience-roots:2022,
title={The BigScience {ROOTS} Corpus: A 1.6{TB} Composite Multilingual Dataset},
author={Hugo Lauren{\c{c}}on and Lucile Saulnier and Thomas Wang and Christopher Akiki and Albert Villanova del Moral and Teven Le Scao and Leandro Von Werra and Chenghao Mou and Eduardo Gonz{\'a}lez Ponferrada and Huu Nguyen and J{\"o}rg Frohberg and Mario {\v{S}}a{\v{s}}ko and Quentin Lhoest and Angelina McMillan-Major and G{\'e}rard Dupont and Stella Biderman and Anna Rogers and Loubna Ben allal and Francesco De Toni and Giada Pistilli and Olivier Nguyen and Somaieh Nikpoor and Maraim Masoud and Pierre Colombo and Javier de la Rosa and Paulo Villegas and Tristan Thrush and Shayne Longpre and Sebastian Nagel and Leon Weber and Manuel Romero Mu{\~n}oz and Jian Zhu and Daniel Van Strien and Zaid Alyafeai and Khalid Almubarak and Vu Minh Chien and Itziar Gonzalez-Dios and Aitor Soroa and Kyle Lo and Manan Dey and Pedro Ortiz Suarez and Aaron Gokaslan and Shamik Bose and David Ifeoluwa Adelani and Long Phan and Hieu Tran and Ian Yu and Suhas Pai and Jenny Chim and Violette Lepercq and Suzana Ilic and Margaret Mitchell and Sasha Luccioni and Yacine Jernite},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=UoEw6KigkUn}
}

data-preparation's People

Contributors

albertvillanova avatar cakiki avatar dependabot[bot] avatar hugolaurencon avatar lvwerra avatar paulovn avatar saullu avatar tevenlescao avatar thomasw21 avatar tristanthrush avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-preparation's Issues

Add citation

When we have the paper somewhere on arxiv, put a link to it in this repo so people using this can cite the paper

Extending this codebase

I was looking at this codebase and encountered this bit:
https://github.com/bigscience-workshop/data-preparation/tree/main/sourcing/code_dataset#code-dataset-sourcing

The query to create the dataset can be found in query.sql. After creation the dataset was preprocessed with processing.py. Note that there is a bug in the script that filters only for GPL licenses instead of filtering them out. There are instructions to remove the bug but it is left there for reproducibility.

This leads me to believe that the code here is meant to be used and investigated "as is" and without modification.
Is this repo primarily meant for reproducibility?

If i wanted to improve and extend it for an independent Dataset building project, should I fork it or work from a branch?

rename cc_pseudo_crawl seed_batch_1 slurm scripts

When thomas/re_organize branch is merged, we need rename the slurm scripts inside repos/data-preparation/sourcing/cc_pseudo_crawl/seeds_batch_1 by:

download_warc.slurm  -> 01_download_warc.slurm    
download_warc_trial_4.slurm -> 02_download_warc_trial_4.slurm    
download_warc_trial_5.slurm -> 03_download_warc_trial_5.slurm
download_warc_too_big.slurm -> 04_download_warc_too_big.slurm
redownload_warc.slurm -> 05_redownload_warc.slurm
check_errors_in_dataset.slurm -> 06_check_errors_in_dataset.slurm  
preprocess_warc.slurm -> 08_preprocess_warc.slurm  
extract_text_and_html_metadata.slurm -> 09_extract_text_and_html_metadata.slurm  
shard_by_seed_id.slurm -> 10_shard_by_seed_id.slurm
merge_seed_shards.slurm -> 11_merge_seed_shards.slurm
shard_and_compress.slurm -> 12_shard_and_compress.slurm

Then we still need to find which of the following files have been used in the end divide_in_subshards.slurm or divide_in_subshards_1000.slurm (step 7)

EDIT: divide_in_subshards.slurm is the step 7 and divide_in_subshards_1000.slurm is reality done in step 10

Mismatch of the Available Data Quantity on Huggingface

I tried to download English part of Roots these days.
According to the paper, there are 484,953,009,124 bytes of English data.
However, after downloading all roots-related datasets on huggingface by filtering, I found there is only about 43.8 GB data.
I wonder how to explain the difference?
Are those huggingface datasets only a subset of Roots?
Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?

Changing parmater values to extreme in parameters_filtering.py doesn't change the no. f documents being removed

Hi! Kudos to the author for an end-to-end piepline for cleaning and filtering a large corpus. I was working with main_filtering.py and was trying to change the parameter values in parameters_filtering.py, hoping to increase/decrease the no. of documents that were being removed out. But I observe no changes.

  1. I have english dataset so I set parameters_filtering_en, and I have experimented with the given values and some modifications in 1/more conditions and cutoffs.
  2. I have also tried out parameters_filtering_default where I do observe changes in documents being filtered out. The no. was different from those in parameters_filtering_en.
  • The parameters_filtering_default has some error. I modified languages_id.py to account for "defualt" as langauge but used flagged_/stop_words of english language.
  1. Within parameters_filtering_default or parameters_filtering_en, when parameter values are changed no changes are observed in no. of documents or documents which are getting removed.

Kindly review the code and let me know the solutions. Also let me know if I'm missing something.

Thank You!

Check links

Especially in README files, at the very end, convert every link pointing to another repo (for example data_tooling) to this repo when possible

Welcome to try SailCraft - A data cleaning tool built upon this repository

We extend our gratitude to the authors of this repository! Your documentation and code have greatly benefited the community.

We have used this repo in building the data processing pipeline tool SailCraft.
It consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning.

Many thanks for your contribution for open research! And we welcome the developers to try SailCraft.

Why stopwords_min_cutoff rather than stopwords_max_cutoff?

Thanks for your helpful codebase!

I am a bit confused about stop words filtering.
The release code removes the document, if its stop words ratio below the certain cutoff.

cond = stopwords_ratio >= stopwords_min_cutoff

But in notebook, section 2.5 states If the stop words ratio for a document is higher than a certain cutoff, it is removed.

I am wondering which one is more useful in your practice.
Thanks in advance!

the version of simhash

Which version of simhash is used in the project, and why is the output of simhash.find_all() method always an empty list?

Write READMEs

We need to write a bunch of readmes on how we used each tools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.