togethercomputer / redpajama-data Goto Github PK

View Code? Open in Web Editor NEW

4.4K 4.4K 335.0 2.05 MB

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

License: Apache License 2.0

Python 96.19% Shell 3.53% Dockerfile 0.28%

redpajama-data's People

Contributors

Stargazers

Watchers

Forkers

csris antocodes lvsean lampts bimri bizrockman tonyw dzhechko kustomzone ekryski alhdo gaoxiaojun philippe2803 goswamig ashrafulsbmcbd flyarong dumpmemory l1aoxingyu mard1no zxf864823150 apollohuang1 galv almldlnn jrcribb grandbudha airfuse anastazya weiyx16 ekoziol decentralised-ai gucky92 skullface20 gunjantitiya ravyt3ja mattkallo azure-arc-0 kurtseifried jack1981 jxzhangjhu ec0000 openandrus chenxin061 signalprime mutsukiori chorseng tsayan jan-karsten-kuhnke qqq-tech abirkorched obiemd5 msgpo eternalerrors luisriverag alialisoz06 richardsonjf iamsenorespana fxhollow hiteshbedre nicodfs congshan0519 gongli1231 atorkmabrains gchenfly andermarcebr techthiyanes maelh aafksab blackcatxj srikalyan i-z-z-y rozek calrs005 nekonton raeschulmoisesf hhy5277 koyweforest prashantramappa fjoelnr yuliu669 eltociear grajr dvsrepo taohong0511 redlegenddev ai-ld avgirl ayourtch mdmmn378 celestialized standardgalactic giftededu closerforever stjordanis open-models-platform arkll cylonspace mysqlsc qinjinghui guoruiwang run0nceex

redpajama-data's Issues

Fine tuning RedPajama Model

Hi,
How do I finetune the RedPajama on my dataset? Is there a training script that ai can reuse?

Got error while runing `python -m cc_net -l my -l gu`

Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation?

(racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu
usage: __main__.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD]
                   [--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM]   
                   [-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET]
                   [--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES]
                   [-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE]
                   [--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG]
__main__.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'my'

Understanding the quality filter

Hello and thank you for the great work.
I am trying to understand the quality filter you had, described here

I took your trained model & script you provided in this issue, and tried to run the script your provided, writing this short sanity check [implementation below].
The first paragraph is from wikipedia, and the second paragraph is a lower quality paragraph.

These are the outputs I receive, almost the same scores & probabilities:

{'pred_label': '__label__cc', 'pred_label_prob': 0.9966633915901184, 'wiki_prob': 0.003336608409881592, ...} # wikipedia paragraph
{'pred_label': '__label__cc', 'pred_label_prob': 0.9801203012466431, 'wiki_prob': 0.019879698753356934, ...} # low quality paragraph

Am I missing something? Am I using it correct? I take the exact steps you make with the model in the classify.py file.

to better understand, your measurement for high/low quality are whether the page is from CC or Wikipedia? or do you have some more quality scorers that I am missing?
My primary suspect is the page length/structure, I understand you trained it on full pages, which are not the same as paragraphs, but would be happy to verify.
Thanks again for the great job :)

wikipedia_paragraph = '''A language model is a probability distribution over sequences of words.[1] Given any sequence of words of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers'''
bad_paragraph = '''language thing is like, you know, when you get lots of words together and there's like a chance for one word after another. Like when you're talking and stuff. And there's a thing that's called infinity digital or something that means you can make lots and lots of sentences, even ones that you might never hear before. Some smart people have found some ways to not make this a problem, like there's this thing called Markov (sounds Russian) and there's other brain-like things that help, but don't ask me about those.'''

import fasttext
model = fasttext.load_model(model_path)

def get_output(content):
    output = {}
    # run classifier
    text = " ".join(content.strip().splitlines())
    pred = model.predict(text)
    (pred_label, pred_prob) = pred
    pred_label = pred_label[0]
    wiki_prob = pred_prob[0]
    if pred_label == "__label__cc":
        wiki_prob = 1 - wiki_prob
    output["pred_label"] = pred_label
    output["pred_label_prob"] = pred_prob[0]
    output["wiki_prob"] = wiki_prob
    output["text"] = content
    return output

print(get_output(wikipedia_paragraph))
print(get_output(bad_paragraph))

where is the FastText ptrtrained model to classify each CommonCrawl webpage

First of all: thank you very much for your contribution!

Many thanks if you can share the FastText ptrtrained model to classify each CommonCrawl webpage whether it is low quality page

Languages

Can you mention what languages covered in this dataset? based on the arXiv:2302.13971v1, LLaMA only covers this kind of languages : bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. Is there possible to add some new low resources languages, like Indonesian for example. Thanks

[Errno 2] No such file or directory: 'cutoff.csv'

Hi, I'm trying to run this test case:
python3 -m cc_net --config config/test_segment.json
but encountered the following error:

Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('test_data2'), mined_dir='mined_by_segment', execution='debug', num_shards=4, min_shard=-1, num_segments_per_shard=1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['de', 'it', 'fr'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=0, target_size='32M', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment'], experiments=[], cache_dir=PosixPath('test_data/wet_cache'))
['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment']
2023-04-27 11:41 INFO 39932:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x11b5b77c0>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x11b5b7a30>, <cc_net.perplexity.MultiSentencePiece object at 0x11b5b78e0>, <cc_net.perplexity.DocLM object at 0x11b5b7970>, <cc_net.perplexity.PerplexityBucket object at 0x11b5b7a60>, <cc_net.minify.Minifier object at 0x11b5b7be0>]
/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
  warnings.warn(
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded hashes from test_data2/hashes/2019-09/0000.bin (0.700GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded 3_361_543 hashes from 1 files. (0.7GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:Classifier - Loading bin/lid.bin
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/de.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/de.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/it.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/it.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/fr.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/fr.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/de.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/de.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/it.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/it.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/fr.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/fr.arpa.bin (took 0.0min)
Traceback (most recent call last):
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 142, in debug_executor
    message = function(*x)
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
    jsonql.run_pipes(
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 432, in run_pipes
    transform = stack.enter_context(compose(transformers))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/contextlib.py", line 429, in enter_context
    result = _cm_type.__enter__(cm)
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
    self._prepare()
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 352, in _prepare
    t.__enter__()
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
    self._prepare()
  File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/perplexity.py", line 267, in _prepare
    cutoffs = pd.read_csv(self.cutoff_csv, index_col=0)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/common.py", line 859, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/Users/work/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'

Are there any possible reasons? Python3.9.6 on MacOS

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 2556: invalid start byte

Another question, plz. I met an error in hashing step. Have you ever seen this before?

Failed job 34548 (4 / 60): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 235, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 160, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 133, in group_by_docs
    for warc in warc_lines:
  File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/py38/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 2556: invalid start byte

`No module named 'datasets'` in `data_prep/book/`

I'm going through gathering the data from each of the data_prep folders and besides some inconsistency on where the data folder is in each README this is the only error I've come across.

cd data_prep
mkdir -p data/book
python3 book/download.py

Traceback (most recent call last):
  File "/mnt/AI/RedPajama-Data/data_prep/book/download.py", line 1, in <module>
    from datasets import load_dataset
ModuleNotFoundError: No module named 'datasets'

error while download from url

I try to download from the url, such as wget -P ./ -c https://data.together.xyz/redpajama-data-1T/v1.0.0/wikipedia/wiki.jsonl. But it meets error as below

63888150K .......... .......... .......... .......... .......... 54% 26.5M 82m10s
63888200K .......... .......... .......... .......... .......... 54% 28.9M 82m10s
63888250K .......... .......... .......... .......... .......... 54% 27.6M 82m10s
63888300K .......... .......... .......... ..........            54% 26.1M=98m14s

2023-06-08 12:02:19 (10.6 MB/s) - Connection closed at byte 65421660160. Retrying.

--2023-06-08 12:02:20--  (try: 2)  https://data.together.xyz/redpajama-data-1T/v1.0.0/wikipedia/wiki.jsonl
Connecting to data.together.xyz (data.together.xyz)|104.26.15.50|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 120142320713 (112G), 54720660553 (51G) remaining
Saving to: ‘./wiki.jsonl’

          [ skipping 63888300K ]
63888300K ,,,,,,,,,, ,,,,,,,,,, ,,,,,,,,,, ,,,,,,,,,, .......... 54% 7.02M 10h19m
63888350K .......... .......... .......... .......... .......... 54%  938K 14h54m
63888400K .......... .......... .......... .......... .......... 54% 6.70M 9h6m
63888450K .......... .......... .......... .......... .......... 54% 11.6M 6h39m
63888500K .......... .......... .......... .......... .......... 54% 10.2M 5h24m
63888550K .......... .......... .......... .......... .......... 54%  215K 17h39m
63888600K .......... .......... .......... .......... .......... 54%  992K 17h13m
63888650K .......... .......... .......... .......... .......... 54%  496K 18h59m
63888700K .......... .......... .......... .......... .......... 54%  659K 19h25m
63888750K .......... .......... .......... .......... .......... 54% 1.44M 18h24m
63888800K .......... .......... .......... .......... .......... 54% 1017K 18h1m
63888850K .......... .......... .......... .......... .......... 54% 1.27M 17h26m
63888900K .......... .......... .......... .......... .......... 54% 1.30M 16h55m
63888950K .......... .......... .......... .......... .......... 54% 1.21M 16h33m
63889000K .......... .......... .......... .......... .......... 54% 1.07M 16h20m
63889050K .......... .......... .......... .......... .......... 54% 1.55M 15h53m
63889100K .......... .......... .......... .......... .......... 54% 1.97M 15h21m
63889150K .......... .......... .......... .......... .......... 54% 1.76M 14h56m
63889200K .......... .......... .......... .......... .......... 54% 1.52M 14h38m
63889250K .......... .......... .......... .......... .......... 54% 2.09M 14h14m
63889300K .......... .......... .......... .......... .......... 54% 2.22M 13h51m
63889350K ..........                                             54% 1.21M=1.0s


Cannot write to ‘./wiki.jsonl’ (Success).

How to fix it?

I got an issue when I use fasttext doing arxiv cleaning.

what is the version of the python and fasttext?

Specifying arxiv dates

Hi there,

Thanks for making this code available. I am trying to use the arxiv downloader, but would be interested in a certain date range of papers to be downloaded. Any tips on how to approach this?

Many thanks

Script fixes in data_prep/github

First of all, thank you for your great work to create this project. I didn't have access to a Slurm workload manager, but I was able to use these scripts to preprocess a sample of the GitHub dataset from BigQuery (which was exactly what I wanted to do!). Here are a couple points which would improve the scripts for the next person:

The script scripts/github-prepare-download.sh mentioned in this README.md seems missing from the scripts directory.
The TARGET_DIR variable in the github-global-dedup-slurm.sbatch script should probably be ./data/github/processed_deduped instead of ./data/github_scratch/processed_deduped
Similarly, the TARGET_DIR and DEDUPED_DIR variables in the github-run-filter-slurm.sbatch script should use github instead of github_scratch

Thanks again for your work on this project.

We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet

How can I do that? I didnt find cc_net can process WARC to WARC.wet

Partially downloaded datasets

Hi there!

I looked through the corpuses and found that sometimes they are not 100% downloaded. Not sure, if the issue is with the downloading scripts. Below are some examples grepped from bookcorpus and arxiv. If you look into these examples, check out the url, if you click on it, you will see a full dataset that contains more text than just a header. For now I am planning to filter out such documents from the training set - there are not too many of them. But it would be great in the future to download these documents more properly and include full docs in the corpus. Then this dataset will be even larger and more useful for model training. I suspect that there will be more datasets corrupted, not just arxiv and book.

I am currently working on the RedPajama-v2, check out our slack for more info on what we found out about this datasets: https://discord.gg/KMmsHFxE

Thanks!

Drive space to store

Hey, could you list the final drive space to store the full dataset somewhere?

cc-net failure on slurm cluster

I went from doing the cc-net pulls locally to using slurm. When I try to execute

theskaz@c4140:/nfs/slow/RedPajama-Data/data_prep/cc/cc_net$ python -m cc_net --dump 2020-05 --num_shards 5000 --hash_in_mem 1

Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2020-05', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=5000, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['en'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (3983 jobs)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/home/theskaz/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 631, in main
    all_files = mine(conf)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 334, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 263, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
    jobs = ex.map_array(function, *args)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 328, in _internal_process_submissions
    array_ex.update_parameters(**self.parameters)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 810, in update_parameters
    self._internal_update_parameters(**kwargs)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 302, in _internal_update_parameters
    raise ValueError(
ValueError: Unavailable parameter(s): ['slurm_time']
Valid parameters are:
  - account (default: None)
  - additional_parameters (default: None)
  - array_parallelism (default: 256)
  - comment (default: None)
  - constraint (default: None)
  - cpus_per_gpu (default: None)
  - cpus_per_task (default: None)
  - exclude (default: None)
  - exclusive (default: None)
  - gpus_per_node (default: None)
  - gpus_per_task (default: None)
  - gres (default: None)
  - job_name (default: 'submitit')
  - mem (default: None)
  - mem_per_cpu (default: None)
  - mem_per_gpu (default: None)
  - nodes (default: 1)
  - ntasks_per_node (default: None)
  - num_gpus (default: None)
  - partition (default: None)
  - qos (default: None)
  - setup (default: None)
  - signal_delay_s (default: 90)
  - srun_args (default: None)
  - stderr_to_stdout (default: False)
  - time (default: 5)
  - wckey (default: 'submitit')

and I go into execution.py and comment out the slurm_time parameter (lines 58-61) and try again, that returns this error:

theskaz@c4140:/nfs/slow/RedPajama-Data/data_prep/cc/cc_net$ python -m cc_net --dump 2020-05 --num_shards 5000 --hash_in_mem 1
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2020-05', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=5000, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['en'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (3983 jobs)
sbatch: error: Batch job submission failed: Invalid job array specification
subprocess.CalledProcessError: Command '['sbatch', '/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/data/logs/submission_file_4751207924ea4dde903eace6afeb2a38.sh']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/home/theskaz/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 631, in main
    all_files = mine(conf)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 334, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 263, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
    jobs = ex.map_array(function, *args)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions
    first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 934, in _submit_command
    output = utils.CommandFunction(command_list, verbose=False)()  # explicit errors
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/utils.py", line 352, in __call__
    raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification

Not sure where to go from here. I can verify that slurm is working and all compute nodes are in the idle state.

Memory and space requirements

Hi, I ran the tokenize scripts you provided to tokenize the datasets.
With cl100k_base vocab and tiktokenizer, it result to "can't allocate memory" in my server for every datasets.
Actually the server has over 300G memory, so I'm worry about how much memory and space does it need to tokenize all datasets?

How can you map the common crawl source back to metadata?

The common crawl data entries have a source like this:

"source":"cc/2023-06/en_head_0000.json.gz/line401859"

What's the right way to map that back to metadata where the entry came from? In particular I'd like the original url and timestamp it was downloaded. Is that possible? Most of the metadata seems to be in terms of the WARC format, not the WET format I believe was used by cc_net to process the data.

Thanks,
Craig

Common Crawl metadata

Current State

Currently the data in the commoncrawl slice contains the following fields in addition to the text field:

"pred_label": "__label__cc", "pred_label_prob": XXX, "wiki_prob": XXX, "source": "cc/2019-30/en_middle_0053.json.gz/line1"

Goal

We would like to also include the metadata that gets generated by the cc_net pipeline (see metadata [here(https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net)). The goal is that one record in the final jsonl files should follow this schema:

{
 "text": " ... ", 
 "meta": {
  "pred_label": "...", 
  "pred_label_prob": "...",
  "wiki_prob": "...",
   "source": "...",
  "url": "...",
  "date_download": "...",
  "digest": "...",
  "length": "...",
  "nlines": "...",
  "source_domain": "...",
  "title": "...",
  "original_nlines": "...",
  "original_length": "...",
  "language": "...",
  "language_score": "...",
  "perplexity": "..."
 }
}

How do i prepare the data for Visualisation?

I'm trying to use the data visualization using meerkat. The viz/main.py visualization is a sample data of Git Hub. Is there a script with which I can expand to other datasets?

No file named github-prepare-local-dedup.sh

Hey, I've been trying to process and clean the dataset for awhile now and I keep getting this error and can't seem to find the file in the repo or anything. Any help would be greatly appreciated. Here is the error message - bash: scripts/github-prepare-local-dedup.sh: No such file or directory

If the program exit with the outside cause

If the program terminates due to a power outage when I run the cc-net data prepare pipline, how can it continue execution at the breakpoint when restarted?

The training data for Quality Classifier

I saw the code for loading cc data in create_corpus.py:
https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/cc/classifier/create_corpus.py#LL32C1-L34C26

for file in glob.glob("common_crawl/*/*/*.gz"):
    if ("middle" in file or "head" in file) and "dedup" not in file:
        jobs.append(file)

As mentioned in LLaMA paper, the commoncrawl data used here should be treated as negative samples for the classifier. Why not use the inferior data(tail) instead of head and middle?

how to process arXiv tex files without downloading?

I download myself arXiv tex files without using running scripts/arxiv-kickoff-download.sh.

My data structure is

my_arxiv_src
 |- papername1
      |- name.tex
 |- papername2
      |- name.tex
      |- other_name.tex

I want to preprocess my latex data, so I run
bash scripts/arxiv-kickoff-cleaning.sh
and arxiv-kickoff-cleaning.sh is the following

#!/bin/bash

set -e

WORKERS=2

# load modules
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 conda/pytorch_1.12.0
pip install -r arxiv_requirements.txt

export DATA_DIR="./my_ arxiv_src"
export TARGET_DIR="./data/arxiv/processed"
export WORK_DIR="./work"

mkdir -p logs/arxiv/cleaning

# setup partitions
python run_clean.py --data_dir "$DATA_DIR" --target_dir "$TARGET_DIR" --workers $WORKERS --setup

# run download in job array
sbatch scripts/arxiv-clean-slurm.sbatch

arxiv-kickoff-cleaning.sh runs with no error but,
the result files which are arxiv_1.jsonl and arxiv_2.jsonl have not content...

What is the DATA_DIR and TARGET_DIR ?
Is there anything running method with latex files?

about download a small portion of cc

hello there, thank for your good work.
i want to download a small portion of cc(to run through the whole process firstly)
when i run the code 'python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1' if i just need to add the argument --num_segment_per_shard 2 and change some numbers like 'python -m cc_net --dump 2023-06 --task_parallelism 10 --num_shards 10 -l en --mine_num_processes 10 --hash_in_mem 1 --num_segments_per_shard 2'. then, some other arguments like target_size='4G' is have any inffluence?
Or how should i set these arguments? like which arguments i should modify or what value is appropriate.
thanks a lot!

Questions about the quality classifier in common crawl

Thank you for your work! I am preprocessing for another language(zh). I have some questions regarding the provided instructions:

In extracted_urls.txt, we provide 38M URLs that are processed from the Wikipedia dump. We early stop this process to only keep 300K pages.

Regarding the extracted_urls.txt file, how was the decision made to keep only 300K pages out of the 38M URLs processed from the Wikipedia dump? Should I follow the same ratio for the zhwiki-20230420-pages-articles-multistream.xml file, which is smaller than the English one?

We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet.

Can you provide more guidance on how to run the pipeline on this file? I had read through the cc_net code and found nothing about wikipedia processing other than data_prep/cc/cc_net/cc_net/get_wiki_cirrus.py. But it seems download from https://dumps.wikimedia.org/other/cirrussearch/current/zhwiki-20230501-cirrussearch-content.json.gz.

python classifier/create_corpus.py > data_train

I notice that the input of create_corpus.py is ["cc_net/data/mined/wikipedia/en_head_0000.json.gz", "cc_net/data/mined/wikipedia/en_middle_0000.json.gz"](maybe parsing an argument is better). Can you provide instructions on how to obtain these files?

for file in glob.glob("common_crawl/*/*/*.gz") in create_corpus.py

Can you clarify whether it should be run on cc_net/data/mined/{CC_DUMP}/*.gz? The glob here may be ambiguous.

Lastly, I would appreciate it if you could improve the Quality Classifier section in the README and scripts in the data_prep/cc/classifier to make it easier for newcomers to follow. Thank you!

how much disk memory will be used？

hello, there.
i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size

Failed building wheel for cc-net

Hello,

trying to play with this repo and when doing a "make install" on the cc-net folder as per the instructions, I get a build error:

(Ubuntu 22.04, python 3.10, python-dev and build-essential installed)

Building wheels for collected packages: cc-net, kenlm
  Building wheel for cc-net (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for cc-net (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [52 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib
      creating build/lib/cc_net
      copying cc_net/get_wiki_cirrus.py -> build/lib/cc_net
      copying cc_net/tokenizer.py -> build/lib/cc_net
      copying cc_net/perplexity.py -> build/lib/cc_net
      copying cc_net/dedup.py -> build/lib/cc_net
      copying cc_net/__init__.py -> build/lib/cc_net
      copying cc_net/mine.py -> build/lib/cc_net
      copying cc_net/minify.py -> build/lib/cc_net
      copying cc_net/split_by_lang.py -> build/lib/cc_net
      copying cc_net/execution.py -> build/lib/cc_net
      copying cc_net/jsonql.py -> build/lib/cc_net
      copying cc_net/process_wet_file.py -> build/lib/cc_net
      copying cc_net/regroup.py -> build/lib/cc_net
      copying cc_net/flat_hash_set.py -> build/lib/cc_net
      copying cc_net/__main__.py -> build/lib/cc_net
      copying cc_net/text_normalizer.py -> build/lib/cc_net
      creating build/lib/cc_net/data
      copying cc_net/data/cutoff.csv -> build/lib/cc_net/data
      copying cc_net/data/test_stats.json -> build/lib/cc_net/data
      running install
      running install_lib
      creating build/bdist.linux-x86_64
      creating build/bdist.linux-x86_64/wheel
      creating build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/text_normalizer.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/mine.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/regroup.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/__main__.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/split_by_lang.py -> build/bdist.linux-x86_64/wheel/cc_net
      creating build/bdist.linux-x86_64/wheel/cc_net/data
      copying build/lib/cc_net/data/test_stats.json -> build/bdist.linux-x86_64/wheel/cc_net/data
      copying build/lib/cc_net/data/cutoff.csv -> build/bdist.linux-x86_64/wheel/cc_net/data
      copying build/lib/cc_net/execution.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/perplexity.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/flat_hash_set.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/jsonql.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/dedup.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/__init__.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/tokenizer.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/minify.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/process_wet_file.py -> build/bdist.linux-x86_64/wheel/cc_net
      copying build/lib/cc_net/get_wiki_cirrus.py -> build/bdist.linux-x86_64/wheel/cc_net
      running install_egg_info
      running egg_info
      writing manifest file 'cc_net.egg-info/SOURCES.txt'
      Copying cc_net.egg-info to build/bdist.linux-x86_64/wheel/cc_net-1.0.0.egg-info
      error: [Errno 524] Unknown error 524: 'build/bdist.linux-x86_64/wheel/cc_net-1.0.0.egg-info/dependency_links.txt'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for cc-net
  Building wheel for kenlm (pyproject.toml) ... done
  Created wheel for kenlm: filename=kenlm-0.0.0-cp310-cp310-linux_x86_64.whl size=3184509 sha256=ef764db78260d4918be7a70e7f121e0a141496d4d3461625c465f070735cb605
  Stored in directory: /tmp/pip-ephem-wheel-cache-hc0p3z_q/wheels/8c/79/77/66697759ddfd5399956d18962ce87af09bddb6f8f49848bf4b
Successfully built kenlm
Failed to build cc-net
ERROR: Could not build wheels for cc-net, which is required to install pyproject.toml-based projects
make: *** [Makefile:45: install] Error 1

kenlm and everythihng else builds fine, cant seem to find an issue citing this specific error. I will say that the depenency_links.txt is blank.. not sure if that is supposed to be the case.

Issue on book datasets download

When running the download.py in the current 'book' file, an error occurs:

It seems like this is because this dataset is defunct:

Hashes to verify download integrity

I've downloaded the dataset with wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt. I'd like to make sure that I got everything correctly. Any chance you could release some form of CRC/MD5/SHA hashes to make sure I didn't download any corrupted files? (I'm not worried about needing cryptographic hashes because of some adversary given that this data is all self hosted. I'm mainly worried I've had a file truncated or something.)

How the 5 dumps of Common Crawl are selected?

When exploring the RedPajama dataset, I found that you have selected five dumps of Common Crawl as the following:

common_crawl/2023-06
common_crawl/2020-05
common_crawl/2021-04
common_crawl/2022-05
common_crawl/2019-30

What are the criteria for selection? Considering that there are many more dumps available in Common Crawl. Could you please provide more information? Thanks a lot!

will there be a trained model?

First of all: thank you very much for your contribution!

That said, I still have a question: in order to really "democratise" AI, a trained model will be needed that may be used for (fine-tuning and) inference - not too many people have the resources to train a new model from scratch.

Will such a model be made available? And, if yes, do you have any idea when?

Thanks in advance for your effort!

Andreas Rozek

Guide how to use

Can you share link of guide how to use this model ??

Expected finish time for processing one single index of commoncrawl?

One more question, please.

using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06 ) ?

Thank you!

Overlap between Common Crawl and C4

The C4 dataset is summarized as "A colossal, cleaned version of Common Crawl's web crawl corpus.". I am confused why this dataset is used in addition to the Common Crawl dataset. Am I mistaken in the understanding, that C4 overlaps completely with Common Crawl and that using both introduces nothing but duplication?

Question about Size Difference of arXiv Data in RedPajama and AWS S3

Hello there, thank you for your work!

I noticed that the arXiv data used in RedPajama was download from AWS S3 (https://info.arxiv.org/help/bulk_data_s3.html), which states that the complete set of files as of March 2023 is about 5.6 TB. However, I downloaded the preprocessed arXiv jsonl file (from this RedPajama link: https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt) and found that the total disk size of arXiv jsonl file is only 87.35GB.

So, I am wondering how this huge gap comes from. There are two possibilities that I can think of:

RedPajama data may have used only a subset of the arXiv AWS S3 data.
The cleaning code and deduplication metrics used may have been so strict that most of the arXiv data was cleaned.

I would appreciate it if you could help answer my question. Thanks a lot!

Any forecast for the realese of v2 of the dataset ?

thanks.

What does "default" do in `load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?

Code is asking me for a name e.g.,

`load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?

I want to use all the data sets. Is the "default" the right argument?

Unlock open science for dataset generation

Hello everyone,

While navigating openness in IA, I ended up here and was wondering which open science sources you would use for this kind of tool. I found only ArXiv listed. Do you have any thinking to include more open science sources ?

I'm not sure if it's because there is a lack of understanding on how open access publication works, but I was thinking that maybe with some explanation it can help in the development of some tools to extract text from millions of scientific articles. Surely something which takes time to create, I do not expect to develop it myself, just trying to give some help to open useful discussions.

Actually, I'm doing open models education (open science, open education, open software, open hardware...), just doing some here.

Quick landscape of Open Science

Open science is going mainstream in science policies, the White House announced 2023 as the year of Open Science. It becomes more and more mandatory to publish in open access for researcher working on public fund, countries are having open science policies, fuelled by crisis like covid.

Universities and organization by themselves are involved in this evolution, as there are interests for scientific diffusion, quality, equity...

Organisation are installing open access repository where they save their content. It's called DASH at Harvard, DSpace at MIT, CERN is hosting a shared platform called Zenodo and so on. A lot of university have their own OA repository.

Explore open access repositories worldwide

All of these repositories are decentralised and you need a way to access multiple of them at once to perform effective searches. There are open science search engine like CORE, with an access to a wide number of organisation (~10'000).

They do have an API, but it may be not the most interesting way to perform this kind of tasks.

2 things :

Open access repositories (usually) implement OAI-PMH protocol to query their content
You will need an indexation of repositories, there are some of them like the Directory of Open Acces Journals (DOAJ)

There are potentially some OAI-PMH queries to get all information about repository content, some paths to explore ? Hope it could help to dig into open science.

Shell example with Zenodo (with a command where I'm not sure on the percentage of resources metadata extracted) :

pip install oaiharvest
oai-harvest https://zenodo.org/oai2d -d oai_dc
ls oai_dc

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Memory requirement for book deduplication?

Hi, I tried the data preparation process of books using my server, which has over 100 cores and about 200GB memory.

When I run the dedup.py with the default parameters you provided (w=6, k=5, l=0, n=100), "Out of memory" problem occurs. And even I decreased the number of processes to 8. OOM still happens after processing several splits.

So could you please tell me that how much memory did you use when applying deduplication process on books? And how much time did it take to finish the whole process? Thanks!

change wikipedia folder name

if you test download.py in wikipedia folder, it will show an error

"name": "FileNotFoundError",
"message": "Unable to resolve any data file that matches '['**']' at /storage/store/work/lgrinszt/memorization/the_pile with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'BLP', 'BMP', 'DIB', 'BUFR', 'CUR', 'PCX', 'DCX', 'DDS', 'PS', 'EPS', 'FIT', 'FITS', 'FLI', 'FLC', 'FTC', 'FTU', 'GBR', 'GIF', 'GRIB', 'H5', 'HDF', 'PNG', 'APNG', 'JP2', 'J2K', 'JPC', 'JPF', 'JPX', 'J2C', 'ICNS', 'ICO', 'IM', 'IIM', 'TIF', 'TIFF', 'JFIF', 'JPE', 'JPG', 'JPEG', 'MPG', 'MPEG', 'MSP', 'PCD', 'PXR', 'PBM', 'PGM', 'PPM', 'PNM', 'PSD', 'BW', 'RGB', 'RGBA', 'SGI', 'RAS', 'TGA', 'ICB', 'VDA', 'VST', 'WEBP', 'WMF', 'EMF', 'XBM', 'XPM', 'aiff', 'au', 'avr', 'caf', 'flac', 'htk', 'svx', 'mat4', 'mat5', 'mpc2k', 'ogg', 'paf', 'pvf', 'raw', 'rf64', 'sd2', 'sds', 'ircam', 'voc', 'w64', 'wav', 'nist', 'wavex', 'wve', 'xi', 'mp3', 'opus', 'AIFF', 'AU', 'AVR', 'CAF', 'FLAC', 'HTK', 'SVX', 'MAT4', 'MAT5', 'MPC2K', 'OGG', 'PAF', 'PVF', 'RAW', 'RF64', 'SD2', 'SDS', 'IRCAM', 'VOC', 'W64', 'WAV', 'NIST', 'WAVEX', 'WVE', 'XI', 'MP3', 'OPUS', 'zip']"

please look at the issue here

Add missing script or update README.md

Thanks for your great work on this project! As mentioned in #25 The script scripts/github-prepare-download.sh which is referenced in this README.md is not present in the repository. Should the file be added, or is the README.md incorrect?
Thanks!

What is the cutoff.csv file mentioned in data_prep/cc/cc_net/cc_net/mine.py?

In mine.py on lines 32-34 it introduces this file called cutoff.csv:

Constant

It doesn't seem to come with the repository or be generated anywhere, it's also critical for classifying head, middle, tail, does anybody know what the file is like, so perhaps I can manually write out the file? Thanks!

The left portion of the dataset after each process

Hi,

In this pipeline, the major step is as follows

quality filtering(cc-net)
deduplication
filter out by classifier(trained with sampled commoncrawl and wiki-text)

my question is how each process filters out the data? and was there any comparison experiment with ThePile process?
for instance,
after 1 step : 50% left out of the single index
after 2 step : 25 % left out of the single index( compared to previous steps, only the half remains, but one in a square left considering the single index)
after 3 step : 12% of left out of the single index(compared to previous steps, only the half remains, but 0.12 left considering the single index)

the final number of tokens with ThePile pipeline and this pipeline seems to have quite gap when using a single snapshot(t seems the final token number with this pipeline is approximately 3 times more than ThePile's).

At first glance, i thought the third step is the reason since this pipeline's classifier(trained with wiki) filters out the docs less than 0.25 threshold therefore keeping more docs compared to ThePile(which filters the docs following GPT3 logic but using openwebtext). However, after several experiments i found the third step of this pipeline filters out the document harsher.. BUT THIS PIPELINE'S FINAL TOKEN NUM seems to like 170~200B.

Are there any comments what makes this gap?

Q: Why does RePajama exist? what problem are you solving?

https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T/discussions/25

Language diversity

Hello. Thank you for your work.

Can you, please, provide information about languages in Red Pajama, or it is English only?
I've downloaded the Common Crawl part, but don't see a language field in metadata.

ArXiv cleaning issue

was able to get the content downloaded from S3 (shows 181GB) and attempted to run the ./run_clean.py script. I get thousands of errors like this one:

[2023-08-02T22:28:40.123948][ERROR] UnicodeDecodeError: ~/Documents/ai_data/RedPajama-Data/data_prep/arxiv/work/329f2d6d-b1f1-48f6-ac00-c42769cdb1ef__e515y_o/tmp0t76cedh/0809/0809.0966.gz

and then the stack trace:

Traceback (most recent call last):
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 125, in <module>
    main()
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 116, in main
    run_clean(
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 67, in run_clean
    arxiv_cleaner.run_parallel(
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 60, in run_parallel
    for record, arxiv_id in executor.map(
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 766, in map
    results = super().map(partial(_process_chunk, fn),
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 190, in _get_chunks
    chunk = tuple(itertools.islice(it, chunksize))
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 146, in arxiv_iterator
    with tempfile.TemporaryDirectory(dir=self._work_dir) as tmpdir:
  File "/usr/lib/python3.10/tempfile.py", line 1008, in __exit__
    self.cleanup()
  File "/usr/lib/python3.10/tempfile.py", line 1012, in cleanup
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "/usr/lib/python3.10/tempfile.py", line 994, in _rmtree
    _rmtree(name, onerror=onerror)
  File "/usr/lib/python3.10/shutil.py", line 725, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib/python3.10/shutil.py", line 664, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 662, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: '1012'

I have attempted to re-download it once, but due to costs, dont want to try again without reaching out.

result not contain raw content

hi, there.
by the config as follow:
{
"hash_in_mem": 1,
"dump": "2023-06",
"num_shards": 1,
"task_parallelism": 1,
"num_segments_per_shard": 1,
"mine_num_processes": 1,
"cleanup_after_regroup": "True",
"lang_whitelist": [
"zh"
],
"keep_bucket": ["head", "middle", "tail"],
"pipeline": [
"dedup",
"lid",
"keep_lang",
"sp",
"lm",
"pp_bucket",
"minify",
"split_by_segment"
],
"execution": "debug",
"output_dir": "../zh_data",
"mined_dir": "zh_mined_by_segment",
"target_size": "256MB",
"cache_dir": "../zh_data/wet_cache"
}
i got the result like this:
"sha1:XEGMU6NDDKQFGIP36I3TJUMYCQFW5QLX", "cc_segment": "crawl-data/CC-MAIN-2023-06/segments/1674764494826.88/wet/CC-MAIN-20230126210844-20230127000844-00000.warc.wet.gz", "language": "zh", "language_score": 0.95, "perplexity": 2445.1, "bucket": "tail", "line_ids": "AAABAAIAAwAEAAUABgAIAAkACgALAAwADQAOAA8AEAARABIAEwAUABUAFgAXABgAGgAbABwAHQAhACIAIwAkACUAJgAnACgAKQA="}
i can't find the reason for why the raw_content is missing

EOFError: Compressed file ended before the end-of-stream marker was reached

Hi, thank you in advance.
I am facing with following error while using same command for processing commoncrawl in README.
python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1
The error seems to be caused by file with bad connection. As my understanding, the code process the file in remote condition, therefore keeping connection with single wet gz file (is it right?) is required. However, the network condition of commoncrawl s3 seems to unstable these days.. So if my suspect is correct which is due to the bad network condition, there seems nothing I can do more.. or is there anything I miss?

Also, the process is killed right away before finishing the whole job if facing with that error. I'm thinking to edit code with Exception so that the process does give up the bad connection gz but continue to the next gz file. Do you think it is available idea?

Thank you!

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.err
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93707 (12 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.err
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93259 (13 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.err
  - /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 75105 (14 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
    jsonql.run_pipes(
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Please consider adding a source of natural dialogue data

Many of the open source datasets are missing natural dialogue data. As a result, the models seem less genuine, less interesting, and less able to chat.

Another way to say this: I just searched "baby cry after 6 week vaccines" and found a bunch of vague articles with generic advice, on the other hand "baby cry after 6 week vaccines reddit" lead my wife and I to some very helpful r/beyondthebump conversations. And that's because natural dialogue is a very high signal data source!

I'd like AI's trained on this dataset to be able to be just as direct and helpful. So I propose including a few T tokens of reddit comments. Perhaps selected from the higher quality subreddits (writingprompts, science, changemymind, askscience, etc).

The largest dataset of dialogue data is reddit so I'd like to propose that you include reddit data

2TB of comments including everything up to now on academictorrents
(just from high quality subreddits) https://huggingface.co/datasets/stanfordnlp/SHP
up unto 2020 on bigquery and all on pushshift.io

There are also other dialogue datasets

Thoughts? Would you merge a PR on this?