togethercomputer / redpajama-data Goto Github PK
View Code? Open in Web Editor NEWThe RedPajama-Data repository contains code for preparing large datasets for training large language models.
License: Apache License 2.0
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
License: Apache License 2.0
Hi,
How do I finetune the RedPajama on my dataset? Is there a training script that ai can reuse?
Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation?
(racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu
usage: __main__.py [-h] [-c CONFIG_NAME] [-d DUMP] [-o OUTPUT_DIR] [-m MINED_DIR] [-e EXECUTION] [-n NUM_SHARDS] [--min_shard MIN_SHARD]
[--num_segments_per_shard NUM_SEGMENTS_PER_SHARD] [--metadata METADATA] [--min_len MIN_LEN] [--hash_in_mem HASH_IN_MEM]
[-l LANG_WHITELIST] [--lang_blacklist LANG_BLACKLIST] [--lang_threshold LANG_THRESHOLD] [-k KEEP_BUCKET]
[--lm_dir LM_DIR] [--cutoff CUTOFF] [--lm_languages LM_LANGUAGES] [--mine_num_processes MINE_NUM_PROCESSES]
[-t TARGET_SIZE] [--cleanup_after_regroup] [--task_parallelism TASK_PARALLELISM] [-p PIPELINE]
[--experiments EXPERIMENTS] [--cache_dir CACHE_DIR] [--config CONFIG]
__main__.py: error: argument -l/--lang_whitelist: invalid Sequence value: 'my'
Hello and thank you for the great work.
I am trying to understand the quality filter you had, described here
I took your trained model & script you provided in this issue, and tried to run the script your provided, writing this short sanity check [implementation below].
The first paragraph is from wikipedia, and the second paragraph is a lower quality paragraph.
These are the outputs I receive, almost the same scores & probabilities:
{'pred_label': '__label__cc', 'pred_label_prob': 0.9966633915901184, 'wiki_prob': 0.003336608409881592, ...} # wikipedia paragraph
{'pred_label': '__label__cc', 'pred_label_prob': 0.9801203012466431, 'wiki_prob': 0.019879698753356934, ...} # low quality paragraph
Am I missing something? Am I using it correct? I take the exact steps you make with the model in the classify.py file.
wikipedia_paragraph = '''A language model is a probability distribution over sequences of words.[1] Given any sequence of words of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers'''
bad_paragraph = '''language thing is like, you know, when you get lots of words together and there's like a chance for one word after another. Like when you're talking and stuff. And there's a thing that's called infinity digital or something that means you can make lots and lots of sentences, even ones that you might never hear before. Some smart people have found some ways to not make this a problem, like there's this thing called Markov (sounds Russian) and there's other brain-like things that help, but don't ask me about those.'''
import fasttext
model = fasttext.load_model(model_path)
def get_output(content):
output = {}
# run classifier
text = " ".join(content.strip().splitlines())
pred = model.predict(text)
(pred_label, pred_prob) = pred
pred_label = pred_label[0]
wiki_prob = pred_prob[0]
if pred_label == "__label__cc":
wiki_prob = 1 - wiki_prob
output["pred_label"] = pred_label
output["pred_label_prob"] = pred_prob[0]
output["wiki_prob"] = wiki_prob
output["text"] = content
return output
print(get_output(wikipedia_paragraph))
print(get_output(bad_paragraph))
First of all: thank you very much for your contribution!
Many thanks if you can share the FastText ptrtrained model to classify each CommonCrawl webpage whether it is low quality page
Can you mention what languages covered in this dataset? based on the arXiv:2302.13971v1, LLaMA only covers this kind of languages : bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. Is there possible to add some new low resources languages, like Indonesian for example. Thanks
Hi, I'm trying to run this test case:
python3 -m cc_net --config config/test_segment.json
but encountered the following error:
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('test_data2'), mined_dir='mined_by_segment', execution='debug', num_shards=4, min_shard=-1, num_segments_per_shard=1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['de', 'it', 'fr'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=0, target_size='32M', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment'], experiments=[], cache_dir=PosixPath('test_data/wet_cache'))
['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'minify', 'split_by_segment']
2023-04-27 11:41 INFO 39932:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x11b5b77c0>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x11b5b7a30>, <cc_net.perplexity.MultiSentencePiece object at 0x11b5b78e0>, <cc_net.perplexity.DocLM object at 0x11b5b7970>, <cc_net.perplexity.PerplexityBucket object at 0x11b5b7a60>, <cc_net.minify.Minifier object at 0x11b5b7be0>]
/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
warnings.warn(
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded hashes from test_data2/hashes/2019-09/0000.bin (0.700GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:DuplicatesRemover - Loaded 3_361_543 hashes from 1 files. (0.7GB total, took 0.02m)
2023-04-27 11:41 INFO 39932:Classifier - Loading bin/lid.bin
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/de.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/de.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/it.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/it.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loading data/lm_sp/fr.sp.model...
2023-04-27 11:41 INFO 39932:MultiSentencePiece - Loaded data/lm_sp/fr.sp.model (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/de.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/de.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/it.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/it.arpa.bin (took 0.0min)
2023-04-27 11:41 INFO 39932:DocLM - Loading data/lm_sp/fr.arpa.bin...
2023-04-27 11:41 INFO 39932:DocLM - Loaded data/lm_sp/fr.arpa.bin (took 0.0min)
Traceback (most recent call last):
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 142, in debug_executor
message = function(*x)
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
jsonql.run_pipes(
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 432, in run_pipes
transform = stack.enter_context(compose(transformers))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/contextlib.py", line 429, in enter_context
result = _cm_type.__enter__(cm)
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
self._prepare()
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 352, in _prepare
t.__enter__()
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 312, in __enter__
self._prepare()
File "/Users/work/temp/RedPajama-Data/data_prep/cc/cc_net/cc_net/perplexity.py", line 267, in _prepare
cutoffs = pd.read_csv(self.cutoff_csv, index_col=0)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
self.handles = get_handle(
File "/Users/Library/Python/3.9/lib/python/site-packages/pandas/io/common.py", line 859, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/Users/work/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'
Are there any possible reasons? Python3.9.6 on MacOS
Another question, plz. I met an error in hashing step. Have you ever seen this before?
Failed job 34548 (4 / 60): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 235, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 160, in parse_warc_file
for doc in group_by_docs(lines):
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 133, in group_by_docs
for warc in warc_lines:
File "/disk2/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/py38/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 2556: invalid start byte
I'm going through gathering the data from each of the data_prep folders and besides some inconsistency on where the data folder is in each README this is the only error I've come across.
cd data_prep
mkdir -p data/book
python3 book/download.py
Traceback (most recent call last):
File "/mnt/AI/RedPajama-Data/data_prep/book/download.py", line 1, in <module>
from datasets import load_dataset
ModuleNotFoundError: No module named 'datasets'
I try to download from the url, such as wget -P ./ -c https://data.together.xyz/redpajama-data-1T/v1.0.0/wikipedia/wiki.jsonl
. But it meets error as below
63888150K .......... .......... .......... .......... .......... 54% 26.5M 82m10s
63888200K .......... .......... .......... .......... .......... 54% 28.9M 82m10s
63888250K .......... .......... .......... .......... .......... 54% 27.6M 82m10s
63888300K .......... .......... .......... .......... 54% 26.1M=98m14s
2023-06-08 12:02:19 (10.6 MB/s) - Connection closed at byte 65421660160. Retrying.
--2023-06-08 12:02:20-- (try: 2) https://data.together.xyz/redpajama-data-1T/v1.0.0/wikipedia/wiki.jsonl
Connecting to data.together.xyz (data.together.xyz)|104.26.15.50|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 120142320713 (112G), 54720660553 (51G) remaining
Saving to: ‘./wiki.jsonl’
[ skipping 63888300K ]
63888300K ,,,,,,,,,, ,,,,,,,,,, ,,,,,,,,,, ,,,,,,,,,, .......... 54% 7.02M 10h19m
63888350K .......... .......... .......... .......... .......... 54% 938K 14h54m
63888400K .......... .......... .......... .......... .......... 54% 6.70M 9h6m
63888450K .......... .......... .......... .......... .......... 54% 11.6M 6h39m
63888500K .......... .......... .......... .......... .......... 54% 10.2M 5h24m
63888550K .......... .......... .......... .......... .......... 54% 215K 17h39m
63888600K .......... .......... .......... .......... .......... 54% 992K 17h13m
63888650K .......... .......... .......... .......... .......... 54% 496K 18h59m
63888700K .......... .......... .......... .......... .......... 54% 659K 19h25m
63888750K .......... .......... .......... .......... .......... 54% 1.44M 18h24m
63888800K .......... .......... .......... .......... .......... 54% 1017K 18h1m
63888850K .......... .......... .......... .......... .......... 54% 1.27M 17h26m
63888900K .......... .......... .......... .......... .......... 54% 1.30M 16h55m
63888950K .......... .......... .......... .......... .......... 54% 1.21M 16h33m
63889000K .......... .......... .......... .......... .......... 54% 1.07M 16h20m
63889050K .......... .......... .......... .......... .......... 54% 1.55M 15h53m
63889100K .......... .......... .......... .......... .......... 54% 1.97M 15h21m
63889150K .......... .......... .......... .......... .......... 54% 1.76M 14h56m
63889200K .......... .......... .......... .......... .......... 54% 1.52M 14h38m
63889250K .......... .......... .......... .......... .......... 54% 2.09M 14h14m
63889300K .......... .......... .......... .......... .......... 54% 2.22M 13h51m
63889350K .......... 54% 1.21M=1.0s
Cannot write to ‘./wiki.jsonl’ (Success).
How to fix it?
Hi there,
Thanks for making this code available. I am trying to use the arxiv downloader, but would be interested in a certain date range of papers to be downloaded. Any tips on how to approach this?
Many thanks
First of all, thank you for your great work to create this project. I didn't have access to a Slurm workload manager, but I was able to use these scripts to preprocess a sample of the GitHub dataset from BigQuery (which was exactly what I wanted to do!). Here are a couple points which would improve the scripts for the next person:
scripts/github-prepare-download.sh
mentioned in this README.md seems missing from the scripts directory.TARGET_DIR
variable in the github-global-dedup-slurm.sbatch script should probably be ./data/github/processed_deduped
instead of ./data/github_scratch/processed_deduped
TARGET_DIR
and DEDUPED_DIR
variables in the github-run-filter-slurm.sbatch script should use github
instead of github_scratch
Thanks again for your work on this project.
How can I do that? I didnt find cc_net can process WARC to WARC.wet
Hi there!
I looked through the corpuses and found that sometimes they are not 100% downloaded. Not sure, if the issue is with the downloading scripts. Below are some examples grepped from bookcorpus and arxiv. If you look into these examples, check out the url, if you click on it, you will see a full dataset that contains more text than just a header. For now I am planning to filter out such documents from the training set - there are not too many of them. But it would be great in the future to download these documents more properly and include full docs in the corpus. Then this dataset will be even larger and more useful for model training. I suspect that there will be more datasets corrupted, not just arxiv and book.
I am currently working on the RedPajama-v2, check out our slack for more info on what we found out about this datasets: https://discord.gg/KMmsHFxE
Thanks!
Hey, could you list the final drive space to store the full dataset somewhere?
I went from doing the cc-net pulls locally to using slurm. When I try to execute
theskaz@c4140:/nfs/slow/RedPajama-Data/data_prep/cc/cc_net$ python -m cc_net --dump 2020-05 --num_shards 5000 --hash_in_mem 1
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2020-05', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=5000, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['en'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (3983 jobs)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
main()
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
func_argparse.parse_and_call(cc_net.mine.get_main_parser())
File "/home/theskaz/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
return command(**parsed_args)
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 631, in main
all_files = mine(conf)
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 334, in mine
hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 263, in hashes
ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
jobs = ex.map_array(function, *args)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
return self._internal_process_submissions(submissions)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
return self._executor._internal_process_submissions(delayed_submissions)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 328, in _internal_process_submissions
array_ex.update_parameters(**self.parameters)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 810, in update_parameters
self._internal_update_parameters(**kwargs)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 302, in _internal_update_parameters
raise ValueError(
ValueError: Unavailable parameter(s): ['slurm_time']
Valid parameters are:
- account (default: None)
- additional_parameters (default: None)
- array_parallelism (default: 256)
- comment (default: None)
- constraint (default: None)
- cpus_per_gpu (default: None)
- cpus_per_task (default: None)
- exclude (default: None)
- exclusive (default: None)
- gpus_per_node (default: None)
- gpus_per_task (default: None)
- gres (default: None)
- job_name (default: 'submitit')
- mem (default: None)
- mem_per_cpu (default: None)
- mem_per_gpu (default: None)
- nodes (default: 1)
- ntasks_per_node (default: None)
- num_gpus (default: None)
- partition (default: None)
- qos (default: None)
- setup (default: None)
- signal_delay_s (default: 90)
- srun_args (default: None)
- stderr_to_stdout (default: False)
- time (default: 5)
- wckey (default: 'submitit')
and I go into execution.py and comment out the slurm_time parameter (lines 58-61) and try again, that returns this error:
theskaz@c4140:/nfs/slow/RedPajama-Data/data_prep/cc/cc_net$ python -m cc_net --dump 2020-05 --num_shards 5000 --hash_in_mem 1
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2020-05', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=5000, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['en'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (3983 jobs)
sbatch: error: Batch job submission failed: Invalid job array specification
subprocess.CalledProcessError: Command '['sbatch', '/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/data/logs/submission_file_4751207924ea4dde903eace6afeb2a38.sh']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
main()
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
func_argparse.parse_and_call(cc_net.mine.get_main_parser())
File "/home/theskaz/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
return command(**parsed_args)
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 631, in main
all_files = mine(conf)
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 334, in mine
hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 263, in hashes
ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
jobs = ex.map_array(function, *args)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
return self._internal_process_submissions(submissions)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
return self._executor._internal_process_submissions(delayed_submissions)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions
first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str)
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 934, in _submit_command
output = utils.CommandFunction(command_list, verbose=False)() # explicit errors
File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/utils.py", line 352, in __call__
raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification
Not sure where to go from here. I can verify that slurm is working and all compute nodes are in the idle state.
Hi, I ran the tokenize scripts you provided to tokenize the datasets.
With cl100k_base vocab and tiktokenizer, it result to "can't allocate memory" in my server for every datasets.
Actually the server has over 300G memory, so I'm worry about how much memory and space does it need to tokenize all datasets?
The common crawl data entries have a source like this:
"source":"cc/2023-06/en_head_0000.json.gz/line401859"
What's the right way to map that back to metadata where the entry came from? In particular I'd like the original url and timestamp it was downloaded. Is that possible? Most of the metadata seems to be in terms of the WARC format, not the WET format I believe was used by cc_net to process the data.
Thanks,
Craig
Currently the data in the commoncrawl slice contains the following fields in addition to the text
field:
"pred_label": "__label__cc", "pred_label_prob": XXX, "wiki_prob": XXX, "source": "cc/2019-30/en_middle_0053.json.gz/line1"
We would like to also include the metadata that gets generated by the cc_net
pipeline (see metadata [here(https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net)). The goal is that one record in the final jsonl files should follow this schema:
{
"text": " ... ",
"meta": {
"pred_label": "...",
"pred_label_prob": "...",
"wiki_prob": "...",
"source": "...",
"url": "...",
"date_download": "...",
"digest": "...",
"length": "...",
"nlines": "...",
"source_domain": "...",
"title": "...",
"original_nlines": "...",
"original_length": "...",
"language": "...",
"language_score": "...",
"perplexity": "..."
}
}
I'm trying to use the data visualization using meerkat. The viz/main.py visualization is a sample data of Git Hub. Is there a script with which I can expand to other datasets?
Hey, I've been trying to process and clean the dataset for awhile now and I keep getting this error and can't seem to find the file in the repo or anything. Any help would be greatly appreciated. Here is the error message - bash: scripts/github-prepare-local-dedup.sh: No such file or directory
If the program terminates due to a power outage when I run the cc-net data prepare pipline, how can it continue execution at the breakpoint when restarted?
I saw the code for loading cc data in create_corpus.py:
https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/cc/classifier/create_corpus.py#LL32C1-L34C26
for file in glob.glob("common_crawl/*/*/*.gz"):
if ("middle" in file or "head" in file) and "dedup" not in file:
jobs.append(file)
As mentioned in LLaMA paper, the commoncrawl data used here should be treated as negative samples for the classifier. Why not use the inferior data(tail
) instead of head
and middle
?
I download myself arXiv tex files without using running scripts/arxiv-kickoff-download.sh.
My data structure is
my_arxiv_src
|- papername1
|- name.tex
|- papername2
|- name.tex
|- other_name.tex
I want to preprocess my latex data, so I run
bash scripts/arxiv-kickoff-cleaning.sh
and arxiv-kickoff-cleaning.sh
is the following
#!/bin/bash
set -e
WORKERS=2
# load modules
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 conda/pytorch_1.12.0
pip install -r arxiv_requirements.txt
export DATA_DIR="./my_ arxiv_src"
export TARGET_DIR="./data/arxiv/processed"
export WORK_DIR="./work"
mkdir -p logs/arxiv/cleaning
# setup partitions
python run_clean.py --data_dir "$DATA_DIR" --target_dir "$TARGET_DIR" --workers $WORKERS --setup
# run download in job array
sbatch scripts/arxiv-clean-slurm.sbatch
arxiv-kickoff-cleaning.sh
runs with no error but,
the result files which are arxiv_1.jsonl
and arxiv_2.jsonl
have not content...
What is the DATA_DIR and TARGET_DIR ?
Is there anything running method with latex files?
hello there, thank for your good work.
i want to download a small portion of cc(to run through the whole process firstly)
when i run the code 'python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1' if i just need to add the argument --num_segment_per_shard 2 and change some numbers like 'python -m cc_net --dump 2023-06 --task_parallelism 10 --num_shards 10 -l en --mine_num_processes 10 --hash_in_mem 1 --num_segments_per_shard 2'. then, some other arguments like target_size='4G' is have any inffluence?
Or how should i set these arguments? like which arguments i should modify or what value is appropriate.
thanks a lot!
Thank you for your work! I am preprocessing for another language(zh). I have some questions regarding the provided instructions:
In
extracted_urls.txt
, we provide 38M URLs that are processed from the Wikipedia dump. We early stop this process to only keep 300K pages.
Regarding the extracted_urls.txt
file, how was the decision made to keep only 300K pages out of the 38M URLs processed from the Wikipedia dump? Should I follow the same ratio for the zhwiki-20230420-pages-articles-multistream.xml
file, which is smaller than the English one?
We then run the same
cc-net
pipeline onwarc_wikipedia.warc
, which produceswarc_wikipedia.warc.wet
.
Can you provide more guidance on how to run the pipeline on this file? I had read through the cc_net
code and found nothing about wikipedia processing other than data_prep/cc/cc_net/cc_net/get_wiki_cirrus.py
. But it seems download from https://dumps.wikimedia.org/other/cirrussearch/current/zhwiki-20230501-cirrussearch-content.json.gz
.
python classifier/create_corpus.py > data_train
I notice that the input of create_corpus.py
is ["cc_net/data/mined/wikipedia/en_head_0000.json.gz", "cc_net/data/mined/wikipedia/en_middle_0000.json.gz"]
(maybe parsing an argument is better). Can you provide instructions on how to obtain these files?
for file in glob.glob("common_crawl/*/*/*.gz")
increate_corpus.py
Can you clarify whether it should be run on cc_net/data/mined/{CC_DUMP}/*.gz
? The glob here may be ambiguous.
Lastly, I would appreciate it if you could improve the Quality Classifier
section in the README and scripts in the data_prep/cc/classifier
to make it easier for newcomers to follow. Thank you!
hello, there.
i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size
Hello,
trying to play with this repo and when doing a "make install" on the cc-net folder as per the instructions, I get a build error:
(Ubuntu 22.04, python 3.10, python-dev and build-essential installed)
Building wheels for collected packages: cc-net, kenlm
Building wheel for cc-net (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for cc-net (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [52 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib
creating build/lib/cc_net
copying cc_net/get_wiki_cirrus.py -> build/lib/cc_net
copying cc_net/tokenizer.py -> build/lib/cc_net
copying cc_net/perplexity.py -> build/lib/cc_net
copying cc_net/dedup.py -> build/lib/cc_net
copying cc_net/__init__.py -> build/lib/cc_net
copying cc_net/mine.py -> build/lib/cc_net
copying cc_net/minify.py -> build/lib/cc_net
copying cc_net/split_by_lang.py -> build/lib/cc_net
copying cc_net/execution.py -> build/lib/cc_net
copying cc_net/jsonql.py -> build/lib/cc_net
copying cc_net/process_wet_file.py -> build/lib/cc_net
copying cc_net/regroup.py -> build/lib/cc_net
copying cc_net/flat_hash_set.py -> build/lib/cc_net
copying cc_net/__main__.py -> build/lib/cc_net
copying cc_net/text_normalizer.py -> build/lib/cc_net
creating build/lib/cc_net/data
copying cc_net/data/cutoff.csv -> build/lib/cc_net/data
copying cc_net/data/test_stats.json -> build/lib/cc_net/data
running install
running install_lib
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/wheel
creating build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/text_normalizer.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/mine.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/regroup.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/__main__.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/split_by_lang.py -> build/bdist.linux-x86_64/wheel/cc_net
creating build/bdist.linux-x86_64/wheel/cc_net/data
copying build/lib/cc_net/data/test_stats.json -> build/bdist.linux-x86_64/wheel/cc_net/data
copying build/lib/cc_net/data/cutoff.csv -> build/bdist.linux-x86_64/wheel/cc_net/data
copying build/lib/cc_net/execution.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/perplexity.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/flat_hash_set.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/jsonql.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/dedup.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/__init__.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/tokenizer.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/minify.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/process_wet_file.py -> build/bdist.linux-x86_64/wheel/cc_net
copying build/lib/cc_net/get_wiki_cirrus.py -> build/bdist.linux-x86_64/wheel/cc_net
running install_egg_info
running egg_info
writing manifest file 'cc_net.egg-info/SOURCES.txt'
Copying cc_net.egg-info to build/bdist.linux-x86_64/wheel/cc_net-1.0.0.egg-info
error: [Errno 524] Unknown error 524: 'build/bdist.linux-x86_64/wheel/cc_net-1.0.0.egg-info/dependency_links.txt'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for cc-net
Building wheel for kenlm (pyproject.toml) ... done
Created wheel for kenlm: filename=kenlm-0.0.0-cp310-cp310-linux_x86_64.whl size=3184509 sha256=ef764db78260d4918be7a70e7f121e0a141496d4d3461625c465f070735cb605
Stored in directory: /tmp/pip-ephem-wheel-cache-hc0p3z_q/wheels/8c/79/77/66697759ddfd5399956d18962ce87af09bddb6f8f49848bf4b
Successfully built kenlm
Failed to build cc-net
ERROR: Could not build wheels for cc-net, which is required to install pyproject.toml-based projects
make: *** [Makefile:45: install] Error 1
kenlm and everythihng else builds fine, cant seem to find an issue citing this specific error. I will say that the depenency_links.txt is blank.. not sure if that is supposed to be the case.
I've downloaded the dataset with wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
. I'd like to make sure that I got everything correctly. Any chance you could release some form of CRC/MD5/SHA hashes to make sure I didn't download any corrupted files? (I'm not worried about needing cryptographic hashes because of some adversary given that this data is all self hosted. I'm mainly worried I've had a file truncated or something.)
When exploring the RedPajama dataset, I found that you have selected five dumps of Common Crawl as the following:
common_crawl/2023-06
common_crawl/2020-05
common_crawl/2021-04
common_crawl/2022-05
common_crawl/2019-30
What are the criteria for selection? Considering that there are many more dumps available in Common Crawl. Could you please provide more information? Thanks a lot!
First of all: thank you very much for your contribution!
That said, I still have a question: in order to really "democratise" AI, a trained model will be needed that may be used for (fine-tuning and) inference - not too many people have the resources to train a new model from scratch.
Will such a model be made available? And, if yes, do you have any idea when?
Thanks in advance for your effort!
Andreas Rozek
Can you share link of guide how to use this model ??
One more question, please.
using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06 ) ?
Thank you!
The C4 dataset is summarized as "A colossal, cleaned version of Common Crawl's web crawl corpus.". I am confused why this dataset is used in addition to the Common Crawl dataset. Am I mistaken in the understanding, that C4 overlaps completely with Common Crawl and that using both introduces nothing but duplication?
Hello there, thank you for your work!
I noticed that the arXiv data used in RedPajama was download from AWS S3 (https://info.arxiv.org/help/bulk_data_s3.html), which states that the complete set of files as of March 2023 is about 5.6 TB. However, I downloaded the preprocessed arXiv jsonl file (from this RedPajama link: https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt) and found that the total disk size of arXiv jsonl file is only 87.35GB.
So, I am wondering how this huge gap comes from. There are two possibilities that I can think of:
I would appreciate it if you could help answer my question. Thanks a lot!
thanks.
Code is asking me for a name e.g.,
`load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?
I want to use all the data sets. Is the "default" the right argument?
Hello everyone,
While navigating openness in IA, I ended up here and was wondering which open science sources you would use for this kind of tool. I found only ArXiv listed. Do you have any thinking to include more open science sources ?
I'm not sure if it's because there is a lack of understanding on how open access publication works, but I was thinking that maybe with some explanation it can help in the development of some tools to extract text from millions of scientific articles. Surely something which takes time to create, I do not expect to develop it myself, just trying to give some help to open useful discussions.
Actually, I'm doing open models education (open science, open education, open software, open hardware...), just doing some here.
Open science is going mainstream in science policies, the White House announced 2023 as the year of Open Science. It becomes more and more mandatory to publish in open access for researcher working on public fund, countries are having open science policies, fuelled by crisis like covid.
Universities and organization by themselves are involved in this evolution, as there are interests for scientific diffusion, quality, equity...
Organisation are installing open access repository where they save their content. It's called DASH at Harvard, DSpace at MIT, CERN is hosting a shared platform called Zenodo and so on. A lot of university have their own OA repository.
All of these repositories are decentralised and you need a way to access multiple of them at once to perform effective searches. There are open science search engine like CORE, with an access to a wide number of organisation (~10'000).
They do have an API, but it may be not the most interesting way to perform this kind of tasks.
2 things :
There are potentially some OAI-PMH queries to get all information about repository content, some paths to explore ? Hope it could help to dig into open science.
Shell example with Zenodo (with a command where I'm not sure on the percentage of resources metadata extracted) :
pip install oaiharvest
oai-harvest https://zenodo.org/oai2d -d oai_dc
ls oai_dc
Hi, I tried the data preparation process of books using my server, which has over 100 cores and about 200GB memory.
When I run the dedup.py with the default parameters you provided (w=6, k=5, l=0, n=100), "Out of memory" problem occurs. And even I decreased the number of processes to 8. OOM still happens after processing several splits.
So could you please tell me that how much memory did you use when applying deduplication process on books? And how much time did it take to finish the whole process? Thanks!
if you test download.py in wikipedia folder, it will show an error
"name": "FileNotFoundError",
"message": "Unable to resolve any data file that matches '['**']' at /storage/store/work/lgrinszt/memorization/the_pile with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'BLP', 'BMP', 'DIB', 'BUFR', 'CUR', 'PCX', 'DCX', 'DDS', 'PS', 'EPS', 'FIT', 'FITS', 'FLI', 'FLC', 'FTC', 'FTU', 'GBR', 'GIF', 'GRIB', 'H5', 'HDF', 'PNG', 'APNG', 'JP2', 'J2K', 'JPC', 'JPF', 'JPX', 'J2C', 'ICNS', 'ICO', 'IM', 'IIM', 'TIF', 'TIFF', 'JFIF', 'JPE', 'JPG', 'JPEG', 'MPG', 'MPEG', 'MSP', 'PCD', 'PXR', 'PBM', 'PGM', 'PPM', 'PNM', 'PSD', 'BW', 'RGB', 'RGBA', 'SGI', 'RAS', 'TGA', 'ICB', 'VDA', 'VST', 'WEBP', 'WMF', 'EMF', 'XBM', 'XPM', 'aiff', 'au', 'avr', 'caf', 'flac', 'htk', 'svx', 'mat4', 'mat5', 'mpc2k', 'ogg', 'paf', 'pvf', 'raw', 'rf64', 'sd2', 'sds', 'ircam', 'voc', 'w64', 'wav', 'nist', 'wavex', 'wve', 'xi', 'mp3', 'opus', 'AIFF', 'AU', 'AVR', 'CAF', 'FLAC', 'HTK', 'SVX', 'MAT4', 'MAT5', 'MPC2K', 'OGG', 'PAF', 'PVF', 'RAW', 'RF64', 'SD2', 'SDS', 'IRCAM', 'VOC', 'W64', 'WAV', 'NIST', 'WAVEX', 'WVE', 'XI', 'MP3', 'OPUS', 'zip']"
please look at the issue here
Thanks for your great work on this project! As mentioned in #25 The script scripts/github-prepare-download.sh
which is referenced in this README.md is not present in the repository. Should the file be added, or is the README.md incorrect?
Thanks!
In mine.py on lines 32-34 it introduces this file called cutoff.csv:
--
33 | FILE_DIR = Path(file).parent
34 | CUTOFF_CSV = FILE_DIR / "data" / "cutoff.csv"
35 |
36 | DEFAULT_PIPELINE = [
37 | "dedup",
68 | cutoff: cutoff file to use for split in head/middle/tail
93 | cutoff: Path = CUTOFF_CSV
412 | steps["pp_bucket"] = perplexity.PerplexityBucket(CUTOFF_CSV)
It doesn't seem to come with the repository or be generated anywhere, it's also critical for classifying head, middle, tail, does anybody know what the file is like, so perhaps I can manually write out the file? Thanks!
Hi,
In this pipeline, the major step is as follows
my question is how each process filters out the data? and was there any comparison experiment with ThePile process?
for instance,
after 1 step : 50% left out of the single index
after 2 step : 25 % left out of the single index( compared to previous steps, only the half remains, but one in a square left considering the single index)
after 3 step : 12% of left out of the single index(compared to previous steps, only the half remains, but 0.12 left considering the single index)
the final number of tokens with ThePile pipeline and this pipeline seems to have quite gap when using a single snapshot(t seems the final token number with this pipeline is approximately 3 times more than ThePile's).
At first glance, i thought the third step is the reason since this pipeline's classifier(trained with wiki) filters out the docs less than 0.25 threshold therefore keeping more docs compared to ThePile(which filters the docs following GPT3 logic but using openwebtext). However, after several experiments i found the third step of this pipeline filters out the document harsher.. BUT THIS PIPELINE'S FINAL TOKEN NUM seems to like 170~200B.
Are there any comments what makes this gap?
Hello. Thank you for your work.
Can you, please, provide information about languages in Red Pajama, or it is English only?
I've downloaded the Common Crawl part, but don't see a language field in metadata.
was able to get the content downloaded from S3 (shows 181GB) and attempted to run the ./run_clean.py script. I get thousands of errors like this one:
[2023-08-02T22:28:40.123948][ERROR] UnicodeDecodeError: ~/Documents/ai_data/RedPajama-Data/data_prep/arxiv/work/329f2d6d-b1f1-48f6-ac00-c42769cdb1ef__e515y_o/tmp0t76cedh/0809/0809.0966.gz
and then the stack trace:
Traceback (most recent call last):
File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 125, in <module>
main()
File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 116, in main
run_clean(
File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 67, in run_clean
arxiv_cleaner.run_parallel(
File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 60, in run_parallel
for record, arxiv_id in executor.map(
File "/usr/lib/python3.10/concurrent/futures/process.py", line 766, in map
results = super().map(partial(_process_chunk, fn),
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in <listcomp>
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/usr/lib/python3.10/concurrent/futures/process.py", line 190, in _get_chunks
chunk = tuple(itertools.islice(it, chunksize))
File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 146, in arxiv_iterator
with tempfile.TemporaryDirectory(dir=self._work_dir) as tmpdir:
File "/usr/lib/python3.10/tempfile.py", line 1008, in __exit__
self.cleanup()
File "/usr/lib/python3.10/tempfile.py", line 1012, in cleanup
self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
File "/usr/lib/python3.10/tempfile.py", line 994, in _rmtree
_rmtree(name, onerror=onerror)
File "/usr/lib/python3.10/shutil.py", line 725, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/usr/lib/python3.10/shutil.py", line 664, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 662, in _rmtree_safe_fd
os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: '1012'
I have attempted to re-download it once, but due to costs, dont want to try again without reaching out.
hi, there.
by the config as follow:
{
"hash_in_mem": 1,
"dump": "2023-06",
"num_shards": 1,
"task_parallelism": 1,
"num_segments_per_shard": 1,
"mine_num_processes": 1,
"cleanup_after_regroup": "True",
"lang_whitelist": [
"zh"
],
"keep_bucket": ["head", "middle", "tail"],
"pipeline": [
"dedup",
"lid",
"keep_lang",
"sp",
"lm",
"pp_bucket",
"minify",
"split_by_segment"
],
"execution": "debug",
"output_dir": "../zh_data",
"mined_dir": "zh_mined_by_segment",
"target_size": "256MB",
"cache_dir": "../zh_data/wet_cache"
}
i got the result like this:
"sha1:XEGMU6NDDKQFGIP36I3TJUMYCQFW5QLX", "cc_segment": "crawl-data/CC-MAIN-2023-06/segments/1674764494826.88/wet/CC-MAIN-20230126210844-20230127000844-00000.warc.wet.gz", "language": "zh", "language_score": 0.95, "perplexity": 2445.1, "bucket": "tail", "line_ids": "AAABAAIAAwAEAAUABgAIAAkACgALAAwADQAOAA8AEAARABIAEwAUABUAFgAXABgAGgAbABwAHQAhACIAIwAkACUAJgAnACgAKQA="}
i can't find the reason for why the raw_content is missing
Hi, thank you in advance.
I am facing with following error while using same command for processing commoncrawl in README.
python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1
The error seems to be caused by file with bad connection. As my understanding, the code process the file in remote condition, therefore keeping connection with single wet gz file (is it right?) is required. However, the network condition of commoncrawl s3 seems to unstable these days.. So if my suspect is correct which is due to the bad network condition, there seems nothing I can do more.. or is there anything I miss?
Also, the process is killed right away before finishing the whole job if facing with that error. I'm thinking to edit code with Exception so that the process does give up the bad connection gz but continue to the next gz file. Do you think it is available idea?
Thank you!
----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.err
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93398_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93707 (12 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
return self._buffer.read1(size)
File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.err
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93707_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 93259 (13 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
return self._buffer.read1(size)
File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
----------------------
You can check full logs with 'job.stderr(0)' and 'job.stdout(0)'or at paths:
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.err
- /ext_data/RedPajama-Data/data_prep/cc/cc_net/data/logs/93259_0_log.out
Waiting on 20 running jobs. Job ids: 74750,74833,75105,75247...
Failed job 75105 (14 / 5000): Job (task=0) failed during processing with trace:
----------------------
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/root/anaconda3/envs/test/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 275, in _hashes_shard
jsonql.run_pipes(
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/ext_data/RedPajama-Data/data_prep/cc/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 313, in read1
return self._buffer.read1(size)
File "/root/anaconda3/envs/test/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/root/anaconda3/envs/test/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
Many of the open source datasets are missing natural dialogue data. As a result, the models seem less genuine, less interesting, and less able to chat.
Another way to say this: I just searched "baby cry after 6 week vaccines" and found a bunch of vague articles with generic advice, on the other hand "baby cry after 6 week vaccines reddit" lead my wife and I to some very helpful r/beyondthebump conversations. And that's because natural dialogue is a very high signal data source!
I'd like AI's trained on this dataset to be able to be just as direct and helpful. So I propose including a few T tokens of reddit comments. Perhaps selected from the higher quality subreddits (writingprompts, science, changemymind, askscience, etc).
The largest dataset of dialogue data is reddit so I'd like to propose that you include reddit data
There are also other dialogue datasets
Thoughts? Would you merge a PR on this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.