Comments (2)
You can refer to python -m cc_net --help
and RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py
for all arguments.
You can run python -m cc_net --config config/test_segment.json
to take arguments from a json file.
An example config for testing the whole process:
{
"dump": "2023-06",
"output_dir": "test_data",
"num_shards": 5,
"num_segments_per_shard": 3,
"hash_in_mem": 1,
"mine_num_processes": 5,
"task_parallelism": 5
}
from redpajama-data.
thank you very much. i will try it.
when i run the code (demo) mayutuan@mayutuans-MacBook-Pro cc_net % python -m cc_net --dump 2023-06 --task_parallelism 6 --num_shards 6 -l en --mine_num_processes 6 --hash_in_mem 1 --num_segments_per_shard 2
and finished download the 6 file of xxxx.bin . i get the message:
Traceback (most recent call last):
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
jsonql.run_pipes(
File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/jsonql.py", line 439, in run_pipes
multiprocessing.Pool(
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_mine_shard..'
i get a solution from web saying that: globals()['my_local_function'] = my_local_function. while i don't know if it is right and how should i implement itο½
from redpajama-data.
Related Issues (20)
- Recommended way to load wget-downloaded data using HF datasets API? HOT 1
- Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path>
- Unavailable Parameters
- Running full pipeline on a small part of CC
- Running the pipeline on cloud or a big data platform HOT 1
- About the final result HOT 2
- What purpose cutoff.csv used in the cc_net pipeline? HOT 2
- Step 2) "Invalid option: ---input_base_uri" HOT 1
- Potential Language Contamination Inquiry HOT 1
- Spanish artifact building error HOT 2
- Inconsistent IDs lead to distributed computing woes. HOT 1
- Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1 HOT 1
- slow transfer speeds from URL sources HOT 5
- Other language data HOT 2
- Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link. HOT 1
- Are shards randomly created? HOT 1
- What is the output of `run_lsh.py`? HOT 8
- possibly missing shard from host HOT 2
- Is there a specific meaning of the snapshot id? HOT 2
- what's the specific meaning of dsir? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from redpajama-data.