Code Monkey home page Code Monkey logo

Comments (2)

ladit avatar ladit commented on August 15, 2024

You can refer to python -m cc_net --help and RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py for all arguments.
You can run python -m cc_net --config config/test_segment.json to take arguments from a json file.
An example config for testing the whole process:

{
  "dump": "2023-06",
  "output_dir": "test_data",
  "num_shards": 5,
  "num_segments_per_shard": 3,
  "hash_in_mem": 1,
  "mine_num_processes": 5,
  "task_parallelism": 5
}

from redpajama-data.

newbietuan avatar newbietuan commented on August 15, 2024

thank you very much. i will try it.
when i run the code (demo) mayutuan@mayutuans-MacBook-Pro cc_net % python -m cc_net --dump 2023-06 --task_parallelism 6 --num_shards 6 -l en --mine_num_processes 6 --hash_in_mem 1 --num_segments_per_shard 2
and finished download the 6 file of xxxx.bin . i get the message:
Traceback (most recent call last):
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/mine.py", line 439, in _mine_shard
jsonql.run_pipes(
File "/Users/mayutuan/Downloads/projects/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/jsonql.py", line 439, in run_pipes
multiprocessing.Pool(
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Users/mayutuan/anaconda3/envs/demo/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_mine_shard..'
i get a solution from web saying that: globals()['my_local_function'] = my_local_function. while i don't know if it is right and how should i implement it~

from redpajama-data.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.