The lm-evaluation-harness from rwkv

Potential error with accelerate + lm-eval + rwkv sharded models

The following error occurs when running lm-eval with accelerate, with a sharded RWKV model
# ------------------------------
# Running Task : anli
# ------------------------------
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `7`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
2024-02-25:19:16:47,809 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,809 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:47,887 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,887 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:47,897 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,897 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:47,904 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,905 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:47,905 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,905 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:47,906 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,906 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:47,937 INFO     [utils.py:145] Note: detected 160 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-02-25:19:16:47,937 INFO     [utils.py:148] Note: NumExpr detected 160 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-02-25:19:16:48,003 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:48,088 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:48,097 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:48,098 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:48,101 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:48,102 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:48,110 INFO     [config.py:58] PyTorch version 2.1.2 available.
2024-02-25:19:16:49,485 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,485 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:49,558 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,559 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:49,568 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,568 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:49,589 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,589 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:49,599 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,599 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:49,607 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,607 INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:49,681 INFO     [__main__.py:162] Verbosity set to INFO
2024-02-25:19:16:49,[681](https://github.com/RWKV/lm-evaluation-harness/actions/runs/8040203638/job/21958032153#step:3:688) INFO     [__init__.py:358] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-02-25:19:16:54,373 INFO     [__main__.py:238] Selected Tasks: ['anli']
2024-02-25:19:16:54,373 INFO     [__main__.py:239] Loading selected tasks...
2024-02-25:19:16:54,478 INFO     [__main__.py:238] Selected Tasks: ['anli']
2024-02-25:19:16:54,478 INFO     [__main__.py:239] Loading selected tasks...
2024-02-25:19:16:54,487 INFO     [__main__.py:238] Selected Tasks: ['anli']
2024-02-25:19:16:54,487 INFO     [__main__.py:239] Loading selected tasks...
2024-02-25:19:16:54,490 INFO     [__main__.py:238] Selected Tasks: ['anli']
2024-02-25:19:16:54,490 INFO     [__main__.py:239] Loading selected tasks...
2024-02-25:19:16:54,501 INFO     [__main__.py:238] Selected Tasks: ['anli']
2024-02-25:19:16:54,501 INFO     [__main__.py:239] Loading selected tasks...
2024-02-25:19:16:54,507 INFO     [__main__.py:238] Selected Tasks: ['anli']
2024-02-25:19:16:54,507 INFO     [__main__.py:239] Loading selected tasks...
  exitcode  : 1 (pid: 45813)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-02-25_19:16:59
  host      : 0c7aa106d18a
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 45814)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-02-25_19:16:59
  host      : 0c7aa106d18a
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 45815)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-02-25_19:16:59
  host      : 0c7aa106d18a
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 45816)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-02-25_19:16:59
  host      : 0c7aa106d18a
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 45817)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-02-25_19:16:59
  host      : 0c7aa106d18a
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 45818)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-25_19:16:59
  host      : 0c7aa106d18a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 45812)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
This error does not seem to occur with smaller unsharded models.
As such as a temporary work around, we would be avoiding sharding the checkpoint for the HF implementation.
rwkv / lm-evaluation-harness Goto Github PK

lm-evaluation-harness's People

Contributors

Stargazers

Forkers

lm-evaluation-harness's Issues

Potential error with accelerate + lm-eval + rwkv sharded models

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent