Comments (11)
After the first ctrl-c to kill the job:
0%| | 0/122 [00:00<?, ?it/s]^C╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /notebooks/ChatMusician/eval/opencompass/run.py:4 in <module> │
│ │
│ 1 from opencompass.cli.main import main │
│ 2 │
│ 3 if __name__ == '__main__': │
│ ❱ 4 │ main() │
│ 5 │
│ │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/cli/main.py:309 in main │
│ │
│ 306 │ │ │ for task in tasks: │
│ 307 │ │ │ │ cfg.attack.dataset = task.datasets[0][0].abbr │
│ 308 │ │ │ │ task.attack = cfg.attack │
│ ❱ 309 │ │ runner(tasks) │
│ 310 │ │
│ 311 │ # evaluate │
│ 312 │ if args.mode in ['all', 'eval']: │
│ │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/base.py:38 in __call__ │
│ │
│ 35 │ │ │ tasks (list[dict]): A list of task configs, usually generated by │
│ 36 │ │ │ │ Partitioner. │
│ 37 │ │ """ │
│ ❱ 38 │ │ status = self.launch(tasks) │
│ 39 │ │ status_list = list(status) # change into list format │
│ 40 │ │ self.summarize(status_list) │
│ 41 │
│ │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/local.py:148 in launch │
│ │
│ 145 │ │ │ │ │
│ 146 │ │ │ │ return res │
│ 147 │ │ │ │
│ ❱ 148 │ │ │ with ThreadPoolExecutor( │
│ 149 │ │ │ │ │ max_workers=self.max_num_workers) as executor: │
│ 150 │ │ │ │ status = executor.map(submit, tasks, range(len(tasks))) │
│ 151 │
│ │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py:649 in __exit__ │
│ │
│ 646 │ │ return self │
│ 647 │ │
│ 648 │ def __exit__(self, exc_type, exc_val, exc_tb): │
│ ❱ 649 │ │ self.shutdown(wait=True) │
│ 650 │ │ return False │
│ 651 │
│ 652 │
│ │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py:235 in shutdown │
│ │
│ 232 │ │ │ self._work_queue.put(None) │
│ 233 │ │ if wait: │
│ 234 │ │ │ for t in self._threads: │
│ ❱ 235 │ │ │ │ t.join() │
│ 236 │ shutdown.__doc__ = _base.Executor.shutdown.__doc__ │
│ 237 │
│ │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1096 in join │
│ │
│ 1093 │ │ │ raise RuntimeError("cannot join current thread") │
│ 1094 │ │ │
│ 1095 │ │ if timeout is None: │
│ ❱ 1096 │ │ │ self._wait_for_tstate_lock() │
│ 1097 │ │ else: │
│ 1098 │ │ │ # the behavior of a negative timeout isn't documented, but │
│ 1099 │ │ │ # historically .join(timeout=x) for x<0 has acted as if timeout=0 │
│ │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1116 in _wait_for_tstate_lock │
│ │
│ 1113 │ │ │ return │
│ 1114 │ │ │
│ 1115 │ │ try: │
│ ❱ 1116 │ │ │ if lock.acquire(block, timeout): │
│ 1117 │ │ │ │ lock.release() │
│ 1118 │ │ │ │ self._stop() │
│ 1119 │ │ except: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyboardInterrupt
from opencompass.
How long will the process stack in
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0
0%|
and, can I see the content of your model config .models.chat_musician.hf_chat_musician
?
from opencompass.
How long will the process stack in
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0 0%|
I am sorry, I don't understand what you're asking for here...
and, can I see the content of your model config
.models.chat_musician.hf_chat_musician
?
from opencompass.models import HuggingFaceCausalLM
model_path_mapping = {
"ChatMusician": "m-a-p/ChatMusician",
"ChatMusician-Base": "m-a-p/ChatMusician-Base"
}
models = [
dict(
type=HuggingFaceCausalLM,
abbr=model_abbr,
path=model_path,
tokenizer_path=model_path,
tokenizer_kwargs=dict(
trust_remote_code=True,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(device_map='auto'),
batch_padding=False, # if false, inference with for-loop without batch padding
run_cfg=dict(num_gpus=1, num_procs=1),
)
for model_abbr, model_path in model_path_mapping.items()
]
from opencompass.
Just for reference, this has been stuck for 45 mintues:
(opencompass) ml@njdfaracvb:/notebooks/ChatMusician/eval$ date; python run.py configs/eval_chat_musician_7b.py
Wed Apr 10 16:13:49 UTC 2024
04/10 16:13:57 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 16:13:57 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0
0%| | 0/122 [00:00<?, ?it/s]
from opencompass.
It seems like the system is in the process of caching the ChatMusician
model on the backend. Please try again once the model has been fully downloaded and cached.
from opencompass.
Models are cached, it is still just sitting there (have just re-reun predict code tp bring down models again):
(opencompass) ml@nwlciszbmg:/notebooks/ChatMusician/eval$ cd
(opencompass) ml@nwlciszbmg:~$ ls -al .cache/huggingface/hub/
total 24
drwxrwxr-x 5 ml ml 4096 Apr 11 06:13 .
drwxrwxr-x 3 ml ml 4096 Apr 11 06:07 ..
drwxrwxr-x 4 ml ml 4096 Apr 11 06:13 .locks
drwxrwxr-x 6 ml ml 4096 Apr 11 06:19 models--m-a-p--ChatMusician
drwxrwxr-x 6 ml ml 4096 Apr 11 06:12 models--m-a-p--ChatMusician-Base
-rw-rw-r-- 1 ml ml 1 Apr 11 06:07 version.txt
(opencompass) ml@nwlciszbmg:~$ du -s !$
du -s .cache/huggingface/hub/
26326872 .cache/huggingface/hub/
(opencompass) ml@nwlciszbmg:~$ du -hs .cache/huggingface/hub/*
13G .cache/huggingface/hub/models--m-a-p--ChatMusician
13G .cache/huggingface/hub/models--m-a-p--ChatMusician-Base
4.0K .cache/huggingface/hub/version.txt
from opencompass.
I think something is hanging in the partitioner; is there an easy way to debug this?
from opencompass.
Could you please provide the contents of your log? It can be found in output/WORK_DIR/logs. Is it empty?
from opencompass.
Nothing found like output/WORK_DIR/logs.
I only have outputs:
ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name output -print
ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name outputs -print
./outputs
from outputs/default/20240411_124955/configs/20240411_
20240411_124955.py.txt
124955'.py
from opencompass.
Please try --debug
option in cli, e.g. python run.py configs/eval_chat_musician_7b.py --debug
, and paste the stuck outputs here.
Besides, will the following codes run successfully?
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'm-a-p/ChatMusician'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map='cuda').eval()
prompt = 'Hello, how are you?'
inputs = tokenizer(prompt, return_tensors='pt')
response = model.generate(input_ids=inputs['input_ids'].to(model.device),)
response = tokenizer.decode(response[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
from opencompass.
Yes, your code runs successfully:
(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ python test.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.72s/it]
/home/ml/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py:1132: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
I'm doing well, thank you.
(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ conda list | grep transformers
sentence-transformers 2.2.2 pypi_0 pypi
transformers 4.39.3 pypi_0 pypi
Solution
After a marathon deep dive, here is the TLDR version. There was indeed a threading deadlock as I suspected. This is a conflict between Intel's Math Kernel Library (MKL) and the GNU OpenMP library (libgomp.so.1) in the platform environment that I am using Digital Ocean Paperspace.
The steps in brief:
- Put in a fresh install of open compass
- copy over the config and dataset for ChatMusician
configs/eval_chat_musician_7b.py
configs/datasets/music_theory_bench - conda install python-Levenshtein -y
- Pin the threading model to use the GNU version:
export MKL_THREADING_LAYER=GNU - Rerun:
time python run.py configs/eval_chat_musician_7b.py
A few errors result, but it is running on a 45GB A6000 at Paperspace. First runtime was approximate 5h30m. Thank you all for your patience and feedback in tracking this down.
from opencompass.
Related Issues (20)
- [Feature] Falmes dataset evaluation seems to be missing configs and json file HOT 3
- [Bug] 评测lawbench数据集时偶现异常
- [Feature] 支持openai/GPT4-o的评测seting HOT 1
- GenInferencer PPLInferencer 不能集成到一起吗[Feature] HOT 2
- [Feature] 如何在needlebench 中使用api model? HOT 1
- [Feature] config的bug,提示下载configs,然后下载了又出现以下bug
- [Bug] unrecognized arguments: --no-batch-padding HOT 1
- opencompass榜单更新情况 HOT 2
- [Bug] hf_chatglm3_6b评测AFQMC数据集时,自测结果与官方不一致。且自测结果不稳定。 HOT 1
- No module named 'opencompass' HOT 9
- [Bug] Unable to use tutorial methods properly——KeyError: 'opt125m'or'opt350m' HOT 1
- [Bug] opencompass/cli和opencompass/datasets/IFEval下缺少__init__.py所以release版本是不能导入这两个包的 HOT 1
- [Bug] configs/datasets/agieval/agieval_mixed_713d14.py not found
- [Bug] llm-compression task faild at eval stage with latest version HOT 3
- [Bug] which version of the dataset should be selected When evaluating the Llama3 model,
- [Bug] run pytorch Qwen-7B-Chat with ARC-c ppl under CPU ,and result is not good HOT 1
- 大海捞针数据集初始化报错( Failed to get opencompass.datasets.needlebench.origin.NeedleBenchOriginDataset)
- opencompass公开榜单更新[Feature] HOT 1
- [Bug] When testing on gen datasets, even if the output is empty or incorrect, unexpected scores can be obtained
- [Bug] Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opencompass.