Prerequisite <input t

After the first ctrl-c to kill the job: <div class="snippet-clipboard-content notr

How long will the process stack in <div class="snippet-clipboard-content notransla

How long will the process stack in <div class="snippet-clipboard-cont

Just for reference, this has been stuck for 45 mintues: <div class="snippet-clipbo

Please try --debug option in cli, e.g. <code class="n

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? about opencompass HOT 11 CLOSED

petergreis commented on June 12, 2024

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore?

from opencompass.

Comments (11)

petergreis commented on June 12, 2024

After the first ctrl-c to kill the job:

0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]^C╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /notebooks/ChatMusician/eval/opencompass/run.py:4 in <module>                                    │
│                                                                                                  │
│   1 from opencompass.cli.main import main                                                        │
│   2                                                                                              │
│   3 if __name__ == '__main__':                                                                   │
│ ❱ 4 │   main()                                                                                   │
│   5                                                                                              │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/cli/main.py:309 in main                     │
│                                                                                                  │
│   306 │   │   │   for task in tasks:                                                             │
│   307 │   │   │   │   cfg.attack.dataset = task.datasets[0][0].abbr                              │
│   308 │   │   │   │   task.attack = cfg.attack                                                   │
│ ❱ 309 │   │   runner(tasks)                                                                      │
│   310 │                                                                                          │
│   311 │   # evaluate                                                                             │
│   312 │   if args.mode in ['all', 'eval']:                                                       │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/base.py:38 in __call__              │
│                                                                                                  │
│   35 │   │   │   tasks (list[dict]): A list of task configs, usually generated by                │
│   36 │   │   │   │   Partitioner.                                                                │
│   37 │   │   """                                                                                 │
│ ❱ 38 │   │   status = self.launch(tasks)                                                         │
│   39 │   │   status_list = list(status)  # change into list format                               │
│   40 │   │   self.summarize(status_list)                                                         │
│   41                                                                                             │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/local.py:148 in launch              │
│                                                                                                  │
│   145 │   │   │   │                                                                              │
│   146 │   │   │   │   return res                                                                 │
│   147 │   │   │                                                                                  │
│ ❱ 148 │   │   │   with ThreadPoolExecutor(                                                       │
│   149 │   │   │   │   │   max_workers=self.max_num_workers) as executor:                         │
│   150 │   │   │   │   status = executor.map(submit, tasks, range(len(tasks)))                    │
│   151                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py:649 in __exit__   │
│                                                                                                  │
│   646 │   │   return self                                                                        │
│   647 │                                                                                          │
│   648 │   def __exit__(self, exc_type, exc_val, exc_tb):                                         │
│ ❱ 649 │   │   self.shutdown(wait=True)                                                           │
│   650 │   │   return False                                                                       │
│   651                                                                                            │
│   652                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py:235 in shutdown  │
│                                                                                                  │
│   232 │   │   │   self._work_queue.put(None)                                                     │
│   233 │   │   if wait:                                                                           │
│   234 │   │   │   for t in self._threads:                                                        │
│ ❱ 235 │   │   │   │   t.join()                                                                   │
│   236 │   shutdown.__doc__ = _base.Executor.shutdown.__doc__                                     │
│   237                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1096 in join                     │
│                                                                                                  │
│   1093 │   │   │   raise RuntimeError("cannot join current thread")                              │
│   1094 │   │                                                                                     │
│   1095 │   │   if timeout is None:                                                               │
│ ❱ 1096 │   │   │   self._wait_for_tstate_lock()                                                  │
│   1097 │   │   else:                                                                             │
│   1098 │   │   │   # the behavior of a negative timeout isn't documented, but                    │
│   1099 │   │   │   # historically .join(timeout=x) for x<0 has acted as if timeout=0             │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1116 in _wait_for_tstate_lock    │
│                                                                                                  │
│   1113 │   │   │   return                                                                        │
│   1114 │   │                                                                                     │
│   1115 │   │   try:                                                                              │
│ ❱ 1116 │   │   │   if lock.acquire(block, timeout):                                              │
│   1117 │   │   │   │   lock.release()                                                            │
│   1118 │   │   │   │   self._stop()                                                              │
│   1119 │   │   except:                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyboardInterrupt

from opencompass.

liushz commented on June 12, 2024

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

from opencompass.

petergreis commented on June 12, 2024

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|

I am sorry, I don't understand what you're asking for here...

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

from opencompass.models import HuggingFaceCausalLM

model_path_mapping = {
    "ChatMusician": "m-a-p/ChatMusician",
    "ChatMusician-Base": "m-a-p/ChatMusician-Base"
}

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr=model_abbr,
        path=model_path,
        tokenizer_path=model_path,
        tokenizer_kwargs=dict(
            trust_remote_code=True,
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(device_map='auto'),
        batch_padding=False, # if false, inference with for-loop without batch padding
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
    for model_abbr, model_path in model_path_mapping.items()
]

from opencompass.

petergreis commented on June 12, 2024

Just for reference, this has been stuck for 45 mintues:

(opencompass) ml@njdfaracvb:/notebooks/ChatMusician/eval$ date; python run.py configs/eval_chat_musician_7b.py
Wed Apr 10 16:13:49 UTC 2024
04/10 16:13:57 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 16:13:57 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]

from opencompass.

liushz commented on June 12, 2024

It seems like the system is in the process of caching the ChatMusician model on the backend. Please try again once the model has been fully downloaded and cached.

from opencompass.

petergreis commented on June 12, 2024

Models are cached, it is still just sitting there (have just re-reun predict code tp bring down models again):

(opencompass) ml@nwlciszbmg:/notebooks/ChatMusician/eval$ cd
(opencompass) ml@nwlciszbmg:~$ ls -al .cache/huggingface/hub/
total 24
drwxrwxr-x 5 ml ml 4096 Apr 11 06:13 .
drwxrwxr-x 3 ml ml 4096 Apr 11 06:07 ..
drwxrwxr-x 4 ml ml 4096 Apr 11 06:13 .locks
drwxrwxr-x 6 ml ml 4096 Apr 11 06:19 models--m-a-p--ChatMusician
drwxrwxr-x 6 ml ml 4096 Apr 11 06:12 models--m-a-p--ChatMusician-Base
-rw-rw-r-- 1 ml ml    1 Apr 11 06:07 version.txt
(opencompass) ml@nwlciszbmg:~$ du -s !$
du -s .cache/huggingface/hub/
26326872        .cache/huggingface/hub/
(opencompass) ml@nwlciszbmg:~$ du -hs .cache/huggingface/hub/*
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician-Base
4.0K    .cache/huggingface/hub/version.txt

from opencompass.

petergreis commented on June 12, 2024

I think something is hanging in the partitioner; is there an easy way to debug this?

from opencompass.

liushz commented on June 12, 2024

Could you please provide the contents of your log? It can be found in output/WORK_DIR/logs. Is it empty?

from opencompass.

petergreis commented on June 12, 2024

Nothing found like output/WORK_DIR/logs.

I only have outputs:

ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name output -print
ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name outputs -print
./outputs

from outputs/default/20240411_124955/configs/20240411_
20240411_124955.py.txt
124955'.py

from opencompass.

Leymore commented on June 12, 2024

Please try --debug option in cli, e.g. python run.py configs/eval_chat_musician_7b.py --debug, and paste the stuck outputs here.

Besides, will the following codes run successfully?

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'm-a-p/ChatMusician'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map='cuda').eval()
prompt = 'Hello, how are you?'
inputs = tokenizer(prompt, return_tensors='pt')
response = model.generate(input_ids=inputs['input_ids'].to(model.device),)
response = tokenizer.decode(response[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

from opencompass.

petergreis commented on June 12, 2024

Yes, your code runs successfully:

(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ python test.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.72s/it]
/home/ml/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py:1132: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(


I'm doing well, thank you.


(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ conda list | grep transformers
sentence-transformers     2.2.2                    pypi_0    pypi
transformers              4.39.3                   pypi_0    pypi

Solution

After a marathon deep dive, here is the TLDR version. There was indeed a threading deadlock as I suspected. This is a conflict between Intel's Math Kernel Library (MKL) and the GNU OpenMP library (libgomp.so.1) in the platform environment that I am using Digital Ocean Paperspace.

The steps in brief:

Put in a fresh install of open compass
copy over the config and dataset for ChatMusician
configs/eval_chat_musician_7b.py
configs/datasets/music_theory_bench
conda install python-Levenshtein -y
Pin the threading model to use the GNU version:
export MKL_THREADING_LAYER=GNU
Rerun: time python run.py configs/eval_chat_musician_7b.py

A few errors result, but it is running on a 45GB A6000 at Paperspace. First runtime was approximate 5h30m. Thank you all for your patience and feedback in tracking this down.

from opencompass.

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? about opencompass HOT 11 CLOSED

Comments (11)

Solution

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent