Code Monkey home page Code Monkey logo

Comments (11)

petergreis avatar petergreis commented on June 12, 2024

After the first ctrl-c to kill the job:

0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]^C╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /notebooks/ChatMusician/eval/opencompass/run.py:4 in <module>                                    │
│                                                                                                  │
│   1 from opencompass.cli.main import main                                                        │
│   2                                                                                              │
│   3 if __name__ == '__main__':                                                                   │
│ ❱ 4 │   main()                                                                                   │
│   5                                                                                              │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/cli/main.py:309 in main                     │
│                                                                                                  │
│   306 │   │   │   for task in tasks:                                                             │
│   307 │   │   │   │   cfg.attack.dataset = task.datasets[0][0].abbr                              │
│   308 │   │   │   │   task.attack = cfg.attack                                                   │
│ ❱ 309 │   │   runner(tasks)                                                                      │
│   310 │                                                                                          │
│   311 │   # evaluate                                                                             │
│   312 │   if args.mode in ['all', 'eval']:                                                       │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/base.py:38 in __call__              │
│                                                                                                  │
│   35 │   │   │   tasks (list[dict]): A list of task configs, usually generated by                │
│   36 │   │   │   │   Partitioner.                                                                │
│   37 │   │   """                                                                                 │
│ ❱ 38 │   │   status = self.launch(tasks)                                                         │
│   39 │   │   status_list = list(status)  # change into list format                               │
│   40 │   │   self.summarize(status_list)                                                         │
│   41                                                                                             │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/local.py:148 in launch              │
│                                                                                                  │
│   145 │   │   │   │                                                                              │
│   146 │   │   │   │   return res                                                                 │
│   147 │   │   │                                                                                  │
│ ❱ 148 │   │   │   with ThreadPoolExecutor(                                                       │
│   149 │   │   │   │   │   max_workers=self.max_num_workers) as executor:                         │
│   150 │   │   │   │   status = executor.map(submit, tasks, range(len(tasks)))                    │
│   151                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py:649 in __exit__   │
│                                                                                                  │
│   646 │   │   return self                                                                        │
│   647 │                                                                                          │
│   648 │   def __exit__(self, exc_type, exc_val, exc_tb):                                         │
│ ❱ 649 │   │   self.shutdown(wait=True)                                                           │
│   650 │   │   return False                                                                       │
│   651                                                                                            │
│   652                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py:235 in shutdown  │
│                                                                                                  │
│   232 │   │   │   self._work_queue.put(None)                                                     │
│   233 │   │   if wait:                                                                           │
│   234 │   │   │   for t in self._threads:                                                        │
│ ❱ 235 │   │   │   │   t.join()                                                                   │
│   236 │   shutdown.__doc__ = _base.Executor.shutdown.__doc__                                     │
│   237                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1096 in join                     │
│                                                                                                  │
│   1093 │   │   │   raise RuntimeError("cannot join current thread")                              │
│   1094 │   │                                                                                     │
│   1095 │   │   if timeout is None:                                                               │
│ ❱ 1096 │   │   │   self._wait_for_tstate_lock()                                                  │
│   1097 │   │   else:                                                                             │
│   1098 │   │   │   # the behavior of a negative timeout isn't documented, but                    │
│   1099 │   │   │   # historically .join(timeout=x) for x<0 has acted as if timeout=0             │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1116 in _wait_for_tstate_lock    │
│                                                                                                  │
│   1113 │   │   │   return                                                                        │
│   1114 │   │                                                                                     │
│   1115 │   │   try:                                                                              │
│ ❱ 1116 │   │   │   if lock.acquire(block, timeout):                                              │
│   1117 │   │   │   │   lock.release()                                                            │
│   1118 │   │   │   │   self._stop()                                                              │
│   1119 │   │   except:                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyboardInterrupt

from opencompass.

liushz avatar liushz commented on June 12, 2024

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%| 

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

from opencompass.

petergreis avatar petergreis commented on June 12, 2024

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%| 

I am sorry, I don't understand what you're asking for here...

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

from opencompass.models import HuggingFaceCausalLM

model_path_mapping = {
    "ChatMusician": "m-a-p/ChatMusician",
    "ChatMusician-Base": "m-a-p/ChatMusician-Base"
}

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr=model_abbr,
        path=model_path,
        tokenizer_path=model_path,
        tokenizer_kwargs=dict(
            trust_remote_code=True,
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(device_map='auto'),
        batch_padding=False, # if false, inference with for-loop without batch padding
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
    for model_abbr, model_path in model_path_mapping.items()
]

from opencompass.

petergreis avatar petergreis commented on June 12, 2024

Just for reference, this has been stuck for 45 mintues:

(opencompass) ml@njdfaracvb:/notebooks/ChatMusician/eval$ date; python run.py configs/eval_chat_musician_7b.py
Wed Apr 10 16:13:49 UTC 2024
04/10 16:13:57 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 16:13:57 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]

from opencompass.

liushz avatar liushz commented on June 12, 2024

It seems like the system is in the process of caching the ChatMusician model on the backend. Please try again once the model has been fully downloaded and cached.

from opencompass.

petergreis avatar petergreis commented on June 12, 2024

Models are cached, it is still just sitting there (have just re-reun predict code tp bring down models again):

(opencompass) ml@nwlciszbmg:/notebooks/ChatMusician/eval$ cd
(opencompass) ml@nwlciszbmg:~$ ls -al .cache/huggingface/hub/
total 24
drwxrwxr-x 5 ml ml 4096 Apr 11 06:13 .
drwxrwxr-x 3 ml ml 4096 Apr 11 06:07 ..
drwxrwxr-x 4 ml ml 4096 Apr 11 06:13 .locks
drwxrwxr-x 6 ml ml 4096 Apr 11 06:19 models--m-a-p--ChatMusician
drwxrwxr-x 6 ml ml 4096 Apr 11 06:12 models--m-a-p--ChatMusician-Base
-rw-rw-r-- 1 ml ml    1 Apr 11 06:07 version.txt
(opencompass) ml@nwlciszbmg:~$ du -s !$
du -s .cache/huggingface/hub/
26326872        .cache/huggingface/hub/
(opencompass) ml@nwlciszbmg:~$ du -hs .cache/huggingface/hub/*
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician-Base
4.0K    .cache/huggingface/hub/version.txt

from opencompass.

petergreis avatar petergreis commented on June 12, 2024

I think something is hanging in the partitioner; is there an easy way to debug this?

from opencompass.

liushz avatar liushz commented on June 12, 2024

Could you please provide the contents of your log? It can be found in output/WORK_DIR/logs. Is it empty?

from opencompass.

petergreis avatar petergreis commented on June 12, 2024

Nothing found like output/WORK_DIR/logs.

I only have outputs:

ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name output -print
ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name outputs -print
./outputs

from outputs/default/20240411_124955/configs/20240411_
20240411_124955.py.txt
124955'.py



from opencompass.

Leymore avatar Leymore commented on June 12, 2024

Please try --debug option in cli, e.g. python run.py configs/eval_chat_musician_7b.py --debug, and paste the stuck outputs here.

Besides, will the following codes run successfully?

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'm-a-p/ChatMusician'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map='cuda').eval()
prompt = 'Hello, how are you?'
inputs = tokenizer(prompt, return_tensors='pt')
response = model.generate(input_ids=inputs['input_ids'].to(model.device),)
response = tokenizer.decode(response[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

from opencompass.

petergreis avatar petergreis commented on June 12, 2024

Yes, your code runs successfully:

(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ python test.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.72s/it]
/home/ml/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py:1132: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(


I'm doing well, thank you.


(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ conda list | grep transformers
sentence-transformers     2.2.2                    pypi_0    pypi
transformers              4.39.3                   pypi_0    pypi

Solution

After a marathon deep dive, here is the TLDR version. There was indeed a threading deadlock as I suspected. This is a conflict between Intel's Math Kernel Library (MKL) and the GNU OpenMP library (libgomp.so.1) in the platform environment that I am using Digital Ocean Paperspace.

The steps in brief:

  • Put in a fresh install of open compass
  • copy over the config and dataset for ChatMusician
    configs/eval_chat_musician_7b.py
    configs/datasets/music_theory_bench
  • conda install python-Levenshtein -y
  • Pin the threading model to use the GNU version:
    export MKL_THREADING_LAYER=GNU
  • Rerun: time python run.py configs/eval_chat_musician_7b.py

A few errors result, but it is running on a 45GB A6000 at Paperspace. First runtime was approximate 5h30m. Thank you all for your patience and feedback in tracking this down.

from opencompass.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.