Code Monkey home page Code Monkey logo

Comments (13)

RobertLou avatar RobertLou commented on August 22, 2024 1

thanks , I have downloaded it manually through meata.com, and convert it into hf format using convert_llama_weights_to_hf.py, and got the files as following:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/ total 13163352 -rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json -rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors -rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors -rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors -rw-r--r-- 1 root root 659 Aug 6 02:00 config.json -rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json -rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json -rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json -rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model -rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

but when I started it with : $python distserve/api_server/distserve_api_server.py --model ../lama2-7b-hf/

it report another issue:

(ParaWorker pid=30741) INFO 02:18:28 runtime peak memory: 12.706 GB (ParaWorker pid=30741) INFO 02:18:28 total GPU memory: 44.521 GB (ParaWorker pid=30741) INFO 02:18:28 kv cache size for one token: 0.50000 MB (ParaWorker pid=30741) INFO 02:18:28 num_gpu_blocks: 3502 (ParaWorker pid=30741) INFO 02:18:28 num_cpu_blocks: 2048 (ParaWorker pid=30742) Gpt<T>::load() - ../lama2-7b-hf/decoder.embed_tokens.weight.pt not found

I encountered the same question before. I can tell you that is because distserver needs to convert the model into another form first. I changed the file distserve/downloader/downloader.py to solve this. You can replace the same part in this file. Here's my code:

        if is_local:
            if model_name_or_path[-1] == '/':
                allow_patterns = "*.bin"
                hf_files = os.path.join(model_name_or_path, allow_patterns)
                cache_dir = DISTSERVE_CACHE
                storage_folder = \
                    os.path.join(cache_dir, 
                                repo_folder_name(repo_id=model_name_or_path)) + '/'
                done_file = os.path.join(storage_folder, "done")
                if os.path.exists(done_file):
                    logger.info(f"Find cached model weights in {storage_folder}.")    
                    return storage_folder
                
                # download and convert model weights
                convert_weights(hf_files, storage_folder, dtype, model)
                file = open(done_file, 'w')
                file.close()
                return storage_folder
            else:
                return model_name_or_path + '/'

from distserve.

RobertLou avatar RobertLou commented on August 22, 2024 1

Hi Robert, Appreciated for your great help.

Is it necessary to download all the files ? it will take too much time to download all. and if I want to run examples/offline.py ,how to specify the local dir in the code?

I'm not sure, but according to the code, *.bin is necessary. You can use --model to specify the local dir, like

python offfline.py --model ../Llama2-7b-hf/
By the way, if you have any question, checking the code is the fastest way. You can find '--model' args in offline.py which helps you to specify the local dir.

from distserve.

William12github avatar William12github commented on August 22, 2024 1

Hi Robert ,
Thank you for your enthusiastic help and good advice!

from distserve.

William12github avatar William12github commented on August 22, 2024

is there alternative way to provide the model except for downloading online directly

from distserve.

RobertLou avatar RobertLou commented on August 22, 2024

You can use this website to download model:https://modelscope.cn/my/overview

from distserve.

William12github avatar William12github commented on August 22, 2024

thanks , I have downloaded it manually through meata.com, and convert it into hf format using convert_llama_weights_to_hf.py, and got the files as following:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/
total 13163352
-rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json
-rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 659 Aug 6 02:00 config.json
-rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json
-rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json
-rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json
-rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model
-rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

but when I started it with :
$python distserve/api_server/distserve_api_server.py --model ../lama2-7b-hf/

it report another issue:

(ParaWorker pid=30741) INFO 02:18:28 runtime peak memory: 12.706 GB (ParaWorker pid=30741) INFO 02:18:28 total GPU memory: 44.521 GB (ParaWorker pid=30741) INFO 02:18:28 kv cache size for one token: 0.50000 MB (ParaWorker pid=30741) INFO 02:18:28 num_gpu_blocks: 3502 (ParaWorker pid=30741) INFO 02:18:28 num_cpu_blocks: 2048 (ParaWorker pid=30742) Gpt<T>::load() - ../lama2-7b-hf/decoder.embed_tokens.weight.pt not found

from distserve.

William12github avatar William12github commented on August 22, 2024

but I don't have the .bin file in the folder , only got these files:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/
total 13163352
-rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json
-rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 659 Aug 6 02:00 config.json
-rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json
-rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json
-rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json
-rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model
-rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

from distserve.

RobertLou avatar RobertLou commented on August 22, 2024

but I don't have the .bin file in the folder , only got these files:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/
total 13163352
-rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json
-rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 659 Aug 6 02:00 config.json
-rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json
-rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json
-rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json
-rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model
-rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

https://modelscope.cn/models/shakechen/Llama-2-7b-hf/files It has *.bin files, maybe download it again?

from distserve.

William12github avatar William12github commented on August 22, 2024

Hi Robert,
Appreciated for your great help.

Is it necessary to download all the files ? it will take too much time to download all.
and if I want to run examples/offline.py ,how to specify the local dir in the code?

from distserve.

William12github avatar William12github commented on August 22, 2024

I am able to run the examples/offline.py code, got following result :

INFO 13:00:01 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:01 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:01 (decoding) 0 unaccepted, 0 waiting, 2 processing
INFO 13:00:02 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:02 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:02 (decoding) GPU blocks: 2 / 941 (0.21%) used, (0 swapping out)
INFO 13:00:02 (decoding) 0 unaccepted, 0 waiting, 1 processing
INFO 13:00:03 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:03 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:03 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:03 (decoding) 0 unaccepted, 0 waiting, 1 processing
Prompt: 'Life blooms like a flower. Far away or by the road. Waiting', Generated text: for the right time to blo om . Ћ
(10 tokens generated).
Prompt: 'A quick brown fox', Generated text: j umps over the lazy dog .
(8 tokens generated).
Prompt: 'Artificial intelligence is', Generated text: a hot topic in the te ch world . The term is thrown around a lot , but what does it really mean ?
(25 tokens generated).
Prompt: 'To be or not to be,', Generated text: that is the question .
(6 tokens generated).
Prompt: 'one two three four', Generated text: five six seven eight nine ten eleven eleven twelve th ir teen fifteen six teen sevent een eigh teen nin ete en twenty one twenty - one twenty - two twenty - three twenty - four twenty - five twenty - six twenty - se ven twenty - one twenty - two
(53 tokens generated).
(ParaWorker pid=5130) INFO 13:00:00 (worker context.#0) model /workspace/Llama-2-7b-hf/ loaded
(ParaWorker pid=5130) INFO 13:00:00 runtime peak memory: 12.497 GB
(ParaWorker pid=5130) INFO 13:00:00 total GPU memory: 22.059 GB
(ParaWorker pid=5130) INFO 13:00:00 kv cache size for one token: 0.50000 MB
(ParaWorker pid=5130) INFO 13:00:00 num_gpu_blocks: 941
(ParaWorker pid=5130) INFO 13:00:00 num_cpu_blocks: 128
root@5a65df8f9a43:/workspace/DistServe#

but I'm still confused :

  1. how to know the performance of the testing?
  2. if I want to comapre it with colocate solution (both prefill and decoding phase run in the single GPU) , how to launch the testing ?

from distserve.

Youhe-Jiang avatar Youhe-Jiang commented on August 22, 2024

I am able to run the examples/offline.py code, got following result :

INFO 13:00:01 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:01 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:01 (decoding) 0 unaccepted, 0 waiting, 2 processing
INFO 13:00:02 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:02 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:02 (decoding) GPU blocks: 2 / 941 (0.21%) used, (0 swapping out)
INFO 13:00:02 (decoding) 0 unaccepted, 0 waiting, 1 processing
INFO 13:00:03 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:03 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:03 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:03 (decoding) 0 unaccepted, 0 waiting, 1 processing
Prompt: 'Life blooms like a flower. Far away or by the road. Waiting', Generated text: for the right time to blo om . Ћ
(10 tokens generated).
Prompt: 'A quick brown fox', Generated text: j umps over the lazy dog .
(8 tokens generated).
Prompt: 'Artificial intelligence is', Generated text: a hot topic in the te ch world . The term is thrown around a lot , but what does it really mean ?
(25 tokens generated).
Prompt: 'To be or not to be,', Generated text: that is the question .
(6 tokens generated).
Prompt: 'one two three four', Generated text: five six seven eight nine ten eleven eleven twelve th ir teen fifteen six teen sevent een eigh teen nin ete en twenty one twenty - one twenty - two twenty - three twenty - four twenty - five twenty - six twenty - se ven twenty - one twenty - two
(53 tokens generated).
(ParaWorker pid=5130) INFO 13:00:00 (worker context.#0) model /workspace/Llama-2-7b-hf/ loaded
(ParaWorker pid=5130) INFO 13:00:00 runtime peak memory: 12.497 GB
(ParaWorker pid=5130) INFO 13:00:00 total GPU memory: 22.059 GB
(ParaWorker pid=5130) INFO 13:00:00 kv cache size for one token: 0.50000 MB
(ParaWorker pid=5130) INFO 13:00:00 num_gpu_blocks: 941
(ParaWorker pid=5130) INFO 13:00:00 num_cpu_blocks: 128
root@5a65df8f9a43:/workspace/DistServe#

but I'm still confused :

  1. how to know the performance of the testing?
  2. if I want to comapre it with colocate solution (both prefill and decoding phase run in the single GPU) , how to launch the testing ?

Hi mate, have you ever met this problem:

(ParaWorker pid=1955026) Error: Peer-to-peer access is unsupported on this platform.
(ParaWorker pid=1955026) In the current version of distserve, it is necessary to use a platform that supports GPU P2P access.
(ParaWorker pid=1955026) Exiting...

I checked the P2P access, it should be supported actually...

Thank you for any help!

from distserve.

William12github avatar William12github commented on August 22, 2024

you can use below command to check your system if it's support P2P:
`nvidia-smi topo -p2p wr
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X

Legend:

X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown`

from distserve.

Youhe-Jiang avatar Youhe-Jiang commented on August 22, 2024

from distserve.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.