Comments (11)
Hi @lekurile the benchmark will proceed but will hit some other error when running on CPU. I'll check with vllm cpu engineers to investigate these errors. I also submitted a PR adding a flag allowing start the server in seperate command line.
#900
from deepspeedexamples.
@delock - FYI. Created this issue so we can track and fix it. Please work with folks assigned on this issue.
from deepspeedexamples.
Hello @delock,
Thank you for raising this issue. I ran a local vllm benchmark with the microsoft/Phi-3-mini-4k-instruct
model using the following code:
# Run benchmark
python ./run_benchmark.py \
--model microsoft/Phi-3-mini-4k-instruct \
--tp_size 1 \
--num_replicas 1 \
--max_ragged_batch_size 768 \
--mean_prompt_length 2600 \
--mean_max_new_tokens 60 \
--stream \
--backend vllm \
--overwrite_results \
### Gernerate the plots
python ./src/plot_th_lat.py --data_dirs results_vllm/
echo "Find figures in ./plots/ and log outputs in ./results/"
I also had to add the "--trust-remote-code",
argument to the vllm_cmd
here:
To reproduce the issue you show above, can you please provide a reproduction script so I can test on my end?
To answer your question:
Is it possible to run this script to benchmark a local API server? I kind of thinking run vllm serving in separate command, and use this benchmark to test the api server vllm started. So I would have better control on how the vllm server started and see all the error message from vllm server if it fails.
We can update the benchmarking script and add an additional argument, where existing local server information is provided and the script will not stand up a new server, but will instead target the existing server using the information provided.
from deepspeedexamples.
@awan-10 @lekurile Thanks for start this thread. I met this error when I tried to run this example on Xeon server with CPU. I suspect this is a configuration issue. Currently, I plan to modify the script to run client code only, and start the server on seperate command line, so will be able to see more error message and get better understanding.
from deepspeedexamples.
Hi @lekurile
Now I can start the server from seperate command line and run benchmark on this server with reduced test size (max batch 128, avg prompt128) to start with.
Yet I met the following error during post processing, I suspect this is due to transformers version. What is the transformers version you are using? My version is transformers==4.40.1
Traceback (most recent call last):
File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/./run_benchmark.py", line 44, in <module>
run_benchmark()
File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/./run_benchmark.py", line 36, in run_benchmark
print_summary(client_args, response_details)
File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/utils.py", line 235, in print_summary
ps = get_summary(vars(args), response_details)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/postprocess_results.py", line 80, in get_summary
[
File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/postprocess_results.py", line 81, in <listcomp>
(len(get_tokenizer().tokenize(r.prompt)) + len(get_tokenizer().tokenize(r.generated_tokens)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 396, in tokenize
return self.encode_plus(text=text, text_pair=pair, add_special_tokens=add_special_tokens, **kwargs).tokens()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3037, in encode_plus
return self._encode_plus(
^^^^^^^^^^^^^^^^^^
File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
batched_output = self._batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
from deepspeedexamples.
Hi @delock,
I'm using transformers==4.40.1
as well.
After #895 was committed to the repo, I'm seeing the same error on my end as well.
File "/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
Can you please try detaching your repo HEAD to fab5d06
, one commit prior, and running again? I'll look into this PR and see if we need to revert or not.
Thanks,
Lev
from deepspeedexamples.
@delock, here's the PR fixing the tokens_per_sec
metric to work for both the streaming and non-streaming cases:
#897
You should be able to get past your error above with this PR, but I'm curious if you're seeing any failures still.
from deepspeedexamples.
Yes, the latest version can going forward. Will see whether it can continue.
@delock, here's the PR fixing the
tokens_per_sec
metric to work for both the streaming and non-streaming cases: #897You should be able to get past your error above with this PR, but I'm curious if you're seeing any failures still.
from deepspeedexamples.
Thanks @delock - can we close this issue for now?
from deepspeedexamples.
Thanks @delock - can we close this issue for now?
Yes, this is no longer an issue now, thanks!
from deepspeedexamples.
Thanks!
from deepspeedexamples.
Related Issues (20)
- Does Zero-Inference support TP? HOT 11
- How to use deepspeed for multi-node and multi-card task in slurm cluster
- [Error] AutoTune: `connect to host localhost port 22: Connection refused`
- 请问fastgen 是否支持长文本和序列并行推理
- can not run the test-gpt.sh because of assertionError
- cannot pickle 'Stream' object
- CPU OOM when inferencing Llama3-70B-Chinese-Chat
- DeepSpeed-Chat step-1 hanging for a long time
- 单机多卡进行RLHF在第三步中使用Qwen模型作Actor Model报错 HOT 1
- an error with gradient checkpointing in DeepspeedChat step 3
- Consult the first phase. HOT 2
- How to start deepspeed automatically? HOT 2
- nvcc fatal : Unsupported gpu architecture 'compute_86' and nvcc fatal : Value 'c++17' is not defined for option 'std' HOT 1
- Different zero stage the training memory compute
- step2 without any response for a long time
- FileNotFoundError: [Errno 2] No such file or directory: 'numactl' HOT 4
- Actor loss nan and Resizing model embedding HOT 1
- How to calculate training efficiency ,i.e tokens/sec of step 1 fine tuning of llama2 model ?
- AttributeError: 'DeepSpeedEngine' object has no attribute 'model', HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeedexamples.