Comments (12)
https://github.com/huggingface/transformers-bloom-inference/blob/abe365066fec6e03ce0ea2cc8136f2da1254e2ea/bloom-inference-server/ds_inference/grpc_server.py#L33
@cderinbogaz I hacked my way around it for now
I pass the downloaded model path and checkpoint dict for the model I need to use and the model="bigscience/bloom"
I know this is not the most elegant method to do this :(
from deepspeed-mii.
@mrwyattii I believe your commit yesterday has fixed this?
Let me know.
I am closely watching this repo :)
from deepspeed-mii.
Seems like there is a check in place which is not letting the new weights work with MII
from deepspeed-mii.
Any updates on this?
@jeffra @RezaYazdaniAminabadi
from deepspeed-mii.
Also the same thing happens with the bigscience/bloom-350m for some reason.
I just ran the example in the README and I got the
AssertionError: text-generation only supports [.....]
error
from deepspeed-mii.
Thanks for the response @mayank31398 !
I think its a neat solution :)
from deepspeed-mii.
weight_quantizer.quantize(transpose(sd[0][prefix + 'self_attention.query_key_value.' + 'weight']))) File "/opt/conda/lib/python3.7/site-packages/deepspeed/module_inject/replace_module.py", line 100, in copy dim=self.in_dim)[self.gpu_index].to(
This is the error I got today while trying int8 inference with bloom.
from deepspeed-mii.
Hi @TahaBinhuraib I think MII doesn't support int8 models.
Can you try vanilla DS-inference?
https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-server
you can try running via a CLI/ deploy a generation server as given in the instructions ^^.
from deepspeed-mii.
The fp16 Bloom weights are now supported. Int8 models are also supported, but currently the DeepSpeed sharded int8 weights for the Bloom model will throw an error. I'm working on a fix for this and automatic loading of the sharded weights (so you don't have to manually download the weights and define the checkpoint file list). Those changes will come in #69 and likely another PR.
from deepspeed-mii.
Thanks @mrwyattii
from deepspeed-mii.
Thanks @mrwyattii can't wait!
from deepspeed-mii.
@mayank31398 @TahaBinhuraib I finally found the time to fix #69 so that it works with int8. You no longer need to download the sharded checkpoint files separately and MII will handle this for you (but it will take a while as the checkpoints are quite large). I just confirmed that it's working on my side, but if you have the opportunity to test it out, please do. The script I used:
import mii
mii_configs = {
"dtype": "int8",
"tensor_parallel": 4,
"port_number": 50950,
}
name = "microsoft/bloom-deepspeed-inference-int8"
mii.deploy(task='text-generation',
model=name,
deployment_name="bloom_deployment",
model_path="/data/bloom-ckpts",
mii_config=mii_configs)
You will probably want to change the model_path
parameter if you run this on your local machine.
from deepspeed-mii.
Related Issues (20)
- RuntimeError: server crashed for some reason, unable to proceed HOT 2
- The inference result is inconsistent with hf HOT 1
- TypeError: expected Tensor as element 0 in argument 0, but got bool HOT 1
- How to generate multiple responses in one time? HOT 1
- Is the DeepSpeed-MII will support habana (HPU) hardware? HOT 2
- How does GPT2/Bert models utilize continuous batching feature in MII? HOT 1
- Use of dtype in the mii fastgen HOT 1
- Fp6 eta HOT 2
- How to set trust_remote_code=True in pipeline HOT 2
- why all-reduce takes lots of time for mixtral which is quite larger than that of vllm and tensorrt-llm
- When I start server, after loading model, I got an error of 'grpc.aio._call.AioRpcError' HOT 5
- Add support for Gemma models
- Speeding up loading in inference checkpoints HOT 2
- Requests.exceptions.ConnectionError: HOT 2
- How to use DeepSpeed-MII to deploy a LLM model from DeepSpeed/Megatron-DeepSpeed trained checkpoints? HOT 2
- MII Example shows that mii is "Slower" than Baseline!
- ValueError: Unsupported model type roberta HOT 2
- Can DeepSpeed-MII inference on multi gpus with only 1 replica? HOT 2
- Kernel execution error with long context length
- Workarounds for pre-Ampere devices HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed-mii.