Comments (30)
Hello @Shahrukh-Alethea! We have successfully built our demo app using Gradio. Could you please provide more error information so we can see what happens?
from glm-130b.
@Sengxian
I am trying this approach to integrate GLM model with API.
- First I use generate.sh script to load the model.
- I start Fast API server in a separate terminal.
- I start a RQ worker. RQ is a Queue based on Redis solution I am using to pass request data to GLM model.
- I have also created a separate method to call generate_continually whenever a job is registered. I am using non interactive mode and passing a named file.
- I get this error message
model
not defined. So, I changed themodel
andtokenizer
to global variable. Still I get the same error messagemodel
andtokenizer
no defined. I also logged both variable and there value is null.
As an alternative approach, I decided to start the server from same script as model that is generate.py
. Since the script executes 8 times, the sever crashes because port is already occupied.
Command to run model
torchrun --nproc_per_node $MP_SIZE ${ARGS}
$MP_SIZE
value is 8
from glm-130b.
For your alternative approach, you can only start the server on GPU 0 (by checking torch.distributed.get_rank()
) and then use torch.distributed.broadcast_object
to broadcast the information to other GPUs.
from glm-130b.
thanks for the tip. It is working now. Just another question, is there a method similar to get_tokenizer
to get current model?
from glm-130b.
I'm afraid not, but I'm guessing you could use global variables to achieve something similar. Here is a possible example:
MODEL = None
def set_model(model):
global MODEL
MODEL = model
def get_model():
global MODEL
return MODEL
from glm-130b.
@Sengxian Thanks for the support.
from glm-130b.
请问怎样用类似flask API代替interactive模式?我将GLM-10B的generate.py改成了API形式,但这里的”Please Input Query (stop to exit) >>>“这些提示貌似是在SwissArmyTransformer中,不知道如何改成API形式
from glm-130b.
请问怎样用类似flask API代替interactive模式?我将GLM-10B的generate.py改成了API形式,但这里的”Please Input Query (stop to exit) >>>“这些提示貌似是在SwissArmyTransformer中,不知道如何改成API形式
您好,利用 generate.py 中的 fill_blanks 函数构建 API 即可,注意只在 0 号进程上启动 API Server。
from glm-130b.
请问怎样用类似flask API代替interactive模式?我将GLM-10B的generate.py改成了API形式,但这里的”Please Input Query (stop to exit) >>>“这些提示貌似是在SwissArmyTransformer中,不知道如何改成API形式
您好,利用 generate.py 中的 fill_blanks 函数构建 API 即可,注意只在 0 号进程上启动 API Server。
@Sengxian 请问构建服务后请求一直卡在filling_sequence,没有响应是什么问题呢?
from glm-130b.
For your alternative approach, you can only start the server on GPU 0 (by checking
torch.distributed.get_rank()
) and then usetorch.distributed.broadcast_object
to broadcast the information to other GPUs.
@Sengxian this will dedicate 1 GPU device for API and GLM-130B model need all 8 GPUS for inference. Do you know any other method in which I can run API on CPU and load models on GPU and communicate between CPU and GPU?
Machine Specs: V100 (8 * 32GB)
from glm-130b.
For your alternative approach, you can only start the server on GPU 0 (by checking
torch.distributed.get_rank()
) and then usetorch.distributed.broadcast_object
to broadcast the information to other GPUs.@Sengxian this will dedicate 1 GPU device for API and GLM-130B model need all 8 GPUS for inference. Do you know any other method in which I can run API on CPU and load models on GPU and communicate between CPU and GPU? Machine Specs: V100 (8 * 32GB)
是的,0 号进程上启动 API Server只是在单卡上进行推理
from glm-130b.
from glm-130b.
@Shahrukh-Alethea Hi,you develop the server and run the inference process properly?can you guide me?
from glm-130b.
@jiangliqin I was not able to do successfully inference. Server occupies one of the GPU. 7 GPUs are not enough for GLM-130B models. Facing same issue as you.
Maybe @Sengxian can help in this matter?
from glm-130b.
@jiangliqin @Shahrukh-Alethea Hello, may I ask why API server dedicates one GPU for the server? I think a custom callback function will be triggered when the API server receives a request, then it should be able to be used for inference at this time.
from glm-130b.
from glm-130b.
@Sengxian
I am using this to start the server
with torch.no_grad():
main(args)
if torch.distributed.get_rank() == 0:
uvicorn.run(app, host="0.0.0.0", port=8080)
The API server starts without any problem and I am able to call test API callbacks too. I have also created my owngenerate_continually
method to accept input from the API call back method rather than from terminal. The signature of the method looks like this now generate_continually(initialize=False, raw_text="")
. initialize
is used to break while
loop for the device 0 so I can start the API while all other devices are kept in loop.
while True:
is_stop = False
if torch.distributed.get_rank() == 0 and initialize:
break
if torch.distributed.get_rank() == 0:
raw_text = raw_text.strip()
torch.distributed.broadcast_object_list([raw_text, is_stop])
else:
print("===========text========",raw_text)
info = [raw_text, is_stop]
torch.distributed.broadcast_object_list(info)
raw_text, is_stop = info
print("-------------done-----------")
if is_stop:
return
try:
print("----------calling process fuction-----------")
start_time = time.time()
process(raw_text)
if torch.distributed.get_rank() == 0:
print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True)
except (ValueError, FileNotFoundError) as e:
print(e)
continue
When the method is called from main
method it has initialize
set to true
which will make sure that device 0 break the loop and continue to start server. Second time, the method generate_continually
is called by API callback. When the method is called from API callback, I get args are not defined error for device 0 and throughs an exception. I have tried to save the args in global variable with no success.
from glm-130b.
您好,这个是因为通信超时导致的错误,我无法通过这个看出真正的报错信息。
from glm-130b.
您好,这个是因为通信超时导致的错误,我无法通过这个看出真正的报错信息。
@Sengxian 现象是在0卡启动服务,路由接受请求的处理也默认只是使用0卡资源,无法使用多卡
from glm-130b.
您好,这个是因为通信超时导致的错误,我无法通过这个看出真正的报错信息。
怎么把入口文件发您定位下问题呢?可以加微信吗?我的是hashenbb
您好,请加入我们的 Slack 频道 进行更详细的讨论,如果不方便分享文件的话可以在 Slack 中私信我 :)
from glm-130b.
@Sengxian I am using this to start the server
with torch.no_grad(): main(args) if torch.distributed.get_rank() == 0: uvicorn.run(app, host="0.0.0.0", port=8080)
The API server starts without any problem and I am able to call test API callbacks too. I have also created my own
generate_continually
method to accept input from the API call back method rather than from terminal. The signature of the method looks like this nowgenerate_continually(initialize=False, raw_text="")
.initialize
is used to breakwhile
loop for the device 0 so I can start the API while all other devices are kept in loop.while True: is_stop = False if torch.distributed.get_rank() == 0 and initialize: break if torch.distributed.get_rank() == 0: raw_text = raw_text.strip() torch.distributed.broadcast_object_list([raw_text, is_stop]) else: print("===========text========",raw_text) info = [raw_text, is_stop] torch.distributed.broadcast_object_list(info) raw_text, is_stop = info print("-------------done-----------") if is_stop: return try: print("----------calling process fuction-----------") start_time = time.time() process(raw_text) if torch.distributed.get_rank() == 0: print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True) except (ValueError, FileNotFoundError) as e: print(e) continue
When the method is called from
main
method it hasinitialize
set totrue
which will make sure that device 0 break the loop and continue to start server. Second time, the methodgenerate_continually
is called by API callback. When the method is called from API callback, I get args are not defined error for device 0 and throughs an exception. I have tried to save the args in global variable with no success.
I'm guessing uvicorn.run(app, host="0.0.0.0", port=8080)
will somehow start a new process, so you will get not defined error. Could you please try using the basic flask api and see if it still shows an error.
from glm-130b.
@Sengxian I am using this to start the server
with torch.no_grad(): main(args) if torch.distributed.get_rank() == 0: uvicorn.run(app, host="0.0.0.0", port=8080)
The API server starts without any problem and I am able to call test API callbacks too. I have also created my own
generate_continually
method to accept input from the API call back method rather than from terminal. The signature of the method looks like this nowgenerate_continually(initialize=False, raw_text="")
.initialize
is used to breakwhile
loop for the device 0 so I can start the API while all other devices are kept in loop.while True: is_stop = False if torch.distributed.get_rank() == 0 and initialize: break if torch.distributed.get_rank() == 0: raw_text = raw_text.strip() torch.distributed.broadcast_object_list([raw_text, is_stop]) else: print("===========text========",raw_text) info = [raw_text, is_stop] torch.distributed.broadcast_object_list(info) raw_text, is_stop = info print("-------------done-----------") if is_stop: return try: print("----------calling process fuction-----------") start_time = time.time() process(raw_text) if torch.distributed.get_rank() == 0: print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True) except (ValueError, FileNotFoundError) as e: print(e) continue
When the method is called from
main
method it hasinitialize
set totrue
which will make sure that device 0 break the loop and continue to start server. Second time, the methodgenerate_continually
is called by API callback. When the method is called from API callback, I get args are not defined error for device 0 and throughs an exception. I have tried to save the args in global variable with no success.I'm guessing
uvicorn.run(app, host="0.0.0.0", port=8080)
will somehow start a new process, so you will get not defined error. Could you please try using the basic flask api and see if it still shows an error.
Still facing same issue
from glm-130b.
您好,这个是因为通信超时导致的错误,我无法通过这个看出真正的报错信息。
怎么把入口文件发您定位下问题呢?可以加微信吗?我的是hashenbb
您好,请加入我们的 Slack 频道 进行更详细的讨论,如果不方便分享文件的话可以在 Slack 中私信我 :)
您好,这个是因为通信超时导致的错误,我无法通过这个看出真正的报错信息。
怎么把入口文件发您定位下问题呢?可以加微信吗?我的是hashenbb
您好,请加入我们的 Slack 频道 进行更详细的讨论,如果不方便分享文件的话可以在 Slack 中私信我 :)
已经私信您发送文件了,请查收~
from glm-130b.
@Sengxian 谢谢以上的耐心指导,请问量化后相应速度>10s,可以用FasterTransformer对量化后的模型进行加速推理吗?
from glm-130b.
@Sengxian 谢谢以上的耐心指导,请问量化后相应速度>10s,可以用FasterTransformer对量化后的模型进行加速推理吗?
目前还没有将量化版本在 FastTransformer 里面实现完毕,完成后会第一时间公布
from glm-130b.
请问给定context的长度是多少?长度太长filling_sequence提示错误:index 272 is out of bounds for dimension 0 with size 272
from glm-130b.
请问给定context的长度是多少?长度太长filling_sequence提示错误:index 272 is out of bounds for dimension 0 with size 272
GLM-130B 最多支持 2048 的 context 长度,请问有完整的错误 log 么?
from glm-130b.
@Shahrukh-Alethea
You have to kick off a separate thread in the main process (rank ==0) to do:
- accept input, and broadcast to other ranks,
- do inference and return the result to the APP server.
Something like below (the code is from my host script, just for illustration.)
# do initialize here
model, tokenizer = get_model_and_tokenizer(args)
def loop():
global TASK_QUEUE
q = TASK_QUEUE
while True:
task = q.get() # block until new item arrived
torch.distributed.broadcast_object_list(
[task.op, task.opts],
src=0,
group=mpu.get_model_parallel_group()
)
# function_caller: the model inference part.
rsp = function_caller(task.op, task.opts)
task.q.put(rsp)
def workmain():
if mpu.get_model_parallel_rank() == 0:
print("master engaged.")
# loop function for rank 0, accept input and broadcast it to other ranks.
thread = threading.Thread(target=loop, daemon=True)
thread.start()
app.run(host="0.0.0.0", port=48888, threaded=True)
else:
#other ranks, accepts task from rank0, and handle it.
print("slave engaged")
while True:
task = [None, None]
rcv_task = torch.distributed.broadcast_object_list(
task,
src=0,
group=mpu.get_model_parallel_group()
)
_ = function_caller(task[0], task[1])
workmain()
I create above according to metaseq/interactive_hosted.py.
The task queue here could be fully replaced by python's queue
from glm-130b.
[> 卡在filling_sequence后的报错信息
Hi, I am facing the same issue of timeout while doing simple inference using the generate.sh script. Can anyone share what was the issue and how we can solve it?
from glm-130b.
This issue can be closed with our official FasterTransformer API server.
from glm-130b.
Related Issues (20)
- GLM-130B 模型结构超参问题
- 关于docs/quantization.md中图片疑问
- 训练目标
- 关于FT inference benchmark数据的疑问
- 每个token耗时呈脉冲式变化
- GLM-130B文档中描述model weights,GPU内存需要260G,测试demo中实际测试总占用在240G左右,请问是什么原因
- 模型并行集群怎么搭建
- 请问GLM可以在输出内容时,同时输出引用内容的来源吗?
- 模型申请页面无法提交申请 HOT 1
- 基于130B有chat版本开源的计划吗?
- 申请邮件收到的模型下载链接都失效了 HOT 5
- FasterTransformer能否支持Glm6B呢
- glm2-130B will it be made? HOT 1
- 请问,课程链接在哪里? HOT 1
- RuntimeError: probability tensor contains either `inf`, `nan` or element < 0answers, answers_with_style, blanks = fill_blanks(raw_text, model, tokenizer, strategy)
- 8卡 fastertransformer 推理报错RuntimeError: [FT][ERROR] Assertion fail: /home/young.ruan/FasterTransformer/src/fastertransformer/th_op/glm/GlmOp.h:539
- 执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救! HOT 1
- Clarification Request on GLM-130B Model Architecture and Licensing for Commercial Use
- 有用tensortRT-llm的docker环境跑通模型的吗?求助...
- 请各位大佬伸以援手,我想要在自己本地部署一个该模型,怎么在windows上进行部署?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from glm-130b.