Hi! I am trying to build an API for GLM-130B model. So far, I have tried to run GL

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Building an API for GLM-130B,about thudm/glm-130b

Comments (30)

Sengxian commented on July 20, 2024

Hello @Shahrukh-Alethea! We have successfully built our demo app using Gradio. Could you please provide more error information so we can see what happens?

from glm-130b.

Shahrukh-Alethea commented on July 20, 2024

@Sengxian
I am trying this approach to integrate GLM model with API.

First I use generate.sh script to load the model.
I start Fast API server in a separate terminal.
I start a RQ worker. RQ is a Queue based on Redis solution I am using to pass request data to GLM model.
I have also created a separate method to call generate_continually whenever a job is registered. I am using non interactive mode and passing a named file.
I get this error message model not defined. So, I changed the model and tokenizer to global variable. Still I get the same error message model and tokenizer no defined. I also logged both variable and there value is null.

As an alternative approach, I decided to start the server from same script as model that is generate.py. Since the script executes 8 times, the sever crashes because port is already occupied.

Command to run model
torchrun --nproc_per_node $MP_SIZE ${ARGS}
$MP_SIZE value is 8

from glm-130b.

Sengxian commented on July 20, 2024

For your alternative approach, you can only start the server on GPU 0 (by checking torch.distributed.get_rank()) and then use torch.distributed.broadcast_object to broadcast the information to other GPUs.

from glm-130b.

Shahrukh-Alethea commented on July 20, 2024

thanks for the tip. It is working now. Just another question, is there a method similar to get_tokenizer to get current model?

from glm-130b.

Sengxian commented on July 20, 2024

I'm afraid not, but I'm guessing you could use global variables to achieve something similar. Here is a possible example:

MODEL = None

def set_model(model):
	global MODEL
	MODEL = model

def get_model():
	global MODEL
	return MODEL

from glm-130b.

Shahrukh-Alethea commented on July 20, 2024

@Sengxian Thanks for the support.

from glm-130b.

jiangliqin commented on July 20, 2024

请问怎样用类似flask API代替interactive模式？我将GLM-10B的generate.py改成了API形式，但这里的”Please Input Query (stop to exit) >>>“这些提示貌似是在SwissArmyTransformer中，不知道如何改成API形式

from glm-130b.

jiangliqin commented on July 20, 2024

请问怎样用类似flask API代替interactive模式？我将GLM-10B的generate.py改成了API形式，但这里的”Please Input Query (stop to exit) >>>“这些提示貌似是在SwissArmyTransformer中，不知道如何改成API形式

您好，利用 generate.py 中的 fill_blanks 函数构建 API 即可，注意只在 0 号进程上启动 API Server。

from glm-130b.

jiangliqin commented on July 20, 2024

请问怎样用类似flask API代替interactive模式？我将GLM-10B的generate.py改成了API形式，但这里的”Please Input Query (stop to exit) >>>“这些提示貌似是在SwissArmyTransformer中，不知道如何改成API形式

您好，利用 generate.py 中的 fill_blanks 函数构建 API 即可，注意只在 0 号进程上启动 API Server。

@Sengxian 请问构建服务后请求一直卡在filling_sequence，没有响应是什么问题呢？

from glm-130b.

hamza-alethea commented on July 20, 2024

For your alternative approach, you can only start the server on GPU 0 (by checking torch.distributed.get_rank()) and then use torch.distributed.broadcast_object to broadcast the information to other GPUs.

@Sengxian this will dedicate 1 GPU device for API and GLM-130B model need all 8 GPUS for inference. Do you know any other method in which I can run API on CPU and load models on GPU and communicate between CPU and GPU?
Machine Specs: V100 (8 * 32GB)

from glm-130b.

jiangliqin commented on July 20, 2024

For your alternative approach, you can only start the server on GPU 0 (by checking torch.distributed.get_rank()) and then use torch.distributed.broadcast_object to broadcast the information to other GPUs.

@Sengxian this will dedicate 1 GPU device for API and GLM-130B model need all 8 GPUS for inference. Do you know any other method in which I can run API on CPU and load models on GPU and communicate between CPU and GPU? Machine Specs: V100 (8 * 32GB)

是的，0 号进程上启动 API Server只是在单卡上进行推理

from glm-130b.

jiangliqin commented on July 20, 2024

sorry,i just start the server successfully,but can’t inference properly. 在 2022年8月29日 ***@***.***> 写道： Hi @jiangliqin, I'm starting the server using this condition if torch.distributed.get_rank() == 0: and it dedicates my one GPU for the server, and the remaining 7 GPUs are not enough for GLM-130B models. And can you guide me on how can I run API on 0 process? kindly also share the example command. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

from glm-130b.

jiangliqin commented on July 20, 2024

@Shahrukh-Alethea Hi,you develop the server and run the inference process properly?can you guide me?

from glm-130b.

Shahrukh-Alethea commented on July 20, 2024

@jiangliqin I was not able to do successfully inference. Server occupies one of the GPU. 7 GPUs are not enough for GLM-130B models. Facing same issue as you.
Maybe @Sengxian can help in this matter?

from glm-130b.

Sengxian commented on July 20, 2024

@jiangliqin @Shahrukh-Alethea Hello, may I ask why API server dedicates one GPU for the server? I think a custom callback function will be triggered when the API server receives a request, then it should be able to be used for inference at this time.

from glm-130b.

jiangliqin commented on July 20, 2024

卡在filling_sequence后的报错信息

from glm-130b.

Shahrukh-Alethea commented on July 20, 2024

@Sengxian
I am using this to start the server

with torch.no_grad():
    main(args)

    if torch.distributed.get_rank() == 0:
        uvicorn.run(app, host="0.0.0.0", port=8080)

The API server starts without any problem and I am able to call test API callbacks too. I have also created my owngenerate_continually method to accept input from the API call back method rather than from terminal. The signature of the method looks like this now generate_continually(initialize=False, raw_text=""). initialize is used to break while loop for the device 0 so I can start the API while all other devices are kept in loop.

while True:
    is_stop = False
    if torch.distributed.get_rank() == 0 and initialize:
       break
    if torch.distributed.get_rank() == 0:
        raw_text = raw_text.strip()
        torch.distributed.broadcast_object_list([raw_text, is_stop])
    else:
        print("===========text========",raw_text)
        info = [raw_text, is_stop]
        torch.distributed.broadcast_object_list(info)
        raw_text, is_stop = info
        print("-------------done-----------")
    if is_stop:
        return
    try:
        print("----------calling process fuction-----------")
        start_time = time.time()
        process(raw_text)
        if torch.distributed.get_rank() == 0:
            print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True)
    except (ValueError, FileNotFoundError) as e:
        print(e)
        continue

When the method is called from main method it has initialize set to true which will make sure that device 0 break the loop and continue to start server. Second time, the method generate_continually is called by API callback. When the method is called from API callback, I get args are not defined error for device 0 and throughs an exception. I have tried to save the args in global variable with no success.

from glm-130b.

Sengxian commented on July 20, 2024

卡在filling_sequence后的报错信息

您好，这个是因为通信超时导致的错误，我无法通过这个看出真正的报错信息。

from glm-130b.

jiangliqin commented on July 20, 2024

卡在filling_sequence后的报错信息

您好，这个是因为通信超时导致的错误，我无法通过这个看出真正的报错信息。

@Sengxian 现象是在0卡启动服务，路由接受请求的处理也默认只是使用0卡资源，无法使用多卡

from glm-130b.

Sengxian commented on July 20, 2024

卡在filling_sequence后的报错信息

您好，这个是因为通信超时导致的错误，我无法通过这个看出真正的报错信息。

怎么把入口文件发您定位下问题呢？可以加微信吗？我的是hashenbb

您好，请加入我们的 Slack 频道进行更详细的讨论，如果不方便分享文件的话可以在 Slack 中私信我 :)

from glm-130b.

Sengxian commented on July 20, 2024

@Sengxian I am using this to start the server
with torch.no_grad():
    main(args)

    if torch.distributed.get_rank() == 0:
        uvicorn.run(app, host="0.0.0.0", port=8080)
The API server starts without any problem and I am able to call test API callbacks too. I have also created my owngenerate_continually method to accept input from the API call back method rather than from terminal. The signature of the method looks like this now generate_continually(initialize=False, raw_text=""). initialize is used to break while loop for the device 0 so I can start the API while all other devices are kept in loop.
while True:
    is_stop = False
    if torch.distributed.get_rank() == 0 and initialize:
       break
    if torch.distributed.get_rank() == 0:
        raw_text = raw_text.strip()
        torch.distributed.broadcast_object_list([raw_text, is_stop])
    else:
        print("===========text========",raw_text)
        info = [raw_text, is_stop]
        torch.distributed.broadcast_object_list(info)
        raw_text, is_stop = info
        print("-------------done-----------")
    if is_stop:
        return
    try:
        print("----------calling process fuction-----------")
        start_time = time.time()
        process(raw_text)
        if torch.distributed.get_rank() == 0:
            print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True)
    except (ValueError, FileNotFoundError) as e:
        print(e)
        continue
When the method is called from main method it has initialize set to true which will make sure that device 0 break the loop and continue to start server. Second time, the method generate_continually is called by API callback. When the method is called from API callback, I get args are not defined error for device 0 and throughs an exception. I have tried to save the args in global variable with no success.

I'm guessing uvicorn.run(app, host="0.0.0.0", port=8080) will somehow start a new process, so you will get not defined error. Could you please try using the basic flask api and see if it still shows an error.

from glm-130b.

Shahrukh-Alethea commented on July 20, 2024

@Sengxian I am using this to start the server
with torch.no_grad():
    main(args)

    if torch.distributed.get_rank() == 0:
        uvicorn.run(app, host="0.0.0.0", port=8080)
The API server starts without any problem and I am able to call test API callbacks too. I have also created my owngenerate_continually method to accept input from the API call back method rather than from terminal. The signature of the method looks like this now generate_continually(initialize=False, raw_text=""). initialize is used to break while loop for the device 0 so I can start the API while all other devices are kept in loop.
while True:
    is_stop = False
    if torch.distributed.get_rank() == 0 and initialize:
       break
    if torch.distributed.get_rank() == 0:
        raw_text = raw_text.strip()
        torch.distributed.broadcast_object_list([raw_text, is_stop])
    else:
        print("===========text========",raw_text)
        info = [raw_text, is_stop]
        torch.distributed.broadcast_object_list(info)
        raw_text, is_stop = info
        print("-------------done-----------")
    if is_stop:
        return
    try:
        print("----------calling process fuction-----------")
        start_time = time.time()
        process(raw_text)
        if torch.distributed.get_rank() == 0:
            print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True)
    except (ValueError, FileNotFoundError) as e:
        print(e)
        continue
When the method is called from main method it has initialize set to true which will make sure that device 0 break the loop and continue to start server. Second time, the method generate_continually is called by API callback. When the method is called from API callback, I get args are not defined error for device 0 and throughs an exception. I have tried to save the args in global variable with no success.
I'm guessing uvicorn.run(app, host="0.0.0.0", port=8080) will somehow start a new process, so you will get not defined error. Could you please try using the basic flask api and see if it still shows an error.

Still facing same issue

from glm-130b.

jiangliqin commented on July 20, 2024

卡在filling_sequence后的报错信息

您好，这个是因为通信超时导致的错误，我无法通过这个看出真正的报错信息。

怎么把入口文件发您定位下问题呢？可以加微信吗？我的是hashenbb

您好，请加入我们的 Slack 频道进行更详细的讨论，如果不方便分享文件的话可以在 Slack 中私信我 :)

已经私信您发送文件了，请查收~

from glm-130b.

jiangliqin commented on July 20, 2024

@Sengxian 谢谢以上的耐心指导，请问量化后相应速度>10s,可以用FasterTransformer对量化后的模型进行加速推理吗？

from glm-130b.

Sengxian commented on July 20, 2024

@Sengxian 谢谢以上的耐心指导，请问量化后相应速度>10s,可以用FasterTransformer对量化后的模型进行加速推理吗？

目前还没有将量化版本在 FastTransformer 里面实现完毕，完成后会第一时间公布

from glm-130b.

jiangliqin commented on July 20, 2024

请问给定context的长度是多少？长度太长filling_sequence提示错误：index 272 is out of bounds for dimension 0 with size 272

from glm-130b.

Sengxian commented on July 20, 2024

请问给定context的长度是多少？长度太长filling_sequence提示错误：index 272 is out of bounds for dimension 0 with size 272

GLM-130B 最多支持 2048 的 context 长度，请问有完整的错误 log 么？

from glm-130b.

lomizandtyd commented on July 20, 2024

@Shahrukh-Alethea
You have to kick off a separate thread in the main process (rank ==0) to do:

accept input, and broadcast to other ranks,
do inference and return the result to the APP server.

Something like below (the code is from my host script, just for illustration.)

# do initialize here
model, tokenizer = get_model_and_tokenizer(args)

def loop():
    global TASK_QUEUE
    q = TASK_QUEUE
    while True:
        task = q.get() # block until new item arrived
        torch.distributed.broadcast_object_list(
            [task.op, task.opts],
            src=0,
            group=mpu.get_model_parallel_group()
        )
        # function_caller: the model inference part.
        rsp = function_caller(task.op, task.opts)
        task.q.put(rsp)

def workmain():
    if mpu.get_model_parallel_rank() == 0:
        print("master engaged.")
        # loop function for rank 0, accept input and broadcast it to other ranks.
        thread = threading.Thread(target=loop, daemon=True) 
        thread.start()
        app.run(host="0.0.0.0", port=48888, threaded=True)
    else:
        #other ranks, accepts task from rank0, and handle it.
        print("slave engaged")
        while True: 
            task = [None, None]
            rcv_task = torch.distributed.broadcast_object_list(
                task,
                src=0,
                group=mpu.get_model_parallel_group()
            ) 
            _ = function_caller(task[0], task[1])

workmain()

I create above according to metaseq/interactive_hosted.py.
The task queue here could be fully replaced by python's queue

from glm-130b.

todiketan commented on July 20, 2024

[> 卡在filling_sequence后的报错信息

](#19 (comment))

Hi, I am facing the same issue of timeout while doing simple inference using the generate.sh script. Can anyone share what was the issue and how we can solve it?

from glm-130b.

Sengxian commented on July 20, 2024

This issue can be closed with our official FasterTransformer API server.

from glm-130b.

Building an API for GLM-130B about glm-130b HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent