Comments (7)
用fp16模型首token基本上没延时,int8模型的我还没弄好.. 估计要等等
主要是因为fp16有cublas的函数可以直接用,int8里面是float * int8 -> float的矩阵乘法,没有现成的加速库
from fastllm.
给力,刚提交的代码是不是fp16的首token改错了,又回到2s~3s的水平了。12号拉的代码速度还只有200ms左右
from fastllm.
给力,刚提交的代码是不是fp16的首token改错了,又回到2s~3s的水平了。12号拉的代码速度还只有200ms左右
啊,具体是什么模型? 我刚才测试了一下是正常的,fp16首token我这边基本没延迟
from fastllm.
chatglm,拉取的最新的代码转fp16后, 打开chatglm.cpp中打印每个token时延,出的结果如下(图一)。同样的query在12号速度是可以的(图2)。输入长度1000+
from fastllm.
chatglm,拉取的最新的代码转fp16后, 打开chatglm.cpp中打印每个token时延,出的结果如下(图一)。同样的query在12号速度是可以的(图2)。输入长度1000+
已修复,谢谢提出
from fastllm.
嗯,确认了速度已改过来,给力
from fastllm.
用fp16模型首token基本上没延时,int8模型的我还没弄好.. 估计要等等 主要是因为fp16有cublas的函数可以直接用,int8里面是float * int8 -> float的矩阵乘法,没有现成的加速库
我用int4后的模型也出现这个问题,当提示词很长的时候,第一个token就会很慢。fp16就很快,目前还没有很好的解决办法是吧?
from fastllm.
Related Issues (20)
- 千问qwen1.5-14B-chat解码错误 HOT 2
- 中文输入无法识别;webui打开的地址无法访问。 HOT 1
- Do you have a plan to implement the CudaCatOp?
- chatglm3 相同提示词生成结果一致
- 结果返回一直是<unk> HOT 3
- [CMakeFiles/Makefile2:100: CMakeFiles/pyfastllm.dir/all]
- 请问现在支持deepseekv2量化吗 HOT 1
- make -j过程中报错 HOT 3
- qwen1.5 int4模型回复出现解码问题:UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 72-73: invalid continuation byte
- 请问什么时候支持GLM-4 ? HOT 4
- H800 docker 编译, half类型转换 编译报错 HOT 1
- GLM-4-6B-Chat转换成flm格式后不能加载 HOT 5
- Meta-Llama-3-70B-Instruct HOT 5
- OSError: libcublas.so.ll: cannot open shared odject file: No such file or directory HOT 1
- 如何多卡部署 HOT 1
- GLM4-V-9B什么时候会出部署代码呢?
- 请问一下国产显卡Ascend 910 and Hygon DCU如何安装fastllm? HOT 1
- 编译完之后运行模型时报错 HOT 1
- chatglm 失去 function calling 能力
- 模型权重转化之后和原来的模型回答的内容不一致 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastllm.