qwenlm / qwen Goto Github PK

View Code? Open in Web Editor NEW

13.0K 13.0K 1.0K 35.84 MB

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

License: Apache License 2.0

Python 62.82% Shell 5.37% Dockerfile 0.91% Jupyter Notebook 30.79% Jinja 0.10%

chinese flash-attention large-language-models llm natural-language-processing pretrained-models

qwen's People

Contributors

Stargazers

Watchers

Forkers

ganjinzero yuanhy1997 pemywei dapeng2018 cdj0311 islet metaalms quduoduo jarek-li jiacheo wuyongxu dwp73 lishuai2016 zomkey hrayd scutlc huybery github-cn-sky ywfsoft itsharex casterbn zhangtyzzz lplzyp dtyxs arthur110 grasschicken li1055107552 neobobos zeroxclem gebangfeng rigary paulhuang01 aoocar harrisking zirenlegend junjiem allensmile ironicbo suicaaaa fengxing666 chunhualiu varson yangluheng stillfantasy19 fanoid guoqiangjia poochaiycb mymoongit zillachan haihaimx cuhksz-yu mjdhasan mondayice rohochan cuiyilong github-dengyu aboutsome yaakovsu ijonso onetopicu001 chiehchiu jiangchenglin521 lokking donvink jazzlee008 hzzhang-nlp mikejohnsonliu xuyutom kustomzone bkcarlos jinzhengdong lhggame adambear lisinan f901107 kerrycu66536 xshan03 mysqlsc snoopycn jiaqianjing jackstephen ahwhbc loulanyue stvlynn undefdev yab bharatr21 wysaid zzddxx520 huxian123 miaoyang8 gengone eltociear arquehi ouyangchucai andriypanin1 ols-tas jaedukseo endrytate atlasar43

qwen's Issues

QWenLMHeadModel.init() got an unexpected keyword argument 'use_bf16'

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, use_bf16=True).eval()

QWenLMHeadModel.init() got an unexpected keyword argument 'use_bf16'

What is the padding token?

Thanks for your amazing work. By the way, may I ask that what is the padding token in your tokenizer? Without that, I don't think I can perform finetuning on this model.

RuntimeError: value cannot be converted to type at::Half without overflow

跑官方README里面的例子和repo中的demo.py都报这个错误。

File "/root/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B-Chat/44e46a0f02169a2c4790fbcccec82cd20f4df717/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow

直接用提供的transformers例子跑不起来

用这个例子跑不起来transformers

gpu感觉不用上，最后ram占满崩溃。。。

有没有大佬提供下MacOS 下 m 系列芯片的使用教程

纯小白，想尝试下

Qwen-7B model can not be found at HuggingFace

https://huggingface.co/Qwen/Qwen-7B

MPS does not support cumsum op with int64 input

您好，尝试在M1的mac上运行模型，由于内存问题，加了一个offload_folder和torch_dtype，代码如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch

tokenizer = AutoTokenizer.from_pretrained("/Users/sniper/model/Qwen-7b-chat", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("/Users/sniper/model/Qwen-7b-chat", device_map="auto",
                                             offload_folder="offload", torch_dtype=torch.float16,
                                             trust_remote_code=True, fp16=True).eval()


model.generation_config = GenerationConfig.from_pretrained("/Users/sniper/model/Qwen-7b-chat",
                                                           trust_remote_code=True)  
# 第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)

但是在chat行（倒数第二）出现错误：

 position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: MPS does not support cumsum op with int64 input

请问这是什么原因呀

qwen7b的硬件需求是什么呀？

请问有公布模型训练和测评的硬件需求吗？需要调研硬件资源需求，但是md文件好像没有明确说明，大家有看到嘛？

关于 SFT 训练 label mask 的疑问

system 和 user 对应的 <|im_end|> 是否要添加 label mask(label_id设为 -100)?
assistant 的 <|im_end|> 后面的 \n 是否要添加 label mask？

测试输入：

<|im_start|>system
system test<|im_end|>
<|im_start|>user
round 1 query<|im_end|>
<|im_start|>assistant
round 1 answer<|im_end|>
<|im_start|>user
round 2 query<|im_end|>
<|im_start|>assistant
round 2 answer<|im_end|>

tokenizer 结果

logn attention size does not match

modeling_qwen.py, line 373

seq_end = key.size(0)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]

should be

seq_start = key.size(1) - query.size(1)
seq_end = key.size(1)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]

请求int4量化模型

未量化的模型会导致在RAM容量较少的情况下占满RAM导致崩溃（例如Google Colaboratory和Kaggle Notebook环境），所以在此希望官方能像ChatGLM-6B一样在Huggingface上上传一份int4量化后的模型。

希望有网页版的

没有显卡能用不？

希望能像MOSS和GLM一样，提供一下多轮对话微调的说明

比如数据格式对齐方式等

基于Qwen-7B实现了QLoRA多轮对话微调，完善API和Web demo功能

首先感谢开源 Qwen-7B 模型，我基于该模型实现了 QLoRA 多轮对话微调，项目地址：https://github.com/hiyouga/LLaMA-Efficient-Tuning

QLoRA 指令微调：

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --do_train \
    --dataset sharegpt_zh \
    --template chatml \
    --finetuning_type lora \
    --lora_target c_attn \
    --output_dir qwen_lora \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 3e-5 \
    --num_train_epochs 1.0 \
    --quantization_bit 4 \
    --fp16

Web Demo：

python src/web_demo.py \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --template chatml

API 部署（基于 OpenAI 格式）：

python src/api_demo.py \
    --model_name_or_path Qwen/Qwen-7B-Chat \
    --template chatml

~~另外，希望开发者可以修复一下 tokenizer 的 decode 方法，使其支持 skip_special_tokens 参数，便于后续开发，目前该参数没有实际生效。~~ （最新版已修复）

~~源码对应位置：huggingface.co/Qwen/Qwen-7B-Chat/blob/5e7f6a3f41724e7cb8ea3e3be7a1faf2bd5d6a38/tokenization_qwen.py#L228~~

def _decode(
    self,
    token_ids: Union[int, List[int]],
    skip_special_tokens: bool = False,
    clean_up_tokenization_spaces: bool = None,
    **kwargs,
) -> str:
    if isinstance(token_ids, int):
        token_ids = [token_ids]
    return self.tokenizer.decode(token_ids)

请问Qwen-7B插件调用那个Agent要怎么用

类似openAI那样的函数调用功能

输入文本较长时无输出结果

首先感谢开源qwen-7B大模型！
我在使用chat版本时遇到输入文本较长时无输出结果的问题，输入指令的文字长度为4722，该指令经过tokenizer编码后的input_ids长度为3172，我修改了generation_config.json中关于输入长度的配置：

  "max_context_size": 4096

但是模型的 response 是一个空字符串，我通过单步调试确认没有因为token过长等原因提前结束，而是进入了正常的自回归解码过程，输出的前两个token刚好是 stop_words_ids中的两个 token，我看了下readme中应该是能支持8k规模的context:

Support of 8K Context Length. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.

我尝试将指令输入截断至3265个字这时又能正常输出结果，想问下这是什么原因呢？单纯是输入过长导致性能不好还是我的使用方式存在问题？

请问会公开训练分析吗？例如loss下降曲线，如果有scale曲线的话，会公开吗？

能否给出设置超参的示例

请给出一些设置超参的代码示例，谢谢。

下一步会提供stream chat接口吗？尝试加了一下，会有乱码 😂

在modeling_qwen.py里加进去了，但有时候好像会有乱码。

下面是diff，求指导。

diff --git a/modeling_qwen.py b/modeling_qwen.py
index cc58746..a0361d9 100644
--- a/modeling_qwen.py
+++ b/modeling_qwen.py
@@ -883,6 +883,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         history: Optional[HistoryType],
         system: str = "You are a helpful assistant.",
         append_history: bool = True,
+        stream: Optional[bool] = False,
     ) -> Tuple[str, HistoryType]:
 
         if history is None:
@@ -902,25 +903,39 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         )
         input_ids = torch.tensor([context_tokens]).to(self.device)
 
-        outputs = self.generate(
-            input_ids,
-            stop_words_ids=stop_words_ids,
-            return_dict_in_generate=False,
-        )
+        if stream:
+            from transformers_stream_generator.main import NewGenerationMixin, StreamGenerationConfig
+            self.__class__.generate = NewGenerationMixin.generate
+            self.__class__.sample_stream = NewGenerationMixin.sample_stream
+            stream_config = StreamGenerationConfig(**self.generation_config.to_dict(), do_stream=True)
 
-        response = decode_tokens(
-            outputs[0],
-            tokenizer,
-            raw_text_len=len(raw_text),
-            context_length=len(context_tokens),
-            chat_format=self.generation_config.chat_format,
-            verbose=False,
-        )
+            def stream_generator():
+                outputs = []
+                for token in self.generate(input_ids, stop_words_ids=stop_words_ids, return_dict_in_generate=False, generation_config=stream_config):
+                    outputs.append(token.item())
+                    yield tokenizer.decode(outputs, skip_special_tokens=True)
+
+            return stream_generator()
+        else:
+            outputs = self.generate(
+                input_ids,
+                stop_words_ids=stop_words_ids,
+                return_dict_in_generate=False,
+            )
+
+            response = decode_tokens(
+                outputs[0],
+                tokenizer,
+                raw_text_len=len(raw_text),
+                context_length=len(context_tokens),
+                chat_format=self.generation_config.chat_format,
+                verbose=False,
+            )
 
-        if append_history:
-            history.append((query, response))
+            if append_history:
+                history.append((query, response))
 
-        return response, history
+            return response, history
 
     def generate(
         self,

跟bloom 7B有什么关系

如题

基于QLoRA微调通义千问Qwen-7B

24G显存微调通义千问Qwen-7B

项目链接：https://github.com/yangjianxin1/Firefly

训练脚本：

torchrun --nproc_per_node={num_gpus} train_qlora.py --train_args_file train_args/qlora/qwen-7b-qlora.json

有没有啥微信群啥的，交流交流

关于text-generation-webui调用，前面的兄弟，有网页版的

在huggingface上留言可能看不到，这里热闹一些：
使用text-generation-webui加载Qwen/Qwen-7B-Chat模型的时候参数如图一所示（这台机器显卡太差，CPU较好），加载之后默认只能使用1个CPU线程（如图二），大量的CPU被闲置，然后推理速度非常非常慢，我查了你们开源的readme，没有看到启动参数调整的信息，请问我可以在哪里调整启动参数，使用更多的CPU用于推理呢，谢谢。
PS：Git从huggingface下载的时候默认会漏一个文件qwen.tiktoken，我不知道是不是我的特例。

请问如何在CPU上做inference？

工具调用的评估方式

非常有价值的工作！
不过工具调用的评估没有太多的材料，希望官方能提供评估的数量级，以及训练时是否针对了特定API进行了训练，对于没见过的API的工具选择的效果如何呢？希望开发者能回复，谢谢！

后续会开源sft 的代码吗

FlashAttention是必须要安装的吗？

如题

24G GPU 炸显存了

用你们的DEMO，结果跑不起来，炸显存了，难道只能用量化的吗？
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

安装flash-attention装不上报错

我使用Python3.8/3.10安装这个flash-attention都装不上报错 Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects，有遇到过这个问题的小伙伴吗

能和 Gradio集成吗

无法使用deepspeed zero3训练

│ /root/.cache/huggingface/modules/transformers_modules/Qwen-7B/modeling_qwen.py:206 in __init__   │
│                                                                                                  │
│    203 │   │   self.use_logn_attn = config.use_logn_attn                                         │
│    204 │   │                                                                                     │
│    205 │   │   logn_list = [math.log(i, self.seq_length) if i > self.seq_length else 1 for i in  │
│ ❱  206 │   │   self.logn_tensor = torch.Tensor(logn_list)[None, :, None, None]                   │
│    207 │   │   self._ntk_cached = 1.0                                                            │
│    208 │   │                                                                                     │
│    209 │   │   self.attn_dropout = nn.Dropout(config.attn_pdrop)                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py:209 in    │
│ new_tensor                                                                                       │
│                                                                                                  │
│    206 def get_new_tensor_fn_for_dtype(dtype: torch.dtype) -> Callable:                          │
│    207 │   def new_tensor(cls, *args) -> Tensor:                                                 │
│    208 │   │   device = torch.device(get_accelerator().device_name(os.environ["LOCAL_RANK"]))    │
│ ❱  209 │   │   tensor = _orig_torch_empty(0, device=device).new_empty(*args)                     │
│    210 │   │   if tensor.is_floating_point():                                                    │
│    211 │   │   │   tensor = tensor.to(dtype)                                                     │
│    212                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: new_empty(): argument 'size' must be tuple of ints, but found element of type float at pos 2049

麻烦解决一下。

Flash attention 加速效果较差，大约只提升5%的推理速度

Hi
我按照您这边给的flash attention 安装步骤，成功安装了 flash attention，
在运行时 log也显示了：
use flash_attn rotary
use flash_attn rms_norm

我在A100 机器上测试，方式安装flash attention比不安装带来的性能提速，只能带来低于5%的推理提速，（每个token的生成耗时）
所以我想问问，在你们内部实测时，flash attention 带来的性能提升大概是多少呀

是否有开源百亿参数模型的计划？

如题

关于 Qwen-7B tokenizer avoid injection attacks 的疑问

感谢开源预训练模型，试了下 chat 模型的效果感觉很强，正在尝试调优，在集成过程中发现 Qwen/Qwen-7B 的 tokenizer 中也实现了 avoid injection attacks 的功能，这个对于非 chat 模型来说应该是不需要的吧？

BTW: warning 中有一处 "OpenAI" 字样未修改：https://huggingface.co/Qwen/Qwen-7B/blob/65b57b1a586a38c959e91bb9dd5fc37cdb5c86fa/tokenization_qwen.py#L156

怎么流式输出？有demo例子吗

可以进行微调嘛？

Are you planning on making bigger models?

Are there any intensions on making 13B , 30B or 60B kind of models , or any kind of bigger open-source foundation models??

tokenize(tokenizer.eos_token) doesn't behave as expected

tokenize.encode(tokenizer.eos_token) != tokenizer.eos_token_id

tiktoken不支持多线程tokenize?

报错：

TypeError: cannot pickle 'builtins.CoreBPE' obiect

此项目需要解决的问题：1、...

1、按README的方法从头到尾实践后，无法启动。
2、下载flash-attention后，无法成功pip install csrc/layer_norm和pip install csrc/rotary。
2、无法流式问答。
4、无webUI。
5、没有说明如何加载本地模型，本地模型的路径应该填写在哪里？希望给个代码范本。
6、按说明安装环境后，在项目内打开CMD输入python medo.py加载后报错：device_map="auto"
总结：希望有更易读且全面的说明流程（起码按README的方法从头到尾实践后可运行）。
如果不改进可能：不利于推广，即使那么多人说你的好，却没有一个真正运行后的测评，也没有视频真正去讲解，因为没人能按你的readme运行得起来。

能否提供web_demo示例

以及相关的deployment步骤说明

请问有通义千问提示词工程的相关社区或教程吗？

看到example中激活调用工具的提示词，眼前一亮，请问有通义千问提示词工程的相关社区或教程吗，例如如何让模型按照JSON 、Markdown等格式返回，目前需要很多次调试提示词才能实现

Bug of tokenize "<|endoftext|>"

在对"<|endoftext|>"进行tokenize的时候，会将其切分成多个token，而不是151643这一个token。

运行脚本：

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
print('encode <|endoftext|>: {}'.format(tokenizer.encode('<|endoftext|>')))

分词结果为：

encode <|endoftext|>: [27, 91, 8691, 723, 427, 91, 29]

希望qwen的同学修复一下。

能否实现一个兼容openai的api server？

只有主要的/v1/chat/completions接口就够用了

fail to save tokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
tokenizer.save_pretrained('checkpoint')

fail to save tokenizer

    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
TypeError: save_vocabulary() got an unexpected keyword argument 'filename_prefix'

这就是阿里代码水平？

请问有没有提供在M1上部署的教程？

安装flash-attn错误

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects