qwenlm / qwen Goto Github PK
View Code? Open in Web Editor NEWThe official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
License: Apache License 2.0
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
License: Apache License 2.0
只有主要的/v1/chat/completions接口就够用了
跑官方README里面的例子和repo中的demo.py都报这个错误。
File "/root/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B-Chat/44e46a0f02169a2c4790fbcccec82cd20f4df717/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow
Hi
我按照 您这边给的flash attention 安装步骤,成功安装了 flash attention,
在运行时 log也显示了:
use flash_attn rotary
use flash_attn rms_norm
我在A100 机器上测试,方式安装flash attention比不安装带来的性能提速,只能带来低于5%的推理提速,(每个token的生成耗时)
所以我想问问,在你们内部实测时,flash attention 带来的性能提升大概是多少呀
您好,尝试在M1的mac上运行模型,由于内存问题,加了一个offload_folder和torch_dtype,代码如下:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("/Users/sniper/model/Qwen-7b-chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/Users/sniper/model/Qwen-7b-chat", device_map="auto",
offload_folder="offload", torch_dtype=torch.float16,
trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("/Users/sniper/model/Qwen-7b-chat",
trust_remote_code=True)
# 第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
但是在chat行(倒数第二)出现错误:
position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: MPS does not support cumsum op with int64 input
请问这是什么原因呀
如题
感谢开源预训练模型,试了下 chat 模型的效果感觉很强,正在尝试调优,在集成过程中发现 Qwen/Qwen-7B
的 tokenizer 中也实现了 avoid injection attacks 的功能,这个对于非 chat 模型来说应该是不需要的吧?
BTW: warning 中有一处 "OpenAI" 字样未修改:https://huggingface.co/Qwen/Qwen-7B/blob/65b57b1a586a38c959e91bb9dd5fc37cdb5c86fa/tokenization_qwen.py#L156
比如 数据格式对齐方式 等
纯小白,想尝试下
非常有价值的工作!
不过工具调用的评估没有太多的材料,希望官方能提供评估的数量级,以及训练时是否针对了特定API进行了训练,对于没见过的API的工具选择的效果如何呢?希望开发者能回复,谢谢!
报错:
TypeError: cannot pickle 'builtins.CoreBPE' obiect
24G显存微调通义千问Qwen-7B
项目链接:https://github.com/yangjianxin1/Firefly
训练脚本:
torchrun --nproc_per_node={num_gpus} train_qlora.py --train_args_file train_args/qlora/qwen-7b-qlora.json
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
在对"<|endoftext|>"进行tokenize的时候,会将其切分成多个token,而不是151643这一个token。
运行脚本:
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
print('encode <|endoftext|>: {}'.format(tokenizer.encode('<|endoftext|>')))
分词结果为:
encode <|endoftext|>: [27, 91, 8691, 723, 427, 91, 29]
希望qwen的同学修复一下。
以及相关的deployment步骤说明
我使用Python3.8/3.10安装这个flash-attention都装不上 报错 Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects,有遇到过这个问题的小伙伴吗
请问有公布模型训练和测评的硬件需求吗?需要调研硬件资源需求,但是md文件好像没有明确说明,大家有看到嘛?
首先感谢开源 Qwen-7B 模型,我基于该模型实现了 QLoRA 多轮对话微调,项目地址:https://github.com/hiyouga/LLaMA-Efficient-Tuning
QLoRA 指令微调:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--stage sft \
--model_name_or_path Qwen/Qwen-7B-Chat \
--do_train \
--dataset sharegpt_zh \
--template chatml \
--finetuning_type lora \
--lora_target c_attn \
--output_dir qwen_lora \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 100 \
--learning_rate 3e-5 \
--num_train_epochs 1.0 \
--quantization_bit 4 \
--fp16
Web Demo:
python src/web_demo.py \
--model_name_or_path Qwen/Qwen-7B-Chat \
--template chatml
API 部署(基于 OpenAI 格式):
python src/api_demo.py \
--model_name_or_path Qwen/Qwen-7B-Chat \
--template chatml
另外,希望开发者可以修复一下 tokenizer 的 decode 方法,使其支持 skip_special_tokens 参数,便于后续开发,目前该参数没有实际生效。 (最新版已修复)
def _decode(
self,
token_ids: Union[int, List[int]],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = None,
**kwargs,
) -> str:
if isinstance(token_ids, int):
token_ids = [token_ids]
return self.tokenizer.decode(token_ids)
首先感谢开源qwen-7B大模型!
我在使用chat版本时遇到输入文本较长时无输出结果的问题,输入指令的文字长度为4722,该指令经过tokenizer编码后的input_ids长度为3172,我修改了generation_config.json
中关于输入长度的配置:
"max_context_size": 4096
但是模型的 response
是一个空字符串,我通过单步调试确认没有因为token过长等原因提前结束,而是进入了正常的自回归解码过程,输出的前两个token刚好是 stop_words_ids
中的两个 token,我看了下readme中应该是能支持8k规模的context:
Support of 8K Context Length. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.
我尝试将指令输入截断至3265个字这时又能正常输出结果,想问下这是什么原因呢?单纯是输入过长导致性能不好还是我的使用方式存在问题?
1、按README的方法从头到尾实践后,无法启动。
2、下载flash-attention后,无法成功pip install csrc/layer_norm和pip install csrc/rotary。
2、无法流式问答。
4、无webUI。
5、没有说明如何加载本地模型,本地模型的路径应该填写在哪里?希望给个代码范本。
6、按说明安装环境后,在项目内打开CMD输入python medo.py加载后报错:device_map="auto"
总结:希望有更易读且全面的说明流程(起码按README的方法从头到尾实践后可运行)。
如果不改进可能:不利于推广,即使那么多人说你的好,却没有一个真正运行后的测评,也没有视频真正去讲解,因为没人能按你的readme运行得起来。
如题
Are there any intensions on making 13B , 30B or 60B kind of models , or any kind of bigger open-source foundation models??
看到example中激活调用工具的提示词,眼前一亮,请问有通义千问提示词工程的相关社区或教程吗,例如如何让模型按照JSON 、Markdown等格式返回,目前需要很多次调试提示词才能实现
Thanks for your amazing work. By the way, may I ask that what is the padding token in your tokenizer? Without that, I don't think I can perform finetuning on this model.
modeling_qwen.py, line 373
seq_end = key.size(0)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]
should be
seq_start = key.size(1) - query.size(1)
seq_end = key.size(1)
logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]
类似openAI那样的函数调用功能
Qwen-7B tokenizer 使用了多少语料训练
能和 Gradio集成吗
如题
用你们的DEMO,结果跑不起来,炸显存了,难道只能用量化的吗?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
没有显卡能用不?
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, use_bf16=True).eval()
QWenLMHeadModel.init() got an unexpected keyword argument 'use_bf16'
│ /root/.cache/huggingface/modules/transformers_modules/Qwen-7B/modeling_qwen.py:206 in __init__ │
│ │
│ 203 │ │ self.use_logn_attn = config.use_logn_attn │
│ 204 │ │ │
│ 205 │ │ logn_list = [math.log(i, self.seq_length) if i > self.seq_length else 1 for i in │
│ ❱ 206 │ │ self.logn_tensor = torch.Tensor(logn_list)[None, :, None, None] │
│ 207 │ │ self._ntk_cached = 1.0 │
│ 208 │ │ │
│ 209 │ │ self.attn_dropout = nn.Dropout(config.attn_pdrop) │
│ │
│ /opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py:209 in │
│ new_tensor │
│ │
│ 206 def get_new_tensor_fn_for_dtype(dtype: torch.dtype) -> Callable: │
│ 207 │ def new_tensor(cls, *args) -> Tensor: │
│ 208 │ │ device = torch.device(get_accelerator().device_name(os.environ["LOCAL_RANK"])) │
│ ❱ 209 │ │ tensor = _orig_torch_empty(0, device=device).new_empty(*args) │
│ 210 │ │ if tensor.is_floating_point(): │
│ 211 │ │ │ tensor = tensor.to(dtype) │
│ 212 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: new_empty(): argument 'size' must be tuple of ints, but found element of type float at pos 2049
麻烦解决一下。
tokenize.encode(tokenizer.eos_token) != tokenizer.eos_token_id
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
tokenizer.save_pretrained('checkpoint')
fail to save tokenizer
vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
TypeError: save_vocabulary() got an unexpected keyword argument 'filename_prefix'
在modeling_qwen.py里加进去了,但有时候好像会有乱码。
下面是diff,求指导。
diff --git a/modeling_qwen.py b/modeling_qwen.py
index cc58746..a0361d9 100644
--- a/modeling_qwen.py
+++ b/modeling_qwen.py
@@ -883,6 +883,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
history: Optional[HistoryType],
system: str = "You are a helpful assistant.",
append_history: bool = True,
+ stream: Optional[bool] = False,
) -> Tuple[str, HistoryType]:
if history is None:
@@ -902,25 +903,39 @@ class QWenLMHeadModel(QWenPreTrainedModel):
)
input_ids = torch.tensor([context_tokens]).to(self.device)
- outputs = self.generate(
- input_ids,
- stop_words_ids=stop_words_ids,
- return_dict_in_generate=False,
- )
+ if stream:
+ from transformers_stream_generator.main import NewGenerationMixin, StreamGenerationConfig
+ self.__class__.generate = NewGenerationMixin.generate
+ self.__class__.sample_stream = NewGenerationMixin.sample_stream
+ stream_config = StreamGenerationConfig(**self.generation_config.to_dict(), do_stream=True)
- response = decode_tokens(
- outputs[0],
- tokenizer,
- raw_text_len=len(raw_text),
- context_length=len(context_tokens),
- chat_format=self.generation_config.chat_format,
- verbose=False,
- )
+ def stream_generator():
+ outputs = []
+ for token in self.generate(input_ids, stop_words_ids=stop_words_ids, return_dict_in_generate=False, generation_config=stream_config):
+ outputs.append(token.item())
+ yield tokenizer.decode(outputs, skip_special_tokens=True)
+
+ return stream_generator()
+ else:
+ outputs = self.generate(
+ input_ids,
+ stop_words_ids=stop_words_ids,
+ return_dict_in_generate=False,
+ )
+
+ response = decode_tokens(
+ outputs[0],
+ tokenizer,
+ raw_text_len=len(raw_text),
+ context_length=len(context_tokens),
+ chat_format=self.generation_config.chat_format,
+ verbose=False,
+ )
- if append_history:
- history.append((query, response))
+ if append_history:
+ history.append((query, response))
- return response, history
+ return response, history
def generate(
self,
<|im_end|>
是否要添加 label mask(label_id设为 -100)?<|im_end|>
后面的 \n
是否要添加 label mask?测试输入:
<|im_start|>system
system test<|im_end|>
<|im_start|>user
round 1 query<|im_end|>
<|im_start|>assistant
round 1 answer<|im_end|>
<|im_start|>user
round 2 query<|im_end|>
<|im_start|>assistant
round 2 answer<|im_end|>
tokenizer 结果
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.