thudm / glm-130b Goto Github PK
View Code? Open in Web Editor NEWGLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)
License: Apache License 2.0
GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)
License: Apache License 2.0
Thanks for sharing the great work!
I'm curious if it is appropriate to do open generation. E.g. put only [gMASK] at the beginning and then complete the text.
Thank you again!
GCP prices 8x 40gb A100's at 50% more than 4x 80gb A100's. Would I be able to accomplish the same results with a little tweaking of the default config?
I try to run GLM FasterTransformer benchmark-generation.sh(without load model checkpoint),but encounter a bug as follows:
CUDA error: invalid argument
Exception raised from alloc_block at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1037 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f8738f8063c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x25dd2 (0x7f8738fdfdd2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2b278 (0x7f8738fe5278 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2cd8c (0x7f8738fe6d8c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2d2f8 (0x7f8738fe72f8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x103 (0x7f873c3e00a3 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x35079fb (0x7f873c5179fb in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x3507a8f (0x7f873c517a8f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x1d5c77f (0x7f878593677f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::empty_memory_format::call(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1e5 (0x7f87856e3ac5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0x1d3 (0x7f86cdd75643 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #11: fastertransformer::Allocator<(fastertransformer::AllocatorType)2>::malloc(unsigned long, bool) + 0xe6 (0x7f86cdd89046 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #12: fastertransformer::GlmContextDecoder<__half>::allocateBuffer() + 0x70 (0x7f86cddc6a00 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #13: fastertransformer::GlmContextDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDecoderLayerWeight<__half>*> > const*, fastertransformer::LayerNormWeight<__half> const*) + 0x1f0 (0x7f86cddcb8e0 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #14: fastertransformer::Glm<__half>::encode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) + 0x1517 (0x7f86cdda6f07 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #15: torch_ext::FTGlm<__half>::encode(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int) + 0xd44 (0x7f86cdd90134 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #16: torch_ext::GlmOp::encode(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, long) + 0x10f (0x7f86cdd6e31f in /root/FasterTransformer/build/lib/libth_glm.so)
frame #17: <unknown function> + 0x7344b (0x7f86cdd8a44b in /root/FasterTransformer/build/lib/libth_glm.so)
frame #18: <unknown function> + 0x69ee6 (0x7f86cdd80ee6 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #19: PyCFunction_Call + 0x54 (0x55f91235f914 in /opt/conda/bin/python)
frame #20: _PyObject_MakeTpCall + 0x31e (0x55f912362ebe in /opt/conda/bin/python)
frame #21: <unknown function> + 0x1b85de (0x55f9123e85de in /opt/conda/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x4d33 (0x55f9124043c3 in /opt/conda/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #24: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #25: <unknown function> + 0x1b848c (0x55f9123e848c in /opt/conda/bin/python)
frame #26: PyObject_Call + 0x5e (0x55f912351b6e in /opt/conda/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x21bf (0x55f91240184f in /opt/conda/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #29: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #30: _PyObject_FastCallDict + 0x2fd (0x55f9123d1d2d in /opt/conda/bin/python)
frame #31: _PyObject_Call_Prepend + 0xcf (0x55f9123d229f in /opt/conda/bin/python)
frame #32: <unknown function> + 0x1a2329 (0x55f9123d2329 in /opt/conda/bin/python)
frame #33: _PyObject_MakeTpCall + 0x31e (0x55f912362ebe in /opt/conda/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x55f5 (0x55f912404c85 in /opt/conda/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #36: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x947 (0x55f9123fffd7 in /opt/conda/bin/python)
frame #38: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #39: PyEval_EvalCodeEx + 0x39 (0x55f9123e7499 in /opt/conda/bin/python)
frame #40: PyEval_EvalCode + 0x1b (0x55f912482ecb in /opt/conda/bin/python)
frame #41: <unknown function> + 0x252f63 (0x55f912482f63 in /opt/conda/bin/python)
frame #42: <unknown function> + 0x26f033 (0x55f91249f033 in /opt/conda/bin/python)
frame #43: <unknown function> + 0x274022 (0x55f9124a4022 in /opt/conda/bin/python)
frame #44: PyRun_SimpleFileExFlags + 0x1b2 (0x55f9124a4202 in /opt/conda/bin/python)
frame #45: Py_RunMain + 0x36d (0x55f9124a477d in /opt/conda/bin/python)
frame #46: Py_BytesMain + 0x39 (0x55f9124a4939 in /opt/conda/bin/python)
frame #47: __libc_start_main + 0xf3 (0x7f87d07a30b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #48: <unknown function> + 0x1e8f39 (0x55f912418f39 in /opt/conda/bin/python)
Following is my environment:
Greetings,
Are there any plans for integrating GLM-130b in the transformers library? (it seems only the small glm-10b
is available at the moment)
We are trying to use the generated output to send additional queries to the model in batch mode and the current setup of the generate.sh
script is difficult to integrate with existing code, at least compared to Bloom and similar.
Thanks,
Alfredo
Hi,
I want to deploy GLM-130B to two 3090 * 8 nodes for inference (3090*16).
I think the memory is enough, but I'm not familiar with distributed inference.
Maybe I need to do the following things:
Could you provide me with some ideas or materials?
Thanks.
The stable training contribution mentioned in the document, while its code and script are not released. Is there an open source plan ?
你好,看到GLM-130B采用Ext5的方式加入了instruction tuning进行指令微调,请问GLM-10B也有引入instruction tuning吗?
Excuse me, I will report an error when I execute this statement. How to solve it?
Hey, thx for releasing this amazing work!
I wonder do you pre-train this model on code generation tasks?
Thanks for making such a powerful model widely available! Very impressive work to get it to run on a single node using all open source methods.
I took it for a spin on an 8x A100 40GB machine and got some nice results.
Have you tried running the model on a single A100 80GB or an H100? Can it run without off-loading the weights to CPU?
I looked at the low resource info and did some simple calculations and it looks like
If that's the case, it'd be great to know because single-card setups are even easier to work with than single-node, and the H100s are coming soon.
Related: #17
Hi, since \n
characters are ignored, what would be the next best option to use instead when prompting GLM with in-context examples?
For example, for other models where \n
is not ignored, we input prompts that look like this:
Passage: The triangle is above the red sphere.
The pink rectangle is to the left of the red sphere.
Question: Is the triangle to the left of the pink rectangle?
Answer: no
Passage: The chest is bigger than the suitcase.
The box is bigger than the suitcase.
The chest fits inside the box.
The suitcase is bigger than the box of chocolates.
The container fits inside the box.
Question: Does the suitcase fit in the box?
Answer: yes
Passage: Mary travelled to the bedroom.
Daniel travelled to the office.
Daniel journeyed to the hallway.
Mary travelled to the hallway.
Sandra travelled to the kitchen.
Mary travelled to the kitchen.
John journeyed to the garden.
Daniel went to the bathroom.
Question: Where is Sandra?
Answer: kitchen
Passage: The hallway is west of the kitchen.
The office is east of the kitchen.
Question: What is the kitchen west of?
Answer: office
Passage: This morning Fred moved to the school.
Julie went back to the cinema yesterday.
Mary travelled to the bedroom yesterday.
Fred journeyed to the bedroom yesterday.
Bill travelled to the kitchen yesterday.
This afternoon Fred journeyed to the office.
Fred travelled to the park this evening.
Mary went to the office this morning.
This afternoon Mary went back to the cinema.
This morning Julie travelled to the office.
Question: Where was Mary before the office?
Answer: bedroom
Passage: The hallway is north of the office.
The bathroom is south of the office.
Question: What is north of the office?
Answer:
I was wondering what the best practice for prompt construction for GLM was, especially for the case where there are in-context examples.
Thank you for your awesome work!
when I follow the steps provided here, I just met the exception:
Traceback (most recent call last):
File "/FasterTransformer/examples/pytorch/glm/glm_server.py", line 101, in <module>
if not glm.load(ckpt_path=args.ckpt_path):
File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 319, in load
is_load = self.weights.load(ckpt_path, tensor_para_rank=self.tensor_para_rank,
File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 190, in load
scale.extend([module[f'transformer.layers.{i}.attention.query_key_value.weight_scale'].reshape(head_num, num_splits, size_per_head).permute(1, 0, 2).reshape(3, local_dim) for i in range(layer_num)])
File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 190, in <listcomp>
scale.extend([module[f'transformer.layers.{i}.attention.query_key_value.weight_scale'].reshape(head_num, num_splits, size_per_head).permute(1, 0, 2).reshape(3, local_dim) for i in range(layer_num)])
KeyError: 'transformer.layers.0.attention.query_key_value.weight_scale'
It seems that state_dict is missing some keys
Hi there!
Thanks for the great work!
Was wondering how can we cite this work ?
Thanks
Hi,
This work looks really interesting!
I am curious about the performance of GLM with the SOTA decoding method, i.e. contrastive search [1], in open-ended text generation. Could you provide some examples generated by GLM with contrastive search?
You can find a tutorial on how to apply contrastive search here (https://github.com/yxuansu/SimCTG#441-chinese-language-model).
Many thanks! :-)
[1] - Su et al., 2022. A Contrastive Framework for Neural Text Generation
hi, I'm still looking for computing perplexity with GLM.
Just look into the recent updates on evaluation/{dataset.py, tasks.py} about language model task.
The code inside dataset.py:LanguageModelTaskDataset:297
is
if idx == 0 or self.config.unidirectional:
prompt, text = tokens[:1], tokens[1:]
else:
prompt_length = self.config.max_seq_length - 1 - self.config.generation_length
prompt, text = tokens[:prompt_length], tokens[prompt_length:]
# ..... skip ....
return {
"tokens": np.array(prompt + [mask_id, sop_id] + text[:-1], dtype=np.int64),
"targets": np.array(prompt + [mask_id] + text, dtype=np.int64),
"position_ids": np.arange(0, seq_length, dtype=np.int64),
"attention_mask": attention_mask < 0.5,
"loss_masks": np.array([0] * (len(prompt) + 1) + [1] * len(text), dtype=np.int64),
}
at idx==0
, you take the full text as prompt
input and also the output text
.
It would lead to absolutely lower PPL. Because model has a full view of what it needs to predict.
Why wouldn't set the prompt
to empty list?
Hi!
I am trying to configure the GLM-130B models with FasterTransformer and I need to convert glm ckpt files, so where i can get model_optim_rng.pt file?
And I'm facing this
CMake Error at cmake/Modules/FindNCCL.cmake:153 (message): Found NCCL header version and library version do not match! (include: /home/ubuntu/anaconda3/envs/glm/include, library: /home/ubuntu/anaconda3/envs/glm/lib/libnccl.so) Please set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually. Call Stack (most recent call first): CMakeLists.txt:41 (find_package)
while i'm trying to make build using this command cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
My basic task is to minimize the inference time I also configured THUDM/GLM-130B main branch and I set MAX_OUTPUT_LENGTH=64 it takes about 55s to generate a response.
Machine Specs: (V100) 8 * 32GB
Thanks
请问有INT4版本的GLM-130B的下载地址吗?
I think it is cheaper.
Hello, can I apply to call this API?
url = 'https://wudao.aminer.cn/os/api/api/v2/completions_130B'
System configuration: CentOS 7, Python 3.9, Pytorch 1.10.1.
I used the low-resource inference based on the bminf
module. But, I get this error:
AttributeError: module 'bminf' has no attribute 'wrapper’
When I used the low-resource inference, I encountered this error:
ValueError: Missing keys for inference: ['mixins.rotary-embedding.rotary_emb.inv_freq'].
Before I used this model for inference, I check the completeness of the 60 checkpoints.
Hi really appreciate the great work!
I am wondering, is there a straightforward way to adapt the code for multinode inference?
I got 3 A100s each with 3 GPUs of 40GB memory.
Does this code naturally support multinode inference? If so where in the code shall I tune it?
Thanks!
I need to use the glm-10b with scripts/generate.sh and set MODEL_TYPE='glm-10b' in the config file in configs.
However, there are still errors that [Errno 2] No such file or directory: 'XXX/glm-10b-en/126000/mp_rank_01_model_states.pt' and [Errno 2] No such file or directory: 'XXX/glm-10b-en/126000/mp_rank_02_model_states.pt' maybe because glm_130b is used.
How can be the config file modified in configs to use glm-10b instead of glm-130b?
Looking forward to reply. Thanks.
Hi!
After reading a lot of information, it seems that in the field of machine translation, it is more likely to use a small amount of parallel corpus for fine-tuning, and I feel that it may work better for some low-resource languages. But it seems that it is difficult to improve the performance of rich corpus languages.
I have checked GLM papers and found no performance analysis on the machine translation task. Is it possible to use GLM-130B to improve machine translation performance in English-Chinese translation tasks? Are there any experiments or best practices about this?
如题
No input text for generation, why is the GPU occupancy 100%?
Fri Oct 21 11:05:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:08.0 Off | 0 |
| N/A 38C P0 53W / 300W | 20392MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 |
| N/A 41C P0 66W / 300W | 20392MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 |
| N/A 40C P0 59W / 300W | 20248MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:0B.0 Off | 0 |
| N/A 40C P0 67W / 300W | 20248MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 14678 C /opt/conda/bin/python 20389MiB |
| 1 N/A N/A 14679 C /opt/conda/bin/python 20389MiB |
| 2 N/A N/A 14682 C /opt/conda/bin/python 20245MiB |
| 3 N/A N/A 14686 C /opt/conda/bin/python 20245MiB |
+-----------------------------------------------------------------------------+
Hello!
It seems the script for converting the tensor parallel dimension fails
Running for instance
python tools/convert_tp.py --input-folder "../glm/glm-130b-sat" --output-folder "../glm/four-div-glm-130b-sat" --target-tp 4
Yields
Traceback (most recent call last):
File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 154, in <module>
main(args)
File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 149, in main
torch.save(create_checkpoint(sd_list, i, original_tp, args.target_tp, args.quantization_bit_width), save_path)
File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 121, in create_checkpoint
new_sd[key], new_sd[f"{key}_scale"] = new_sd[key]
ValueError: too many values to unpack (expected 2)
Any advice here? Thanks 🙏🏻
Thanks a lot for sharing the code.
I followed the steps mentioned here for running it locally without docker, but I am getting the following error.
Traceback (most recent call last):
File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/glm_server.py", line 105, in <module>
glm.init_model(512,# output_len,
File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 375, in init_model
self.cuda()
File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 359, in cuda
self.model = self.Glm(get_torch_default_comm(), self.rank, self.head_num, self.size_per_head, self.head_num * self.size_per_head * 8 // 3,
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
1. libth_glm.Glm(arg0: c10d::ProcessGroupNCCL, arg1: int, arg2: int, arg3: int, arg4: int, arg5: int, arg6: int, arg7: int, arg8: int, arg9: int, arg10: int, arg11: int, arg12: int, arg13: List[at::Tensor], arg14: List[at::Tensor], arg15: List[at::Tensor])
Hi!
I am trying to build an API for GLM-130B model. So far, I have tried to run GLM model and FastAPI server from generate.sh script with no success. I also tried to run the GLM model on the start_event of FastAPI with no success. Is there any way through which I can use the model to generate response through API.
Thanks
Hi thanks for the great work!
Is there a plan on sharing the code and data you specifically used for evaluating BIG-bench-lite?
It might be important for recreating the results given the decision points regarding prompt design etc.
Is there any tutorial/code on the finetuning of this GLM-130B model?
Hi,
Thanks for your work and open-source!
There's one point I'm confused about: In your paper (section 2.3, last paragraph), you said
For the [MASK] and multi-task objectives, we use a context window of 512 and concatenate four samples together to cater the 2,048-sequence-length
I wonder if there's a special attention mask to ensure that each sample should only attend to itself, and not attend to other samples?
(e.g., something like a block diagonal attention mask as the following, where each block corresponds to one sample, respectively?)
Otherwise it would be weird to concatenate multiple independent samples together just for computation efficiency, or am I missing something here? (Since there's no training code in the repo yet)
Hi, thanks for releasing code and weights for GLM-130B.
The README says that GLM-130B was trained partly on 1.2T Pile corpus for English
. The Pile size is 825 GiB
or 0.886 TB
.
Was there any English data used to train GLM-130B in addition to the Pile?
Can I apply the generate script scripts/generate.sh
on GLM-10B Chinese checkpoint in the GLM
repository?
What is the device requirement for only generatation (zero-shot)
I have finetuned some smaller models like GLM-10b using SAT and prefix tuning. Is there a standard way I can use the GLM-130B model with the SAT toolset?
Hi!
I have checked #9. But there are all monolingual models. Is bilingual GLM with smaller model sizes like 1.5B, 2.7B available?
Run generate.sh with "model_glm_130b_int4.sh" configuration, still reporting an error, memory 157G (physical memory) + 195G (virtual memory, swap), 4*V100 graphics card.
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
/workspace/generate.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-10-20_08:16:53
host : 8bdf70b6de4a
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1286)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1286
=====================================================
I'm developing a chat bot on top of GLM-130B.
Currently I'm using "[MASK]" at the end of dialogue for bot's response generation.
[gMASK] is too slow for me on my 8xV100 server.
Your GLM repo https://github.com/THUDM/GLM reports [sMASK] could be used for sentence generation.
But I didn't find any doc in this repo. Does GLM-130 B support [sMASK] for sentence generation?
Do you have any plan to export more API of GLM-130 B ? Such as compute LM perplexity / Multiple choice selection or any other features? Since you have already test the model on Few-CLUE, there must be ways to utilize those features.
Hi, in your paper you talk about using INT8 dtype to store the weights, but they are cast to FP16 for the calculation. I was just wondering if at inference time do you actually calculate in INT8 (rather than FP16) given that you are using fastertransformer and that has support kernels which use INT8 tensor cores, to obtain an improvement in speed
我发送了好多次添加请求,都没有得到回应,都是已过期。
Great work!
Any plans to integrate the FasterTransformer recipe and code (https://github.com/THUDM/GLM-130B/blob/main/docs/inference-with-fastertransformer.md) with the Triton FasterTransformer backend (https://github.com/triton-inference-server/fastertransformer_backend)?
How to batch inference? Thanks!
When I load the int4 model, I get the following error;
The run command is: bash scripts/generate.sh --input-source input.txt
I use two a6000 graphics cards (2*48G)
Traceback (most recent call last):
File "/ssd1/xingyum/GLM-130B/generate.py", line 210, in <module>
main(args)
File "/ssd1/xingyum/GLM-130B/generate.py", line 156, in main
model, tokenizer = initialize_model_and_tokenizer(args)
File "/ssd1/xingyum/GLM-130B/initialize.py", line 72, in initialize_model_and_tokenizer
load_checkpoint(model, args)
File "/home/xingyum/anaconda3/envs/vis/lib/python3.10/site-packages/SwissArmyTransformer/training/model_io.py", line 181, in load_checkpoint
missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False)
File "/home/xingyum/anaconda3/envs/vis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GLM130B:
size mismatch for transformer.word_embeddings.weight: copying a param with shape torch.Size([18816, 12288]) from checkpoint, the shape in current model is torch.Size([75264, 12288]).
size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([4608, 12288]) from checkpoint, the shape in current model is torch.Size([18432, 12288]).
size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([4608]) from checkpoint, the shape in current model is torch.Size([18432]).
size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 6144]).
您好,试了GLM-130B和CodeGeex的效果,很惊艳。请问是否考虑将两个模型结合成一个模型?例如:在GLM-130B的基础上采用CodeGeex的数据集进行继续预训练。
我有如下问题想请教:
会有更大的吞吐,为什么最终会选择4224呢?另外,
BSZ=176 * 24=4224`,24正好是dp数,那176 需要梯度累加吗?大模型训练上用梯度累加跟小模型上会有显著差异吗?分析可能是 distribution 变动仍然太过剧烈,先换纯文本 + reshuffle 尝试训练
warmup-samples-after-loading
这个是什么操作?是从平衡的多任务,逐渐转换为带权重分布的多任务吗?I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer ignore_linebreak=False
? where '\n' will be encoded to 20004
.
Thank you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.