thudm / glm-130b Goto Github PK

GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)

License: Apache License 2.0

Shell 4.37% Python 92.01% Makefile 0.46% Cuda 3.15%

glm-130b's Issues

Is it possible/recommended to do open generation?

Thanks for sharing the great work!

I'm curious if it is appropriate to do open generation. E.g. put only [gMASK] at the beginning and then complete the text.

Thank you again!

4x 80gb A100 vs 8x 40gb A100

GCP prices 8x 40gb A100's at 50% more than 4x 80gb A100's. Would I be able to accomplish the same results with a little tweaking of the default config?

FasterTransformer benchmark-generation.sh bug

I try to run GLM FasterTransformer benchmark-generation.sh(without load model checkpoint)，but encounter a bug as follows:

CUDA error: invalid argument
Exception raised from alloc_block at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1037 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f8738f8063c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x25dd2 (0x7f8738fdfdd2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2b278 (0x7f8738fe5278 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2cd8c (0x7f8738fe6d8c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2d2f8 (0x7f8738fe72f8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x103 (0x7f873c3e00a3 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x35079fb (0x7f873c5179fb in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x3507a8f (0x7f873c517a8f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x1d5c77f (0x7f878593677f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::empty_memory_format::call(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1e5 (0x7f87856e3ac5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0x1d3 (0x7f86cdd75643 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #11: fastertransformer::Allocator<(fastertransformer::AllocatorType)2>::malloc(unsigned long, bool) + 0xe6 (0x7f86cdd89046 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #12: fastertransformer::GlmContextDecoder<__half>::allocateBuffer() + 0x70 (0x7f86cddc6a00 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #13: fastertransformer::GlmContextDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDecoderLayerWeight<__half>*> > const*, fastertransformer::LayerNormWeight<__half> const*) + 0x1f0 (0x7f86cddcb8e0 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #14: fastertransformer::Glm<__half>::encode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) + 0x1517 (0x7f86cdda6f07 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #15: torch_ext::FTGlm<__half>::encode(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int) + 0xd44 (0x7f86cdd90134 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #16: torch_ext::GlmOp::encode(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, long) + 0x10f (0x7f86cdd6e31f in /root/FasterTransformer/build/lib/libth_glm.so)
frame #17: <unknown function> + 0x7344b (0x7f86cdd8a44b in /root/FasterTransformer/build/lib/libth_glm.so)
frame #18: <unknown function> + 0x69ee6 (0x7f86cdd80ee6 in /root/FasterTransformer/build/lib/libth_glm.so)
frame #19: PyCFunction_Call + 0x54 (0x55f91235f914 in /opt/conda/bin/python)
frame #20: _PyObject_MakeTpCall + 0x31e (0x55f912362ebe in /opt/conda/bin/python)
frame #21: <unknown function> + 0x1b85de (0x55f9123e85de in /opt/conda/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x4d33 (0x55f9124043c3 in /opt/conda/bin/python)
frame #23: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #24: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #25: <unknown function> + 0x1b848c (0x55f9123e848c in /opt/conda/bin/python)
frame #26: PyObject_Call + 0x5e (0x55f912351b6e in /opt/conda/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x21bf (0x55f91240184f in /opt/conda/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #29: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #30: _PyObject_FastCallDict + 0x2fd (0x55f9123d1d2d in /opt/conda/bin/python)
frame #31: _PyObject_Call_Prepend + 0xcf (0x55f9123d229f in /opt/conda/bin/python)
frame #32: <unknown function> + 0x1a2329 (0x55f9123d2329 in /opt/conda/bin/python)
frame #33: _PyObject_MakeTpCall + 0x31e (0x55f912362ebe in /opt/conda/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x55f5 (0x55f912404c85 in /opt/conda/bin/python)
frame #35: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #36: _PyFunction_Vectorcall + 0x378 (0x55f9123e7818 in /opt/conda/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x947 (0x55f9123fffd7 in /opt/conda/bin/python)
frame #38: _PyEval_EvalCodeWithName + 0x2c3 (0x55f9123e6433 in /opt/conda/bin/python)
frame #39: PyEval_EvalCodeEx + 0x39 (0x55f9123e7499 in /opt/conda/bin/python)
frame #40: PyEval_EvalCode + 0x1b (0x55f912482ecb in /opt/conda/bin/python)
frame #41: <unknown function> + 0x252f63 (0x55f912482f63 in /opt/conda/bin/python)
frame #42: <unknown function> + 0x26f033 (0x55f91249f033 in /opt/conda/bin/python)
frame #43: <unknown function> + 0x274022 (0x55f9124a4022 in /opt/conda/bin/python)
frame #44: PyRun_SimpleFileExFlags + 0x1b2 (0x55f9124a4202 in /opt/conda/bin/python)
frame #45: Py_RunMain + 0x36d (0x55f9124a477d in /opt/conda/bin/python)
frame #46: Py_BytesMain + 0x39 (0x55f9124a4939 in /opt/conda/bin/python)
frame #47: __libc_start_main + 0xf3 (0x7f87d07a30b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #48: <unknown function> + 0x1e8f39 (0x55f912418f39 in /opt/conda/bin/python)

Following is my environment:

docker image: nvcr.io/nvidia/pytorch:21.09-py3 or nvcr.io/nvidia/pytorch:22.05-py3
GPU: 3090(Driver Version: 470.57.02 CUDA Version: 11.4) or A100(Driver Version: 470.57.02 CUDA Version: 11.7)
CUDA_LAUNCH_BLOCKING=1

Hugging Face transformers integration

Greetings,

Are there any plans for integrating GLM-130b in the transformers library? (it seems only the small glm-10b is available at the moment)

We are trying to use the generated output to send additional queries to the model in batch mode and the current setup of the generate.sh script is difficult to integrate with existing code, at least compared to Bloom and similar.

Thanks,

Alfredo

Inference with 3090*16

Hi,

I want to deploy GLM-130B to two 3090 * 8 nodes for inference (3090*16).

I think the memory is enough, but I'm not familiar with distributed inference.

Maybe I need to do the following things:

model parallel and pipeline parallel
a distributed API server
...

Could you provide me with some ideas or materials?

Thanks.

Is GLM with smaller model size like 1.5B, 2.7B available?

there are an plan to release the training part?

The stable training contribution mentioned in the document, while its code and script are not released. Is there an open source plan ?

GLM-10B和GLM-130B

你好，看到GLM-130B采用Ext5的方式加入了instruction tuning进行指令微调，请问GLM-10B也有引入instruction tuning吗？

about：conda install -y cmake numpy pybind11 pytorch torchvision cudatoolkit-dev cudnn

Excuse me, I will report an error when I execute this statement. How to solve it？

is code generation pretraining task

Hey, thx for releasing this amazing work!
I wonder do you pre-train this model on code generation tasks?

1xA100 80GB inference in INT4?

Thanks for making such a powerful model widely available! Very impressive work to get it to run on a single node using all open source methods.

I took it for a spin on an 8x A100 40GB machine and got some nice results.

Have you tried running the model on a single A100 80GB or an H100? Can it run without off-loading the weights to CPU?

I looked at the low resource info and did some simple calculations and it looks like

The FP16 model has 260GB of weights and runs smoothly on 320GB of VRAM (eg an 8x A100 40GB or 4x A100 80GB).
The INT4 model has 65GB of weights, so it should run smoothly on 65 * 320/260 = 80 GB VRAM.

If that's the case, it'd be great to know because single-card setups are even easier to work with than single-node, and the H100s are coming soon.

Good replacement for `\n`

Related: #17

Hi, since \n characters are ignored, what would be the next best option to use instead when prompting GLM with in-context examples?

For example, for other models where \n is not ignored, we input prompts that look like this:

Passage: The triangle is above the red sphere.
The pink rectangle is to the left of the red sphere.
Question: Is the triangle to the left of the pink rectangle?
Answer: no

Passage: The chest is bigger than the suitcase.
The box is bigger than the suitcase.
The chest fits inside the box.
The suitcase is bigger than the box of chocolates.
The container fits inside the box.
Question: Does the suitcase fit in the box?
Answer: yes

Passage: Mary travelled to the bedroom.
Daniel travelled to the office.
Daniel journeyed to the hallway.
Mary travelled to the hallway.
Sandra travelled to the kitchen.
Mary travelled to the kitchen.
John journeyed to the garden.
Daniel went to the bathroom.
Question: Where is Sandra?
Answer: kitchen

Passage: The hallway is west of the kitchen.
The office is east of the kitchen.
Question: What is the kitchen west of?
Answer: office

Passage: This morning Fred moved to the school.
Julie went back to the cinema yesterday.
Mary travelled to the bedroom yesterday.
Fred journeyed to the bedroom yesterday.
Bill travelled to the kitchen yesterday.
This afternoon Fred journeyed to the office.
Fred travelled to the park this evening.
Mary went to the office this morning.
This afternoon Mary went back to the cinema.
This morning Julie travelled to the office.
Question: Where was Mary before the office?
Answer: bedroom

Passage: The hallway is north of the office.
The bathroom is south of the office.
Question: What is north of the office?
Answer:

I was wondering what the best practice for prompt construction for GLM was, especially for the case where there are in-context examples.

Met exception when run the fastertransformers demo

Thank you for your awesome work!
when I follow the steps provided here, I just met the exception:

Traceback (most recent call last):
  File "/FasterTransformer/examples/pytorch/glm/glm_server.py", line 101, in <module>
    if not glm.load(ckpt_path=args.ckpt_path):
  File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 319, in load
    is_load = self.weights.load(ckpt_path, tensor_para_rank=self.tensor_para_rank,
  File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 190, in load
    scale.extend([module[f'transformer.layers.{i}.attention.query_key_value.weight_scale'].reshape(head_num, num_splits, size_per_head).permute(1, 0, 2).reshape(3, local_dim) for i in range(layer_num)])
  File "/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 190, in <listcomp>
    scale.extend([module[f'transformer.layers.{i}.attention.query_key_value.weight_scale'].reshape(head_num, num_splits, size_per_head).permute(1, 0, 2).reshape(3, local_dim) for i in range(layer_num)])
KeyError: 'transformer.layers.0.attention.query_key_value.weight_scale'

It seems that state_dict is missing some keys

How to cite the repo

Hi there!

Thanks for the great work!
Was wondering how can we cite this work ?

Thanks

GLM-130B + Contrastive Search?

Hi,

This work looks really interesting!

I am curious about the performance of GLM with the SOTA decoding method, i.e. contrastive search [1], in open-ended text generation. Could you provide some examples generated by GLM with contrastive search?

You can find a tutorial on how to apply contrastive search here (https://github.com/yxuansu/SimCTG#441-chinese-language-model).

Many thanks! :-)

[1] - Su et al., 2022. A Contrastive Framework for Neural Text Generation

language model evaluation at idx = 0

hi, I'm still looking for computing perplexity with GLM.

Just look into the recent updates on evaluation/{dataset.py, tasks.py} about language model task.

The code inside dataset.py:LanguageModelTaskDataset:297 is

        if idx == 0 or self.config.unidirectional:
            prompt, text = tokens[:1], tokens[1:]
        else:
            prompt_length = self.config.max_seq_length - 1 - self.config.generation_length
            prompt, text = tokens[:prompt_length], tokens[prompt_length:]

        # ..... skip ....
        return {
            "tokens": np.array(prompt + [mask_id, sop_id] + text[:-1], dtype=np.int64),
            "targets": np.array(prompt + [mask_id] + text, dtype=np.int64),
            "position_ids": np.arange(0, seq_length, dtype=np.int64),
            "attention_mask": attention_mask < 0.5,
            "loss_masks": np.array([0] * (len(prompt) + 1) + [1] * len(text), dtype=np.int64),
        }

at idx==0, you take the full text as prompt input and also the output text.
It would lead to absolutely lower PPL. Because model has a full view of what it needs to predict.
Why wouldn't set the prompt to empty list?

Inference with FasterTransformer with GLB-130B

Hi!
I am trying to configure the GLM-130B models with FasterTransformer and I need to convert glm ckpt files, so where i can get model_optim_rng.pt file?
And I'm facing this
CMake Error at cmake/Modules/FindNCCL.cmake:153 (message): Found NCCL header version and library version do not match! (include: /home/ubuntu/anaconda3/envs/glm/include, library: /home/ubuntu/anaconda3/envs/glm/lib/libnccl.so) Please set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually. Call Stack (most recent call first): CMakeLists.txt:41 (find_package)
while i'm trying to make build using this command cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..

My basic task is to minimize the inference time I also configured THUDM/GLM-130B main branch and I set MAX_OUTPUT_LENGTH=64 it takes about 55s to generate a response.
Machine Specs: (V100) 8 * 32GB
Thanks

INT4版本

请问有INT4版本的GLM-130B的下载地址吗？

M1Max 64G inference support? or even M2Max (in future)?

I think it is cheaper.

Hello, can I apply to call this API?

Hello, can I apply to call this API?
url = 'https://wudao.aminer.cn/os/api/api/v2/completions_130B'

AttributeError: module 'bminf' has no attribute 'wrapper’

System configuration: CentOS 7, Python 3.9, Pytorch 1.10.1.
I used the low-resource inference based on the bminf module. But, I get this error:

AttributeError: module 'bminf' has no attribute 'wrapper’

ValueError: Missing keys for inference: ['mixins.rotary-embedding.rotary_emb.inv_freq'].

When I used the low-resource inference, I encountered this error:

ValueError: Missing keys for inference: ['mixins.rotary-embedding.rotary_emb.inv_freq'].

Before I used this model for inference, I check the completeness of the 60 checkpoints.

How to use the code for multinode inference?

Hi really appreciate the great work!

I am wondering, is there a straightforward way to adapt the code for multinode inference?

I got 3 A100s each with 3 GPUs of 40GB memory.

Does this code naturally support multinode inference? If so where in the code shall I tune it?

Thanks!

The config for glm-10b

I need to use the glm-10b with scripts/generate.sh and set MODEL_TYPE='glm-10b' in the config file in configs.

However, there are still errors that [Errno 2] No such file or directory: 'XXX/glm-10b-en/126000/mp_rank_01_model_states.pt' and [Errno 2] No such file or directory: 'XXX/glm-10b-en/126000/mp_rank_02_model_states.pt' maybe because glm_130b is used.

How can be the config file modified in configs to use glm-10b instead of glm-130b?

Looking forward to reply. Thanks.

Use GLM-130B on machine translation task.

Hi!
After reading a lot of information, it seems that in the field of machine translation, it is more likely to use a small amount of parallel corpus for fine-tuning, and I feel that it may work better for some low-resource languages. But it seems that it is difficult to improve the performance of rich corpus languages.

I have checked GLM papers and found no performance analysis on the machine translation task. Is it possible to use GLM-130B to improve machine translation performance in English-Chinese translation tasks? Are there any experiments or best practices about this?

试了下，做开放性问答。感觉overfit啊

如题

No input text for generation, why is the GPU occupancy 100%?

Fri Oct 21 11:05:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   38C    P0    53W / 300W |  20392MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   41C    P0    66W / 300W |  20392MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   40C    P0    59W / 300W |  20248MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0B.0 Off |                    0 |
| N/A   40C    P0    67W / 300W |  20248MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14678      C   /opt/conda/bin/python           20389MiB |
|    1   N/A  N/A     14679      C   /opt/conda/bin/python           20389MiB |
|    2   N/A  N/A     14682      C   /opt/conda/bin/python           20245MiB |
|    3   N/A  N/A     14686      C   /opt/conda/bin/python           20245MiB |
+-----------------------------------------------------------------------------+

Tensor parallel dimension conversion script fails

Hello!

It seems the script for converting the tensor parallel dimension fails

Running for instance

python tools/convert_tp.py --input-folder "../glm/glm-130b-sat" --output-folder "../glm/four-div-glm-130b-sat" --target-tp 4

Yields

Traceback (most recent call last):
  File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 154, in <module>
    main(args)
  File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 149, in main
    torch.save(create_checkpoint(sd_list, i, original_tp, args.target_tp, args.quantization_bit_width), save_path)
  File "/extra/ucinlp1/dylan/GLM-130B/tools/convert_tp.py", line 121, in create_checkpoint
    new_sd[key], new_sd[f"{key}_scale"] = new_sd[key]
ValueError: too many values to unpack (expected 2)

Any advice here? Thanks 🙏🏻

FasterTransformer conda issue

Thanks a lot for sharing the code.
I followed the steps mentioned here for running it locally without docker, but I am getting the following error.

Traceback (most recent call last):
  File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/glm_server.py", line 105, in <module>
    glm.init_model(512,# output_len,
  File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 375, in init_model
    self.cuda()
  File "/projects/tir4/users/zhengbaj/exp/GLM-130B/FasterTransformer/examples/pytorch/glm/../../../examples/pytorch/glm/utils/glm.py", line 359, in cuda
    self.model = self.Glm(get_torch_default_comm(), self.rank, self.head_num, self.size_per_head, self.head_num * self.size_per_head * 8 // 3,
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. libth_glm.Glm(arg0: c10d::ProcessGroupNCCL, arg1: int, arg2: int, arg3: int, arg4: int, arg5: int, arg6: int, arg7: int, arg8: int, arg9: int, arg10: int, arg11: int, arg12: int, arg13: List[at::Tensor], arg14: List[at::Tensor], arg15: List[at::Tensor])

Building an API for GLM-130B

Hi!
I am trying to build an API for GLM-130B model. So far, I have tried to run GLM model and FastAPI server from generate.sh script with no success. I also tried to run the GLM model on the start_event of FastAPI with no success. Is there any way through which I can use the model to generate response through API.
Thanks

BIG-bench-lite evaluation code

Hi thanks for the great work!

Is there a plan on sharing the code and data you specifically used for evaluating BIG-bench-lite?

It might be important for recreating the results given the decision points regarding prompt design etc.

Question about finetuning

Is there any tutorial/code on the finetuning of this GLM-130B model?

Question about sample concatenation during training

Hi,

Thanks for your work and open-source!

There's one point I'm confused about: In your paper (section 2.3, last paragraph), you said

For the [MASK] and multi-task objectives, we use a context window of 512 and concatenate four samples together to cater the 2,048-sequence-length

I wonder if there's a special attention mask to ensure that each sample should only attend to itself, and not attend to other samples?
(e.g., something like a block diagonal attention mask as the following, where each block corresponds to one sample, respectively?)

Otherwise it would be weird to concatenate multiple independent samples together just for computation efficiency, or am I missing something here? (Since there's no training code in the repo yet)

Clarification on reported dataset size for the Pile (`1.2T` vs `825GiB`)

Hi, thanks for releasing code and weights for GLM-130B.

The README says that GLM-130B was trained partly on 1.2T Pile corpus for English. The Pile size is 825 GiB or 0.886 TB.

Was there any English data used to train GLM-130B in addition to the Pile?

Generate script

Can I apply the generate script scripts/generate.sh on GLM-10B Chinese checkpoint in the GLM repository？

What is the device requirement for only-generatation (zero-shot)

What is the device requirement for only generatation (zero-shot)

[Question] Can I finetune GLM-130B with SAT framework?

I have finetuned some smaller models like GLM-10b using SAT and prefix tuning. Is there a standard way I can use the GLM-130B model with the SAT toolset?

Is bilingual GLM with smaller model size like 1.5B, 2.7B available?

Hi!
I have checked #9. But there are all monolingual models. Is bilingual GLM with smaller model sizes like 1.5B, 2.7B available?

[Disscussion] Can we align GLM-130B to human like chatgpt?

Run generate.sh with "model_glm_130b_int4.sh" configuration, still reporting an error, memory 157G (physical memory) + 195G (virtual memory, swap), 4*V100 graphics card.

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
/workspace/generate.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-20_08:16:53
  host      : 8bdf70b6de4a
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1286)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1286
=====================================================

Does GLM-130 B support [sMASK] for sentence generation?

I'm developing a chat bot on top of GLM-130B.

Currently I'm using "[MASK]" at the end of dialogue for bot's response generation.
[gMASK] is too slow for me on my 8xV100 server.

Your GLM repo https://github.com/THUDM/GLM reports [sMASK] could be used for sentence generation.
But I didn't find any doc in this repo. Does GLM-130 B support [sMASK] for sentence generation?

Do you have any plan to export more API of GLM-130 B ? Such as compute LM perplexity / Multiple choice selection or any other features? Since you have already test the model on Few-CLUE, there must be ways to utilize those features.

INT8 inference

Hi, in your paper you talk about using INT8 dtype to store the weights, but they are cast to FP16 for the calculation. I was just wondering if at inference time do you actually calculate in INT8 (rather than FP16) given that you are using fastertransformer and that has support kernels which use INT8 tensor cores, to obtain an improvement in speed

一直无法添加微信群？

我发送了好多次添加请求，都没有得到回应，都是已过期。

Triton FasterTransformer Backend

Great work!
Any plans to integrate the FasterTransformer recipe and code (https://github.com/THUDM/GLM-130B/blob/main/docs/inference-with-fastertransformer.md) with the Triton FasterTransformer backend (https://github.com/triton-inference-server/fastertransformer_backend)?

how to batch inference

How to batch inference? Thanks!

BIG-Bench evaluation?

BIG-Bench (paper, code) is a large and diverse collaborative benchmark testing multiple capabilities of LLMs. I think it would be very beneficial to community to see the evaluation of GLM on this benchmark

Mismatch error when load int4 model

When I load the int4 model, I get the following error;
The run command is: bash scripts/generate.sh --input-source input.txt
I use two a6000 graphics cards (2*48G)

Traceback (most recent call last):
  File "/ssd1/xingyum/GLM-130B/generate.py", line 210, in <module>
    main(args)
  File "/ssd1/xingyum/GLM-130B/generate.py", line 156, in main
    model, tokenizer = initialize_model_and_tokenizer(args)
  File "/ssd1/xingyum/GLM-130B/initialize.py", line 72, in initialize_model_and_tokenizer
    load_checkpoint(model, args)
  File "/home/xingyum/anaconda3/envs/vis/lib/python3.10/site-packages/SwissArmyTransformer/training/model_io.py", line 181, in load_checkpoint
    missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False)
  File "/home/xingyum/anaconda3/envs/vis/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GLM130B:
        size mismatch for transformer.word_embeddings.weight: copying a param with shape torch.Size([18816, 12288]) from checkpoint, the shape in current model is torch.Size([75264, 12288]).
        size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([4608, 12288]) from checkpoint, the shape in current model is torch.Size([18432, 12288]).
        size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([4608]) from checkpoint, the shape in current model is torch.Size([18432]).
        size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 6144]).

GLM-130B+CodeGeex

您好，试了GLM-130B和CodeGeex的效果，很惊艳。请问是否考虑将两个模型结合成一个模型？例如：在GLM-130B的基础上采用CodeGeex的数据集进行继续预训练。

请教关于训练日志中的一些问题

我有如下问题想请教：

LargeScale 是一个开源工具包吗？我在搜索引擎和github中没有找到直接信息
在测速中，更大的全局batch size 会有更大的吞吐，为什么最终会选择4224呢？另外，BSZ=176 * 24=4224`，24正好是dp数，那176 需要梯度累加吗？大模型训练上用梯度累加跟小模型上会有显著差异吗？
如下面所引用的，咱们中英文的数据都是纯文本，由多任务数据换回原中英文数据？对整个数据进行重新shuffle吗？这样会不会导致模型训练到重复的数据？这个shuffle对大模型训练的稳定作用这么大吗？

分析可能是 distribution 变动仍然太过剧烈，先换纯文本 + reshuffle 尝试训练

warmup-samples-after-loading 这个是什么操作？是从平衡的多任务，逐渐转换为带权重分布的多任务吗？
这里的loss 爆炸体现在loss 上是nan 吗？还是只是突然增加数量级？

Does GLM-130B support newline (\n)?

I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer ignore_linebreak=False? where '\n' will be encoded to 20004.
Thank you very much!

thudm / glm-130b Goto Github PK

glm-130b's Issues

Recommend Projects

Recommend Topics

Recommend Org