blacksamorez / tensor_parallel Goto Github PK

View Code? Open in Web Editor NEW

567.0 7.0 36.0 323 KB

Automatically split your PyTorch models on multiple GPUs for training & inference

License: MIT License

Python 100.00%

deep-learning machine-learning natural-language-processing nlp python pytorch pytorch-transformers

tensor_parallel's Introduction

tensor_parallel

🚀 Try new 40B LLMs demo in Kaggle

Run large PyTorch models on multiple GPUs in one line of code with potentially linear speedup.

import transformers
import tensor_parallel as tp
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-13b")
model = transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-13b")  # use opt-125m for testing

model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])  # <- each GPU has half the weights

inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"].to("cuda:0")
outputs = model.generate(inputs, num_beams=5)
print(tokenizer.decode(outputs[0])) # A cat sat on my lap for a few minutes ...

model(input_ids=inputs, labels=inputs).loss.backward()  # training works as usual

Installation

Latest stable version (recommended):

pip install tensor_parallel

Bleeding edge version:

pip install https://github.com/BlackSamorez/tensor_parallel/archive/main.zip

Usage

Simply wrap your PyTorch model with tp.tensor_parallel and use it normally. For best memory efficiency, call tp.tensor_parallel while the model is still on CPU.

Here are a few use cases:

examples/training_flan-t5-xl.ipynb - fine-tune full FLAN-T5 model on text summarization
tensor_parallel int8 LLM - adapter-tuning a large language model with LLM.8bit + tensor_parallel
TBA - defining custom parallelism strategy

Advanced parameters to tensor_parallel:

device_ids: List[device] - which devices to use; defaults to all available GPUs
output_device: device - model outputs will have this device
tensor_parallel_config: tp.Config - use custom parallelism strategy, see slicing_configs.py
distributed: bool - if True, use torch.distributed backend instead of threading (requires torchrun)
sharded: bool - if True, find all trainable parameters that weren't split by Tensor Parallelism and split them using ZeRO-3 algorithm.
- weights will be split between GPUs and re-assembled before each forward pass
- TL;DR use this when training to avoid duplicate parameters (enabled by default!)
- sharded_param_names: List[str] - parameter names that should be sharded this way, default = found automatically

Saving the model

To save a model such that it could be used in a non tensor_parallel context, you should use a save_tensor_parallel context wrapper.

import torch
import transformers
import tensor_parallel as tp

model = tp.tensor_parallel(
    transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-13b"), 
)

# A whole lot of trainig...

with tp.save_tensor_parallel(model):
    torch.save(model.state_dict(), "/tmp/")
    # or 
    model.save_pretrained("/tmp/")

Such code saves a model as if it was never split. It works by gathering model parts during state_dict creation.

Memory efficient dispatch

Normally, to normally create and dispatch a tensor_parallel model, one needs the whole model in memory. This can be troublesome, but there is another way.

It's possible to convert a state_dict of a basic model into the corresponding tensor_parallel state_dict using a helper function convert_state_dict. The state dict can then be dispatched and loaded into the model:

import accelerate
import transformers

import tensor_parallel as tp

# Initialize a weightless tensor_parallel model from MyModel
with accelerate.init_empty_weights():
    model = tp.TensorParallel(
        MyModel(),
        device_ids=[0, 1] # and prepare it to be put on GPUs 0 and 1
    )

# Load partial state_dict for MyModel
state_dict = torch.load("my_model_part_1_of_5.bin")

# Convert it into a tensor_parallel state_dict
tensor_parallel_state_dict = tp.convert_state_dict(
    state_dict,
    tensor_parallel_config=model.tensor_parallel_config,
    world_size=len(model.devices),
)

# Dispatch the partial state_dict (load_state_dict doesn't work with meta so here I use accelerate)
device_map = tp.infer_sharded_device_map(model)
for param_name, param in state_dict.items():
    module_name = param_name
    while len(module_name) > 0 and module_name not in device_map:
        module_name = ".".join(module_name.split(".")[:-1])
    param_device = device_map[module_name]
    accelerate.utils.set_module_tensor_to_device(model, param_name, param_device, value=param)

With this no more than one part of the model needs to be loaded into memory at once.

FAQ

Q: I don't have a multi-GPU server. Can I use tensor_parallel in Google Colab?
A: Colab has a single GPU, so there's no point in tensor parallelism. However, Kaggle offers two T4 for free to all phone-verified accounts.
Q: What is tensor parallelism?
A: You split each layer's weights into parts, multiply each part on a separate GPU, then gather results. Read more here
Q: Should I use TensorParallel or DataParallel?
A: TensorParallel for large models, DataParallel for smaller ones
Q: How does it compare against FullyShardedDataParallel and ZeRO?
A: ZeRO is better if you can fit a large batch, TensorParallel is better for small batches

Why use tensor_parallel ...

v.s. DeepSpeed and FairScale
- DeepSpeed has many parallelization strategies, but requires careful configuration
- tensor_parallel has one strategy that works with 1 line of code
- tensor_parallel works in a jupyter notebook
v.s. MegatronLM
- MegatronLM has great tensor parallelism for one model architecture
- tensor_parallel has good parallelism for any architecture
- tensor_parallel is way easier to install
v.s. parallelformers
- parallelformers is inference-only, tensor_parallel supports training
v.s. alpa
- alpa is a powerful tool for automatic distributed training / inference in JAX
- tensor_parallel works with PyTorch
v.s. Model.parallelize()
- both are easy to use, both fit large models
- in parallelize, one GPU works at a time
- in tensor_parallel, GPUs work in parallel

In short, use tensor_parallel for quick prototyping on a single machine. Use DeepSpeed+Megatron or alpa for million-dollar training runs.

Troubleshooting

If you experience NCCL errors, or random hanging, you may have some code errors that are not displayed properly. To debug these errors, we recommend restarting with export TENSOR_PARALLEL_USE_NATIVE=1 or on a single device.

If you found a bug or encountered a problem, please report it to our issue tracker. We will do our best to help, but it may take some time before we get to it. Please create issues only if your problem is specifically with tensor_parallel. For example, if you need help installing transformers or optimizing your code, please seek it elsewhere.

Code style

We use black and isort for all pull requests. Before committing your code, simply run black . && isort . and you will be fine.

tensor_parallel's People

Contributors

Stargazers

Watchers

tensor_parallel's Issues

Request to fix the content about parallelformers in README.

Hello! Thanks for great work.
I am an author of parallelformers, and I saw you mentioned parallelformers in README.

parallelformers implements a fixed [list of architectures](https://github.com/tunib-ai/parallelformers/tree/main/parallelformers/transformers)

First, I would like you to fix the link to https://github.com/tunib-ai/parallelformers/tree/main/parallelformers/policies. In fact parallelformers supports more than 60 architectures with pre-defined policies and users also can parallelize unsupported models with their own policy class if want like this. But in the link you mentioned, there are only parts of total supported models. It's not good to inform incorrect information to users.

Second, I found this code and #45. And you mentioned like 'this library supports any architecture automatically' in your README. Then why are you creating these configs even though tensor_parallel library can split any model 'automatically'? (and currently there are only 7 pre-defined configs. am I right?)

I am really interested in how could you implement tensor model parallelism for any model architecture automatically. I discussed this with MS DeepSpeed and HF transformers team for a year, and we couldn't find appropriate method for automation. We've tried to automate this process for some models, but it couldn't parallelize ALL models stably. because there are so many model architectures in HF transformers and some models have different structures with other models. (You can see this the discussion here) I also wonder how many models did you test with your library.

That's why parallelformers and deepspeed were implemented with model-specific policy classes. You can see injection_policy argument here. If you found better and stable methods than us, I would like to hear its mechanism and learn from you.

Third, we are creating new a library, https://github.com/EleutherAI/oslo for distributed model training and it supports TP, PP, DP, ZeRO, MoE and theirs mixture. (you can see the example here) parallelformers was developed with producer-consumer architecture for web server deployment, so it is not appropriate for model training. After developing the parallelformers, I decided to create a new library for model training. If you are interested in to connect with us, please feel free to let me know. I am interested in adding your own great TP algorithm to our library.

Thanks.

TensorParallelPreTrainedModel dislikes gradient checkpoints in some cases

In the T5 tutorial, gradient checkpointing works fine (and correctly) in a custom training loop, but running it inside Trainer causes "trying to backward a second time" error.

Curiously,

this happens both with and without sharding
model can be fully trained with a custom for loop with no issue
once you call trainer.train, all subsequent backward calls will trigger an error

Huggingface Accelerate

Hi all,

Thank you for the great work.

I was wondering if this library was compatible with Huggingface accelerate.prepare() for training?

Thank you,

Enrico

Example: finetune T5-XXL or mT5-XXL with prompt-tuning or similar

depends on: #14
pre-process (shard) T5 checkpoint so that it can be loaded into kaggle
- make sure you keep the pre-processing script or save a link to that for the readers!
push sharded model to HF
write a LoRA or prefix-tuning or IA3 wrapper for T5 or use adapter_transformers
train the model on something funny

Does tensor_parallel support the model inference concurrently or in multi-threads?

When I use tensor_parallel to make my model inference on two GPUs(the model is deployed on Flask pywsgi), I also use gunicorn to manage my flask app to make it can accept many request and then make my model inference concurrently, I test this on single GPU is ok, but as long as I use tensor_parallel lib, it will comes an error:tensor_parallel/cross_device_ops.py", line 78, in forward inputs = tuple(map(torch.Tensor.contiguous, inputs)) TypeError: descriptor 'contiguous' for 'torch._C._TensorBase' objects doesn't apply to a 'NoneType' object， it also raise this exception:TypeError: Caught TypeError in replica 0 on device 0., what shall I do? Can you analysis the problem?

Example: inference OPT-13B in kaggle with benchmarks

Model: https://huggingface.co/facebook/opt-13b
Smaller model for experiments: https://huggingface.co/facebook/opt-350m

It would be great to allow inference for this model

Steps:

write a config in ./tensor_parallel/slicing_configs.py
- make sure you split / gather KV similarly to BloomModel
find a way to shard a checkpoint such that the model is fully loaded with kaggle's limited ram
- please ping @me or @BlackSamorez on t5 checkpoint example
benchmark performance
- baseline: model.parallelize
- inference, batch size = 1 sequence
- forward no_grad, batch = 4x 64 tokens (if it fits!)
- forward+backward batch = 4x 64 tokens (if it fits!), make sure you model.gradient_checkpointing_enable() in both cases

What is the difference between this project and autotp of deepspeed?

It seems that in deepspeed there is an auto tensor parallel scheme to use.

How to use trained models?

Hi, I trained BLOOM using this library, but how do I load it for inference? Architecture BloomForCasualLM changed to TensorParalelForPretraining @BlackSamorez

Torch version requirement

Hi may I know the environmental requirement of this tool?

I have torch1.9-cu111 installed locally, but when I tried to install this tool, it automatically installed torch1.13 and uninstalled the original torch1.9.

And earlier there was another time when I have torch1.12-cu102 installed in advance and installed tensor-parallel afterwards, and it seems happy with torch 1.12.

So what is the minimum requirement of it? Having a hard time to work around all the compatibility thing amid torch/cuda/etc ....

model.generate() with inputs_embeds

Hi! A very easy-to-use library.

When I call model.generate(inputs_embeds=...) with inputs_embeds instead of input_ids, it does not seem to have been implemented.

*** ValueError: You passed `inputs_embeds` to `.generate()`, but the model class TensorParallelPreTrainedModel doesn't have its forwarding implemented. See the GPT2 implementation for an example (https://github.com/huggingface/transformers/pull/21405), and feel free to open a PR with it!

Can we have this feature? Thank you!

automatic releases

It can be possible to set up a script that automatically releases the newer version to PyPI
Consider this automation script as an example:

https://github.com/learning-at-home/go-libp2p-daemon/blob/f477ed00d1d8277e618c41c3122eb6989ea62264/.github/workflows/release.yml

When I try to do the tensor_parallel on NLLB from meta, there is an error

Here is my code:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import tensor_parallel as tp

tokenizer = AutoTokenizer.from_pretrained("/workspace/projects/nllb/nllb3.3b", use_auth_token=True, src_lang="deu_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained("/workspace/projects/nllb/nllb3.3b", use_auth_token=True)
model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

model.to('cuda')

article = "Die Ware hat unter 20 Euro gekostet."
inputs = tokenizer(article, return_tensors="pt").to("cuda:0")

output1=translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["zho_Hans"], max_length=500)

output1=translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["zho_Hans"])

out=tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print(out)

#############################################################
And here is the error:
Using automatic config: tensor parallel config not provided and no custom config registered for the model
The following patterns in state_rules were unused: ["re.compile('^model.decoder.embed_tokens.weight$')", "re.compile('^model.encoder.embed_tokens.weight$')"]
The following patterns in state_rules were unused: ["re.compile('^model.decoder.embed_tokens.weight$')", "re.compile('^model.encoder.embed_tokens.weight$')"]
Using ZeRO-3 sharding for 499712 non tensor-parallel parameters
Traceback (most recent call last):
File "test_nllb.py", line 14, in
output1=translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["zho_Hans"], max_length=500)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/workspace/tools/transformers/src/transformers/generation/utils.py", line 1518, in generate
return self.greedy_search(
File "/workspace/tools/transformers/src/transformers/generation/utils.py", line 2335, in greedy_search
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensor_parallel/pretrained_model.py", line 78, in forward
return self.wrapped_model(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensor_parallel/sharding.py", line 95, in forward
return self.module(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensor_parallel/tensor_parallel.py", line 130, in forward
return parallel_apply(self.module_shards, inputs, kwargs_tup, self.devices)[self.output_device_index]
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 463, in reraise
raise exception
KeyError: Caught KeyError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/tools/transformers/src/transformers/models/m2m_100/modeling_m2m_100.py", line 1335, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/tools/transformers/src/transformers/models/m2m_100/modeling_m2m_100.py", line 1220, in forward
last_hidden_state=encoder_outputs[0],
KeyError: 0

#############################################################
Do any guys know why😂😂😂😂

Does tensor_parallel support data parallel and tensor parallel hybrid training?

Example Question (got error) : Try new 40B LLMs demo in Kaggle

i just running jupyter example.
i got error like this

  File "/ssd/data01/ysh/test/.venv/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x8192 and 1x18874368)

i used A100 80GB * 2

and if i want to using huggingface trainer, how can i use? just change, for train loop to huggingface trainer?

why raised cuda error?

i used tiiuae/falcon-40b

and want to doing full fine-tuning by lima instruction dataset

model = tp.tensor_parallel(model, sharded=True)

just use like this and i have 1) A100 80GB * 2 and another server 2) A100 80GB * 4
but when i running code on 1) device or 2) device
raised error like this
Model parameters were moved to incorrect devices, did call on model.cuda() or model.to(device)? If so, please avoid doing that

why?

Great work！ and can this work with deepspeedzero?

great work.
seems more efficient than pp in my machine (8*2080ti 22g)
I am thinking about if this can work with zero or, forexample to combine 2 gpu to 1, than use zero to combine the 4 groups of gpu..

Does tensor_parallel support multi-node tensor parallel training?

Slow inference performance for large Llama models compared to naive MP

The inference speed of naive model parallel is much better than tensor parallel:

Setup: Llama-30b on 2080Ti 22G x4
Naive: 31.64s
4-way TP, main branch: 177.78s
4-way TP, llama branch: 102.22s

The code for naive inference

import torch
import time
import os
import json
import tensor_parallel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import accelerate
from transformers.utils.bitsandbytes import replace_8bit_linear
from accelerate.hooks import remove_hook_from_module

model_name = 'models/llama-30b'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.half, device_map="balanced")

torch.cuda.empty_cache()
model = model.eval()
with torch.no_grad():
    batch = tokenizer(
        "DeepSpeed first included offloading capabilities with ZeRO-Offload, a system ",
        return_tensors="pt"
    )
    batch = {k: v.cuda(0) for k, v in batch.items()}
    print("Start")
    t0 = time.time()
    generated = model.generate(batch["input_ids"], attention_mask=batch["attention_mask"], max_length=200)
    t1 = time.time()
    print(f"Output generated in {(t1-t0):.2f} seconds")
    print(tokenizer.decode(generated[0]))

The code for TP:

import torch
import time
import os
import json
import tensor_parallel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import accelerate
from transformers.utils.bitsandbytes import replace_8bit_linear
from accelerate.hooks import remove_hook_from_module

model_name = 'models/llama-30b'

tokenizer = AutoTokenizer.from_pretrained(model_name)
with accelerate.init_empty_weights():
    model = AutoModelForCausalLM.from_config(AutoConfig.from_pretrained(model_name)).half()
    model = tensor_parallel.TensorParallelPreTrainedModel(model)

device_map = tensor_parallel.infer_sharded_device_map(model) # <- The model is on meta device but we can sill deduce
                                                #    the target devices for each weight using this helper function
# Get nums parts
with open(f"{model_name}/pytorch_model.bin.index.json", "r") as index_file:
    shard_filenames = set(json.load(index_file)["weight_map"].values())

for shard_filename in sorted(shard_filenames):
    # Download a shard
    shard_path = f"{model_name}/{shard_filename}"
    print(shard_path)
    
    # Convert model shard
    converted_state_dict = tensor_parallel.convert_state_dict( # <- tensor_parallel helper function. 
        torch.load(shard_path),                   #    Creates a tensor_parallel checkpoint form a normal one
        model.tensor_parallel_config,
        world_size=4,
        for_pretrained=True,
    )    
    torch.save(converted_state_dict, "/tmp/shard.bin")
    del converted_state_dict
        
    # Dispatch the shard
    accelerate.load_checkpoint_in_model(
        model,
        checkpoint="/tmp/shard.bin",
        device_map=device_map,
    )

torch.cuda.empty_cache()
model = model.eval()
with torch.no_grad():
    batch = tokenizer(
        "DeepSpeed first included offloading capabilities with ZeRO-Offload, a system ",
        return_tensors="pt"
    )
    batch = {k: v.cuda(0) for k, v in batch.items()}
    print("Start")
    t0 = time.time()
    generated = model.generate(batch["input_ids"], attention_mask=batch["attention_mask"], max_length=200)
    t1 = time.time()
    print(f"Output generated in {(t1-t0):.2f} seconds")
    print(tokenizer.decode(generated[0]))

TypeError when multi-thread inference using tensor_parallel

When inference model with multiple thread, it will throw TypeError

  File "/data/miniconda3/envs/python39/lib/python3.9/concurrent/futures/_base.py", line 446, in result
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    return self.__get_result()
  File "/data/miniconda3/envs/python39/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/data/miniconda3/envs/python39/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
    result = self.fn(*self.args, **self.kwargs)
  File "/data/workspace/code/generate-demo/inference/starcoder_multithread.py", line 174, in gen_streamer
    outputs = model.generate(input_ids=input_ids, streamer=streamer, max_new_tokens=256, top_p=0.8, early_stopping=True, temperature=0.8, repetition_penalty=1.0)
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.greedy_search(
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/generation/utils.py", line 2248, in greedy_search
    outputs = self(
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/tensor_parallel/pretrained_model.py", line 82, in forward
    return self.wrapped_model(*args, **kwargs)
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/tensor_parallel/tensor_parallel.py", line 129, in forward
    return parallel_apply(self.module_shards, inputs, kwargs_tup, self.devices)[self.output_device_index]
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
start generate: 10
    raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/data/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
 ...........
TypeError: '>' not supported between instances of 'NoneType' and 'int'

Error in README.Md, hence not able to load model with limited memory.

Hello,

I'm trying to load MPT-7B, where I have a limited amount of memory and I'm trying to follow the instructions mentioned in the README, but it is throwing an error because it is trying to load the state_dict instead of the in tp.tensor_parallel nn.Model. Please provide an example that works, or suggest something that I need to change for it to work.

# Load partial state_dict for MyModel
state_dict = torch.load("my_model_part_1_of_5.bin")

# Convert it into a tensor_parallel state_dict
tensor_parallel_state_dict = tp.tensor_parallel(
    state_dict,
    tensor_parallel_config=model.tensor_parallel_config,
    world_size=len(model.devices),
)

-- Thank You

Cloud Tensor_parallel add multiple accelerator inference support with torch.distributed?

Our accelarator is not GPU , it's a XPU, only support torch.distributed mode with pytorch-xpu。
Cloud Tensor_parallel add multiple accelerator inference support with torch.distributed? Thanks!

tensor_parallel\src\tensor_parallel\factory.py

def tensor_parallel(
    module: nn.Module,
    device_ids: Optional[Sequence[Union[torch.device, str]]] = None,
    tensor_parallel_config: Optional[Config] = None,
    distributed: Optional[bool] = None,
    sharded: Optional[bool] = None,
    sharded_param_names: Optional[Collection[str]] = None,
    **kwargs,
) -> nn.Module:
........................
  num_trainable_parameters = sum(p.numel() for p in module.parameters() if p.requires_grad)
    distributed = distributed if distributed is not None else torch.distributed.is_initialized()

    if distributed:
        if device_ids is None:
            device_ids = [torch.device("cuda" if torch.cuda.is_available() else "cpu")]
        assert len(device_ids) == 1, "if distributed=True, please specify a single (current) device"
        assert not sharded, "distributed + sharded mode is not implemented, please keep one"

        return make_distributed_shard(module, device=torch.device(device_ids[0]), **kwargs)
    else:
......

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Hi, recently I was running LLaMA-2 with tensor-parallel inference through generate method and I encounter this problem.

Here is the error msg:

[2023-08-11 23:43:09,855][FK.general_util.evaluator][INFO] - ***** Running evaluation test.test *****                                                                                                   │|    0   N/A  N/A     53265      C   ...avishankar1/tc/bin/python    16531MiB |
[2023-08-11 23:43:09,855][FK.general_util.evaluator][INFO] -   Num examples = 1569                                                                                                                      │|    0   N/A  N/A   4088900      C   ...avishankar1/tc/bin/python    32031MiB |
[2023-08-11 23:43:09,856][FK.general_util.evaluator][INFO] -   Batch size = 1                                                                                                                           │|    1   N/A  N/A     53265      C   ...avishankar1/tc/bin/python      331MiB |
Evaluating:   0%|          | 0/1569 [00:00<?, ?it/s]                                                                                                                                                    │|    1   N/A  N/A   1331281      C   .../envs/torch2.0/bin/python    14639MiB |
Error executing job with overrides: ['ddp_eval=False']                                                                                                                                                  │|    1   N/A  N/A   4099212      C   ...avishankar1/tc/bin/python     2087MiB |
Traceback (most recent call last):                                                                                                                                                                      │|    2   N/A  N/A   1331282      C   .../envs/torch2.0/bin/python    14667MiB |
  File "/export/home2/fangkai/merit-v2/trainer_base_fsdp_v4.py", line 464, in <module>                                                                                                                  │|    2   N/A  N/A   4099212      C   ...avishankar1/tc/bin/python     3915MiB |
    main()                                                                                                                                                                                              │|    3   N/A  N/A   1546448      C   ...da3/envs/torch/bin/python    11611MiB |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main                                                                            │|    3   N/A  N/A   4099212      C   ...avishankar1/tc/bin/python    24771MiB |
    _run_hydra(                                                                                                                                                                                         │|    4   N/A  N/A   1060990      C   ...da3/envs/torch/bin/python    18553MiB |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra                                                                    │|    5   N/A  N/A   2376820      C   ...da3/envs/torch/bin/python    42357MiB |
    _run_app(                                                                                                                                                                                           │|    5   N/A  N/A   4099212      C   ...avishankar1/tc/bin/python      349MiB |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app                                                                      │|    6   N/A  N/A   4099212      C   ...avishankar1/tc/bin/python    32031MiB |
    run_and_report(                                                                                                                                                                                     │|    7   N/A  N/A    310278      C   ...nvs/retrieval/bin/python3     1333MiB |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/_internal/utils.py", line 216, in run_and_report                                                                │+-----------------------------------------------------------------------------+
    raise ex                                                                                                                                                                                            │(base) fangkai@scsehg:~$ nvi8
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report                                                                │Fri Aug 11 23:03:01 2023
    return func()                                                                                                                                                                                       │+-----------------------------------------------------------------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>                                                                      │| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
    lambda: hydra.run(                                                                                                                                                                                  │|-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run                                                                           │| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    _ = ret.return_value                                                                                                                                                                                │| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value                                                                       │|                               |                      |               MIG M. |
    raise self._return_value                                                                                                                                                                            │|===============================+======================+======================|
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job                                                                            │|   0  NVIDIA RTX A6000    On   | 00000000:01:00.0 Off |                  Off |
    ret.return_value = task_function(task_cfg)                                                                                                                                                          │| 37%   67C    P2   124W / 300W |  48572MiB / 49140MiB |     43%      Default |
  File "/export/home2/fangkai/merit-v2/trainer_base_fsdp_v4.py", line 436, in main                                                                                                                      │|                               |                      |                  N/A |
    result = evaluate(cfg, model, tokenizer, prefix=prefix, _split=split)                                                                                                                               │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/merit-v2/general_util/evaluator.py", line 227, in evaluate_fn                                                                                                             │|   1  NVIDIA RTX A6000    On   | 00000000:24:00.0 Off |                  Off |
    outputs, pred_res = eval_forward_fn(batch)                                                                                                                                                          │| 30%   38C    P8    22W / 300W |   2438MiB / 49140MiB |      0%      Default |
  File "/export/home2/fangkai/merit-v2/general_util/evaluator.py", line 470, in __call__                                                                                                                │|                               |                      |                  N/A |
    decoding_outputs = self.model.generate(**batch, generation_config=self.generation_config)                                                                                                           │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                            │|   2  NVIDIA RTX A6000    On   | 00000000:41:00.0 Off |                  Off |
    return func(*args, **kwargs)                                                                                                                                                                        │| 30%   41C    P8    29W / 300W |   3931MiB / 49140MiB |      0%      Default |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers/generation/utils.py", line 1588, in generate                                                             │|                               |                      |                  N/A |
    return self.sample(                                                                                                                                                                                 │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers/generation/utils.py", line 2642, in sample                                                               │|   3  NVIDIA RTX A6000    On   | 00000000:61:00.0 Off |                  Off |
    outputs = self(                                                                                                                                                                                     │| 30%   38C    P2    69W / 300W |  36398MiB / 49140MiB |      0%      Default |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                 │|                               |                      |                  N/A |
    return forward_call(*args, **kwargs)                                                                                                                                                                │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward                                                    │|   4  NVIDIA RTX A6000    On   | 00000000:81:00.0 Off |                  Off |
    outputs = self.model(                                                                                                                                                                               │| 30%   33C    P2    73W / 300W |  18555MiB / 49140MiB |      0%      Default |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                 │|                               |                      |                  N/A |
    return forward_call(*args, **kwargs)                                                                                                                                                                │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward                                                    │|   5  NVIDIA RTX A6000    On   | 00000000:A1:00.0 Off |                  Off |
    layer_outputs = decoder_layer(                                                                                                                                                                      │| 30%   44C    P2    99W / 300W |  42718MiB / 49140MiB |      0%      Default |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                 │|                               |                      |                  N/A |
    return forward_call(*args, **kwargs)                                                                                                                                                                │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward                                                    │|   6  NVIDIA RTX A6000    On   | 00000000:C1:00.0 Off |                  Off |
    hidden_states, self_attn_weights, present_key_value = self.self_attn(                                                                                                                               │| 36%   66C    P2   188W / 300W |  32047MiB / 49140MiB |    100%      Default |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                 │|                               |                      |                  N/A |
    return forward_call(*args, **kwargs)                                                                                                                                                                │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/merit-v2/models/llama.py", line 75, in _forward                                                                                                                           │|   7  NVIDIA RTX A6000    On   | 00000000:E1:00.0 Off |                  Off |
    query_states = self.q_proj(hidden_states)                                                                                                                                                           │| 30%   38C    P2    73W / 300W |   1335MiB / 49140MiB |     13%      Default |
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                 │|                               |                      |                  N/A |
    return forward_call(*args, **kwargs)                                                                                                                                                                │+-------------------------------+----------------------+----------------------+
  File "/export/home2/fangkai/anaconda3/envs/torch2.0/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward                                                                     │
    return F.linear(input, self.weight, self.bias)                                                                                                                                                      │+-----------------------------------------------------------------------------+
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)                       │| Processes:                                                                  |
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1346893) of binary: /export/home2/fangkai/anaconda3/envs/torch2.0/bin/python                               │|  GPU   GI   CI        PID   Type   Process name

It seems that the error happened at query projection.

The following is my initialization wrap:

def load_model_from_pretrained_tp(pretrained_model_name_or_path: str, *args, **kwargs):
    tp_sharded = kwargs.pop("tp_sharded", None)
    enable_flash_attention = kwargs.pop("enable_flash_attention", False)
    flash_attention_vanilla_torch = kwargs.pop("flash_attention_vanilla_torch", False)
    flash_attention_var_len = kwargs.pop("flash_attention_var_len", False)

    model = LlamaForCausalLM.from_pretrained(pretrained_model_name_or_path, *args, **kwargs)

    if enable_flash_attention:
        logger.info("⚡⚡⚡ enable llama flash attention.")

        layers = model.model.layers
        for layer in layers:
            llama_fast_attention_wrap(layer.self_attn, vanilla_torch=flash_attention_vanilla_torch, var_len=flash_attention_var_len)

    import tensor_parallel as tp
    import torch.distributed as dist

    n_gpus = torch.cuda.device_count()
    if not dist.is_initialized():
        model = tp.tensor_parallel(model, [torch.device(f"cuda:{i}") for i in range(n_gpus)], sharded=tp_sharded)
    else:
        model = tp.tensor_parallel(model, sharded=False)[0]
    return model

I noticed that you do not calling batch["input_ids"].to(device) method. When I remove this code I found that it will raise another error message that the inputs are on cpu.

version information:

transformers==4.31.0
torch==2.0.0
tensor-parallel==2.0.0

Thanks for your help very much!

GPT-2 broken starting in v1.2.5

Thanks for the cool package. As of version 1.2.5, I can't do a forward pass on GPT-2. Simple repro script:

import transformers
import tensor_parallel as tp

model = transformers.AutoModelForCausalLM.from_pretrained('gpt2-xl', cache_dir='/scr-ssd/em7')
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2-xl', cache_dir='/scr-ssd/em7')
model = tp.tensor_parallel(model)

test_input = 'I enjoy walking with my cute dog'
tokens = tokenizer(test_input, return_tensors='pt').to('cuda:0')
tokens['labels'] = tokens['input_ids'].clone()
outputs = model(**tokens)

The output on 1.2.5 is:

The following patterns in state_rules were unused: ["re.compile('.*lm_head\\\\.weight$')", "re.compile('.*q_attn\\\\.weight$')", "re.compile('.*q_attn\\\\.bias$')"]
The following patterns in state_rules were unused: ["re.compile('.*lm_head\\\\.weight$')", "re.compile('.*q_attn\\\\.weight$')", "re.compile('.*q_attn\\\\.bias$')"]
Using ZeRO-3 sharding for 464000 non tensor-parallel parameters
Traceback (most recent call last):
  File "tp_test.py", line 16, in <module>
    outputs = model(**tokens)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/tensor_parallel/pretrained_model.py", line 78, in forward
    return self.wrapped_model(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/tensor_parallel/sharding.py", line 95, in forward
    return self.module(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/tensor_parallel/tensor_parallel.py", line 130, in forward
    return parallel_apply(self.module_shards, inputs, kwargs_tup, self.devices)[self.output_device_index]
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1076, in forward
    transformer_outputs = self.transformer(
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 900, in forward
    outputs = block(
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 391, in forward
    attn_outputs = self.attn(
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/tensor_parallel/wrapper.py", line 71, in forward
    output = self.tp_wrapped_module(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 313, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/iris/u/em7/code/direct-preference-optimization/env/lib/python3.8/site-packages/transformers/pytorch_utils.py", line 106, in forward
    x = x.view(size_out)
RuntimeError: shape '[1, 7, 2496]' is invalid for input of size 16800

On version 1.2.4, I get the output:

The following patterns in state_rules were unused: ["re.compile('.*lm_head\\\\.weight$')", "re.compile('.*q_attn\\\\.weight$')", "re.compile('.*q_attn\\\\.bias$')"]
The following patterns in state_rules were unused: ["re.compile('.*lm_head\\\\.weight$')", "re.compile('.*q_attn\\\\.weight$')", "re.compile('.*q_attn\\\\.bias$')"]
Using ZeRO-3 sharding for 464000 non tensor-parallel parameters                                                                                
tensor(4.7082, device='cuda:0', grad_fn=<NllLossBackward0>)

Peculiar Adam8bit performance

When applying adamw_bnb_8bit optimizer on T5 on 4x 1080Ti, i've noticed that gpu 0 wouldhave 100% utilization (and low power) about half of the time -- while other GPUs would do nothing. When switching the optimizer back to, say, adafactor, all GPUs would be utilized equally.

I hypothecize that Adam8bit uses a custom gpu kernel hard-coded such that all calculations are done on param[0]'s GPU.
One way around that would be to create one Adam8bit per GPU -- or maybe, if we're lucky, we can just use different param_groups.

Crucially, optimizer still works correctly - it's just slower.

set distributed=True, return AttributeError: 'NoneType' object

For example:

tp.tensor_parallel(module=model, device_ids=["cuda:0", "cuda:1"], distributed=True)

AttributeError: 'NoneType' object has no attribute 'make_distributed_shard'

When I try to do the tensor_parallel on NLLB from meta, there is an error:

README

Based on informal brainstorming with @IaroslavLisniak @BlackSamorez

header

TL:DR; parallelize anything in one line of code, training AND inference
badges for docs / support / etc
teaser code
—— pip install tensor_parallel
—— some HF model
—— import tensor_parallel as tp
—— tp.tensor_parallel(hf_model) — looks better than TensorParallelHuggingfaceModel(model) because it shows we're not just about transformers
—— must be very short, 10 lines is the absolute most it can do

what can it do? (examples)

rename ./notebooks -> ./examples
wrap any (custom) model
- example: stable_diffusion or coatnet
./examples/simple_opt13b.ipynb
- OPT-13B inference
- fwd/bwd with prompts
- compare performance
./examples/advanced_inference_flan-t5.ipynb
- larger model: Flan-T5-xxl / GPT-NeoX-20B / UL2 (20B)
- convert to 8-bit
- need to load a few layers at a time, otherwise it won't fit!
- talk to the model, show cool examples of in-context learning / question answering
./examples/finetune_t5.ipynb
- fine-tune a T5 with prefix-tuning or LoRA
./examples/custom_config.ipynb
- make it VERY clear that you can still wrap any model, configs are only needed for optimization
- write a config for some model step by step, example: bloom or gpt-2
- at the end, write a code that tests that the converted model is equivalent to the base model in forward and inference, see ./tests
./examples/train_with_torchrun.py
- finetune t5, but using distributed backend instead of threading
- explain how to run this in ./examples/readme.md

Support for PEFT LoRA and 4-bit quantization

Hi, would you plan to update this work to support PEFT LoRA and 4-bit quantization?

refactoring concerns

consider splitting slicer_wrapper into two files: config.py and actions.py
- actions.py contains types, apply_action, create_collective_ops, process_state_, process_attrs_, process_input, process_output, _TensorParallelWrapper and all private functions
- config.py contains the config file - and maybe extracts some get_default_config functions into private sub-functions to make that method shorter
consider renaming notebooks -> examples

GPU Contention

Do you in any way measure gpu contention, which also takes some time to join all the tensors which are distributed along the GPUs? Maybe for some very large models the case would be that it is better to use just one GPU as GPU contention in case of tensor parallelism would be very slow

Possibility to run on different GPUs

Is it possible to run this on multiple different models of GPUs?

Given the goal of minimizing GPU costs as much as possible, I can only choose between a GPU with large VRAM and low performance or a GPU with high performance and low VRAM. Would it be possible to purchase both types of GPUs, one with large VRAM and low performance and the other with high performance and low VRAM, and perform parallel inference using them? Could this potentially help improve inference speed to some extent while still obtaining the large VRAM required to run large models?

Would it suitable for the multi-GPU parallel inference for llama2?

Hi, I've built a chatbot using Llama2 on a machine equipped with four GPUs, each with 16GB of memory. However, it appears that only 'cuda:0' is currently being utilized. Consequently, we are experiencing high latency, approximately 60 seconds per question. I'm wondering if Tensor Parallel can help us leverage the other CUDA devices. I've attempted the following:

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", local_files_only=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", local_files_only=True,\
low_cpu_mem_usage=True, \
torch_dtype=torch.float16,\
load_in_4bit=True)
model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])

Please let me know if you have any suggestions or advice. Thanks in advance!

Feature request: revert tensor_parallel model back into normal state_dict

It would be nice to export a tensor-parallel model in such a way that its state can be loaded with a different number of ranks. Something like

model = tp.tensor_parallel(model)
restored_model = tp.undo_tensor_parallel(model, output_device='cpu')
torch.save(restored_model.state_dict())

Can I parallelize just one large layer?

Can I parallelize just one large layer across gpus and keep all other layers the same and work with distributed data parallel?

For example, a regular resnet18 model, which is small enough but the number of classes is 1 million, how do I parallelize just the classification layer?

Maybe use 'model_type' insted of 'architectures' for pretrained model architecture deduction

    quick sanity check: are we sure this should apply to T5ForConditionalGeneration and not, say, T5Model?

Originally posted by @justheuristic in #14 (comment)

Not work with 4bit quant

with de demo in readme 👍 🚀 Try new 20B LLMs demo in Kaggle

switch to using 4bit:

`with accelerate.init_empty_weights():
model = transformers.AutoModelForCausalLM.from_config(transformers.AutoConfig.from_pretrained(".../hf-LLaMA/13B")).half()

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
   # bnb_4bit_compute_dtype=torch.bfloat16,
  #  bnb_4bit_use_double_quant=True,
   # bnb_4bit_quant_type='nf4',
)

model = tp.TensorParallelPreTrainedModel(
    model,
    device_ids=["cuda:0", "cuda:1", "cuda:2"], 
)

model = replace_with_bnb_linear(model, ["lm_head"], None, nf4_config)
model.is_loaded_in_8bit = 1
model.is_loaded_in_4bit = True`

for inference:
it not works without '.cuda()'
RuntimeError: All tensors must be on devices[0]: 0

but works with
inputs = tokenizer("cat:", return_tensors="pt")["input_ids"].to("cuda:0")

for training with peft lora
it does not work.
with to("cuda:0")
File "/home/user/miniconda3/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (10010x5120 and 1x4587520)

also not work with cuda():
RuntimeError: All tensors must be on devices[0]: 0

Add more predefined configs

We could use more predefined configs for better user experience.
Models to consider:

Support LLaMA Models, including HuggingFace-adapted variants

Facebook has released a new model known as LLaMA. It would be awesome to use tensor_parallel with it. Currently getting the following error when trying to use decapoda-research/llama-30b-hf:

RuntimeError: The size of tensor a (1110) must match the size of tensor b (6656) at non-singleton dimension 2

Issues if GPU > 2

Hey - nice library!

I've noticed anything with GPU > 2 begins to run into issues. For example, a job that requires 30GB on one, might require 15GB on two. If you use 8, each GPU has 15GB. The benefit of the parallelism seems to stop beyond two. Any ideas?

Thanks!

Error loading LLAMA model config

https://github.com/BlackSamorez/tensor_parallel/blob/31329adad4cc2d5674e353f26d1f6d574276bf12/src/tensor_parallel/slicing_configs.py#L401C23-L401C23

Error loading LLAMA model config
File "/root/anaconda3/envs/tt/lib/python3.10/site-packages/tensor_parallel/slicing_configs.py", line 403, in get_llama_config config.attr_rules[r".*self_attn$"]["num_key_value_heads"] = partial(split_num_heads, world_size=world_size) KeyError: '.*self_attn$'

2x slowdown using TP

Thanks for this very interesting lib @BlackSamorez! I just tried running your kaggle nb locally, using 2 x A6000s. I used the 'meta-llama/Llama-2-7b-hf' model. The call to model.generate with max_length=200 takes 12 seconds when using tensor_parallel with the 2 GPUs.

However, if I remove tensor_parallel and instead just use a single GPU, generation is over twice as fast, taking 5 seconds.

Is this slowdown expected, or am I doing something wrong?

New ideas

Also, maybe you should check out this site for some new paradigms
https://colossalai.org/docs/concepts/paradigms_of_parallelism/

Throw errors when trying to wrap models that are not supposed to be wrapped

Some models already rely on devices interactions. I propose we don't wrap them and throw error.
Possible examples:

Wrapping a model that is already wrapped
Wrapping an 'accelerate' model (_hf_hooks move tensors from device to device during forward pass, etc.)
Wrapping a data parallel model

tensor_parallel method distributed=True

Hey, really liking this library !

I'm wanting to benchmark the difference between running TP normally (as in the demo notebook) and adding the distributed=True flag to the tp.tensor_parallel method; which I think will use torch.distributed rather than the NCCL backend for process communication between devices, please correct me if I understood this wrong.

I can see that in the Docstring that torchrun is required, ie starting the script with 'torchrun --nproc_per_node=4'.
I have done this and also played around with init process group within the code in combination with torchrun.

To be honest, I'm struggling to understand how to implement this with distributed=True, especially since with this flag, the device_ids parameter requires only one GPU to be passed.
I'm working with 4x 3090 GPUs that have NVLink and the Falcon-40B-Instruct model.

Any guidance on how to set this up would be much appreciated.
Let me know if you need any additional info from me.

Thanks and keep going !

How to load lora weights？

After fine-tuning llama with lora, how to load through multi-gpu?
Are there any examples?

distributed TP model forward output's requires_grad is False

Hi, thanks for the nice work!

I've been trying to optimize the performance of the TP wrapper here, and the first thing that came to mind was balancing out the compute on each rank using distributed / multiprocessing (as opposed to threading).

I've been wrapping my model like the following

tp.tensor_parallel(model, distributed=True, device_ids=device_ids)

But it seems the final model output from the forward pass doesn't require grad anymore, which makes it impossible to call loss.backward().

Code below to reproduce

# torchrun --nproc-per-node 2 --master_port=1234 training/tp/standalone.py
import os

import tensor_parallel as tp
import torch
import transformers
from torch import distributed as dist


def get_local_rank():
    return int(os.getenv("LOCAL_RANK", -1))


def get_world_size():
    return int(os.getenv("WORLD_SIZE", 1))


dist.init_process_group(backend="nccl", rank=get_local_rank(), world_size=get_world_size())
pg = dist.distributed_c10d._get_default_group()
torch.cuda.set_device(get_local_rank())

current_device = torch.device(torch.cuda.current_device())
device_ids = [current_device]
print(device_ids)

model = transformers.LlamaForCausalLM.from_pretrained(
    <some_huggingface_llama_checkpoint>, torch_dtype=torch.bfloat16
)
model, _ = tp.tensor_parallel(model, distributed=True, device_ids=device_ids)
tokenizer = transformers.AutoTokenizer.from_pretrained(
    <some_huggingface_llama_checkpoint>, use_fast=True
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tensors = tokenizer(["how are you?", "I love you and you too."], return_tensors="pt", padding=True)
tensors = {
    "input_ids": tensors["input_ids"],
    "attention_mask": tensors["attention_mask"],
    "labels": tensors["input_ids"],
}

with torch.enable_grad():
    model.train()
    tensors = {k: v.to(current_device) for k, v in tensors.items()}
    outputs = model(**tensors, return_dict=True, output_hidden_states=True)
    print(outputs.logits)
    print(outputs.logits.requires_grad) # False
    print(outputs.keys())

    hs = outputs.hidden_states
    s0 = hs[0]
    print(s0)
    print(s0.requires_grad) # False

Finally, pasting some of my system configs here

pip dump

tensor-parallel          2.0.0
termcolor                2.3.0
tiktoken                 0.4.0
tokenizers               0.13.3
tomli                    2.0.1
tomlkit                  0.11.8
toolz                    0.12.0
torch                    2.0.1
torchvision              0.15.2
tqdm                     4.65.0
transformers             4.30.0.dev0

hardware: 2 A100 gpus nvlink interconnect

I am trying to load a model on two gpus but only the first is being allocated. (both are visible)