A basic ui for running gpt neo 2.7B on low vram (3 gb Vram minimum)
Expected speed on pcie-3 with 3gb vram is 0.8s/token or 20s for 25 tokens
Expected speed on pcie-3 with 8gb vram is 0.4s/token or 10s for 25 tokens
(with a 2000 token input)
A basic ui for running gpt neo 2.7B on low vram (3 gb Vram minimum)
License: Apache License 2.0
When I try to test this low-VRAM method locally, I only get some garbage output. I use the standard "unicorns copypasta" for input. But the output looks like this: The U. Toe I think it Toe there areح To re the whole. </p)
. Sometimes it even worse than this.
Here's how I tested:
.pkl
file. Install this:pip install torch==1.9.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers
from transformers import GPTNeoForCausalLM
import pickle
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B").half()
with open("gptneo.pkl", "wb") as f:
pickle.dump(model, f)
This script generates gptneo.pkl
(5.1G).
pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers
Copy/move the gptneo.pkl
file to this new virtualenv.
Run the following script:
import os
from transformers import GPTNeoForCausalLM, AutoTokenizer
import torch
import copy
import gc
import pickle
import torch.cuda.comm
import time
# Pickle file for low ram loading
if True:
print("Setting up model, this will take a few minutes")
with open('gptneo.pkl', 'rb') as f:
model = pickle.load(f)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
breakmodel = True
ram_blocks = 22
from transformers import GPTNeoForCausalLM,GPTNeoModel
from transformers.modeling_outputs import BaseModelOutputWithPast
from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoAttentionMixin
#Define a new forward pass
def new_forward(
self,
input_ids=None,
past_key_values=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
global breakmodel
if breakmodel:
global ram_blocks
if not hasattr(self, 'extrastorage'):
import copy
setattr(self,"extrastorage",{})
self.wte.to("cuda")
self.wpe.to("cuda")
self.ln_f.to("cuda")
torch.cuda.empty_cache()
for i in range(ram_blocks):
self.h[i].to("cpu")
self.extrastorage[i] = copy.deepcopy(self.h[i])
smalltensor = torch.tensor(0).to("cuda")
for param1 in self.h[i].parameters():
param1.data = smalltensor
self.h[i].to("cuda")
for i in range(ram_blocks,len(self.h)):
self.h[i].to("cuda")
for param in self.wte.parameters():
param.requires_grad = False
param.data = param.data.detach()
gc.collect()
torch.cuda.empty_cache()
for param in self.wpe.parameters():
param.requires_grad = False
param.data = param.data.detach()
gc.collect()
torch.cuda.empty_cache()
for i in range(len(self.h)):
for param in self.h[i].parameters():
param.requires_grad = False
param.data = param.data.detach()
gc.collect()
torch.cuda.empty_cache()
for param in self.ln_f.parameters():
param.requires_grad = False
for i in range(ram_blocks):
for param in self.extrastorage[i].parameters():
param.requires_grad = False
param.data = param.data.detach().pin_memory()
gc.collect()
torch.cuda.empty_cache()
for param1,param2 in zip(self.h[0].parameters(),self.extrastorage[0].parameters()):
param1.data = param2.data.to("cuda", non_blocking=False).detach()
for param1,param2 in zip(self.h[ram_blocks-1].parameters(),self.extrastorage[ram_blocks-1].parameters()):
param1.data = param2.data.to("cuda", non_blocking=False).detach()
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
input_shape = input_ids.size()
input_ids = input_ids.view(-1, input_shape[-1])
batch_size = input_ids.shape[0]
elif inputs_embeds is not None:
input_shape = inputs_embeds.size()[:-1]
batch_size = inputs_embeds.shape[0]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
device = input_ids.device if input_ids is not None else inputs_embeds.device
if token_type_ids is not None:
token_type_ids = token_type_ids.view(-1, input_shape[-1])
if position_ids is not None:
position_ids = position_ids.view(-1, input_shape[-1])
if past_key_values is None:
past_length = 0
past_key_values = tuple([None] * len(self.h))
else:
past_length = past_key_values[0][0].size(-2)
device = input_ids.device if input_ids is not None else inputs_embeds.device
if position_ids is None:
position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
# Attention mask.
if attention_mask is not None:
assert batch_size > 0, "batch_size has to be defined and > 0"
global_attention_mask = attention_mask.view(batch_size, -1)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# this attention mask is more simple than the triangular masking of causal attention
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
global_attention_mask = global_attention_mask[:, None, None, :]
# Since global_attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
global_attention_mask = global_attention_mask.to(dtype=self.dtype) # fp16 compatibility
global_attention_mask = (1.0 - global_attention_mask) * -10000.0
else:
global_attention_mask = None
# Local causal attention mask
batch_size, seq_length = input_shape
full_seq_length = seq_length + past_length
local_attention_mask = GPTNeoAttentionMixin.create_local_attention_mask(
batch_size, full_seq_length, self.config.window_size, device, attention_mask
)
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x num_heads x N x N
# head_mask has shape n_layer x batch x num_heads x N x N
head_mask = self.get_head_mask(head_mask, self.config.num_layers)
if inputs_embeds is None:
inputs_embeds = self.wte(input_ids)
position_embeds = self.wpe(position_ids)
hidden_states = inputs_embeds + position_embeds
if token_type_ids is not None:
token_type_embeds = self.wte(token_type_ids)
hidden_states = hidden_states + token_type_embeds
hidden_states = self.drop(hidden_states)
output_shape = input_shape + (hidden_states.size(-1),)
presents = () if use_cache else None
all_self_attentions = () if output_attentions else None
all_hidden_states = () if output_hidden_states else None
if breakmodel :
copystream = torch.cuda.Stream(device=0,priority = -1)
for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
if breakmodel :
if i in range(ram_blocks):
index1 = (i+1)%ram_blocks
for param1,param2 in zip(self.h[index1].parameters(),self.h[(i-1)%ram_blocks].parameters()):
param1.data = param2.data
for param1,param2 in zip(self.h[index1].parameters(),self.extrastorage[index1].parameters()):
with torch.cuda.stream(copystream):
torch.cuda.comm.broadcast(param2.data,out = [param1.data])
attn_type = self.config.attention_layers[i]
attn_mask = global_attention_mask if attn_type == "global" else local_attention_mask
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if getattr(self.config, "gradient_checkpointing", False) and self.training:
if use_cache:
logger.warning(
"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
"`use_cache=False`..."
)
use_cache = False
def create_custom_forward(module):
def custom_forward(*inputs):
# None for past_key_value
return module(*inputs, use_cache, output_attentions)
return custom_forward
outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(block),
hidden_states,
None,
attn_mask,
head_mask[i],
)
else:
outputs = block(
hidden_states,
layer_past=layer_past,
attention_mask=attn_mask,
head_mask=head_mask[i],
use_cache=use_cache,
output_attentions=output_attentions,
)
hidden_states = outputs[0]
if use_cache is True:
presents = presents + (outputs[1],)
if output_attentions:
all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
if breakmodel:
if i in range(ram_blocks):
torch.cuda.synchronize()
if breakmodel:
del copystream
torch.cuda.empty_cache()
hidden_states = self.ln_f(hidden_states)
hidden_states = hidden_states.view(*output_shape)
# Add last hidden state
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=presents,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)
if breakmodel:
model.eval().half().to("cpu")
model.lm_head.to("cuda")
model.transformer.wte.to("cuda")
model.transformer.wpe.to("cuda")
model.transformer.ln_f.to("cuda")
gc.collect()
print(GPTNeoModel.forward)
print(new_forward)
GPTNeoModel.forward = new_forward
print(GPTNeoModel.forward)
#@title Sampling settings (DO NOT SKIP)
#@markdown You can modify sampling settings here. Don't forget to run the cell again after changing. The number of generated tokens is subtracted from the context window size, don't set it high.
tail_free_sampling = 0.95 #@param {type:"number"}
top_k = 80 #@param {type:"number"}
top_p = 0.8 #@param {type:"number"}
temperature = 0.7#@param {type:"number"}
number_generated_tokens = 25#@param {type:"integer"}
repetition_penalty = 1.1 #@param {type:"number"}
repetition_penalty_range = 512 #@param {type:"number"}
repetition_penalty_slope = 3.33 #@param {type:"number"}
#@markdown If tail free sampling is enabled, top_p and top_k should probably not be used.
enable_tfs = True #@param {type:"boolean"}
enable_top_k = False #@param {type:"boolean"}
enable_top_p = False #@param {type:"boolean"}
if not enable_tfs:
tail_free_sampling = None
if not enable_top_k:
top_k = None
if not enable_top_p:
top_p = None
#@markdown Temperatures seem to give results different from those in AID, so play around with it. Even 0.5 can give good results.
basic_prompt = "test " * 10
inputs = tokenizer(basic_prompt, return_tensors="pt",truncation=True,max_length=2000).to("cuda")
outputs = model(**inputs)
start_time = time.time()
with torch.no_grad():
for i in range(1):
outputs = model(**inputs)
print(time.time() - start_time)
del inputs,outputs
torch.cuda.empty_cache()
def more_text(inputtext):
#return "Epictest"
with torch.no_grad():
with torch.cuda.amp.autocast(enabled=True):
#start_time = time.time()
context = 2000
overhead = 50
currpoint = len(inputtext)
inputs = tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead)
if inputs.input_ids[0].size()[0] == context+overhead:
low = 0
high = len(inputtext)
currpoint = 0
#BINARY SEARCH FOR A POINT WHERE TOKENIZER RETURNS BETWEEN CONTEXT AND CONTEXT + OVERHEAD TOKENS
while low <= high:
currpoint = (high + low) // 2
# If x is greater, ignore left half
inputs = tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead)
if inputs.input_ids[0].size()[0] < context:
low = currpoint + 1
# If x is smaller, ignore right half
elif inputs.input_ids[0].size()[0] == context + overhead :
high = currpoint - 1
# means x is present at mid
else:
break
ids = tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead).input_ids
else:
ids = tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead,padding = 'max_length').input_ids
ids = ids[:,-context:]
n_ids = ids.shape[1]
if n_ids < 1:
n_ids = 1
ids = torch.tensor([[tokenizer.eos_token_id]])
max_length = n_ids + number_generated_tokens
gc.collect()
basic_output = model.generate(
ids.long().to("cuda"),
do_sample=True,
num_beams=1,
min_length=max_length,
max_length=max_length,
temperature=temperature,
top_k = top_k,
top_p = top_p,
repetition_penalty = repetition_penalty,
repetition_penalty_range = repetition_penalty_range,
repetition_penalty_slope = repetition_penalty_slope,
use_cache=True,
pad_token_id=tokenizer.eos_token_id,
num_return_sequences = 1
).long()
gc.collect()
torch.cuda.empty_cache()
return tokenizer.decode(basic_output[0][-number_generated_tokens:])
#print(time.time() - start_time)
#print(number_generated_tokens)
initial_text = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
new_text = more_text(initial_text)
print(new_text)
print("DONE")
To create this script I basically just copypasted all the relevant parts from here:
https://github.com/arrmansa/Basic-UI-for-GPT-Neo-with-low-vram/blob/main/Basic%20UI.ipynb
Also, i changed two lines:
ram_blocks = 22
basic_prompt = "test " * 10
And then added the test text.
When I run this script I get nonsensical output.
I've also tried to remove .half()
when generating the .pkl
file. I get the file that is twice as big, it takes much more time to load, but the result is the same. I've also fiddled with tail_free_sampling
, top_k
, top_p
and enable_*
params to no avail.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.