marella / ctransformers Goto Github PK

View Code? Open in Web Editor NEW

1.7K 20.0 136.0 63.93 MB

Python bindings for the Transformer models implemented in C/C++ using GGML library.

License: MIT License

Python 2.66% CMake 0.38% C++ 35.71% Shell 0.08% Cuda 11.67% C 43.38% Objective-C 2.73% Metal 3.39%

ai llm transformers ctransformers

ctransformers's Introduction

CTransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Also see ChatDocs

Supported Models
Installation
Usage
- 🤗 Transformers
- LangChain
- GPU
- GPTQ
Documentation
License

Supported Models

Models	Model Type	CUDA	Metal
GPT-2	`gpt2`
GPT-J, GPT4All-J	`gptj`
GPT-NeoX, StableLM	`gpt_neox`
Falcon	`falcon`	✅
LLaMA, LLaMA 2	`llama`	✅	✅
MPT	`mpt`	✅
StarCoder, StarChat	`gpt_bigcode`	✅
Dolly V2	`dolly-v2`
Replit	`replit`

Installation

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output, set stream=True:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use it with 🤗 Transformers, create model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using ExLlama.

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If model name or path doesn't contain the word gptq then specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Config

Parameter	Type	Description	Default
`top_k`	`int`	The top-k value to use for sampling.	`40`
`top_p`	`float`	The top-p value to use for sampling.	`0.95`
`temperature`	`float`	The temperature to use for sampling.	`0.8`
`repetition_penalty`	`float`	The repetition penalty to use for sampling.	`1.1`
`last_n_tokens`	`int`	The number of last tokens to use for repetition penalty.	`64`
`seed`	`int`	The seed value to use for sampling tokens.	`-1`
`max_new_tokens`	`int`	The maximum number of new tokens to generate.	`256`
`stop`	`List[str]`	A list of sequences to stop generation when encountered.	`None`
`stream`	`bool`	Whether to stream the generated text.	`False`
`reset`	`bool`	Whether to reset the model state before generating text.	`True`
`batch_size`	`int`	The batch size to use for evaluating tokens in a single prompt.	`8`
`threads`	`int`	The number of threads to use for evaluating tokens.	`-1`
`context_length`	`int`	The maximum context length to use.	`-1`
`gpu_layers`	`int`	The number of layers to run on GPU.	`0`

Note: Currently only LLaMA, MPT and Falcon models support the context_length parameter.

`class` `AutoModelForCausalLM`

`classmethod` `AutoModelForCausalLM.from_pretrained`

from_pretrained(
    model_path_or_repo_id: str,
    model_type: Optional[str] = None,
    model_file: Optional[str] = None,
    config: Optional[ctransformers.hub.AutoConfig] = None,
    lib: Optional[str] = None,
    local_files_only: bool = False,
    revision: Optional[str] = None,
    hf: bool = False,
    **kwargs
) → LLM

Loads the language model from a local file or remote repo.

Args:

model_path_or_repo_id: The path to a model file or directory or the name of a Hugging Face Hub model repo.
model_type: The model type.
model_file: The name of the model file in repo or directory.
config: AutoConfig object.
lib: The path to a shared library or one of avx2, avx, basic.
local_files_only: Whether or not to only look at local files (i.e., do not try to download the model).
revision: The specific model version to use. It can be a branch name, a tag name, or a commit id.
hf: Whether to create a Hugging Face Transformers model.

Returns: LLM object.

`class` `LLM`

`method` `LLM.init`

__init__(
    model_path: str,
    model_type: Optional[str] = None,
    config: Optional[ctransformers.llm.Config] = None,
    lib: Optional[str] = None
)

Loads the language model from a local file.

Args:

model_path: The path to a model file.
model_type: The model type.
config: Config object.
lib: The path to a shared library or one of avx2, avx, basic.

`property` LLM.bos_token_id

The beginning-of-sequence token.

`property` LLM.config

The config object.

`property` LLM.context_length

The context length of model.

`property` LLM.embeddings

The input embeddings.

`property` LLM.eos_token_id

The end-of-sequence token.

`property` LLM.logits

The unnormalized log probabilities.

`property` LLM.model_path

The path to the model file.

`property` LLM.model_type

The model type.

`property` LLM.pad_token_id

The padding token.

`property` LLM.vocab_size

The number of tokens in vocabulary.

`method` `LLM.detokenize`

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.

Args:

tokens: The list of tokens.
decode: Whether to decode the text as UTF-8 string.

Returns: The combined text of all tokens.

`method` `LLM.embed`

embed(
    input: Union[str, Sequence[int]],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → List[float]

Computes embeddings for a text or list of tokens.

Note: Currently only LLaMA and Falcon models support embeddings.

Args:

input: The input text or list of tokens to get embeddings for.
batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1

Returns: The input embeddings.

`method` `LLM.eval`

eval(
    tokens: Sequence[int],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → None

Evaluates a list of tokens.

Args:

tokens: The list of tokens to evaluate.
batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1

`method` `LLM.generate`

generate(
    tokens: Sequence[int],
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]

Generates new tokens from a list of tokens.

Args:

tokens: The list of tokens to generate tokens from.
top_k: The top-k value to use for sampling. Default: 40
top_p: The top-p value to use for sampling. Default: 0.95
temperature: The temperature to use for sampling. Default: 0.8
repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
seed: The seed value to use for sampling tokens. Default: -1
batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1
reset: Whether to reset the model state before generating text. Default: True

Returns: The generated tokens.

`method` `LLM.is_eos_token`

is_eos_token(token: int) → bool

Checks if a token is an end-of-sequence token.

Args:

token: The token to check.

Returns: True if the token is an end-of-sequence token else False.

`method` `LLM.prepare_inputs_for_generation`

prepare_inputs_for_generation(
    tokens: Sequence[int],
    reset: Optional[bool] = None
) → Sequence[int]

Removes input tokens that are evaluated in the past and updates the LLM context.

Args:

tokens: The list of input tokens.
reset: Whether to reset the model state before generating text. Default: True

Returns: The list of tokens to evaluate.

`method` `LLM.reset`

reset() → None

Deprecated since 0.2.27.

`method` `LLM.sample`

sample(
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None
) → int

Samples a token from the model.

Args:

top_k: The top-k value to use for sampling. Default: 40
top_p: The top-p value to use for sampling. Default: 0.95
temperature: The temperature to use for sampling. Default: 0.8
repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
seed: The seed value to use for sampling tokens. Default: -1

Returns: The sampled token.

`method` `LLM.tokenize`

tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]

Converts a text into list of tokens.

Args:

text: The text to tokenize.
add_bos_token: Whether to add the beginning-of-sequence token.

Returns: The list of tokens.

`method` `LLM.call`

__call__(
    prompt: str,
    max_new_tokens: Optional[int] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    stop: Optional[Sequence[str]] = None,
    stream: Optional[bool] = None,
    reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]

Generates text from a prompt.

Args:

prompt: The prompt to generate text from.
max_new_tokens: The maximum number of new tokens to generate. Default: 256
top_k: The top-k value to use for sampling. Default: 40
top_p: The top-p value to use for sampling. Default: 0.95
temperature: The temperature to use for sampling. Default: 0.8
repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
seed: The seed value to use for sampling tokens. Default: -1
batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
threads: The number of threads to use for evaluating tokens. Default: -1
stop: A list of sequences to stop generation when encountered. Default: None
stream: Whether to stream the generated text. Default: False
reset: Whether to reset the model state before generating text. Default: True

Returns: The generated text.

License

MIT

ctransformers's People

Contributors

Stargazers

Watchers

Forkers

leedaga aphexus dadeba segmond underwaterbepis laurids-reichardt sorokinvld gavinchen1314 ramstorage mcx ayatsenko-saguarocm colosieve dbpprt paixai abeusher abacaj hurricanejin kp-forks apollohuang1 sbmaruf joel29dec 6 drewwalkup spock-ai qeternity fruityloops1 wesley7137 chavesliu techthiyanes antonpolishko wdshin furqanrydhan kaidanov usmanafridi automindx rrsart2023 kvanlier kgrsajid jncraton keithlwy siddharth28392 liliang-cn arditecht ludoplex asdlei99 jllllll rohan-mehta-1024 jameschung2000 mohan-chinnappan-n frischifrisch willb0 rupesh510 cyrilmagsuci martincastellano mwksandman liormedan aman-saini-402 jlcan touristshaun newmedia2 bmedi wheynelau jan-karsten-kuhnke 5l1v3r1 bhargav muaiyadh readytodance laeljh s-ahmad461 murongtianfeng wingtangwong machinelearningzuu thanhpham1987 awesome-software fieri pti-dp-st001 victorlee0505 muharremokutan hbcbh1999 amesianx akshay1921 and270 agitronics tomchapin tonywhite11 furyhawk 1kamp1 sujeendran zhcharles liunix61 jdwebprogrammer ego tangtc1981 harshhappycrew murilocurti ahhr80 anhmike amirhossein1376 mattgpt ethicalsecurity-agency

ctransformers's Issues

Compiled Cublas version FileNotFound on windows

Hi,

I managed to install the project using CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

But on model loading I'm getting:

FileNotFoundError: Could not find module 'C:\Users\User\dev\melMass\hal\python\.venv\Lib\site-packages\ctransformers\lib\local\ctransformers.dll' (or one of its dependencies). Try
using the full path with constructor syntax.

It is there but I guess something went wrong when compiling it:

❯ ls -s 'C:\Users\User\dev\melMass\hal\python\.venv\Lib\site-packages\ctransformers\lib\local\'
╭───┬───────────────────┬──────┬────────┬────────────────╮
│ # │       name        │ type │  size  │    modified    │
├───┼───────────────────┼──────┼────────┼────────────────┤
│ 0 │ ctransformers.dll │ file │ 1.0 MB │ 23 minutes ago │
╰───┴───────────────────┴──────┴────────┴────────────────╯

Full trace log

╭─────────────────────── Traceback (most recent call last) ────────────────────────╮
│ C:\Users\User\dev\melMass\hal\python\cli.py:17 in <module>                       │
│                                                                                  │
│   14 console.print("Loading model...", style="bold yellow")                      │
│   15 # Larger batch_size will process the prompt faster but will require more me │
│   16 # Also if you have enough GPU memory to fit the model, setting threads=1 ca │
│ ❱ 17 llm = AutoModelForCausalLM.from_pretrained(model_path, model_type='starcode │
│   18                                                                             │
│   19 console.print("Model loaded...", style="bold green")                        │
│   20                                                                             │
│                                                                                  │
│ C:\Users\User\dev\melMass\hal\python\.venv\lib\site-packages\ctransformers\hub.p │
│   154 │   │   │   │   local_files_only=local_files_only,                         │
│   155 │   │   │   )                                                              │
│   156 │   │                                                                      │
│ ❱ 157 │   │   return LLM(                                                        │
│   158 │   │   │   model_path=model_path,                                         │
│   159 │   │   │   model_type=model_type,                                         │
│   160 │   │   │   config=config.config,                                          │
│                                                                                  │
│ C:\Users\User\dev\melMass\hal\python\.venv\lib\site-packages\ctransformers\llm.p │
│ y:206 in __init__                                                                │
│                                                                                  │
│   203 │   │   if not Path(model_path).is_file():                                 │
│   204 │   │   │   raise ValueError(f"Model path '{model_path}' doesn't exist.")  │
│   205 │   │                                                                      │
│ ❱ 206 │   │   self._lib = load_library(lib)                                      │
│   207 │   │   self._llm = self._lib.ctransformers_llm_create(                    │
│   208 │   │   │   model_path.encode(),                                           │
│   209 │   │   │   model_type.encode(),                                           │
│                                                                                  │
│ C:\Users\User\dev\melMass\hal\python\.venv\lib\site-packages\ctransformers\llm.p │
│ y:102 in load_library                                                            │
│                                                                                  │
│    99 │   │   os.add_dll_directory(os.path.join(os.environ["CUDA_PATH"], "bin")) │
│   100 │                                                                          │
│   101 │   path = find_library(path)                                              │
│ ❱ 102 │   lib = CDLL(path)                                                       │
│   103 │                                                                          │
│   104 │   lib.ctransformers_llm_create.argtypes = [                              │
│   105 │   │   c_char_p,  # model_path                                            │
│                                                                                  │
│ C:\Users\User\.pyenv\pyenv-win\versions\3.10.11\lib\ctypes\__init__.py:374 in    │
│ __init__                                                                         │
│                                                                                  │
│   371 │   │   self._FuncPtr = _FuncPtr                                           │
│   372 │   │                                                                      │
│   373 │   │   if handle is None:                                                 │
│ ❱ 374 │   │   │   self._handle = _dlopen(self._name, mode)                       │
│   375 │   │   else:                                                              │
│   376 │   │   │   self._handle = handle                                          │
│   377                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Could not find module
'C:\Users\User\dev\melMass\hal\python\.venv\Lib\site-packages\ctransformers\lib\loca
l\ctransformers.dll' (or one of its dependencies). Try using the full path with
constructor syntax.

System

Windows 11
CUDA 11.7
RTX 3090

High RAM usag when offloading to GPU layers.

Just updated my GPU from a 2080 to a 3090 and man does it makes things go brrrr lol.

Anyways, I notice a new strange behavor when I did. Instead of model + GPU taking close to what the model took in system ram... it now takes almost double the system ram. When offloading from say 8 to 100 using the model wizardLM-13B-Uncensored.ggmlv3.q4_0.bin I jump from 6-7 GB to almost 12 to 14 GB on system RAM - even more as I increase the number of GPU layers. I was under the impression that more GPU_Layers the less system memory it should be using not more?

    def load_chat_model( self, model = "wizardLM-13B-Uncensored.ggmlv3.q4_0.bin" ): 
        self.gptj = AutoModelForCausalLM.from_pretrained(
            f'models/{model}',
            model_type = 'llama', #mpt, llama
            reset = True,
            threads = 1, gpu_layers = 100,
            context_length = 2048, #8192, 2048
            batch_size = 2048,
            temperature = float( .65 ),
            repetition_penalty = float( 1.1 )
        )

While I have the RAM for it - just seems very very strange it should be taking even more system ram than ever before.

While not 100% related, I could be just simply doing something wrong with the settings, I had another issue where when I did offload to the GPU when I had my 2080 - things were slow. The fix for it was to increase the batch_size and that did improve the performance even just on 8 layers. Changing the batch in this case doesnt seem to change much for memory usage only the gpu_layers seem to be the issue.
#27 As noted here, I dont seem to get "out of memory" errors when I increase the GPU layers - it will jsut "oom" if I go past too many layers for my GPU VRAM relying on the system threads instead.

ctransformers 0.2.10
Windows 11
3090
CUDA supported
Python 3.10
32GB of RAM

RAM Usage after load + message | system ram before loading | difference
threads = 8,
CPU only: 14.0 - 7.3 = 7GB

threads = 1, gpu_layers = 50,
1T + GPU: 20.9 - 7.3 = 13GB

A little more testing I see it scales up to about an extra 5GB of data for the system RAM before it caps out increasing a little bit per layer between 1-50. Almost seems like it not releasing the "work load" that it was planning on sending to GPU.

Multi-threading concurrency, error reported

https://colab.research.google.com/drive/1c6v2UUTcvrRtJBCxLk67KCWS_pd_2hCA

segmentation fault (core dumped)

CuBLas Support

Do you have any ideas to support CuBlas to increase inference speed by offloading some layers to GPU

Issues loading RedPajama Models?

@marella Is it possible to load and run any of the RedPajama variants? I have tried several variations of the following in Google Colab, but all of them cause the session to crash:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('keldenl/RedPajama-INCITE-Instruct-3B-v1-GGML', 
                                           model_type = 'gpt_neox', 
                                           #local_files_only = True,
                                           #lib = 'basic',
                                           )

Is this optimized for Apple M2 GPUs?

Hey there,

Is this optimized for Apple M2 hardware? Does it leverage the Apple Neural Engine or metal?

thanks!

Batching support

Looks like ggml supports batching
Can it be supported here too
Will it gain some performance?

Low level API to generate candidate tokens and probabilities

The sample() function only samples a token from the current state model. It would be nice to have a low level API to generate candidate tokens with probabilities to make it possible to implement custom deciding strategies.

Error when running ctransformers - std::runtime_Error

Hi, folks. Here's the code I'm using:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('/mnt/mydisk/AI/text/models/llama_cpp/gpt4all-lora-quantized-ggml.bin', model_type='llama', gpu_layers=2,  lib='avx') # I've also tried without lib='avx'

print(llm('AI is going to'))

Here's the error I'm getting:


terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file

What can I do?
Also, Koboldcpp works fine.

Few questions / issues

I just wanted to ask if you guys are planing to add MPT GPU support as well somet ime? I see its supported for LLAMA models.
Real reason for the ticket, I am having issue getting it to really use the GPU. Sometimes it works and sometimes it doenst. Not really sure how to explain, but:
A) Windows 11, 32 GB of RAM, Ryzen 5800x, 13B-HyperMantis.ggmlv3.q5_1.bin, set to LLAMA, 12 Threads (give or take what I set here for setting reasons), RTX 2080
B) Install the model the other day. Tested on the CPU was able to get results back under 20 to 25 seconds. Saw there was GPU suported, uninstall and reinstall with CUDA support. Tested the GPU off loading and it didnt seem to do much in my first round of testing. I had set it to 25 layers at the time. Didnt see any improvement in speed, but could see that the GPU was being used with higher memory access and GPU usage spiking, but never capping at max. Lower the count to 15 layers. Tested again. This time was able to hit 5 to 10. Went crazy and tested it as much as I could getting really good results. Today, I rebooted my machine and its acting like it did the other day at 25. Tried lower it from 15 to 10 or below ... but it doesnt seem to be using the GPU yet "acting" or "setting" up the usage for the GPU as I can see the memory and inflex of usage - but never fully topping out.

I could be totally using it wrong, but the fact it was working the other day and today it stop tells me something change on my computer, but I honstly couldnt tell you what did. Didnt perform any updates, but its also weird it didnt work before then all at once it did. Not sure if there is some type of support model it needs or not. Cuda is supported under torch check. Any help or inforamtion is welcome:) I understand this is not a common issue. Any places I can check/get values for to see if its really working would be great. Just seems odd.

CLBlast support for gpt-2 types (WizardCoder)?

There is CLBlast GPU support for GPT-2 based models on koboldcpp for example, where I can do prompt processing on the GPU VRAM for less prompt batching errors with my 16GB of CPU RAM. Does anyone know if this is possible with ctransformers?

Exllama support?

Hi,
exllama now can support 4k context size. Is that something you are planning on supporting in future version?

Error loading model

Hi,

I get the following error when I try to load the model:

python3.10/site-packages/ctransformers/lib/basic/libctransformers.so: cannot open shared object file: No such file or directory

using:
llm = AutoModelForCausalLM.from_pretrained('/models/gpt-2-1558M-ggml/ggml-model-q4_1.bin', model_type='gpt2', lib='basic')

I am running this on aarch64 Ubuntu 22.04 system

Please let me know how to fix this.

Thank you.

How to get the perplexity of the sequence

This project is crasy fast
But can it be applied not only to generate text but also to score it too

Is there is the way to measure the perplexity or naturalnes of the sequence, for example of a question-answer pair

Caching

Hey! Llama.cpp has a built-in state save/load mechanism which allows for fast startup on contexts that have been processed before. Is this possible to do with ctransformers somehow?

New Falcon ggllm.cpp format

Hi @marella

Have you seen the new GGCC model format of ggllm.cpp? This significantly improves the quality of its Falcon model support. There is also improvements to performance.

But it is a new format, incompatible with the old.

Are you planning to add support? https://github.com/cmp-nct/ggllm.cpp

I have multiple Falcon GGCC models up but right now they can only be used with ggllm.cpp on the command line. It'd be good to increase that support

Thanks

How to specify specific gpu ids to offfload to?

I'd like to specify only 2 gpus to offload to from n on the machine, is it possible to that. Thanks!

How to know which word is the end of the response?

Is there a better way to know what the last word of the response is?

I currently added this line in llm.py

                match = stop_regex.search(text)
                if match:
                    text = text[: match.start()]
                    yield '[<STOP>]'   <----
                    break

and then i do this in my backend

    generator = generate(llm, generation_config, user_prompt.strip())
    for word in generator:
        if word == "[<STOP>]":
            send(json.dumps({'end': 'true'}))
        else:
            send(json.dumps({'content': word})

Add ggml debug timing

This is a great python binding for ggml. I couldn't find a way to get ggml operation debug timing. Could it be added?

Segfault when trying to use call multiple times

Hi I was running this in a fastapi server generating text multiple times from the endpoint I end up getting segfault if I try to generate text before the previous generation has finished.

I am calling the llm directly:

llm("prompt goes here")

Tried a few models, streaming true + false, behavior seems to be the same. Have to wait for each generation to finish before trying to generate for other requests. Curious if this is expected behavior

Feature request: support for Huggingface Hub branches/revisions

I recently updated my Falcon GGMLs to the new GGCC format. But I kept the old ggml format models available in another branch, called ggmlv3

With Hugging Face Hub, it would be possible to download those like so:

model = AutoModelForCausalLM.from_pretrained("TheBloke/blah", revision="ggmlv3", ...)

My request is for the same to be supported in ctransformers, so branches can be used to differentiate between versions of models.

Thanks in advance.

Failed to Create LLM

I'm Running Wizard-Mega-13B-ggml using ctransformers on Google Colab. it Works Fine.

But After Sometimes, When I'm running the Same Code it Gives me This Error

Failed to create LLM 'llama' from '/root/.cache/huggingface/hub/models--TheBloke--wizard-mega-13B-GGML/snapshots/8aa1f0f288f8bb96426da430e2af19c9d70dd4ff/wizard-mega-13B.ggmlv3.q4_0.bin'.

My Code is Dead Simple,

from huggingface_hub import hf_hub_download
import time
from ctransformers import AutoModelForCausalLM
model_path = hf_hub_download(repo_id= "TheBloke/wizard-mega-13B-GGML", filename= "wizard-mega-13B.ggmlv3.q4_0.bin")
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1, "temperature": 0.5, 'stream' : False, 'threads': 1}
llm = AutoModelForCausalLM.from_pretrained(model_path, model_type="llama", lib='basic', **config)

# Rest of the Code for input pipeline

So, What may be the Problem for This?

how to apply stream into MPT-7B-Instruct-GGML model

I try to pass the arguments that are listed in the documentation, but I get nowhere.

handler = StdOutCallbackHandler()
llm = CTransformers(model='TheBloke/MPT-7B-Instruct-GGML',model_file='mpt-7b-instruct.ggmlv3.q4_0.bin' ,
model_type='mpt',config={"stream":True,
"max_new_tokens":256,
"threads":6},
callbacks=[StreamingStdOutCallbackHandler()]
)
llm(PROMPT_FOR_GENERATION_FORMAT.format(context=content,
query=query))
but it looks not working. It does not return a generator. instead it returns a string.
The model takes an extremely long time to think before it starts to print and the response speed is about the same as without the stream.

Redestribute cuda compatible version without the need to recompile locally

Hi and thanks for GPU integration. I made it work and that's cool.

But it requires the user to have cuda and all build tools installed with the specific version 11.7 (latest one supported by torch as of now), which makes the whole thing non user friendly.

I was thinking of precompiling this for many possible architectures then redestributing them.

Starcoder / Quantized Issues

Hey! Thanks for this library, I really appreciate the API and simplicity you are bringing to this, it's exactly what I was looking for in trying to integrate ggml models into python! (specifically into my library lambdaprompt.

One issue, it seems like there's something going wrong with starcoder quantized models.
For the full model, it seems to work great, and I'm getting the same outputs it seems.

What works (full model weights):

 ./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

as equivalent to:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    '/workspaces/research/models/starcoder/starcoder-ggml.bin',
    model_type='starcoder')
print(llm("def fibo(", max_new_tokens=30, top_k=0, top_p=0.95, temperature=0.2))

Seem to give equivalent results!

What fails (quantized model weights):

However, when I change to the quantized model (to reproduce the same as this)

./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

I get a core dumped ggml error

Python 3.10.11 (main, Apr 12 2023, 14:46:22) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctransformers import AutoModelForCausalLM
>>> llm = AutoModelForCausalLM.from_pretrained(
...     '/workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin',
...     model_type='starcoder')
GGML_ASSERT: /home/runner/work/ctransformers/ctransformers/models/ggml/src/ggml.c:4408: wtype != GGML_TYPE_COUNT
Aborted (core dumped)

RedPajama support

Please, very good and small model
https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1

ggml_new_tensor_impl: not enough space in the scratch memory pool

I have a real bug this time.

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 575702016, available 268435456)
Segmentation fault

This affects ggml implementation in llama.cpp so it is likely not a bug with ctransformers but an inherited bug from the ggml code. This appears on the second run, which to me means that there is a memory leak - the buffer is either not correctly sized and/or it is not cleared upon exit

Config

How do I specify model() parameters, from here: https://github.com/marella/ctransformers#config

Placing them inside model() does not raise an error, but they are ignored for my mpt model ie. I can enter whatever I want, output does not change at all.

Am I supposed to enter them under kwargs as a list? Is there an example somewhere?

EDIT: I changed gnenerate() to model(). I had generate() on my mind from another question, but it is the model parameters that I am trying to set

How to enable CTransformers load model_file from Google Drive?

I'm using from langchain.llms import CTransformers

I found error
ValidationError: 1 validation error for CTransformers
not found in '/root/.cache/huggingface/hub/

Performance on Apple silicon

Context: #1 (comment)

@bgonzalezfractal did you notice any performance improvement just by changing the threads parameter?

If you don't have the latest quantized models, you can go back to the previous commit using:

git checkout e707f99
git submodule update

Here you can run the build commands and check:

cmake -S . -B build
cmake --build build

Segmentation fault on m1 mac

Trying simple example on m1 mac:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "/path/to/starcoderbase-GGML/starcoderbase-ggml-q4_0.bin",
    model_type="starcoder",
    lib="basic",
)

print(llm("Hi"))

leads to segmentation fault. Model works fine with ggml example code.

Text output

I am back to testing ctransformers, now with the quantized mpt-30b-instruct model and I have a really basic problem. Does ctransformers handle output in a different way to transformers?

I have been trying for a while to get it to print a block of text, but when using the native huggingface tokenizer I either get a streaming output from the for loop where each decoded token is placed in new line, or I get a list:

python mpt-30b_transformers_test.py
['Blue', ' can', ' be', ' associated', ' with', ' many', ' things', ',', ' such', ' as', ' sadness', ' or', ' depression', ' due', ' to', ' its', ' association', ' in', ' mood', 's', ' and', ' feelings', ';', ' however', ' it', '’', 's', ' also', ' been', ' proven', ' through', ' research', ' studies', ' conducted', ' by', ' psychologists', ' at', ' Harvard', ' University', ' that', ' looking', '/', 'sur', 'round', 'ing', ' yourself', ' within', ' this', ' shade', ' will', ' reduce', ' your', ' stress', ' levels', '.', '  ', 'It', ' is', ' a', ' very', ' cal', 'ming', ' colour', ' which', ' makes', ' sense', ' when', ' you', ' think', ' about', ' the', ' sky', ',', ' sea', ' etc', ' where', ' blue', ' dominates', ' as', ' these', ' are', ' places', ' we', ' often', ' go', ' to', ' relax', ' and', ' switch', ' off', ' from', ' our', ' busy', ' lives', '!']

I can wrangle this into text, but it is an extra step and a bit of a pita as not every token is a word or a free syntax symbol so I am wondering if there is a better way to get plain text output when using ctransformers with a HF tokenizer.

basic lib doesn't work with Ampere ARM processors

As the title suggest, if I try to set lib='basic' I get the following error:

/home/ubuntu/.local/lib/python3.10/site-packages/ctransformers/lib/basic/libctransformers.so: cannot open shared object file: No such file or directory

If I try to directly run libctransformers.so I get this error:

cannot execute binary file: Exec format error

No GPU support for MPT models?

New ggml llamacpp file format support

Hi and thanks for this beautiful work.

Are you planning on supporting the version 2 of the llamacpp files. I want to add the OpenAssistant model to my GPT4ALL-ui and can't find a python binding that supports it.

Here is the bloke's version of the model:
https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-GGML/tree/main

Best regards

Outputting text sentence by sentence

I was wondering if it would be possible to run inference and recieve the output line by line, instead of all at once
or as "mangled tokens"

Ukrainian Upper case not appearing

@marella thanks, very usefull project
I use llama ggml model
With latest llama.cpp it says Привіт, but with ctransformers only ривіт
It is related to all words, not only first in a responce. Банан -> анан

I'm not sure if it's related to encoding or something else

nvm already is langchain compatible

REST API?

What's the best practice to interface ctransformers API and expose it as REST API?

I looked at https://github.com/jquesnelle/transformers-openai-api and tried to change it to use ctransformer, but then I stopped as the changes required increased making it hard(er) to keep in-sync with the original.

Support Microsoft Guidance

I am trying to use a 'custom tokenizer' but I am unable to see how can I invoke it. Also can we use a standard tokenizer from HF by pulling it or loading from the local path?

Types of model that ctransformer can load?

Asking out of curiosity. What are the model that I could load using ctransformers?

Falcon support?

I've been tracking the Falcon ggerganov/ggml#231 PR, and as I understand currently it won't work on a released version of ggml.

Any suggestions on how to test it config wise are appreciated, I'm assuming llama might not work based on other PRs.

Need a method to get the embeddings

Hi there. I am upgrading my bindings for the lord of llms tool and I now need to be able to vectorize text to embedding space of the current model. Is there a way to have access to the latent space of the model ? I input a text and get the encoder output in latent space?

Best regards

q2 quantization support

Hi there. The llamacpp is now supporting q2 quantization. Is there any chance this comes to ctransformers?

Model crashes when tokenizing 4 digit numbers

model.tokenize("1111")

crashes the python interpreter

if relevant, I'm using https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML

ctx limiter.

Hi I have notice that there is no option to define the context amount (max tokens) used for generation. Is the library using the maximum context by default?.

Compilation error on linux ubuntu

Hi there. I have installed cuda toolkit 11.7 using conda. Also gcc (tested multiple versions). Here with 9.4.0 I get this error:

  -- Build files have been written to: /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/_skbuild/linux-x86_64-3.10/cmake-build
  [1/6] Building C object CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o
  FAILED: CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o
  /home/200.3-PYTHON/commun/envs/lollms/bin/cc -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_DMMV_Y=1 -DGGML_USE_CUBLAS -DGGML_USE_K_QUANTS -DK_QUANTS_PER_ITERATION=2 -Dctransformers_EXPORTS -I/nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models -isystem /home/200.3-PYTHON/commun/envs/lollms/include -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -mfma -mavx2 -mf16c -mavx -pthread -MD -MT CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o -MF CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o.d -o CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o -c /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.c
  In file included from /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.c:1:
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:26:15: error: expected declaration specifiers or '...' before 'sizeof'
     26 | static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_fp16_t) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:26:79: error: expected declaration specifiers or '...' before string constant
     26 | static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_fp16_t) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
        |                                                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:38:15: error: expected declaration specifiers or '...' before 'sizeof'
     38 | static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + 11 * QK_K / 64, "wrong q3_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:38:86: error: expected declaration specifiers or '...' before string constant
     38 | static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + 11 * QK_K / 64, "wrong q3_K block size/padding");
        |                                                                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:50:15: error: expected declaration specifiers or '...' before 'sizeof'
     50 | static_assert(sizeof(block_q4_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2, "wrong q4_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:50:81: error: expected declaration specifiers or '...' before string constant
     50 | static_assert(sizeof(block_q4_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2, "wrong q4_K block size/padding");
        |                                                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:63:15: error: expected declaration specifiers or '...' before 'sizeof'
     63 | static_assert(sizeof(block_q5_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2 + QK_K/8, "wrong q5_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:63:90: error: expected declaration specifiers or '...' before string constant
     63 | static_assert(sizeof(block_q5_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2 + QK_K/8, "wrong q5_K block size/padding");
        |                                                                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:75:15: error: expected declaration specifiers or '...' before 'sizeof'
     75 | static_assert(sizeof(block_q6_K) == sizeof(ggml_fp16_t) + QK_K / 16 + 3*QK_K/4, "wrong q6_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:75:81: error: expected declaration specifiers or '...' before string constant
     75 | static_assert(sizeof(block_q6_K) == sizeof(ggml_fp16_t) + QK_K / 16 + 3*QK_K/4, "wrong q6_K block size/padding");
        |                                                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:83:15: error: expected declaration specifiers or '...' before 'sizeof'
     83 | static_assert(sizeof(block_q8_K) == sizeof(float) + QK_K + QK_K/16*sizeof(int16_t), "wrong q8_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:83:85: error: expected declaration specifiers or '...' before string constant
     83 | static_assert(sizeof(block_q8_K) == sizeof(float) + QK_K + QK_K/16*sizeof(int16_t), "wrong q8_K block size/padding");
        |                                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  [2/6] Building C object CMakeFiles/ctransformers.dir/models/ggml/ggml.c.o
  FAILED: CMakeFiles/ctransformers.dir/models/ggml/ggml.c.o

Is there any prevompiled version I can try to install directly? Or do you have an idea why this fails?

OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found

I am trying to use ctransformers to load the falcon-40b ggml model and I get the following error

OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/ctransformers/lib/avx2/libctransformers.so)

I am using an AWS Sagemaker notebook. I am new to this, please help in resolving this issue

Add support for "stop" in config

(In the readme at least) the config passed in CTransformers doesn't accept stop strings, which is a common feature.

Best backend yet!

Thanks for this cool library. I have integrated it into my gpt4All-ui

If you want, please come to the discord channel. You are welcome to suggest ideas or discuss the future of all of this stuff:
https://discord.gg/FAfwMshF

And to test the tool, you can find it here:
https://github.com/nomic-ai/gpt4all-ui/releases/tag/v0.0.8

marella / ctransformers Goto Github PK

ctransformers's Introduction

Supported Models

Installation

Usage

🤗 Transformers

LangChain

GPU

CUDA

ROCm

Metal

GPTQ

Documentation

Config

class AutoModelForCausalLM

classmethod AutoModelForCausalLM.from_pretrained

class LLM

method LLM.__init__

property LLM.bos_token_id

property LLM.config

property LLM.context_length

property LLM.embeddings

property LLM.eos_token_id

property LLM.logits

property LLM.model_path

property LLM.model_type

property LLM.pad_token_id

property LLM.vocab_size

method LLM.detokenize

method LLM.embed

method LLM.eval

method LLM.generate

method LLM.is_eos_token

method LLM.prepare_inputs_for_generation

method LLM.reset

method LLM.sample

method LLM.tokenize

method LLM.__call__

License

ctransformers's People

Contributors

Stargazers

Watchers

Forkers

ctransformers's Issues

System

What works (full model weights):

What fails (quantized model weights):

Recommend Projects

Recommend Topics

Recommend Org

`class` `AutoModelForCausalLM`

`classmethod` `AutoModelForCausalLM.from_pretrained`

`class` `LLM`

`method` `LLM.init`

`property` LLM.bos_token_id

`property` LLM.config

`property` LLM.context_length

`property` LLM.embeddings

`property` LLM.eos_token_id

`property` LLM.logits

`property` LLM.model_path

`property` LLM.model_type

`property` LLM.pad_token_id

`property` LLM.vocab_size

`method` `LLM.detokenize`

`method` `LLM.embed`

`method` `LLM.eval`

`method` `LLM.generate`

`method` `LLM.is_eos_token`

`method` `LLM.prepare_inputs_for_generation`

`method` `LLM.reset`

`method` `LLM.sample`

`method` `LLM.tokenize`

`method` `LLM.call`