Code Monkey home page Code Monkey logo

ctransformers's Introduction

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Also see ChatDocs

Supported Models

Models Model Type CUDA Metal
GPT-2 gpt2
GPT-J, GPT4All-J gptj
GPT-NeoX, StableLM gpt_neox
Falcon falcon
LLaMA, LLaMA 2 llama
MPT mpt
StarCoder, StarChat gpt_bigcode
Dolly V2 dolly-v2
Replit replit

Installation

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output, set stream=True:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use it with 🤗 Transformers, create model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using ExLlama.

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If model name or path doesn't contain the word gptq then specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Config

Parameter Type Description Default
top_k int The top-k value to use for sampling. 40
top_p float The top-p value to use for sampling. 0.95
temperature float The temperature to use for sampling. 0.8
repetition_penalty float The repetition penalty to use for sampling. 1.1
last_n_tokens int The number of last tokens to use for repetition penalty. 64
seed int The seed value to use for sampling tokens. -1
max_new_tokens int The maximum number of new tokens to generate. 256
stop List[str] A list of sequences to stop generation when encountered. None
stream bool Whether to stream the generated text. False
reset bool Whether to reset the model state before generating text. True
batch_size int The batch size to use for evaluating tokens in a single prompt. 8
threads int The number of threads to use for evaluating tokens. -1
context_length int The maximum context length to use. -1
gpu_layers int The number of layers to run on GPU. 0

Note: Currently only LLaMA, MPT and Falcon models support the context_length parameter.

class AutoModelForCausalLM


classmethod AutoModelForCausalLM.from_pretrained

from_pretrained(
    model_path_or_repo_id: str,
    model_type: Optional[str] = None,
    model_file: Optional[str] = None,
    config: Optional[ctransformers.hub.AutoConfig] = None,
    lib: Optional[str] = None,
    local_files_only: bool = False,
    revision: Optional[str] = None,
    hf: bool = False,
    **kwargs
) → LLM

Loads the language model from a local file or remote repo.

Args:

  • model_path_or_repo_id: The path to a model file or directory or the name of a Hugging Face Hub model repo.
  • model_type: The model type.
  • model_file: The name of the model file in repo or directory.
  • config: AutoConfig object.
  • lib: The path to a shared library or one of avx2, avx, basic.
  • local_files_only: Whether or not to only look at local files (i.e., do not try to download the model).
  • revision: The specific model version to use. It can be a branch name, a tag name, or a commit id.
  • hf: Whether to create a Hugging Face Transformers model.

Returns: LLM object.

class LLM

method LLM.__init__

__init__(
    model_path: str,
    model_type: Optional[str] = None,
    config: Optional[ctransformers.llm.Config] = None,
    lib: Optional[str] = None
)

Loads the language model from a local file.

Args:

  • model_path: The path to a model file.
  • model_type: The model type.
  • config: Config object.
  • lib: The path to a shared library or one of avx2, avx, basic.

property LLM.bos_token_id

The beginning-of-sequence token.


property LLM.config

The config object.


property LLM.context_length

The context length of model.


property LLM.embeddings

The input embeddings.


property LLM.eos_token_id

The end-of-sequence token.


property LLM.logits

The unnormalized log probabilities.


property LLM.model_path

The path to the model file.


property LLM.model_type

The model type.


property LLM.pad_token_id

The padding token.


property LLM.vocab_size

The number of tokens in vocabulary.


method LLM.detokenize

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.

Args:

  • tokens: The list of tokens.
  • decode: Whether to decode the text as UTF-8 string.

Returns: The combined text of all tokens.


method LLM.embed

embed(
    input: Union[str, Sequence[int]],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → List[float]

Computes embeddings for a text or list of tokens.

Note: Currently only LLaMA and Falcon models support embeddings.

Args:

  • input: The input text or list of tokens to get embeddings for.
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1

Returns: The input embeddings.


method LLM.eval

eval(
    tokens: Sequence[int],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → None

Evaluates a list of tokens.

Args:

  • tokens: The list of tokens to evaluate.
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1

method LLM.generate

generate(
    tokens: Sequence[int],
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]

Generates new tokens from a list of tokens.

Args:

  • tokens: The list of tokens to generate tokens from.
  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The generated tokens.


method LLM.is_eos_token

is_eos_token(token: int) → bool

Checks if a token is an end-of-sequence token.

Args:

  • token: The token to check.

Returns: True if the token is an end-of-sequence token else False.


method LLM.prepare_inputs_for_generation

prepare_inputs_for_generation(
    tokens: Sequence[int],
    reset: Optional[bool] = None
) → Sequence[int]

Removes input tokens that are evaluated in the past and updates the LLM context.

Args:

  • tokens: The list of input tokens.
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The list of tokens to evaluate.


method LLM.reset

reset() → None

Deprecated since 0.2.27.


method LLM.sample

sample(
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None
) → int

Samples a token from the model.

Args:

  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1

Returns: The sampled token.


method LLM.tokenize

tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int]

Converts a text into list of tokens.

Args:

  • text: The text to tokenize.
  • add_bos_token: Whether to add the beginning-of-sequence token.

Returns: The list of tokens.


method LLM.__call__

__call__(
    prompt: str,
    max_new_tokens: Optional[int] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    stop: Optional[Sequence[str]] = None,
    stream: Optional[bool] = None,
    reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]

Generates text from a prompt.

Args:

  • prompt: The prompt to generate text from.
  • max_new_tokens: The maximum number of new tokens to generate. Default: 256
  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1
  • batch_size: The batch size to use for evaluating tokens in a single prompt. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1
  • stop: A list of sequences to stop generation when encountered. Default: None
  • stream: Whether to stream the generated text. Default: False
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The generated text.

License

MIT

ctransformers's People

Contributors

abacaj avatar github-actions[bot] avatar jameschung2000 avatar jllllll avatar jncraton avatar marella avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ctransformers's Issues

Compiled Cublas version FileNotFound on windows

Hi,

I managed to install the project using CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

But on model loading I'm getting:

FileNotFoundError: Could not find module 'C:\Users\User\dev\melMass\hal\python\.venv\Lib\site-packages\ctransformers\lib\local\ctransformers.dll' (or one of its dependencies). Try
using the full path with constructor syntax.

It is there but I guess something went wrong when compiling it:

❯ ls -s 'C:\Users\User\dev\melMass\hal\python\.venv\Lib\site-packages\ctransformers\lib\local\'
╭───┬───────────────────┬──────┬────────┬────────────────╮
│ # │       name        │ type │  size  │    modified    │
├───┼───────────────────┼──────┼────────┼────────────────┤
│ 0 │ ctransformers.dll │ file │ 1.0 MB │ 23 minutes ago │
╰───┴───────────────────┴──────┴────────┴────────────────╯
Full trace log
╭─────────────────────── Traceback (most recent call last) ────────────────────────╮
 C:\Users\User\dev\melMass\hal\python\cli.py:17 in <module>                       
                                                                                  
   14 console.print("Loading model...", style="bold yellow")                      
   15 # Larger batch_size will process the prompt faster but will require more me │
   16 # Also if you have enough GPU memory to fit the model, setting threads=1 ca │
  17 llm = AutoModelForCausalLM.from_pretrained(model_path, model_type='starcode │
│   18                                                                             │
│   19 console.print("Model loaded...", style="bold green")                        │
│   20                                                                             │
│                                                                                  │
│ C:\Users\User\dev\melMass\hal\python\.venv\lib\site-packages\ctransformers\hub.p │
│   154 │   │   │   │   local_files_only=local_files_only,                         │
│   155 │   │   │   )                                                              │
│   156 │   │                                                                      │
│ ❱ 157 │   │   return LLM(                                                        │
│   158 │   │   │   model_path=model_path,                                         │
│   159 │   │   │   model_type=model_type,                                         │
│   160 │   │   │   config=config.config,                                          │
│                                                                                  │
│ C:\Users\User\dev\melMass\hal\python\.venv\lib\site-packages\ctransformers\llm.p │
│ y:206 in __init__                                                                │
│                                                                                  │
│   203 │   │   if not Path(model_path).is_file():                                 │
│   204 │   │   │   raise ValueError(f"Model path '{model_path}' doesn't exist.")  │
│   205 │   │                                                                      │
│ ❱ 206 │   │   self._lib = load_library(lib)                                      │
│   207 │   │   self._llm = self._lib.ctransformers_llm_create(                    │
│   208 │   │   │   model_path.encode(),                                           │
│   209 │   │   │   model_type.encode(),                                           │
│                                                                                  │
│ C:\Users\User\dev\melMass\hal\python\.venv\lib\site-packages\ctransformers\llm.p │
│ y:102 in load_library                                                            │
│                                                                                  │
│    99 │   │   os.add_dll_directory(os.path.join(os.environ["CUDA_PATH"], "bin")) │
│   100 │                                                                          │
│   101 │   path = find_library(path)                                              │
│ ❱ 102 │   lib = CDLL(path)                                                       │
│   103 │                                                                          │
│   104 │   lib.ctransformers_llm_create.argtypes = [                              │
│   105 │   │   c_char_p,  # model_path                                            │
│                                                                                  │
│ C:\Users\User\.pyenv\pyenv-win\versions\3.10.11\lib\ctypes\__init__.py:374 in    │
│ __init__                                                                         │
│                                                                                  │
│   371 │   │   self._FuncPtr = _FuncPtr                                           │
│   372 │   │                                                                      │
│   373 │   │   if handle is None:                                                 │
│ ❱ 374 │   │   │   self._handle = _dlopen(self._name, mode)                       │
│   375 │   │   else:                                                              │
│   376 │   │   │   self._handle = handle                                          │
│   377                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Could not find module
'C:\Users\User\dev\melMass\hal\python\.venv\Lib\site-packages\ctransformers\lib\loca
l\ctransformers.dll' (or one of its dependencies). Try using the full path with
constructor syntax.

System

Windows 11
CUDA 11.7
RTX 3090

High RAM usag when offloading to GPU layers.

Just updated my GPU from a 2080 to a 3090 and man does it makes things go brrrr lol.

Anyways, I notice a new strange behavor when I did. Instead of model + GPU taking close to what the model took in system ram... it now takes almost double the system ram. When offloading from say 8 to 100 using the model wizardLM-13B-Uncensored.ggmlv3.q4_0.bin I jump from 6-7 GB to almost 12 to 14 GB on system RAM - even more as I increase the number of GPU layers. I was under the impression that more GPU_Layers the less system memory it should be using not more?

    def load_chat_model( self, model = "wizardLM-13B-Uncensored.ggmlv3.q4_0.bin" ): 
        self.gptj = AutoModelForCausalLM.from_pretrained(
            f'models/{model}',
            model_type = 'llama', #mpt, llama
            reset = True,
            threads = 1, gpu_layers = 100,
            context_length = 2048, #8192, 2048
            batch_size = 2048,
            temperature = float( .65 ),
            repetition_penalty = float( 1.1 )
        )

While I have the RAM for it - just seems very very strange it should be taking even more system ram than ever before.

While not 100% related, I could be just simply doing something wrong with the settings, I had another issue where when I did offload to the GPU when I had my 2080 - things were slow. The fix for it was to increase the batch_size and that did improve the performance even just on 8 layers. Changing the batch in this case doesnt seem to change much for memory usage only the gpu_layers seem to be the issue.
#27 As noted here, I dont seem to get "out of memory" errors when I increase the GPU layers - it will jsut "oom" if I go past too many layers for my GPU VRAM relying on the system threads instead.

ctransformers 0.2.10
Windows 11
3090
CUDA supported
Python 3.10
32GB of RAM

RAM Usage after load + message | system ram before loading | difference
threads = 8,
CPU only: 14.0 - 7.3 = 7GB

threads = 1, gpu_layers = 50,
1T + GPU: 20.9 - 7.3 = 13GB

A little more testing I see it scales up to about an extra 5GB of data for the system RAM before it caps out increasing a little bit per layer between 1-50. Almost seems like it not releasing the "work load" that it was planning on sending to GPU.

CuBLas Support

Do you have any ideas to support CuBlas to increase inference speed by offloading some layers to GPU

Issues loading RedPajama Models?

@marella Is it possible to load and run any of the RedPajama variants? I have tried several variations of the following in Google Colab, but all of them cause the session to crash:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('keldenl/RedPajama-INCITE-Instruct-3B-v1-GGML', 
                                           model_type = 'gpt_neox', 
                                           #local_files_only = True,
                                           #lib = 'basic',
                                           )

Batching support

Looks like ggml supports batching
Can it be supported here too
Will it gain some performance?

Error when running ctransformers - std::runtime_Error

Hi, folks. Here's the code I'm using:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('/mnt/mydisk/AI/text/models/llama_cpp/gpt4all-lora-quantized-ggml.bin', model_type='llama', gpu_layers=2,  lib='avx') # I've also tried without lib='avx'

print(llm('AI is going to'))

Here's the error I'm getting:


terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file

What can I do?
Also, Koboldcpp works fine.

Few questions / issues

  1. I just wanted to ask if you guys are planing to add MPT GPU support as well somet ime? I see its supported for LLAMA models.
  2. Real reason for the ticket, I am having issue getting it to really use the GPU. Sometimes it works and sometimes it doenst. Not really sure how to explain, but:
    A) Windows 11, 32 GB of RAM, Ryzen 5800x, 13B-HyperMantis.ggmlv3.q5_1.bin, set to LLAMA, 12 Threads (give or take what I set here for setting reasons), RTX 2080
    B) Install the model the other day. Tested on the CPU was able to get results back under 20 to 25 seconds. Saw there was GPU suported, uninstall and reinstall with CUDA support. Tested the GPU off loading and it didnt seem to do much in my first round of testing. I had set it to 25 layers at the time. Didnt see any improvement in speed, but could see that the GPU was being used with higher memory access and GPU usage spiking, but never capping at max. Lower the count to 15 layers. Tested again. This time was able to hit 5 to 10. Went crazy and tested it as much as I could getting really good results. Today, I rebooted my machine and its acting like it did the other day at 25. Tried lower it from 15 to 10 or below ... but it doesnt seem to be using the GPU yet "acting" or "setting" up the usage for the GPU as I can see the memory and inflex of usage - but never fully topping out.

I could be totally using it wrong, but the fact it was working the other day and today it stop tells me something change on my computer, but I honstly couldnt tell you what did. Didnt perform any updates, but its also weird it didnt work before then all at once it did. Not sure if there is some type of support model it needs or not. Cuda is supported under torch check. Any help or inforamtion is welcome:) I understand this is not a common issue. Any places I can check/get values for to see if its really working would be great. Just seems odd.

CLBlast support for gpt-2 types (WizardCoder)?

There is CLBlast GPU support for GPT-2 based models on koboldcpp for example, where I can do prompt processing on the GPU VRAM for less prompt batching errors with my 16GB of CPU RAM. Does anyone know if this is possible with ctransformers?

Exllama support?

Hi,
exllama now can support 4k context size. Is that something you are planning on supporting in future version?

Error loading model

Hi,

I get the following error when I try to load the model:

python3.10/site-packages/ctransformers/lib/basic/libctransformers.so: cannot open shared object file: No such file or directory

using:
llm = AutoModelForCausalLM.from_pretrained('/models/gpt-2-1558M-ggml/ggml-model-q4_1.bin', model_type='gpt2', lib='basic')

I am running this on aarch64 Ubuntu 22.04 system

Please let me know how to fix this.

Thank you.

How to get the perplexity of the sequence

This project is crasy fast
But can it be applied not only to generate text but also to score it too

Is there is the way to measure the perplexity or naturalnes of the sequence, for example of a question-answer pair

Caching

Hey! Llama.cpp has a built-in state save/load mechanism which allows for fast startup on contexts that have been processed before. Is this possible to do with ctransformers somehow?

New Falcon ggllm.cpp format

Hi @marella

Have you seen the new GGCC model format of ggllm.cpp? This significantly improves the quality of its Falcon model support. There is also improvements to performance.

But it is a new format, incompatible with the old.

Are you planning to add support? https://github.com/cmp-nct/ggllm.cpp

I have multiple Falcon GGCC models up but right now they can only be used with ggllm.cpp on the command line. It'd be good to increase that support

Thanks

How to know which word is the end of the response?

Is there a better way to know what the last word of the response is?

I currently added this line in llm.py

                match = stop_regex.search(text)
                if match:
                    text = text[: match.start()]
                    yield '[<STOP>]'   <----
                    break

and then i do this in my backend

    generator = generate(llm, generation_config, user_prompt.strip())
    for word in generator:
        if word == "[<STOP>]":
            send(json.dumps({'end': 'true'}))
        else:
            send(json.dumps({'content': word})

Add ggml debug timing

This is a great python binding for ggml. I couldn't find a way to get ggml operation debug timing. Could it be added?

Segfault when trying to use __call__ multiple times

Hi I was running this in a fastapi server generating text multiple times from the endpoint I end up getting segfault if I try to generate text before the previous generation has finished.

I am calling the llm directly:

llm("prompt goes here")

Tried a few models, streaming true + false, behavior seems to be the same. Have to wait for each generation to finish before trying to generate for other requests. Curious if this is expected behavior

Feature request: support for Huggingface Hub branches/revisions

I recently updated my Falcon GGMLs to the new GGCC format. But I kept the old ggml format models available in another branch, called ggmlv3

With Hugging Face Hub, it would be possible to download those like so:

model = AutoModelForCausalLM.from_pretrained("TheBloke/blah", revision="ggmlv3", ...)

My request is for the same to be supported in ctransformers, so branches can be used to differentiate between versions of models.

Thanks in advance.

Failed to Create LLM

I'm Running Wizard-Mega-13B-ggml using ctransformers on Google Colab. it Works Fine.

But After Sometimes, When I'm running the Same Code it Gives me This Error

Failed to create LLM 'llama' from '/root/.cache/huggingface/hub/models--TheBloke--wizard-mega-13B-GGML/snapshots/8aa1f0f288f8bb96426da430e2af19c9d70dd4ff/wizard-mega-13B.ggmlv3.q4_0.bin'.

My Code is Dead Simple,

from huggingface_hub import hf_hub_download
import time
from ctransformers import AutoModelForCausalLM
model_path = hf_hub_download(repo_id= "TheBloke/wizard-mega-13B-GGML", filename= "wizard-mega-13B.ggmlv3.q4_0.bin")
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1, "temperature": 0.5, 'stream' : False, 'threads': 1}
llm = AutoModelForCausalLM.from_pretrained(model_path, model_type="llama", lib='basic', **config)

# Rest of the Code for input pipeline

So, What may be the Problem for This?

how to apply stream into MPT-7B-Instruct-GGML model

I try to pass the arguments that are listed in the documentation, but I get nowhere.

handler = StdOutCallbackHandler()
llm = CTransformers(model='TheBloke/MPT-7B-Instruct-GGML',model_file='mpt-7b-instruct.ggmlv3.q4_0.bin' ,
model_type='mpt',config={"stream":True,
"max_new_tokens":256,
"threads":6},
callbacks=[StreamingStdOutCallbackHandler()]
)
llm(PROMPT_FOR_GENERATION_FORMAT.format(context=content,
query=query))
but it looks not working. It does not return a generator. instead it returns a string.
The model takes an extremely long time to think before it starts to print and the response speed is about the same as without the stream.

Redestribute cuda compatible version without the need to recompile locally

Hi and thanks for GPU integration. I made it work and that's cool.

But it requires the user to have cuda and all build tools installed with the specific version 11.7 (latest one supported by torch as of now), which makes the whole thing non user friendly.

I was thinking of precompiling this for many possible architectures then redestributing them.

Starcoder / Quantized Issues

Hey! Thanks for this library, I really appreciate the API and simplicity you are bringing to this, it's exactly what I was looking for in trying to integrate ggml models into python! (specifically into my library lambdaprompt.

One issue, it seems like there's something going wrong with starcoder quantized models.
For the full model, it seems to work great, and I'm getting the same outputs it seems.

What works (full model weights):

 ./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

as equivalent to:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    '/workspaces/research/models/starcoder/starcoder-ggml.bin',
    model_type='starcoder')
print(llm("def fibo(", max_new_tokens=30, top_k=0, top_p=0.95, temperature=0.2))

Seem to give equivalent results!

What fails (quantized model weights):

However, when I change to the quantized model (to reproduce the same as this)

./build/bin/starcoder -m /workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin -p "def fibo(" --top_k 0 --top_p 0.95 --temp 0.2

I get a core dumped ggml error

Python 3.10.11 (main, Apr 12 2023, 14:46:22) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctransformers import AutoModelForCausalLM
>>> llm = AutoModelForCausalLM.from_pretrained(
...     '/workspaces/research/models/starcoder/starcoder-ggml-q4_1.bin',
...     model_type='starcoder')
GGML_ASSERT: /home/runner/work/ctransformers/ctransformers/models/ggml/src/ggml.c:4408: wtype != GGML_TYPE_COUNT
Aborted (core dumped)

ggml_new_tensor_impl: not enough space in the scratch memory pool

I have a real bug this time.

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 575702016, available 268435456)
Segmentation fault

This affects ggml implementation in llama.cpp so it is likely not a bug with ctransformers but an inherited bug from the ggml code. This appears on the second run, which to me means that there is a memory leak - the buffer is either not correctly sized and/or it is not cleared upon exit

Config

How do I specify model() parameters, from here: https://github.com/marella/ctransformers#config

Placing them inside model() does not raise an error, but they are ignored for my mpt model ie. I can enter whatever I want, output does not change at all.

Am I supposed to enter them under kwargs as a list? Is there an example somewhere?

EDIT: I changed gnenerate() to model(). I had generate() on my mind from another question, but it is the model parameters that I am trying to set

Performance on Apple silicon

Context: #1 (comment)

@bgonzalezfractal did you notice any performance improvement just by changing the threads parameter?

If you don't have the latest quantized models, you can go back to the previous commit using:

git checkout e707f99
git submodule update

Here you can run the build commands and check:

cmake -S . -B build
cmake --build build

Segmentation fault on m1 mac

Trying simple example on m1 mac:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "/path/to/starcoderbase-GGML/starcoderbase-ggml-q4_0.bin",
    model_type="starcoder",
    lib="basic",
)

print(llm("Hi"))

leads to segmentation fault. Model works fine with ggml example code.

Text output

I am back to testing ctransformers, now with the quantized mpt-30b-instruct model and I have a really basic problem. Does ctransformers handle output in a different way to transformers?

I have been trying for a while to get it to print a block of text, but when using the native huggingface tokenizer I either get a streaming output from the for loop where each decoded token is placed in new line, or I get a list:

python mpt-30b_transformers_test.py
['Blue', ' can', ' be', ' associated', ' with', ' many', ' things', ',', ' such', ' as', ' sadness', ' or', ' depression', ' due', ' to', ' its', ' association', ' in', ' mood', 's', ' and', ' feelings', ';', ' however', ' it', '’', 's', ' also', ' been', ' proven', ' through', ' research', ' studies', ' conducted', ' by', ' psychologists', ' at', ' Harvard', ' University', ' that', ' looking', '/', 'sur', 'round', 'ing', ' yourself', ' within', ' this', ' shade', ' will', ' reduce', ' your', ' stress', ' levels', '.', '  ', 'It', ' is', ' a', ' very', ' cal', 'ming', ' colour', ' which', ' makes', ' sense', ' when', ' you', ' think', ' about', ' the', ' sky', ',', ' sea', ' etc', ' where', ' blue', ' dominates', ' as', ' these', ' are', ' places', ' we', ' often', ' go', ' to', ' relax', ' and', ' switch', ' off', ' from', ' our', ' busy', ' lives', '!']

I can wrangle this into text, but it is an extra step and a bit of a pita as not every token is a word or a free syntax symbol so I am wondering if there is a better way to get plain text output when using ctransformers with a HF tokenizer.

basic lib doesn't work with Ampere ARM processors

As the title suggest, if I try to set lib='basic' I get the following error:

/home/ubuntu/.local/lib/python3.10/site-packages/ctransformers/lib/basic/libctransformers.so: cannot open shared object file: No such file or directory

If I try to directly run libctransformers.so I get this error:

cannot execute binary file: Exec format error

Ukrainian Upper case not appearing

@marella thanks, very usefull project
I use llama ggml model
With latest llama.cpp it says Привіт, but with ctransformers only ривіт
It is related to all words, not only first in a responce. Банан -> анан

I'm not sure if it's related to encoding or something else

REST API?

What's the best practice to interface ctransformers API and expose it as REST API?

I looked at https://github.com/jquesnelle/transformers-openai-api and tried to change it to use ctransformer, but then I stopped as the changes required increased making it hard(er) to keep in-sync with the original.

Support Microsoft Guidance

I am trying to use a 'custom tokenizer' but I am unable to see how can I invoke it. Also can we use a standard tokenizer from HF by pulling it or loading from the local path?

Falcon support?

I've been tracking the Falcon ggerganov/ggml#231 PR, and as I understand currently it won't work on a released version of ggml.

Any suggestions on how to test it config wise are appreciated, I'm assuming llama might not work based on other PRs.

Need a method to get the embeddings

Hi there. I am upgrading my bindings for the lord of llms tool and I now need to be able to vectorize text to embedding space of the current model. Is there a way to have access to the latent space of the model ? I input a text and get the encoder output in latent space?

Best regards

q2 quantization support

Hi there. The llamacpp is now supporting q2 quantization. Is there any chance this comes to ctransformers?

ctx limiter.

Hi I have notice that there is no option to define the context amount (max tokens) used for generation. Is the library using the maximum context by default?.

Compilation error on linux ubuntu

Hi there. I have installed cuda toolkit 11.7 using conda. Also gcc (tested multiple versions). Here with 9.4.0 I get this error:

  -- Build files have been written to: /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/_skbuild/linux-x86_64-3.10/cmake-build
  [1/6] Building C object CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o
  FAILED: CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o
  /home/200.3-PYTHON/commun/envs/lollms/bin/cc -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_DMMV_Y=1 -DGGML_USE_CUBLAS -DGGML_USE_K_QUANTS -DK_QUANTS_PER_ITERATION=2 -Dctransformers_EXPORTS -I/nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models -isystem /home/200.3-PYTHON/commun/envs/lollms/include -O3 -DNDEBUG -std=gnu11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -mfma -mavx2 -mf16c -mavx -pthread -MD -MT CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o -MF CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o.d -o CMakeFiles/ctransformers.dir/models/ggml/k_quants.c.o -c /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.c
  In file included from /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.c:1:
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:26:15: error: expected declaration specifiers or '...' before 'sizeof'
     26 | static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_fp16_t) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:26:79: error: expected declaration specifiers or '...' before string constant
     26 | static_assert(sizeof(block_q2_K) == 2*sizeof(ggml_fp16_t) + QK_K/16 + QK_K/4, "wrong q2_K block size/padding");
        |                                                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:38:15: error: expected declaration specifiers or '...' before 'sizeof'
     38 | static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + 11 * QK_K / 64, "wrong q3_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:38:86: error: expected declaration specifiers or '...' before string constant
     38 | static_assert(sizeof(block_q3_K) == sizeof(ggml_fp16_t) + QK_K / 4 + 11 * QK_K / 64, "wrong q3_K block size/padding");
        |                                                                                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:50:15: error: expected declaration specifiers or '...' before 'sizeof'
     50 | static_assert(sizeof(block_q4_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2, "wrong q4_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:50:81: error: expected declaration specifiers or '...' before string constant
     50 | static_assert(sizeof(block_q4_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2, "wrong q4_K block size/padding");
        |                                                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:63:15: error: expected declaration specifiers or '...' before 'sizeof'
     63 | static_assert(sizeof(block_q5_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2 + QK_K/8, "wrong q5_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:63:90: error: expected declaration specifiers or '...' before string constant
     63 | static_assert(sizeof(block_q5_K) == 2*sizeof(ggml_fp16_t) + 3*QK_K/64 + QK_K/2 + QK_K/8, "wrong q5_K block size/padding");
        |                                                                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:75:15: error: expected declaration specifiers or '...' before 'sizeof'
     75 | static_assert(sizeof(block_q6_K) == sizeof(ggml_fp16_t) + QK_K / 16 + 3*QK_K/4, "wrong q6_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:75:81: error: expected declaration specifiers or '...' before string constant
     75 | static_assert(sizeof(block_q6_K) == sizeof(ggml_fp16_t) + QK_K / 16 + 3*QK_K/4, "wrong q6_K block size/padding");
        |                                                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:83:15: error: expected declaration specifiers or '...' before 'sizeof'
     83 | static_assert(sizeof(block_q8_K) == sizeof(float) + QK_K + QK_K/16*sizeof(int16_t), "wrong q8_K block size/padding");
        |               ^~~~~~
  /nobackup/sa226037/627327.0/pip-install-ouvtejw5/ctransformers_b35eb18ba93148019a7fa6999f9eed8d/models/ggml/k_quants.h:83:85: error: expected declaration specifiers or '...' before string constant
     83 | static_assert(sizeof(block_q8_K) == sizeof(float) + QK_K + QK_K/16*sizeof(int16_t), "wrong q8_K block size/padding");
        |                                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  [2/6] Building C object CMakeFiles/ctransformers.dir/models/ggml/ggml.c.o
  FAILED: CMakeFiles/ctransformers.dir/models/ggml/ggml.c.o

Is there any prevompiled version I can try to install directly? Or do you have an idea why this fails?

OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found

Hi

I am trying to use ctransformers to load the falcon-40b ggml model and I get the following error

OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/ctransformers/lib/avx2/libctransformers.so)

I am using an AWS Sagemaker notebook. I am new to this, please help in resolving this issue

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.