mybigday / llama.rn Goto Github PK

React Native binding of llama.cpp

License: MIT License

JavaScript 0.22% C++ 39.23% TypeScript 0.97% Swift 0.01% Ruby 0.15% C 51.10% Objective-C 5.96% Objective-C++ 0.88% Shell 0.10% CMake 0.07% Java 1.32%

llama llama-cpp llm react-native android ios

llama.rn's People

Contributors

Stargazers

Watchers

Forkers

escottgoodwin jef1056 ankit1057 ratebseirawan smashinfries bhargavhirpara scsonic vali-98 batrlatom hans00 purplemaia

llama.rn's Issues

Parallel decoding

llama.cpp now supports parallel decoding in one context so we can support.

Breaking change: Deprecate stopCompletion method and move to return values of completion.

Feature Request: TextStreaming

Is it possible to add a text streaming feature? It looks like your loading a local cpp server I wonder does swift support sockets for react native? Inference is so slow right on mobile devices right now, streaming would help the user know something is happening. Interested in contributing if you need contribs. I believe it is supported by llama.cpp in langchains implementation but im not sure if that's custom

Example: Pure text completion playground

It just like OpenAI text completion playground, start with a text area as prompt.

We can also add area for use grammar.

Android: Cannot load models, stopCompletions not working.

As it says on the tin. Loading small 3b models ala Tiny Llama or StableLM models do not work. Tested models:

Attempting to call initLlama results in

Error: Failed to initialize context

Which I can only assume is here:

https://github.com/mybigday/llama.rn/blob/main/android/src/main/java/com/rnllama/RNLlama.java?plain=1#L55

I do not know enough about native functions to investigate further.

In addition, stopCompletions() does not stop a completion on Android.
Thanks for your work, the project is fantastic otherwise.

Seed, min p

can you add support to change it?

Support Android

Not high priority.

It probably easy to support because we implemented the context in cpp/rn-llama.hpp.

We won't try to integrate the CLBlast part for now, because not all Android devices supports OpenCL.

We can look at ggerganov/llama.cpp#2059 / ggerganov/llama.cpp#2039 for a chance to land so we can use the Vulkan backend.

Benchmark method

Port llama.cpp/examples/llama-bench/llama-bench.cpp as method so we can easier to collect the benchmark results in mobile devices.

Example: Custom params in the UI

So we can easier to test model params without using debug mode.

Update llamacpp module to latest

I am having troubling updating the llamacpp submodule mysel. Could the project be updated as llamacpp has added support for a few new base models that currently do not work in llama.rn?

Expose tokenize / embedding functions as util

Add utisl for expose llama_tokenize and llama_get_embeddings to JS side.

For context we need provide embedding param to enable that.

stablelm-2-zephyr-1_6b-Q8_0.gguf does not work

Hello,

I've been working on getting the stablelm-2-zephyr-1_6b-Q8_0.gguf operational (link: https://huggingface.co/spaces/stabilityai/stablelm-2-1_6b-zephyr), especially since the 3B version seems to function quite well. However, I'm encountering an issue with the 1.6B version where it fails to initialize the context. Currently, I'm using the latest version of your master branch to compile the library. Is there a straightforward modification I can make on my end to resolve this?

from logs:

01-29 22:50:05.365 3017 20732 E RNLLAMA_LOG_ANDROID: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 340, got 268

Thank you.

Early stopping inference

Shouldn't there be a function that allows the user to stop inference? Could be implemented as a callback function just like in whisper.rn's realtimeInference()

Failure to initLlama on Xiaomi phones.

Hello again, I've received reports from users of ChatterUI that model loading fails on Xiaomi branded phones:

Confirmed not working:

Xiaomi Poco F5 - Android 14
Redmi 10C - Android 13

I've also queried about other phones, and got a few responses for working devices.

Confirmed working:

Samsung A71 - Android 13
Samsung M52 - Android 13

Version used:

llama.rn 0.3.0-rc.14

Logcat response on the tested Poco F5:

RNLLAMA_ANDROID_JNI: [RNLlama] is_model_loaded false

There aren't enough users to confirm this is a trend across all Xiaomi phones, but it is peculiar.

LLaVa support

llama.cpp includes LLaVa example (+clip.cpp), we could use it to provide vision support. We may implement it after #30 is done.

Also, it will be great if we could make an another package named clip.rn or react-native-clip, but currently I afraid we haven't more resources to maintain it, so just keep in mind.

OpenCL Implementation for Android

First of all, thanks for the hard work on bringing this project to the react-native ecosystem.

I have been using llama.rn for a few weeks now in my personal project:
https://github.com/Vali-98/ChatterUI

I was wondering if there is any interest in implementing OpenCL for android. I have attempted to work on it myself to little success, given my inexperience with native modules.

Prompt cache

https://github.com/ggerganov/llama.cpp/blob/92d0b751a77a089e650983e9f1564ef4d31b32b9/examples/main/main.cpp#L243

Support save prompt cache as file so we could speed up context initialization + prompt processing.

[Android] Seed value does not create deterministic outputs.

As mentioned in the title, setting a seed value does not make an output deterministic on Android.

llama.rn version: 0.3.0-rc.13
Model used: phi-2.Q3_K_M.gguf
Android Devices Tested on: Emulated Pixel 3a - Android 14

Params used:

{
  "frequency_penalty": 0, 
  "grammar": "", 
  "min_p": 0.07, 
  "mirostat": 0, 
  "mirostat_eta": 0.1, 
  "mirostat_tau": 5, 
  "n_predict": 288, 
  "n_threads": 5, 
  "presence_penalty": 0, 
  "prompt": "", 
  "repeat_penalty": 1, 
  "seed": 2, 
  "stop": ["User:", "### Response: "], 
  "temperature": 1, 
  "tfs_z": 1, 
  "top_k": 0, 
  "top_p": 1, 
  "typical_p": 1
}

Implementing optimizations from layla

Layla is a project that also integrates llamacpp for mobile use:
https://github.com/l3utterfly/llama.cpp/tree/layla-build

After some quick testing, it does seem like Layla's fork for llamacpp runs models far faster on android than llama.rn, almost twice as fast in some cases with 7b models.

It would be wonderful in these improvements were added to llama.rn as well.

cannot load model

issue in the README regarding model loading. It mentions the 'gguf' model but lacks clear instructions. is file loading implemented yet? No model found is always result.