withcatai / node-llama-cpp Goto Github PK

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level

Home Page: https://withcatai.github.io/node-llama-cpp/

License: MIT License

Shell 0.04% C++ 6.97% TypeScript 88.28% JavaScript 0.54% CSS 3.36% CMake 0.82%

ai bindings catai llama llama-cpp llm nodejs prebuilt-binaries grammar gguf

node-llama-cpp's People

Contributors

Stargazers

Watchers

Forkers

derandreas-dt cmeka tears743 luiyen zxyty v4lentin1879 rx350h ryanaltair pahrizal graphnull iammerrick an-lee brarsanmol bw120 neocafe bezdomniy mainfraame joelvappiani mintplex-labs reone1 ammarmunir4567 id-2 jlabrada71 ebarahona kaz875 scenaristeur sanyaade-projects felbdogg okobsamoht jshph hiepxanh jonholman stewartoallen umar23faiz quanglam2807 davidraedev lwilson810 ochafik viandmarket25 murshidav mabrothrax fabricelifaa nesique strangebytesdev nbeerbower jasonallen msalam72 imclab petermetz phpk rcala lightforgemedia aawadat bluewhiteheart hui2008

node-llama-cpp's Issues

feat: pass an image as part of the evaluation

When llama.cpp's support for this will be stable

Failed to fetch llama.cpp release info RequestError [HttpError]: Not Found

I'm running the basic example provided in the README.md but I get an error when the library tries to fetch the llama.cpp

root@981b3099966c:/app# node example.js 
Prebuild binaries not found, falling back to to locally built binaries
Repo: ggerganov/llama.cpp
Release: /app/llama.cpp

⠋ Fetching llama.cpp info(node:75) ExperimentalWarning: The Fetch API is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
⠦ Fetching llama.cpp infoFailed to fetch llama.cpp release info RequestError [HttpError]: Not Found
    at /app/node_modules/@octokit/request/dist-node/index.js:112:21
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async requestWithGraphqlErrorHandling (/app/node_modules/@octokit/plugin-retry/dist-node/index.js:71:20)
    at async Job.doExecute (/app/node_modules/bottleneck/light.js:405:18) {
  status: 404,
  response: {
    url: 'https://api.github.com/repos/ggerganov/llama.cpp/releases/tags/%2Fapp%2Fllama.cpp',
    status: 404,
    headers: {
      'access-control-allow-origin': '*',
      'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset',
      'content-encoding': 'gzip',
      'content-length': '117',
      'content-security-policy': "default-src 'none'",
      'content-type': 'application/json; charset=utf-8',
      date: 'Thu, 31 Aug 2023 13:28:37 GMT',
      'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
      server: 'GitHub.com',
      'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
      vary: 'Accept-Encoding, Accept, X-Requested-With',
      'x-content-type-options': 'nosniff',
      'x-frame-options': 'deny',
      'x-github-api-version-selected': '2022-11-28',
      'x-github-media-type': 'github.v3; format=json',
      'x-github-request-id': '819E:7616:24960C:24DD3C:64F09584',
      'x-ratelimit-limit': '60',
      'x-ratelimit-remaining': '57',
      'x-ratelimit-reset': '1693492072',
      'x-ratelimit-resource': 'core',
      'x-ratelimit-used': '3',
      'x-xss-protection': '0'
    },
    data: {
      message: 'Not Found',
      documentation_url: 'https://docs.github.com/rest/releases/releases#get-a-release-by-tag-name'
    }
  },
  request: {
    method: 'GET',
    url: 'https://api.github.com/repos/ggerganov/llama.cpp/releases/tags/%2Fapp%2Fllama.cpp',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit.js/3.1.0 octokit-core.js/5.0.0 Node.js/18.0.0 (linux; arm64)'
    },
    request: { hook: [Function: bound bound register] }
  }
}
✖ Failed to fetch llama.cpp info
file:///app/dist/cli/commands/DownloadCommand.js:103
            throw new Error(`Failed to find release "${release}" of "${repo}"`);
                  ^

Error: Failed to find release "/app/llama.cpp" of "ggerganov/llama.cpp"
    at file:///app/dist/cli/commands/DownloadCommand.js:103:19
    at async withOra (file:///app/dist/utils/withOra.js:6:21)
    at async DownloadLlamaCppCommand (file:///app/dist/cli/commands/DownloadCommand.js:79:5)
    at async loadBin (file:///app/dist/utils/getBin.js:57:13)
    at async file:///app/dist/llamaEvaluator/LlamaBins.js:2:29

Node.js v18.0.0
root@981b3099966c:/app#

The env variable NODE_LLAMA_CPP_REPO_RELEASE is set:

root@981b3099966c:/app# echo $NODE_LLAMA_CPP_REPO_RELEASE
/app/llama.cpp

I have also set the skip download, but it fails searching for the binaries:

root@981b3099966c:/app# echo $NODE_LLAMA_CPP_SKIP_DOWNLOAD
1
root@981b3099966c:/app# node example.js 
Prebuild binaries not found, falling back to to locally built binaries
file:///app/dist/utils/getBin.js:54
            throw new Error("No prebuild binaries found and NODE_LLAMA_CPP_SKIP_DOWNLOAD env var is set to true");
                  ^

Error: No prebuild binaries found and NODE_LLAMA_CPP_SKIP_DOWNLOAD env var is set to true
    at loadBin (file:///app/dist/utils/getBin.js:54:19)
    at async file:///app/dist/llamaEvaluator/LlamaBins.js:2:29

Node.js v18.0.0

The only way I get it working is via downloading via CLI:

RUN cd $ROOT_APPLICATION && npx --yes node-llama-cpp download --release latest --nodeTarget $NODE_VERSION
RUN cd $ROOT_APPLICATION && npx --yes node-llama-cpp build --nodeTarget $NODE_VERSION

Thanks

docs: cannot read properties of undefined (reading '_chatGrammar') on penalty example

What was unclear or otherwise insufficient?

First, in JS and not Typescript, with the code at

https://withcatai.github.io/node-llama-cpp/guide/chat-session#repeat-penalty-customization
I get

const context = new LlamaContext({model});
      ^

SyntaxError: Identifier 'context' has already been declared

next with ctx in place of context like

const ctx = new LlamaContext({model});
const session = new LlamaChatSession({
    ctx
});

i get

file:///dev/node-llama/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:61
    async prompt(prompt, { onToken, signal, maxTokens, temperature, topK, topP, grammar = this.context._chatGrammar, trimWhitespaceSuffix = false, repeatPenalty } = {}) {
                                                                                                       ^

TypeError: Cannot read properties of undefined (reading '_chatGrammar')
    at LlamaChatSession.prompt (file:///node-llama/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:61:104)
    at file:///dev/node-llama/penalty.js:21:26

Recommended Fix

something does not work

Additional Context

No response

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

OH I got it ,
you should remove

import {context} from "esbuild";

ESM support?

Feature Description

Hey folks, first of all thanks for the fantastic work!

I'm a LangChain.js maintainer and we've had a few folks struggle to use node-llama-cpp through our integration with ESM (for example: langchain-ai/langchainjs#4181). They get errors like the following:

Error [ERR_REQUIRE_ESM]: require() of ES Module ../node_modules/node-llama-cpp/dist/index.js from ../node_modules/@langchain/community/dist/utils/llama_cpp.cjs not supported.
Instead change the require of index.js in ../node_modules/@langchain/community/dist/utils/llama_cpp.cjs to a dynamic import() which is available in all CommonJS modules.

I think it may be possible to fix using a solution like the one above, but would be nice to support ESM natively.

The Solution

Configure the package to export both .mjs and .cjs files.

Considered Alternatives

Some kind of dynamic import within the LangChain wrapper, but adding this would help others use node-llama-cpp directly.

Additional Context

I can try to have a look at this as well, but can't promise any timeline.

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Error building with Cuda

Issue description

Can't build executable with Cuda support enabled

Expected Behavior

Running npx node-llama-cpp download --cuda should compile the release with cuda support

Actual Behavior

When running Version 2.5.0 of the Package the build process seems to fail, but the nvidia Toolkit can be found:

D:\Projekte\ai-test\node_modules\node-llama-cpp\llama\build\llama.cpp>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe"  --use-local-env -ccbin "D:\Software\Microsoft Visual Studio\2022\VC\Tools\MSVC\14.37.32822\bin\HostX64\x64" -x cu
  -I"D:\Projekte\ai-test\node_modules\node-addon-api" -I"C:\Users\tillw\.cmake-js\node-x64\v18.16.0\include\node" -I"D:\Projekte\ai-test\node_modules\node-llama-cpp\llama\llama.cpp\." -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include" -I"C:\Pr
  ogram Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include"     --keep-dir x64\Release -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,sm_52] --generate-code=arch=compute_61,code=[comput
  e_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] /EHsc -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEE
  R_MAX_BATCH_SIZE=128 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DG
  GML_CUDA_PEER_MAX_BATCH_SIZE=128 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -Xcompiler "/EHsc /W3 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml.dir\Release\ggml.pdb" -o ggml.dir\Release\ggml-cuda.obj "D:\Projekte\ai-test\node_
  modules\node-llama-cpp\llama\llama.cpp\ggml-cuda.cu"
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified

When trying to build with Cuda support on the current version, CMake seems unable to find the CUDA toolkit:

D:\Projekte\ai-test\node_modules\node-llama-cpp\llama\build\CMakeFiles\3.26.4\VCTargetsPath.vcxproj" (default target) (1) -
    (PrepareForBuild target) ->
      D:\Software\Microsoft Visual Studio\2022\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(456,5): error MSB8020: The build tools for C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2 (Platform Toolset = 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2') cannot be found. To build using the C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2 build tools, please install C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2 build tools.  Alternatively, you may upgrade to the current Visual Studio tools by selecting the Project menu or right-click the solution, and then selecting "Retarget solution". [D:\Projekte\ai-test\node_modules\node-llama-cpp\llama\build\CMakeFiles\3.26.4\VCTargetsPath.vcxproj]

When trying to build LLama.cpp itself with Cuda support it works without any problem.

Steps to reproduce

initialize a new node project by running npm init
add the dependancy by running npm install node-llama-cpp
try to compile with cuda support npx node-llama-cpp download --cuda

My Environment

Dependency	Version
Operating System	Windows 11
CPU	Ryzen 7 7800x3d
Node.js version	18.6
`node-llama-cpp` version	b1378

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Error node-llama-cpp build

I have manually downloaded in a container llama.cpp and set the env:

ENV NODE_LLAMA_CPP_REPO_RELEASE /app/llama.cpp
RUN cd $ROOT_APPLICATION && git clone https://github.com/ggerganov/llama.cpp.git

so in the container the env is correct, but when I try to build:

root@981b3099966c:/app# echo $NODE_LLAMA_CPP_REPO_RELEASE
/app/llama.cpp
root@981b3099966c:/app# npx --yes node-llama-cpp build --nodeTarget $NODE_VERSION
✖ Failed to compile llama.cpp
node-llama-cpp build

Compile the currently downloaded llama.cpp

Options:
  -h, --help        Show help                                                              [boolean]
  -a, --arch        The architecture to compile llama.cpp for                               [string]
  -t, --nodeTarget  The Node.js version to compile llama.cpp for. Example: v18.0.0          [string]
      --metal       Compile llama.cpp with Metal support. Can also be set via the NODE_LLAMA_CPP_MET
                    AL environment variable                               [boolean] [default: false]
      --cuda        Compile llama.cpp with CUDA support. Can also be set via the NODE_LLAMA_CPP_CUDA
                     environment variable                                 [boolean] [default: false]
  -v, --version     Show version number                                                    [boolean]

Error: "/root/.npm/_npx/bd61418a9647e51e/node_modules/node-llama-cpp/llama/llama.cpp" directory does not exist
    at compileLlamaCpp (file:///root/.npm/_npx/bd61418a9647e51e/node_modules/node-llama-cpp/dist/utils/compileLLamaCpp.js:13:19)
    at async file:///root/.npm/_npx/bd61418a9647e51e/node_modules/node-llama-cpp/dist/cli/commands/BuildCommand.js:47:9
    at async withOra (file:///root/.npm/_npx/bd61418a9647e51e/node_modules/node-llama-cpp/dist/utils/withOra.js:6:21)
    at async Object.BuildLlamaCppCommand [as handler] (file:///root/.npm/_npx/bd61418a9647e51e/node_modules/node-llama-cpp/dist/cli/commands/BuildCommand.js:42:5)
root@981b3099966c:/app#

It seems to search the default location under npx...

Add method for dumping context and loading it.

Feature Description

As far as I can tell there is currently no way to currently persist context, I would like to be able to dump the internal state of the model to a file or some other form of persistent storage so I can load a chat sometime later and keep the context history.

This would solve the following use cases:

Continue a chat after a crash or restart.
Run multiple chats on a single model.

The Solution

Having the ability to access (dump) the internal data as a stream would allow one to save to a file then load it from there later to continue the chat.

Considered Alternatives

It might be possible to save the entire (text) history and load it next time but this seems quite inefficient and would take a long time to load.

Other solutions to solve the above use cases would be welcome.

Additional Context

Does anyone know if llama.cpp expose such a feature which would make it trivial to add to this project?

This project: https://github.com/kuvaus/LlamaGPTJ-chat has a feature like this and I believe it uses llama.cpp so I am thinking there should be a way to do this.

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

bug: model parameter `threads` doesn't work

Issue description

It seems to me that parameter threads doesn't work as expected

Expected Behavior

If I have 24 CPUs and pass threads:24 then all CPUs should be utilized. II tried call original llama.cpp with argument -t 24 and it works normally as expected.

Actual Behavior

I pass parameter thread: 24 or 1 to constructor and nothing is changed: always it starts utilize 4 CPUs upper 80% and sometimes use 1-2 additional with 25-50% utilization.

Steps to reproduce

Try to pass different threads value to model constructor and observe CPUs utilization (for example, htop)

My Environment

Dependency	Version
Operating System	Ubuntu 22.04.3 LTS
CPU	AMD EPYC 7742 64-Core Processor
Node.js version	v20.10.0
Typescript version	no use
`node-llama-cpp` version	3.0.0-beta.1 & v2.8.1

Additional Context

./llama.cpp/main -m ./catai/models/phind-codellama-34b-q3_k_s -p "Please, write JavaScript function to sort array" -ins -t 24

const model = new LlamaModel({
    modelPath: "/root/catai/models/phind-codellama-34b-q3_k_s",
    threads: 1,
});

const model = new LlamaModel({
    modelPath: "/root/catai/models/phind-codellama-34b-q3_k_s",
    threads: 24,
});

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

feat: automatic batching

Also, automatically set the right contextSize and provide other good defaults to make the usage smoother.

Support configuring the context swapping size for infinite text generation (by default, it'll be automatic and dynamic depending on the prompt)

Not worlking as intended.

Issue description

Following instructions as-is just doesn' t work.

Expected Behavior

Working.

Actual Behavior

Not working.

Steps to reproduce

I follow the exact instructions at https://www.npmjs.com/package/node-llama-cpp (npm install and copy/paste code into .js file)

I try running (only changing the line with the .gguf file to using a file on my harddrive) it I get:

╭─arthur at aquarelle in ~/dev/ai/llmi/src on main✘✘✘ 23-10-22 - 19:43:11
╰─⠠⠵ node structure.js                                                                                                                                                                                                                                                                                                                                                                                                                                                                            on main↑1|✚1…2
(node:2540491) Warning: To load an ES module, set "type": "module" in the package.json or use the .mjs extension.
(Use `node --trace-warnings ...` to show where the warning was created)
/home/arthur/dev/ai/llmi/src/structure.js:2
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";
^^^^^^

SyntaxError: Cannot use import statement outside a module
    at internalCompileFunction (node:internal/vm:73:18)
    at wrapSafe (node:internal/modules/cjs/loader:1153:20)
    at Module._compile (node:internal/modules/cjs/loader:1197:27)
    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)
    at Module.load (node:internal/modules/cjs/loader:1091:32)
    at Module._load (node:internal/modules/cjs/loader:938:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:83:12)
    at node:internal/main/run_main_module:23:47

Google recommends I change to:

//import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";
const { LlamaModel, LlamaContext, LlamaChatSession } = require('node-llama-cpp');

So I do that (am I wrong or is it impossible for the example given in the README to work...?), and I get:

╭─arthur at aquarelle in ~/dev/ai/llmi/src on main✘✘✘ 23-10-22 - 19:43:15
╰─⠠⠵ node structure.js                                                                                                                                                                                                                                                                                                                                                                                                                                                                            on main↑1|✚1…2
/home/arthur/dev/ai/llmi/src/structure.js:15
const a1 = await session.prompt(q1);
           ^^^^^

SyntaxError: await is only valid in async functions and the top level bodies of modules
    at internalCompileFunction (node:internal/vm:73:18)
    at wrapSafe (node:internal/modules/cjs/loader:1153:20)
    at Module._compile (node:internal/modules/cjs/loader:1197:27)
    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)
    at Module.load (node:internal/modules/cjs/loader:1091:32)
    at Module._load (node:internal/modules/cjs/loader:938:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:83:12)
    at node:internal/main/run_main_module:23:47

Node.js v20.5.1

So I put the awaits inside an async:

// Async hell.
(async () => {

    const q1 = "Hi there, how are you?";
    console.log("User: " + q1);
    
    const a1 = await session.prompt(q1);
    console.log("AI: " + a1);
    
    
    const q2 = "Summerize what you said";
    console.log("User: " + q2);
    
    const a2 = await session.prompt(q2);
    console.log("AI: " + a2);
    
})();

now I get:

╭─arthur at aquarelle in ~/dev/ai/llmi/src on main✘✘✘ 23-10-22 - 19:44:31
╰─⠠⠵ node structure.js                                                                                                                                                                                                                                                                                                                                                                                                                                                                            on main↑1|✚1…2
/home/arthur/dev/ai/llmi/src/structure.js:3
const { LlamaModel, LlamaContext, LlamaChatSession } = require('node-llama-cpp');
                                                       ^

Error [ERR_REQUIRE_ESM]: require() of ES Module /home/arthur/dev/ai/llmi/src/node_modules/node-llama-cpp/dist/index.js from /home/arthur/dev/ai/llmi/src/structure.js not supported.
Instead change the require of index.js in /home/arthur/dev/ai/llmi/src/structure.js to a dynamic import() which is available in all CommonJS modules.
    at Object.<anonymous> (/home/arthur/dev/ai/llmi/src/structure.js:3:56) {
  code: 'ERR_REQUIRE_ESM'
}

Node.js v20.5.1
╭─arthur at aquarelle in ~/dev/ai/llmi/src on main✘✘✘ 23-10-22 - 19:44:51
╰─⠠⠵

At that point I just give up...

(Note this is after nearly an hour trying to get this module to work with ts-node, and utterly failing, despite trying DOZENS of things from Google and ChatGPT... I use thousands of modules from npm in ts projects, this is the first time I get this much trouble... whfich is why I failed back to trying to run it with node (instead of ts-node) to simplify the issue, and as you can see above, even that fails...)

I'm at a loss...

Any help welcome.

My Environment

Latest Ubuntu, Node 20.5.1

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

langchain.js - throws error "disposed undefined"

Issue description

Properly use langchain.js with llama.cpp and embeddings (maybe it's a new feature?)

Expected Behavior

Use langchain.js to add some useful retrievers like for memory or file loading.
I'm probably wrong somewhere but I don't know where, can you give me some advice? Thanks!

Actual Behavior

When I try to use retrievers together with LlamaCppEmbeddings I get an error:

file:///Users/mrddter/langchain-llama/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:28
        if (contextSequence.disposed)
                            ^

TypeError: Cannot read properties of undefined (reading 'disposed')
    at new LlamaChatSession (file:///Users/mrddter/langchain-llama/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:28:29)
    at createLlamaSession (file:///Users/mrddter/langchain-llama/node_modules/langchain/node_modules/@langchain/community/dist/utils/llama_cpp.js:27:12)
    at new LlamaCpp (file:///Users/mrddter/langchain-llama/node_modules/langchain/node_modules/@langchain/community/dist/llms/llama_cpp.js:77:25)
    at file:///Users/mrddter/langchain-llama/index.js:10:15

Steps to reproduce

I used this test code:

import { VectorStoreRetrieverMemory } from 'langchain/memory'
import { LLMChain } from 'langchain/chains'
import { PromptTemplate } from 'langchain/prompts'
import { MemoryVectorStore } from 'langchain/vectorstores/memory'

import { LlamaCppEmbeddings } from 'langchain/embeddings/llama_cpp'
import { LlamaCpp } from 'langchain/llms/llama_cpp'

const embeddings = new LlamaCppEmbeddings({ modelPath: process.env.MODEL_PATH, batchSize: 1024 })
const model = new LlamaCpp({ modelPath: process.env.MODEL_PATH, batchSize: 1024 })

const vectorStore = new MemoryVectorStore(embeddings)
const memory = new VectorStoreRetrieverMemory({
  // 1 is how many documents to return, you might want to return more, eg. 4
  vectorStoreRetriever: vectorStore.asRetriever(1),
  memoryKey: 'history'
})

// First let's save some information to memory, as it would happen when
// used inside a chain.
await memory.saveContext({ input: 'My favorite food is pizza' }, { output: 'thats good to know' })
await memory.saveContext({ input: 'My favorite sport is soccer' }, { output: '...' })
await memory.saveContext({ input: "I don't the Celtics" }, { output: 'ok' })

// Now let's use the memory to retrieve the information we saved.
console.log(await memory.loadMemoryVariables({ prompt: 'what sport should i watch?' }))
/*
{ history: 'input: My favorite sport is soccer\noutput: ...' }
*/

// Now let's use it in a chain.

const prompt =
  PromptTemplate.fromTemplate(`The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Relevant pieces of previous conversation:
{history}

(You do not need to use these pieces of information if not relevant)

Current conversation:
Human: {input}
AI:`)
const chain = new LLMChain({ llm: model, prompt, memory })

const res1 = await chain.call({ input: "Hi, my name is Perry, what's up?" })
console.log({ res1 })
/*
{
  res1: {
    text: " Hi Perry, I'm doing great! I'm currently exploring different topics related to artificial intelligence like natural language processing and machine learning. What about you? What have you been up to lately?"
  }
}
*/

const res2 = await chain.call({ input: "what's my favorite sport?" })
console.log({ res2 })
/*
{ res2: { text: ' You said your favorite sport is soccer.' } }
*/

const res3 = await chain.call({ input: "what's my name?" })
console.log({ res3 })
/*
{ res3: { text: ' Your name is Perry.' } }
*/

My Environment

Dependency	Version
Operating System	macOS Ventura
CPU	Apple M1
Node.js version	10.10.0
Typescript version	only js
`langchain` version	^0.0.212
`node-llama-cpp` version	^3.0.0-beta.1

Additional Context

I also tried this snippet to try to use only embeddings via langhchain.js but without success:

import { LlamaCppEmbeddings } from 'langchain/embeddings/llama_cpp'

const embeddings = new LlamaCppEmbeddings({ modelPath: process.env.MODEL_PATH, batchSize: 1024 })
const res = await embeddings.embedQuery('Hello Llama!')
console.log(res)

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

feat: minP support

Feature Description

Setting minP allows for better results even at higer temperatures by rejecting tokens that are too unlikely. It is now supported in llama.cpp, and it makes both topP and topK almost superfluous.

The Solution

It should be done the same way as you did for maxP, I guess.

Considered Alternatives

Don't do it if you don't want to.

Additional Context

No response

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

CUDA compilation failing with VS 2022 Community

Issue description

CUDA compilation fails with code

Expected Behavior

Compilation with CUDA should work

Actual Behavior

I have an error MSB8020 whan trying to recompile with CUDA.
I tested to compile with CUDA direclty on the llama.ccp project with CMAKE, without error.

Steps to reproduce

Lauching npx --no node-llama-cpp download --cuda

My Environment

Dependency	Version
Operating System	Windows 11
CPU	AMD Ryzen 5 3600 6-Core Processor
Node.js version	v18.18.2
`node-llama-cpp` version	2.7.3
CUDA version	v12.2
Visual Studio version	2022 community

Additional Context

The logs :

vuith@THOMZ-FIXE MINGW64 ~/GitRepositories/node-llama-cpp-experiment (main)
$ npx --no node-llama-cpp download --cuda
Repo: ggerganov/llama.cpp
Release: b1378
CUDA: enabled

✔ Fetched llama.cpp info
✔ Removed existing llama.cpp directory
Cloning llama.cpp
Clone ggerganov/llama.cpp (local bundle)  100% ████████████████████████████████████████  0s
◷ Compiling llama.cpp
Not searching for unused variables given on the command line.
CMake Error at CMakeLists.txt:3 (project):
  Failed to run MSBuild command:

    C:/Program Files/Microsoft Visual Studio/2022/Community/MSBuild/Current/Bin/amd64/MSBuild.exe

  to get the value of VCTargetsPath:

    Version MSBuild 17.7.2+d6990bcfa pour .NET Framework
    La génération a démarré 22/10/2023 18:51:39.

    Projet "C:\Users\vuith\GitRepositories\node-llama-cpp-experiment\node_modules\node-llama-cpp\llama\build\CMakeFiles\3.28.0-rc2\VCTargetsPath.vcxproj" sur le noud 1 (cibles par défaut).
    C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(456,5): error MSB8020: Les outils de génération pour C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2 (ensemble d'outils de plateforme = 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2') sont introuvables. Pour générer à l'aide des outils de génération C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2, installez les outils de génération C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2. Vous avez également la possibilité de mettre à niveau les outils Visual Studio actuels en sélectionnant le menu Projet ou en cliquant avec le bouton droit sur la solution, puis en sélectionnant "Recibler la solution". [C:\Users\vuith\GitRepositories\node-llama-cpp-experiment\node_modules\node-llama-cpp\llama\build\CMakeFiles\3.28.0-rc2\VCTargetsPath.vcxproj]
    Génération du projet "C:\Users\vuith\GitRepositories\node-llama-cpp-experiment\node_modules\node-llama-cpp\llama\build\CMakeFiles\3.28.0-rc2\VCTargetsPath.vcxproj" terminée (cibles par défaut) -- ÉCHEC

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Module not found: Default condition should be last one

Issue description

I ran into "Module not found: Default condition should be last one" error on my console from importing ChatPromptWrapper.

Expected Behavior

No compilation error.

Actual Behavior

Module not found: Default condition should be last one

Steps to reproduce

Set up a Next.js project
Create a Route Handler, e.g. app/api/local/route.ts
Import ChatPromptWrapper

import { ChatPromptWrapper } from 'node-llama-cpp';

My Environment

Dependency	Version
`next`	13.4.18
`node-llama-cpp`	2.1.2

Additional Context

Next.js uses Webpack as its underlying bundler, and I suspect the ordering of default export in package.json matters for this specific bundler.

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Add llama.cpp build number as info

Feature Description

The website mentions that each release ships with "the most recent llama.cpp release at the time", offering the possibility to download newer releases as appropriate (with the corresponding caveats).

However, I could not find which release that was. I now manually figured out that release b1618 is the last one that works with the beta, as afterwards there was a breaking API change, but I had to test out quite a few to find "the newest".

It would be great to store this information somewhere – either on the repository (releases section would suffice) or in the distributable.

The Solution

Add the build number of llama.cpp that was used to create a node-llama-cpp release to some public place.

Considered Alternatives

Manually finding out the most recent working build number. None of the files, and not the releases section contain this number.

Additional Context

It doesn't need to be pretty, so there's no need to make it complicated. It's just good if the number is recorded somewhere!

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Failed to load model

I am trying to load a local model Llama2_7B built using llama.cpp then quantized to q4_0. When I attempt to load the model I get the error:

error loading model: unknown (magic, version) combination: 46554747, 00000001; is this really a GGML file?

To use a quantized model are there parameters I need to set or is something else preventing this model from loading? For my test I am just passing in the path.

Could not find a KV slot

Issue description

In version 2.8.3 I am receiving "could not find a KV slot for the batch (try reducing the size of the batch or increase the context)" error when performing a bunch of zero-shot prompts in a loop with the same context.

Expected Behavior

To be able to perform zero-shot prompts in a loop without failure.

Actual Behavior

I'm receiving the following error after a few iterations regardless of the number of layers or if I create multiple contexts.

Steps to reproduce

follow instructions on my OSS project for setting up genai-gamelist
update package.json to use 2.8.3 from 3.x beta for node-llama-cpp
run the script

My Environment

Dependency	Version
Operating System	Ubuntu 22.04 LTS
CPU	AMD Ryzen 9 5900X
Node.js version	18.x
Typescript version	5.3.3
`node-llama-cpp` version	2.8.3

Additional Context

With only upgrading to the 3.x beta, I am now unable to reproduce the issue with the same script.

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

node-llama-cpp doesn`t work with typescript

Issue description

node-llama-cpp doesn`t with typescript

Expected Behavior

work with typescript

Actual Behavior

when I install node-llama-cpp package and run it I have errors

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3:45 - error TS1139: Type parameter declaration expected.

3 export declare class LlamaJsonSchemaGrammar<const T extends Readonly> extends LlamaGrammar {
~~~~~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3:53 - error TS1005: ',' expected.

3 export declare class LlamaJsonSchemaGrammar<const T extends Readonly> extends LlamaGrammar {
~~~~~~~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3:69 - error TS1005: ',' expected.

3 export declare class LlamaJsonSchemaGrammar<const T extends Readonly> extends LlamaGrammar {
~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3:85 - error TS1109: Expression expected.

3 export declare class LlamaJsonSchemaGrammar<const T extends Readonly> extends LlamaGrammar {
~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3:87 - error TS1109: Expression expected.

3 export declare class LlamaJsonSchemaGrammar<const T extends Readonly> extends LlamaGrammar {
~~~~~~~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3:95 - error TS1434: Unexpected keyword or identifier.

3 export declare class LlamaJsonSchemaGrammar<const T extends Readonly> extends LlamaGrammar {
~~~~~~~~~~~~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:4:5 - error TS1128: Declaration or statement expected.

4 private readonly _schema;
~~~~~~~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:4:13 - error TS1128: Declaration or statement expected.

4 private readonly _schema;
~~~~~~~~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:5:23 - error TS1005: ',' expected.

5 constructor(schema: T);
~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:6:15 - error TS1005: ',' expected.

6 parse(json: string): GbnfJsonSchemaToType;
~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:6:24 - error TS1005: ';' expected.

6 parse(json: string): GbnfJsonSchemaToType;
~

node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:6:49 - error TS1005: '(' expected.

6 parse(json: string): GbnfJsonSchemaToType;
~

Found 12 errors in the same file, starting at: node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaJsonSchemaGrammar.d.ts:3

Steps to reproduce

install it to the typescript project
with none the typescript it works

My Environment

"node-llama-cpp": "^2.8.1"
"typescript": "~4.6.3"
"ts-node": "~10.7.0",
my tsconfig

{
"compileOnSave": false,
"compilerOptions": {
"baseUrl": "./",
"outDir": "./dist",
"sourceMap": true,
"declaration": false,
"module": "commonjs",
"moduleResolution": "node",
"experimentalDecorators": true,
"skipLibCheck": true,
"target": "es5",
"skipLibCheck": true,
"typeRoots": ["node_modules/@types"],
"lib": ["es2018", "dom"]
}
}

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

feat: hide llama.cpp logs

Feature Description

Hide all the logs produced by llama.cpp about model running/parameters/etc

The Solution

Either ignore output or process it and receive as a callback when running model internally

Considered Alternatives

In general non-optional logging in (any) libraries is a bad practice

Additional Context

Thanks for your work ❤️

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

feat: function calling support in a chat session's `prompt` function

Make it possible to provide functions that the model can call as part of the response.

It should be as simple as something like that:

const res = await chatSession.prompt("What is the current weather?", {
    functions: {
        getWeather: {
            description: "Get the current weather for a location"
            params: {
                location: {
                    type: "string"
                }
            },
            handler({location}) {
                console.log("Providing fake weather for location:", location);

                return {
                    temperature: 32,
                    raining: true,
                    unit: "celsius"
                };
            }
        },
        getCurrentLocation: {
            description: "Get the current location",
            handler() {
                console.log("Providing fake location");

                return "New York, New York, United States".
            }
        }
    }
});
console.log(res);

If you have ideas of a text format I can use to prompt the model with, please share.
I'm looking for a format that can achieve all of these:

A way to give the model the list of possible functions, while utilizing as few tokens for this as possible.
It can also be OK to give the model only a brief overview of the available functions and let it fetch more info about a function on demand.
Make it easy to stream regular text, while still distinguishing between regular text that the model writes and a function call it tries to do

I thought about implementing support for this format as part of GeneralChatPromptWrapper, but I'm not really sure whether this the safest way to distinguish between text and function calling:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.

Available functions:
` ` `
function getWeather(params: {location: string});
function getCurrentLocation();
` ` `

You can call these functions by writing a text like that:
[[call: myFunction({param1: "value"})]]

### Human:
What is the current weather?

### Assistant:

Then when the model will write text, it may go like that:

I'll get the current location to fetch the weather for it.
[[call: getCurrentLocation()]]

I'll then detect the function call in the model response and evaluate this text:

[[result: "New York, New York, United States"]]

So the model can then continue to provide completion:

I'll now get the current weather for New York, New York, United States.
[[call: getWeather({location: "New York, New York, United States"})]]

I'll then detect the function call in the model response and evaluate this text:

[[result: {temperature: 32, raining: true, unit: "celsius"}]]

So the model can then continue to provide completion:

The current weather for New York, New York, United States is 32 degrees celsius and it's currently raining

I plan to use grammar tricks to make sure the model can only call existing functions and with the right parameter types.

How you can help

If you have an idea of a better format, please let me know
If you have an idea of a good way to implement this in the LlamaChat (see LlamaChatPromptWrapper) format, this would also be very helpful

I'm currently working on a major change in this module, so if you'd like to help with implementing any of this, please let me know beforehand so your work won't become incompatible with the new changes

Naming the `LlamaModel` anything but `model`makes the app crash when creating a new `LlamaContext`

Issue description

Like the title say, I can't create a context for a model that is not named model

Expected Behavior

I should be able to name my variables however I want.

Actual Behavior

I can't.

Steps to reproduce

import "dotenv/config";

import {
  LlamaModel,
  LlamaContext,
} from "node-llama-cpp";

import path from "path";

const llama_root = process.env.LLAMA_MODEL_PATH;

const llamaModel = new LlamaModel({
  modelPath: path.join(llama_root, "models", "7B", "ggml-model_Q5_K_M"),
});
new LlamaContext({ llamaModel, threads: 4 });

console.log("Here!");

Crashes with:

 file:///.../llamabot/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaContext.js:13
    constructor({ model, prependBos = true, grammar, seed = model._contextOptions.seed, contextSize = model._contextOptions.contextSize, batchSize = model._contextOptions.batchSize, logitsAll = model._contextOptions.logitsAll, embedding = model._contextOptions.embedding, threads = model._contextOptions.threads }) {
                                                                  ^
 TypeError: Cannot read properties of undefined (reading '_contextOptions')
    at new LlamaContext (file:///.../llamabot/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaContext.js:13:67)
    at ...

but

    import "dotenv/config";

import {
  LlamaModel,
  LlamaContext,
} from "node-llama-cpp";

import path from "path";

const llama_root = process.env.LLAMA_MODEL_PATH;

const model = new LlamaModel({
  modelPath: path.join(llama_root, "models", "7B", "ggml-model_Q5_K_M"),
});
new LlamaContext({ model, threads: 4 });

console.log("Here!");

prints `Here!` eas expected and exits without a crash

My Environment

Dependency	Version
Operating System	Windows 10
CPU	Ryzen 9 (5900X)
Node.js version	1.1.9
Typescript version	doesn't matter here, that's a runtime issue
`node-llama-cpp` version	2.8.2

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

feat: reuse context for different chat session history

Issue description

So i am creating a question and answer bot i dont want to add my previous chat messages to session, due to adding my context limit get exausted

Expected Behavior

i want the ability to not saving anything to session or ability to clear session contexts

Actual Behavior

after some chat it quicky exhausted and throwing error

Steps to reproduce

after some chat it quicky exhausted and throwing error

My Environment

Dependency	Version
Operating System
CPU	Intel i9 / Apple M1
Node.js version	x.y.zzz
Typescript version	x.y.zzz
`node-llama-cpp` version	x.y.zzz

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Feature request: CUDA support

Is there any plan for CUDA support?

Please add AMD ROCm support

Feature Description

I cannot use this with AMD GPU.

The Solution

Support AMD GPU,

Considered Alternatives

I don't want to sell my AMD GPU.

Additional Context

No response

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Feature request: Langchain support

Is there a plan for langchain support

ggml_allocr_alloc: not enough space in the buffer

Issue description

When I pass the prompt greater than 1000+ tokens - it fails with ggml_allocr_alloc error.

Expected Behavior

Should allow to consume prompt size equals to model's context size.

Actual Behavior

AI: ggml_allocr_alloc: not enough space in the buffer (needed 308691360, largest block available 281657952)
GGML_ASSERT: /Users/pixelycia/Projects/node-llama-cpp/node_modules/node-llama-cpp/llama/llama.cpp/ggml-alloc.c:173: !"not enough space in the buffer"
zsh: abort node-llama-cpp chat -m ./models/mistral-7b-openorca.Q5_K_M.gguf -c 8192

Steps to reproduce

I was able to reproduce error through the code and CLI:

node-llama-cpp chat -m ./models/mistral-7b-openorca.Q5_K_M.gguf -c 8192
Pass long prompt.

My Environment

Dependency	Version
Operating System	MacOS Sonoma 14.0
Node.js version	v18.18.0
Typescript version	5.1.3
`node-llama-cpp` version	^2.5.1

Additional Context

I compiled latest llama.cpp and from original repository - it works perfectly fine.

Tried to clear \ re-download \ re-build with the build-in CLI tool "node-llama-cpp" with different releases \ metal \ no-metal support - fails with error all the time.

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Set temperature request.

Enhancement requests. I've based a module for langchain on this repo, and wondered if it would be possible to support setting the temperature?

feat: get embedding for text

Related: langchain-ai/langchainjs#3626

support commonjs

Feature Description

This library is currently integrated into langchain and because it only supports es modules, it is limiting both the adoption of langchain as well as this library in the ecosystem.

It's also important to note that you can't just switch your package.json for any nestjs project to module. most of the tooling and @nestjs modules themselves are not compatible.

The Solution

The easiest approach would be to modify the build and package json to support both commonjs and es modules builds. It might also be possible to support module only if you set the file extension correctly to .mjs

Considered Alternatives

There are no alternatives. In its current state, you cannot use the langchain or this library at all with nestjs. I've also tried a considerable amount trying workarounds for webpack and non-webpack builds with nestjs.

Additional Context

This issue was raised previously and the reason used to close it was because the node ecosystem is moving towards es modules. This is true, but it is still a long ways away. The node working group specifically created the ability to support both es modules and commonjs as a bridge to es modules. By not providing that other build, you're severally limiting the reach of this library as well as langchain.

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

`(node:23396) UnhandledPromiseRejectionWarning: Error: AbortError`

Issue description

When aborting a prompt() an error is thrown.

Expected Behavior

I can abort a generation.

Actual Behavior

This is thrown:

(node:23396) UnhandledPromiseRejectionWarning: Error: AbortError
    at LlamaChatSession._evalTokens (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:154:23)
    at async file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:106:72
    at async withLock (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/utils/withLock.js:11:16)
    at async LlamaChatSession.promptWithMeta (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:74:16)
    at async LlamaChatSession.prompt (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:62:26)
(Use `Electron --trace-warnings ...` to show where the warning was created)
(node:23396) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:23396) UnhandledPromiseRejectionWarning: Error: AbortError
    at LlamaChatSession._evalTokens (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:154:23)
    at async file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:106:72
    at async withLock (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/utils/withLock.js:11:16)
    at async LlamaChatSession.promptWithMeta (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:74:16)
(node:23396) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 2)
    at async LlamaChatSession.prompt (file:///Users/parismorgan/repo/foo/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/llamaEvaluator/LlamaChatSession.js:62:26)

which when I click in leads to this line:

for await (const chunk of evaluationIterator) {
            if (signal?.aborted)
                throw new AbortError();

and before that, this line:

const { text, stopReason, stopString, stopStringSuffix } = await this._evalTokens(this._ctx.encode(promptText), {
                onToken, signal, maxTokens, temperature, topK, topP, grammar, trimWhitespaceSuffix,
                repeatPenalty: repeatPenalty == false ? { lastTokens: 0 } : repeatPenalty
            });

Steps to reproduce

I first set up things like this:

const model = new LlamaModel({
  modelPath: '~/repo/gguf-models/llama-2-7b-chat.Q4_K_M.gguf',
  useMlock: true
})
const context = new LlamaContext({
  model,
  batchSize: 512,
  threads: 8,
  contextSize: 4096
})
session = new LlamaChatSession({
  context,
  systemPrompt: "This is a transcript of a never ending conversation between Paris and Siri. This is the personality of Siri: Siri is a knowledgeable and friendly AI. They are very curious and will ask the user a lot of questions about themselves and their life.\nSiri is a virtual assistant who lives on Paris's computer.",
  printLLamaSystemInfo: true,
  promptWrapper: new LlamaChatPromptWrapper()
})

Then I kick off a text generation:

const abortController = new AbortController()
ipcMain.on('message', async (_event, {message}) => {
  ipcMain.on('message', async (_event, {message}) => {
    await session.prompt(message, {
      onToken,
      signal: abortController.signal
    })
})

And then I abort it:

ipcMain.on('stop-generation', () => {
  abortController.abort()
})

My Environment

Dependency	Version
Operating System	Mac
CPU	Intel i9 / Apple M2 Pro
Node.js version	v18.16.0
Typescript version	5.1.6
`node-llama-cpp` version	2.8.0
Electron version	25.3.0

Additional Context

Is the appropriate thing to do here just to wrap the code in a try catch? Or is this actually a bug? Thank you for any help!

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Error [ERR_REQUIRE_ESM]: require() of ES Module

Issue description

Error for launch project with the lib

Expected Behavior

Using the lib in env with nestjs and langchainjs

Actual Behavior

/srv/app/node_modules/.pnpm/[email protected][email protected]/node_modules/langchain/dist/llms/llama_cpp.cjs:4
2023-10-06T15:15:09.118856808Z const node_llama_cpp_1 = require("node-llama-cpp");
2023-10-06T15:15:09.118858668Z ^
2023-10-06T15:15:09.118860078Z Error [ERR_REQUIRE_ESM]: require() of ES Module /srv/app/node_modules/.pnpm/[email protected]/node_modules/node-llama-cpp/dist/index.js from /srv/app/node_modules/.pnpm/[email protected][email protected]/node_modules/langchain/dist/llms/llama_cpp.cjs not supported.
2023-10-06T15:15:09.118861714Z Instead change the require of index.js in /srv/app/node_modules/.pnpm/[email protected][email protected]/node_modules/langchain/dist/llms/llama_cpp.cjs to a dynamic import() which is available in all CommonJS modules.
2023-10-06T15:15:09.118863275Z at Object. (/srv/app/node_modules/.pnpm/[email protected][email protected]/node_modules/langchain/dist/llms/llama_cpp.cjs:4:26)
2023-10-06T15:15:09.118864924Z at Object. (/srv/app/node_modules/.pnpm/[email protected][email protected]/node_modules/langchain/llms/llama_cpp.cjs:1:18)
2023-10-06T15:15:09.118866441Z at Object. (/srv/app/dist/services/predict/llama-2/llama2-7B-predict.service.js:5:21)
2023-10-06T15:15:09.118867854Z at Object. (/srv/app/dist/app.module.js:13:37)
2023-10-06T15:15:09.118869223Z at Object. (/srv/app/dist/main.js:4:22)

Steps to reproduce

Add the lib and try to use it in project with nestjs

My Environment

| Dependency | Version |
"@nestjs/common": "^10.2.7",
"@nestjs/core": "^10.2.7",
"@nestjs/platform-express": "^10.2.7",
"cmake-js": "^7.2.1",
"langchain": "^0.0.159",
"node-llama-cpp": "^2.5.1",
"reflect-metadata": "^0.1.13",
"rxjs": "^7.8.1"
DEV
"@nestjs/cli": "^10.1.18",
"@nestjs/schematics": "^10.0.2",
"@nestjs/testing": "^10.2.7",
"@types/express": "^4.17.18",
"@types/jest": "^29.5.5",
"@types/node": "^20.8.2",
"@types/supertest": "^2.0.14",
"@typescript-eslint/eslint-plugin": "^6.7.4",
"@typescript-eslint/parser": "^6.7.4",
"eslint": "^8.50.0",
"eslint-config-prettier": "^9.0.0",
"eslint-plugin-prettier": "^5.0.0",
"jest": "^29.7.0",
"prettier": "^3.0.3",
"source-map-support": "^0.5.21",
"supertest": "^6.3.3",
"ts-jest": "^29.1.1",
"ts-loader": "^9.4.4",
"ts-node": "^10.9.1",
"tsconfig-paths": "^4.2.0",
"typescript": "^5.2.2"
| Operating System | ubuntu22.04 |
| Node.js version | v20.8.0 |
| Typescript version | 5.2.2 |
| node-llama-cpp version | 2.5.1 |

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Cuda support is not working

I have built binary as per README and passed gpuLayers but it is still not using GPU.

llama_model_loader: loaded meta data with 17 key-value pairs and 291 tensors from models/codellama-7b.Q4_K_S.gguf (version GGUF V1 (support until nov 2023))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32016,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight f16      [  4096, 32016,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_v.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    6:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_down.weight q5_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   10:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_v.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   15:         blk.1.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_down.weight q5_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   19:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   20:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_v.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   24:         blk.2.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_down.weight q5_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   27:              blk.2.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   28:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   29:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_v.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   33:         blk.3.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_down.weight q5_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   36:              blk.3.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   37:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   38:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   42:         blk.4.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   45:              blk.4.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   46:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   47:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   51:         blk.5.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   54:              blk.5.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   55:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   56:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   60:         blk.6.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   63:              blk.6.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   64:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   65:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   69:         blk.7.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   72:              blk.7.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   73:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   74:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   78:         blk.8.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   81:              blk.8.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   82:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   87:         blk.9.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   90:              blk.9.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   91:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   96:        blk.10.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.10.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  100:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  105:        blk.11.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.11.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  109:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  112:             blk.12.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:             blk.12.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  114:        blk.12.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  116:           blk.12.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  117:             blk.12.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  118:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  121:             blk.13.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:             blk.13.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  123:        blk.13.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  125:           blk.13.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.13.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  127:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  130:             blk.14.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:             blk.14.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  132:        blk.14.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  134:           blk.14.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.14.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  136:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:             blk.15.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:             blk.15.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  141:        blk.15.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  143:           blk.15.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  145:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:             blk.16.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  150:        blk.16.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  152:           blk.16.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  154:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:             blk.17.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  159:        blk.17.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  161:           blk.17.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  163:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:             blk.18.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  168:        blk.18.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  170:           blk.18.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  172:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:             blk.19.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  177:        blk.19.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  179:           blk.19.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  181:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:             blk.20.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  186:        blk.20.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  188:           blk.20.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  190:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:             blk.21.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  195:        blk.21.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  197:           blk.21.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  199:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:             blk.22.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  204:        blk.22.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  206:           blk.22.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  208:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:             blk.23.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  213:        blk.23.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  215:           blk.23.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  217:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  222:        blk.24.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  224:           blk.24.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  226:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  231:        blk.25.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  233:           blk.25.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  235:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  240:        blk.26.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  242:           blk.26.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  244:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  249:        blk.27.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  253:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  258:        blk.28.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  260:           blk.28.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  262:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  263:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  267:        blk.29.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  269:           blk.29.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  271:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  272:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:             blk.30.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  276:        blk.30.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  278:           blk.30.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  280:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  281:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:             blk.31.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  285:        blk.31.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  287:           blk.31.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  289:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q4_0:    1 tensors
llama_model_loader: - type q4_K:  216 tensors
llama_model_loader: - type q5_K:    8 tensors
llm_load_print_meta: format         = GGUF V1 (support until nov 2023)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32016
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 16384
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 4096
llm_load_print_meta: n_head         = 32
llm_load_print_meta: n_head_kv      = 32
llm_load_print_meta: n_layer        = 32
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 1000000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly Q4_K - Small
llm_load_print_meta: model size     = 6.74 B
llm_load_print_meta: general.name   = LLaMA
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 3825.08 MB (+ 2048.00 MB per state)
..............................................................................................
llama_new_context_with_model: kv self size  = 2048.00 MB
llama_new_context_with_model: compute buffer total size =  281.41 MB
[2023-08-29T04:43:28.252]  [INFO] server: 🚀 server is listening on port 80

feat: max GPU layers param

Feature Description

Allow an option for LlamaModel to use all available GPU layers.

The Solution

new LlamaModel({
  gpuLayers: -1,  // use all available gpu layers
});

Considered Alternatives

Another number or symbol other than -1 would also work

Additional Context

Using -1 for GPU layers is standard in Python toolchains

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

feat: `create-project` command to make it easier to bootstrap a new project

Support creating these types of projects:

Node with TypeScript using vite-node
Electron app with TypeScript
Node with plain JavaScript

Passing grammars

I don't find a way to pass along a grammar when setting things up, the way you do for llama.cpp

Am I missing something, or is this just not supported yet?

I need it to force the model to generate JSON, which is extra convenient when running a node-based system...

Thanks!

feat: Apply different LoRA dynamically

Feature Description

Can change LoRA dynamically after loading LLaMa model.

The Solution

See llama_model_apply_lora_from_file() function in llama.cpp.

https://github.com/ggerganov/llama.cpp/blob/e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb/llama.h#L353C1-L359C1

Considered Alternatives

None.

Additional Context

No response

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Set repeat-penalty request

Feature Description

It would be nice to be able to pass the repeat-penalty to the model too.

The Solution

Simply add a new parameter to pass for the repeat-penalty

Considered Alternatives

Tried doing it myself but the repo won't compile properly

Additional Context

No response

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Issue with Webpack Compilation

Issue description

When bundling node-llama-cpp with webpack and Typescript, there's something weird happening: Webpack somehow appears to load the module as a promise. After that is resolved, everything works fine, but this makes the code extremely weird.

Expected Behavior

Bundling code with webpack should work out of the box as indicated in the getting started guide.

NOTE: I am using webpack because I'm working on an Electron app with Electron forge. I cannot "just" use TypeScript.

Actual Behavior

Destructuring the module import resolves in undefines. Importing everyting at ones gives me a promise that, if I await this, then actually gives me the modules as it should. Also, they then work fine. See code:

// --> All undefined upon running the app; throws error "LlamaModel is not a constructor" due to that
import { LlamaModel, LlamaContext, LlamaConversation } from 'node-llama-cpp'

// If I import everything as a single object ...
import Llama from 'node-llama-cpp'
console.log(Llama) // -> 'Promise<pending>'

Steps to reproduce

It works when I do the following mental gymnastics:

import Llama from 'node-llama-cpp'

;(Llama as any).then(mod => {
  const model = new mod.LlamaModel({ modelPath: 'path/to/model' })
  // `model` will be a proper loaded LlamaModel instance that can be used further down the road.
})

And I receive the proper output that indicates that llama.cpp has loaded successfully. I have not yet tried to prompt the model, but I can confirm that the model has been loaded into RAM successfully.

My Environment

Dependency	Version
Operating System	macOS Sonoma 14.2.1
CPU	M2 Pro
Node.js version	18.19.0 LTS
Typescript version	5.3.3
`node-llama-cpp` version	2.8.3

Additional Context

It appears that something that the dist files of node-llama-cpp do is something webpack doesn't like. However, I have had no success yet to find the source.

All the other handling (such as bundling the node file, etc.) work flawlessly with the Electron forge setup.

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time, but I can support (using donations) development.

Llama2 Template Error

Issue description

The Llama2 Templates appear to not work with llama2 models.

Expected Behavior

Using the LlamaChatPromptWrapper I would expect the model to produce a normal response.

Actual Behavior

When I use LlamaChatPromptWrapper it seems to get stuck and produce the following output:

GGML_ASSERT: node_modules/node-llama-cpp/llama/llama.cpp/ggml.c:4785: view_src == NULL || data_size + view_offs <= ggml_nbytes(view_src)

I suspect this is a result of it not understanding the template/stop tokens.

Steps to reproduce

Use the 7B model: https://huggingface.co/TheBloke/Llama-2-7B-GGUF

Run the following code:

import { LlamaModel, LlamaContext, LlamaChatSession, LlamaChatPromptWrapper } from "node-llama-cpp";

const modelPath = "llama-2-7b.Q4_K_M.gguf";

const model = new LlamaModel({ modelPath, gpuLayers: 64 });
const context = new LlamaContext({ model });
const session = new LlamaChatSession({ context, promptWrapper: new LlamaChatPromptWrapper() });


const q1 = "What is a llama?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

My Environment

Dependency	Version
Operating System	Ubuntu 22.04
Node.js version	19.1.0
Typescript version	4.8.4
`node-llama-cpp` version	2.4.0

Additional Context

The GeneralChatPromptWrapper seems to work normally with the exception of adding "\n\n### :" to the stop tokens. Why does the general prompt wrapper work whereas the llama specific one doesn't? Is this an issue with the model file itself, e.g. bad conversion? Is there a better way to debug this?

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

CLI does not work with Bun

Issue description

CLI does not work with Bun

Expected Behavior

bunx --no node-llama-cpp chat --model ./models/llama-2-7b.Q4_K_M.gguf

model should load correctly and enter a chat

Actual Behavior

bunx --no node-llama-cpp chat --model ./models/llama-2-7b.Q4_K_M.gguf

Returns: Not enough non-option arguments: got 0, need at least 1

Steps to reproduce

bun install node-llama-cpp
bunx --no node-llama-cpp chat --model ./models/llama-2-7b.Q4_K_M.gguf

My Environment

Ubuntu 22 |
"node-llama-cpp": "^2.8.5",

Additional Context

It would be great if this module works with bun

bun is super fast and let's us also use commonjs modules in the same script alongside esm import-only modules like node-llama-cpp

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Node exits when n_tokens <= n_batch

Issue description

I have a webserver which is using node-llama-cpp under the hood. When I give short inputs, everything is fine, but if I enter too many tokens, llama.cpp errors with "n_tokens <= n_batch". The problem is not that it errors... the problem is that it effectively calls process.exit() and kills my entire webserver (next.js), instead of throwing an exception which I could catch.

Expected Behavior

When an error occurs (such as input exceeding the context length), it would be much better to throw an Error, rather than force exiting the node runtime. User input should never be able to terminate my server.

Actual Behavior

GGML_ASSERT: /home/platypii/code/node_modules/node-llama-cpp/llama/llama.cpp/llama.cpp:6219: n_tokens <= n_batch

And then the node process exits abruptly. It is not catch-able.

Steps to reproduce

Use the basic node-llama-cpp usage example: create LlamaModel, create LlamaContext, create LlamaChatSession, call session.prompt with a long input (more than 512 tokens by default).

My Environment

Dependency	Version
Operating System	Mac
CPU	Apple M3
Node.js version	20.9.0
Typescript version	5.2.2
`node-llama-cpp` version	2.8.3

Additional Context

Note: I can increase the batch/context size, that's not the issue. The issue is the node process exiting rather than throwing.

I tried to trace what was happening, and add additional try/catch statements, but it appears to be happening inside the native addon.cpp implementation of eval. That's where I got stuck, open to suggestions.

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Error when compiling with CUDA on Windows

Issue description

On Windows 10, when calling node-llama-cpp download --cuda I receive an nvcc compiler error

Expected Behavior

I expect the operation to complete successfully

Actual Behavior

The actual output of this command is below. Of note, nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified

Repo: ggerganov/llama.cpp
Release: b1204
CUDA: enabled

√ Fetched llama.cpp info
√ Removed existing llama.cpp directory
Cloning llama.cpp
Clone ggerganov/llama.cpp  100% ████████████████████████████████████████  0s
◷ Compiling llama.cpp
Not searching for unused variables given on the command line.
-- Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19045.
-- The C compiler identification is MSVC 19.29.30151.0
-- The CXX compiler identification is MSVC 19.29.30151.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.40.1.windows.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Found CUDAToolkit: Y:/CUDA_12/include (found version "12.2.140")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: Y:/CUDA_12/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- x86 detected
-- Configuring done (15.7s)
-- Generating done (0.1s)
-- Build files have been written to: C:/Users/Ben/Development/test_app/node_modules/node-llama-cpp/llama/build
Microsoft (R) Build Engine version 16.11.2+f32259642 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

  Checking Build System
  Generating build details from Git
  -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.40.1.windows.1")
  Building Custom Rule C:/Users/Ben/Development/test_app/node_modules/node-llama-cpp/llama/llama.cpp/CMakeLists.txt
  Building Custom Rule C:/Users/Ben/Development/test_app/node_modules/node-llama-cpp/llama/llama.cpp/CMakeLists.txt
  Compiling CUDA source file ..\..\llama.cpp\ggml-cuda.cu...

  C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\build\llama.cpp>"Y:\CUDA_12\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64" -x cu   -I"C:\Users\User\Devel
  opment\test_app\node_modules\node-addon-api" -I"C:\Users\User\.cmake-js\node-x64\v19.9.0\include\node" -I"C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\llama.cpp\." -IY:\CUDA_12\include -IY:\CUDA_12\include     --keep-dir x64\Release  -maxrregcount=0
    --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,sm_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] /EHsc -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_
  QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_C
  UDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -Xcompiler "/EHsc /W3 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml.dir\Release\ggml.pdb" -o ggml.dir\Release\ggml-cuda.obj "C:\Users\User\Development\test_app\nod
  e_modules\node-llama-cpp\llama\llama.cpp\ggml-cuda.cu"
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations\CUDA 12.2.targets(799,9): error MSB3721: The command ""Y:\CUDA_12\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.
29.30133\bin\HostX64\x64" -x cu   -I"C:\Users\User\Development\test_app\node_modules\node-addon-api" -I"C:\Users\User\.cmake-js\node-x64\v19.9.0\include\node" -I"C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\llama.cpp\." -IY:\CUDA_12\include -IY:
\CUDA_12\include     --keep-dir x64\Release  -maxrregcount=0   --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,sm_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] /EHsc -Xcompiler="/EHsc
 -Ob2"   -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_
K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -Xcompiler "/EHsc /W3 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml.dir\Release\ggml.pdb" -o ggml.dir\Release\ggml-cuda.
obj "C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\llama.cpp\ggml-cuda.cu"" exited with code 1. [C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\build\llama.cpp\ggml.vcxproj]
Not searching for unused variables given on the command line.
-- Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19045.
-- The C compiler identification is MSVC 19.29.30151.0
-- The CXX compiler identification is MSVC 19.29.30151.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.29.30133/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.40.1.windows.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Found CUDAToolkit: Y:/CUDA_12/include (found version "12.2.140")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: Y:/CUDA_12/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- x86 detected
-- Configuring done (15.9s)
-- Generating done (0.1s)
-- Build files have been written to: C:/Users/Ben/Development/test_app/node_modules/node-llama-cpp/llama/build
Microsoft (R) Build Engine version 16.11.2+f32259642 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

  Checking Build System
  Generating build details from Git
  -- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.40.1.windows.1")
  Building Custom Rule C:/Users/Ben/Development/test_app/node_modules/node-llama-cpp/llama/llama.cpp/CMakeLists.txt
  Building Custom Rule C:/Users/Ben/Development/test_app/node_modules/node-llama-cpp/llama/llama.cpp/CMakeLists.txt
  Compiling CUDA source file ..\..\llama.cpp\ggml-cuda.cu...

  C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\build\llama.cpp>"Y:\CUDA_12\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64" -x cu   -I"C:\Users\User\Devel
  opment\test_app\node_modules\node-addon-api" -I"C:\Users\User\.cmake-js\node-x64\v19.9.0\include\node" -I"C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\llama.cpp\." -IY:\CUDA_12\include -IY:\CUDA_12\include     --keep-dir x64\Release  -maxrregcount=0
    --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,sm_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] /EHsc -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_
  QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_C
  UDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -Xcompiler "/EHsc /W3 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml.dir\Release\ggml.pdb" -o ggml.dir\Release\ggml-cuda.obj "C:\Users\User\Development\test_app\nod
  e_modules\node-llama-cpp\llama\llama.cpp\ggml-cuda.cu"
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations\CUDA 12.2.targets(799,9): error MSB3721: The command ""Y:\CUDA_12\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.
29.30133\bin\HostX64\x64" -x cu   -I"C:\Users\User\Development\test_app\node_modules\node-addon-api" -I"C:\Users\User\.cmake-js\node-x64\v19.9.0\include\node" -I"C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\llama.cpp\." -IY:\CUDA_12\include -IY:
\CUDA_12\include     --keep-dir x64\Release  -maxrregcount=0   --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,sm_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] /EHsc -Xcompiler="/EHsc
 -Ob2"   -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DNAPI_VERSION=7 -DGGML_USE_
K_QUANTS -DGGML_USE_CUBLAS -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -D"CMAKE_INTDIR=\"Release\"" -Xcompiler "/EHsc /W3 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdggml.dir\Release\ggml.pdb" -o ggml.dir\Release\ggml-cuda.
obj "C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\llama.cpp\ggml-cuda.cu"" exited with code 1. [C:\Users\User\Development\test_app\node_modules\node-llama-cpp\llama\build\llama.cpp\ggml.vcxproj]
ERR! OMG Process terminated: 1

Steps to reproduce

Run node-llama-cpp download --cuda on Windows 10 with CUDA v12

My Environment

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

A way to get the answer while it's generating.

Feature Description

I try to generate some answers from an AI model and the generating time is pretty slow, I'd like to be able to get the answer as it is generated by llama.cpp. This is to make a webchat so it would be really better like this

The Solution

It'd be handy to use on() events.

Considered Alternatives

I would then handle it using React states but this the event/emitters would do the workaround, maybe another function instead of "prompt" that would take a callback ?

Additional Context

I'd also like to know what models are lightweight because I tried LLaMa2 (7B version) and even on a i5-12600k it's pretty slow, if you know some models that are faster ?

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

OMG Thank you

Feature Description

This is a gratitude Issue

The Solution

Considered Alternatives

Additional Context

Thank you so much for creating this :love

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

No CUDA toolset found Error even its Found CUDAToolkit

Issue description

when try to build for cuda build is failing its showing cublas and cudatoolkit found but still it failed showing No Cuda toolset found, i have added all the necessory envirnment variables but stil failing

Expected Behavior

I am using cuda toolkit version 12.2 and visual studio 2022 community edition , when i try to build it supposed to be build but it shows error

Expecting to build

Actual Behavior

Its now showing error like this using this command npx --no node-llama-cpp download --cuda
env variables : NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET : v143
this is one of the env variable that is i thin kproblamatic , above one is using for visual studio , its only working when i give like above
but dont know if the same using when cmake build

-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.2/include (found version "12.2.91")
-- cuBLAS found
CMake Error at xpack/store/@xpack-dev-tools/cmake/3.26.5-1.1/.content/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:501 (message):
No CUDA toolset found.
Call Stack (most recent call first):
xpack/store/@xpack-dev-tools/cmake/3.26.5-1.1/.content/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)

Steps to reproduce

just run npx --no node-llama-cpp download --cuda
then its hapening

My Environment

Dependency	Version
Operating System windows
CPU	Intel i9 -13900k
Node.js version	x.y.zzz
Typescript version	x.y.zzz
`node-llama-cpp` version	3.0.0-beta.1

Additional Context

No response

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Raw approach runs into odd response

I tried running the raw code but the system appeared to freeze so I added some logging, here is the code I tried:

import {LlamaModel, LlamaContext} from "node-llama-cpp";

const model = new LlamaModel({
    modelPath: "/Users/nigel/git/llama2/llama.cpp/models/7B/ggml-model-q4_0.bin"
});

const context = new LlamaContext({model});

const q1 = "Hi there, how are you?";
console.log("You: " + q1);

const tokens = context.encode(q1);
const res = [];
for await (const chunk of context.evaluate(tokens)) {
    console.log('got chunk');
    res.push(chunk);

    // it's important to not concatinate the results as strings,
    // as doing so will break some characters (like some emojis) that are made of multiple tokens.
    // by using an array of tokens, we can decode them correctly together.
    const resString = context.decode(Uint32Array.from(res));
    console.log('chunk' + resString);
    const lastPart = resString.split("ASSISTANT:").reverse()[0];
    if (lastPart.includes("USER:"))
        break;
}

const a1 = context.decode(Uint32Array.from(res)).split("USER:")[0];
console.log("AI: " + a1);

The expected result was something like I hope you are doing well. but it actually produced, after multiple iterations:

got chunk
chunk I hope you are doing well. Unterscheidung von „Gesundheit“ und „Gesundheitssystem“ – eine kritische Analyse.
The healthcare system is a complex and multifaceted system that is constantly evolving. It is important to understand the different components of the healthcare system and how they work together to provide quality care to patients.
The healthcare system is made up of a variety of different components, including hospitals, clinics, pharmacies, and insurance companies. Each of these components plays a vital role in the overall functioning of the healthcare system.
Hospitals are the primary providers of healthcare services. They are responsible for providing inpatient and outpatient care to patients. Hospitals also provide a variety of other services, such as laboratory testing, radiology, and pharmacy services.
Clinics are smaller healthcare facilities that provide outpatient care to patients. Clinics are typically staffed by a team of doctors, nurses, and other healthcare professionals. Clinics may specialize in a particular area of medicine, such as cardiology or oncology.
Pharmacies are responsible for dispensing prescription medications to patients. Pharmacies also provide a variety of other services, such as counseling patients on the proper use of their medications and providing information on potential side effects.
Insurance companies are responsible

It would have carried on but I was forced to ^C the code at that point.

other people's gguf file can not run

Issue description

error loading model: unknown model architecture: ''

Expected Behavior

While I was browsing Reddit, I found a GGUF model of phi-2 trained by others，but when i download the model from the repository ，it can not run 。

the reddit page: https://www.reddit.com/r/LocalLLaMA/comments/18hnhd6/tutorial_how_to_run_phi2_locally_or_on_colab_for/?rdt=58275

the resposity addr: https://huggingface.co/radames/phi-2-quantized/tree/main

when i execution npx --no node-llama-cpp chat --model ./models/model-v2-qk4.gguf and npx --no node-llama-cpp chat --model ./models/model-v2-q80.gguf, the error appear.

Actual Behavior

in this image, mistral-xxx is can run successful, it from thebloke

Steps to reproduce

i just run simple commands from getting started part, those included by
npx --no node-llama-cpp chat --model
and
node index.js(code from getting stared part)

My Environment

Dependency	Version
Operating System
CPU	Apple M1
Node.js version	18.16.0
Typescript version	use javascript
`node-llama-cpp` version	2.8.2

Additional Context

i dont kow how to dstinguish which model can be run, its so hard

thank you receive

Relevant Features Used

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Bun support

Feature Description

Currently, running the library using Bun throws an error:

Is this Bun's fault or something else?

The Solution

It should run like NodeJS

Considered Alternatives

Well, don't use Bun?

Additional Context

No response

Related Features to This Feature Request

Metal support
CUDA support
Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Error: Command npm run -s node-gyp-llama -- configure --arch=arm64 --target=v18.0.0 exited with code 1

I'm running the example script provided in the README.md using a local/offline copy of this library (that should work fine). I get this error when calling the script the first time. I'm not using any specific env, the llama.cpp has been downloaded in the folder of the bindings under llama/llama.cpp:

.
├── example.js
├── lib
│   ├── AbortError.d.ts
│   ├── AbortError.js
│   ├── AbortError.js.map
│   ├── ChatPromptWrapper.d.ts
│   ├── ChatPromptWrapper.js
│   ├── ChatPromptWrapper.js.map
│   ├── chatWrappers
│   ├── cli
│   ├── commands.d.ts
│   ├── commands.js
│   ├── commands.js.map
│   ├── config.d.ts
│   ├── config.js
│   ├── config.js.map
│   ├── index.d.ts
│   ├── index.js
│   ├── index.js.map
│   ├── llamaEvaluator
│   ├── package.json
│   ├── types.d.ts
│   ├── types.js
│   ├── types.js.map
│   └── utils
└── llama
    ├── addon.cpp
    ├── binariesGithubRelease.json
    ├── binding.gyp
    ├── llama.cpp
    └── usedBin.json

The example script was

import {fileURLToPath} from "url";
import path from "path";
import {LlamaModel, LlamaContext, LlamaChatSession} from "./lib/index.js";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const model = new LlamaModel({
    modelPath: path.join(__dirname, "models", "LaMA-2-7B-32K_GGUF", "LLaMA-2-7B-32K-Q3_K_L.gguf")
});
const context = new LlamaContext({model});
const session = new LlamaChatSession({context});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);