Currently the rpc-server doesn't perform any input va

Track allocated buffers in rpc-server about llama.cpp HOT 2 CLOSED

rgerganov commented on June 27, 2024 1

Track allocated buffers in rpc-server

from llama.cpp.

Comments (2)

rgerganov commented on June 27, 2024 2

Should we make a dedicated session liked structure for each client to organize the resource allocatd by connection?

Right now the rpc-server can serve only one client at a time (I should add this to the README). I prefer to keep it that way because the code is simple as we don't have to deal with multiple threads, synchronization, etc. Users can still run multiple instances using the same backend and overcommit backend memory if this is what they want.

from llama.cpp.

chraac commented on June 27, 2024

Have some other thought maybe off topic:
Should we make a dedicated session liked structure for each client to organize the resource allocatd by connection? we can also hold ggml_backend_t there.

from llama.cpp.

Related Issues (20)

Bug: --chat-template seems to be broken now, no way to truly chat from the llama-cli HOT 3
Bug: LoRA Finetuning fails for GPU offloading
Bug: brew install on a Mac HOT 1
Bug: Persistent hallucination even after re-running llama.cpp HOT 4
win7 failed HOT 1
Bug: JSON Schema - enum behind a $ref generates an object with unrestricted properties HOT 3
Bug: llama-server crashes when started with --embeddings HOT 6
Bug: similar sizes suggest some heavy shared component in all 38 `llama-*` binaries (which now weigh 14 GB in total) HOT 5
[feature request] conversion to gguf in a more pure form. HOT 2
Vulkan backend regression: gibberish output when layers offloaded to GPU HOT 2
Bug: Cannot load GGUF file, it asks if it is GGML. HOT 1
Bug: Crashes at the end of startup during first prompt processing HOT 23
Bug: llama.cpp apparently exits with '[end of text]' before processing prompt if prompt is ~2048 tokens
Add Support for Bamboo LLM
Bug: ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 137438953504 HOT 2
sh: 1: ./llama.cpp/llama-quantize: not found HOT 2
Bug: abort on Android (pixel 8 pro) HOT 1
Bug: [RPC] RPC apparently isn't honoring backend memory capacity et. al. HOT 3
Feature Request: Provide means to quantify the restriction of RAM/VRAM usage for each GPU and system RAM.
Feature Request: It would be convenient and faster if users could specify that the model data used for a RPC-server instance is already available by some fast(er) means (file system GGUF, whatever). HOT 1

Track allocated buffers in rpc-server about llama.cpp HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent