Comments (2)
Should we make a dedicated session liked structure for each client to organize the resource allocatd by connection?
Right now the rpc-server
can serve only one client at a time (I should add this to the README). I prefer to keep it that way because the code is simple as we don't have to deal with multiple threads, synchronization, etc. Users can still run multiple instances using the same backend and overcommit backend memory if this is what they want.
from llama.cpp.
Have some other thought maybe off topic:
Should we make a dedicated session liked structure for each client to organize the resource allocatd by connection? we can also hold ggml_backend_t
there.
from llama.cpp.
Related Issues (20)
- Bug: --chat-template seems to be broken now, no way to truly chat from the llama-cli HOT 3
- Bug: LoRA Finetuning fails for GPU offloading
- Bug: brew install on a Mac HOT 1
- Bug: Persistent hallucination even after re-running llama.cpp HOT 4
- win7 failed HOT 1
- Bug: JSON Schema - enum behind a $ref generates an object with unrestricted properties HOT 3
- Bug: llama-server crashes when started with --embeddings HOT 6
- Bug: similar sizes suggest some heavy shared component in all 38 `llama-*` binaries (which now weigh 14 GB in total) HOT 5
- [feature request] conversion to gguf in a more pure form. HOT 2
- Vulkan backend regression: gibberish output when layers offloaded to GPU HOT 2
- Bug: Cannot load GGUF file, it asks if it is GGML. HOT 1
- Bug: Crashes at the end of startup during first prompt processing HOT 23
- Bug: llama.cpp apparently exits with '[end of text]' before processing prompt if prompt is ~2048 tokens
- Add Support for Bamboo LLM
- Bug: ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 137438953504 HOT 2
- sh: 1: ./llama.cpp/llama-quantize: not found HOT 2
- Bug: abort on Android (pixel 8 pro) HOT 1
- Bug: [RPC] RPC apparently isn't honoring backend memory capacity et. al. HOT 3
- Feature Request: Provide means to quantify the restriction of RAM/VRAM usage for each GPU and system RAM.
- Feature Request: It would be convenient and faster if users could specify that the model data used for a RPC-server instance is already available by some fast(er) means (file system GGUF, whatever). HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama.cpp.