Comments (1)
Managing cache long term for a project or product is a nontrivial technical commitment requiring policies and configuration parameters for size and history limits and then code to manage the cached data, APIs to access it, often eventually (in the lifetime of a project) offloading cache and passing through configuration to alternate plugin implementations.
It is now already possible for scripts or client apps, browsers or middleware to cache all or any part of LLM chats in many ways using logs or flat files, databases, in-memory, etc. It is arguably a separate domain of concerns that could overcomplicate llama.cpp. It could be a separate product too.
from llama.cpp.
Related Issues (20)
- Converting finetune LLaVA model to gguf but while debugging getting result = self.mapping.get(key[:-len(suffix)]) as None
- No successful releases from CI in the last 2 days. HOT 5
- iGPU offloading Bug: Memory access fault by GPU node-1 (appeared once only)
- Research: Im writing a paper on our medical finetuned llava-v1.6, HOT 2
- Bug: Server ends up in infinite loop if number of requests in the batch is greater than parallel slots with system prompt HOT 2
- Refactor: Formalise Keys.General GGUF KV Store HOT 10
- Bug: embeddings endpoint broken HOT 1
- Bug: server /completion endpoint no longer accepts numeric tokens HOT 2
- Bug: Possible precision loss when using KV cache HOT 2
- Bug: 'scripts/run-with-preset.py` fails on `--tensor-split` option when run on non-GPU-enabled system HOT 2
- Bug: I use llama-b3091-bin-win-llvm-arm64.zip Run qwen2-0_5b-instruct-q8_0.gguf and it cannot start. Is it a compilation error of llama-b3091-bin-win-llvm-arm64.zip?
- Bug: Random output after the last update HOT 1
- Feature Request: Add Paligemma support HOT 1
- Bug: multithreading for requests,model infer service failed
- Bug: GGML_ASSERT: ggml.c:12793: ne2 == ne02 zsh: abort ./finetune --model-base --train-data ./Llama3-8B-Chinese-Chat-fintune/111.tx
- Bug: get-wikitext-103.sh seems not working HOT 2
- Bug: apparent pre-tokenization hash collision in get_vocab_base_pre() function in convert-hf-to-gguf.py script? Llama3 8B is mapped to smaug-bpe instead of llama-bpe
- Bug: CUDA error: out of memory - Phi-3 Mini 128k prompted with 20k+ tokens on 4GB GPU HOT 16
- Support for MatMul free LLMs
- ci : self-hosted runner issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama.cpp.