Comments (24)
This issue is definitely on my radar. At the moment, the Ollama embeddings API can only handle 1 request at a time. It's a known issue and I believe it's being prioritized (see ollama/ollama#358).
Some kind of progress indicator would be nice. Material UI has a Linear determinate progress bar component: https://mui.com/material-ui/react-progress/#linear-determinate. This should be pretty easy to implement (i.e. message passing between background script and main extension).
Separately, I'm working on a small refactor to make things a little bit faster: #52
from lumos.
const vectorStore = await MemoryVectorStore.fromDocuments(
documents,
new OllamaEmbeddings({
baseUrl: OLLAMA_BASE_URL,
model: OLLAMA_MODEL,
}),
);
I think fromDocuments/OllamaEmbeddings runs serially anyway so may need some wrapper beyond the server proxy.
from lumos.
Probably worth investigating yourself
I really like the magic / "just works" aspect of using RAG/embeddings, but it would be nice if it was a bit faster /$somehow/
from lumos.
I mean for cloud, it definitely would help.
I just realized that's for OpenAI / Cohere
from lumos.
Have been pondering IPFS global shared embeddings cache
The great big CAS in the sky
It seems so wasteful for everyone to be doing this separately
from lumos.
Assume you saw this:
https://www.youtube.com/watch?v=Ml179HQoy9o
from lumos.
from lumos.
I wonder if one could make a little node.js prototype server for this that simply wraps/proxies to multiple llama.cpp instances using mmap mode. Would be fun to see it purring and get an inkling of future performance.
from lumos.
Looks like someone made an Ollama proxy server:
https://github.com/ParisNeo/ollama_proxy_server
I've never tinkered much with the ollama memory settings ( so job well done Ollama team)
Can it use mmap?
from lumos.
Seems it can use mmap. At least for the /api/generate there is a use_mmap
option.
You can start multiple instances of ollama by setting OLLAMA_HOST environment variable.
e.g. OLLAMA_HOST=0.0.0.0:11435
ollama serve
from lumos.
Don't know if there's "anything on the table" regardless
from lumos.
Apparently the embeddings don't use the entire weights, so maybe there's a way. I'm very fuzzy on how those are created.
I patched the proxy server to allow CORS, but I am not having much luck, and suddenly it's giving me crazy Yoda responses :)
Maybe there's a way to start a pool of ollama instances used just for the embedding, which don't ever load the full model weights and allocate too many buffers (ofc mmap doesn't solve everything)
from lumos.
[GIN] 2024/01/31 - 10:15:31 | 200 | 1.460126417s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:31 | 200 | 1.462187292s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:31 | 200 | 1.418453666s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:31 | 200 | 1.458442375s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:31] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:32 | 200 | 1.416166584s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:32] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:32] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:32 | 200 | 1.461118458s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:32] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:32] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:32 | 200 | 1.460722667s | 127.0.0.1 | POST "/api/embeddings"
127.0.0.1 - - [31/Jan/2024 10:15:32] "POST /api/embeddings HTTP/1.1" 200 -
127.0.0.1 - - [31/Jan/2024 10:15:32] "POST /api/embeddings HTTP/1.1" - -
[GIN] 2024/01/31 - 10:15:32 | 200 | 1.50568075s | 127.0.0.1 | POST "/api/embeddings"
Hacked the ollama langchain code to do things in parallel and sadly not really getting any speed up. Maybe due to the threads settings for the servers. Maybe it's already pretty efficient in using available cores?
I was getting around 250ms per embed, which went up to around 1.5s per embed.
Maybe there is a tiny speedup, shrug, but seemingly not a low hanging fruit substantial one.
from lumos.
@sublimator, thanks for testing out all of that stuff! For reference, how large is the content in your testing (i.e. how much text is on the page)? And about how many embeds before 250ms goes to 1.5s? In my testing, I didn't observe an increase in latency; it seems to be constant throughout the entire sequence of embeds.
If there's a comparable Wikipedia article, we can both work off of that for testing.
from lumos.
If there's a comparable Wikipedia article,
how large is the content in your testing
I was kind of using random pages
it seems to be constant throughout the entire sequence of embeds.
Did you use parallel processing though? By default the OllamaEmbedding class does the requests sequentially. Let me dig it back up.
from lumos.
I hacked the OllamaEmbeddings class (just the compiled code in node_modules):
async _embed(strings) {
console.log('hack is working!!')
const embeddings = [];
for await (const prompt of strings) {
const embedding = this.caller.call(() => this._request(prompt));
embeddings.push(embedding);
}
return await Promise.all(embeddings)
}
async embedDocuments(documents) {
return this._embed(documents);
}
from lumos.
Maybe we should start a branch if want to look seriously, but anyway, here's some more of the artifacts of my investigations before:
The ini file I used
[DefaultServer]
url = http://localhost:11442
queue_size = 1
[SecondaryServer]
url = http://localhost:11435
queue_size = 1
[SecondaryServer1]
url = http://localhost:11436
queue_size = 2
[SecondaryServer2]
url = http://localhost:11437
queue_size = 3
[SecondaryServer3]
url = http://localhost:11438
queue_size = 4
[SecondaryServer4]
url = http://localhost:11439
queue_size = 5
[SecondaryServer5]
url = http://localhost:11440
queue_size = 6
[SecondaryServer6]
url = http://localhost:11441
queue_size = 7
proxy server hacks (I might not have been using the right python version, but all I wanted to change was the CORS stuff)
diff --git a/ollama_proxy_server/main.py b/ollama_proxy_server/main.py
index 87944d0..5f53a8e 100644
--- a/ollama_proxy_server/main.py
+++ b/ollama_proxy_server/main.py
@@ -47,9 +47,9 @@ def main():
parser.add_argument('--log_path', default="access_log.txt", help='Path to the access log file')
parser.add_argument('--users_list', default="authorized_users.txt", help='Path to the config file')
parser.add_argument('--port', type=int, default=8000, help='Port number for the server')
- parser.add_argument('-d', '--deactivate_security', action='store_true', const=True, default=False, help='Deactivates security')
+ parser.add_argument('-d', '--deactivate_security', action='store_true', default=False, help='Deactivates security')
args = parser.parse_args()
- servers = get_config(args.config)
+ servers = get_config(args.config)
authorized_users = get_authorized_users(args.users_list)
deactivate_security = args.deactivate_security
print("Ollama Proxy server")
@@ -58,22 +58,25 @@ def main():
class RequestHandler(BaseHTTPRequestHandler):
def add_access_log_entry(self, event, user, ip_address, access, server, nb_queued_requests_on_server, error=""):
log_file_path = Path(args.log_path)
-
+
if not log_file_path.exists():
with open(log_file_path, mode='w', newline='') as csvfile:
fieldnames = ['time_stamp', 'event', 'user_name', 'ip_address', 'access', 'server', 'nb_queued_requests_on_server', 'error']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
-
+
with open(log_file_path, mode='a', newline='') as csvfile:
fieldnames = ['time_stamp', 'event', 'user_name', 'ip_address', 'access', 'server', 'nb_queued_requests_on_server', 'error']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
row = {'time_stamp': str(datetime.datetime.now()), 'event':event, 'user_name': user, 'ip_address': ip_address, 'access': access, 'server': server, 'nb_queued_requests_on_server': nb_queued_requests_on_server, 'error': error}
writer.writerow(row)
-
+
def _send_response(self, response):
self.send_response(response.status_code)
self.send_header('Content-type', response.headers['content-type'])
+ self.send_header('Access-Control-Allow-Origin', '*')
+ self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
+ self.send_header('Access-Control-Allow-Headers', 'Authorization, Content-Type')
self.end_headers()
self.wfile.write(response.content)
@@ -81,6 +84,13 @@ def main():
self.log_request()
self.proxy()
+ def do_OPTIONS(self):
+ self.send_response(200, "ok")
+ self.send_header('Access-Control-Allow-Origin', '*')
+ self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
+ self.send_header('Access-Control-Allow-Headers', 'Authorization, Content-Type')
+ self.end_headers()
+
def do_POST(self):
self.log_request()
self.proxy()
@@ -93,7 +103,7 @@ def main():
return False
token = auth_header.split(' ')[1]
user, key = token.split(':')
-
+
# Check if the user and key are in the list of authorized users
if authorized_users.get(user) == key:
self.user = user
@@ -112,11 +122,11 @@ def main():
if not auth_header or not auth_header.startswith('Bearer '):
self.add_access_log_entry(event='rejected', user="unknown", ip_address=client_ip, access="Denied", server="None", nb_queued_requests_on_server=-1, error="Authentication failed")
else:
- token = auth_header.split(' ')[1]
+ token = auth_header.split(' ')[1]
self.add_access_log_entry(event='rejected', user=token, ip_address=client_ip, access="Denied", server="None", nb_queued_requests_on_server=-1, error="Authentication failed")
self.send_response(403)
self.end_headers()
- return
+ return
url = urlparse(self.path)
path = url.path
get_params = parse_qs(url.query) or {}
@@ -141,16 +151,17 @@ def main():
if path == '/api/generate':
que = min_queued_server[1]['queue']
client_ip, client_port = self.client_address
+ self.user = 'unknown'
self.add_access_log_entry(event="gen_request", user=self.user, ip_address=client_ip, access="Authorized", server=min_queued_server[0], nb_queued_requests_on_server=que.qsize())
que.put_nowait(1)
try:
response = requests.request(self.command, min_queued_server[1]['url'] + path, params=get_params, data=post_params)
self._send_response(response)
except Exception as ex:
- self.add_access_log_entry(event="gen_error",user=self.user, ip_address=client_ip, access="Authorized", server=min_queued_server[0], nb_queued_requests_on_server=que.qsize(),error=ex)
+ self.add_access_log_entry(event="gen_error",user=self.user, ip_address=client_ip, access="Authorized", server=min_queued_server[0], nb_queued_requests_on_server=que.qsize(),error=ex)
finally:
que.get_nowait()
- self.add_access_log_entry(event="gen_done",user=self.user, ip_address=client_ip, access="Authorized", server=min_queued_server[0], nb_queued_requests_on_server=que.qsize())
+ self.add_access_log_entry(event="gen_done",user=self.user, ip_address=client_ip, access="Authorized", server=min_queued_server[0], nb_queued_requests_on_server=que.qsize())
else:
# For other endpoints, just mirror the request.
response = requests.request(self.command, min_queued_server[1]['url'] + path, params=get_params, data=post_params)
#!/bin/bash
# Start the Default Server
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11442 ollama serve &
# Start Secondary Servers
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11435 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11436 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11437 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11438 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11439 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11440 ollama serve &
OLLAMA_ORIGINS=* OLLAMA_HOST=0.0.0.0:11441 ollama serve &
ollama_proxy_server --config config.ini --port 11434 -d
from lumos.
how large is the content in your testing
Enough that there was a lot of embeddings requests anyway
This was one of the pages:
https://news.ycombinator.com/item?id=39197619
But I suspect it's grown since I was testing
from lumos.
I would try hacking a separate pool of servers just for the embeddings, with the proxy running on a non-default port. I'm still not sure WHEN the full model weights are loaded, but from the logging it could very well be lazily when the generate api call is hit
from lumos.
I hacked the OllamaEmbeddings class (just the compiled code in node_modules):
async _embed(strings) { console.log('hack is working!!') const embeddings = []; for await (const prompt of strings) { const embedding = this.caller.call(() => this._request(prompt)); embeddings.push(embedding); } return await Promise.all(embeddings) } async embedDocuments(documents) { return this._embed(documents); }
I feel like this change (or something similar) will be accepted in LangChainJS. See:
from lumos.
In any case, it didn't seem to help much in the big picture.
Maybe you can tweak the threading settings for each ollama instance or something
from lumos.
I mean that might actually work somehow - cause the shared embeddings would just share public data. And then you just need your little local LLM for processing the private queries. I guess it would get complicated by needing to trust that content -> embeddings for a given model, but I suppose they could be signed. Shrug. Want FAST responses, and to waste less energy $SOMEHOW
from lumos.
https://ollama.com/library/nomic-embed-text
from lumos.
The latest version of Ollama seems to be much more performant when loading/unlocking models. Having separate inference and embedding models will feel smoother. I'll close this issue after a couple more updates are implemented (e.g. batch embedding creation).
from lumos.
Related Issues (20)
- Unable to connect to Ollama API on Mac OS HOT 7
- Support uploading image file (multimodal)
- Add shortcut to unattach file
- Add functionality to delete individual message (or regenerate LLM response)
- Add LangChain `YoutubeLoader` HOT 4
- Add audio document loader HOT 1
- WebLLM HOT 2
- Window resolution abnormal HOT 12
- Add support for embedding model `mxbai-embed-large`
- Summarize chat for chat title/preview (chat history view)
- Support `snowflake-arctic-embed` embedding model
- Provide an option to use either selected text in the browser or text copied to the clipboard (necessary for using Google docs etc) HOT 13
- Add support for `moondream`, `llava-llama3`, and `llava-phi3` models
- Experiment with Ollama API concurrency HOT 1
- Any plans for Firefox? HOT 1
- Review/edit Mutable AI wiki
- Preload Ollama models
- Investigate LangGraph/agent implementation
- OPENAI embeddings API Integration HOT 4
- Yellow background is copied with text when highlighting text > right click > copy > paste
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lumos.