Comments (10)
hi @semoal and @michaelfeil error should have been fixed and please give a try. (note: please remove huggingface cache in ~/.cache/huggingface/hub
and ~./cache/huggingface/modules
my script:
import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
from infinity_emb.transformer.utils import InferenceEngine
from infinity_emb.primitives import Device
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
embeddings_args = EngineArgs(
model_name_or_path="jinaai/jina-embeddings-v2-base-es",
engine=InferenceEngine.torch,
device=Device.auto,
trust_remote_code=True,
)
engine = AsyncEmbeddingEngine.from_args(embeddings_args)
async def main():
async with engine: # engine starts with engine.astart()'
embeddings, usage = await engine.embed(sentences=sentences)
print(embeddings)
asyncio.run(main())
from infinity.
@semoal Seems to be an error with jinaai/jina-bert-v2-qk-devlin-norm-1e-2
in combination with torch.compile(model,dynamic=True)
Two things you should do now:
- Set
export INFINITY_DISABLE_COMPILE=True
also - open an issue at jina, reminding them that torch.compile fails for their custom modeling code.
from infinity.
thanks @michaelfeil and @semoal , we're looking into it!
from infinity.
but you're right, let me rewrite with torch format :)
from infinity.
@semoal Can you confirm this works when using with os.environ
: INFINITY_DISABLE_OPTIMUM="TRUE"
(for BetterTransformer) and INFINITY_DISABLE_COMPILE="TRUE"
from infinity.
Yes, its the jit nature of torch.compile - please enable the warmup flag for that
from infinity.
Confirmed that disabling the compile works Michael, thanks for the quick feedback. Created an issue on Jina HF board https://huggingface.co/jinaai/jina-embeddings-v2-base-es/discussions/6
from infinity.
@bwanglzu First pointer might be that torch inductor does not like the pythonic implementation of start = 2 ** (-(2 ** -(math.log2(n) - 3)))
perhaps it can be torchified without a performance sacrifice. https://huggingface.co/jinaai/jina-bert-implementation/blob/f3ec4cf7de7e561007f27c9efc7148b0bd713f81/modeling_bert.py#L720 - also without dynamic=True
might be an idea. torch.compile gives a decent +15% throughput - might be a shame to drop it.
@semoal I might start to consolidate the inference engine options and introduce additional arguments, so that you dont have to deal with ENV variables (but CLI arguments instead) - would that be helpful for you?
from infinity.
Just tried and now doesn't crash when receiving requests and correctly generates the embedding but I see a warning/error when initializing the model:
2024-02-27T17:13:25.816 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,814 infinity_emb INFO: Adding acceleration.py:20
2024-02-27T17:13:25.816 app[17816011be4689] ord [info] optimizations via Huggingface optimum. Disable by
2024-02-27T17:13:25.816 app[17816011be4689] ord [info] setting the env var `INFINITY_DISABLE_OPTIMUM`
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] ERROR 2024-02-27 17:13:25,818 infinity_emb ERROR: acceleration.py:27
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] BetterTransformer failed with The transformation of
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] the model JinaBertModel to BetterTransformer failed
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] while it should not. Please fill a bug report or
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] open a PR to support this model at
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] https://github.com/huggingface/optimum/
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] Traceback (most recent call last):
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/infinity_em
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] b/transformer/acceleration.py", line 25, in
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] to_bettertransformer
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] model = BetterTransformer.transform(model)
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File "/usr/lib/python3.10/contextlib.py", line 79,
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] in inner
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] return func(*args, **kwds)
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/optimum/bet
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] tertransformer/transformation.py", line 270, in
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] transform
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] set_last_layer(model_fast)
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] File
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] "/usr/local/lib/python3.10/dist-packages/optimum/bet
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] tertransformer/transformation.py", line 166, in
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] set_last_layer
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] raise Exception(
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] Exception: The transformation of the model
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] JinaBertModel to BetterTransformer failed while it
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] should not. Please fill a bug report or open a PR to
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] support this model at
2024-02-27T17:13:25.825 app[17816011be4689] ord [info] https://github.com/huggingface/optimum/
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,825 infinity_emb INFO: sentence_transformer.py:67
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] Switching to half() precision (cuda: fp16).
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] Disable by the setting the env var
2024-02-27T17:13:25.827 app[17816011be4689] ord [info] `INFINITY_DISABLE_HALF`
2024-02-27T17:13:25.853 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:25,852 infinity_emb INFO: sentence_transformer.py:73
2024-02-27T17:13:25.853 app[17816011be4689] ord [info] using torch.compile()
2024-02-27T17:13:27.823 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,821 infinity_emb INFO: batch_handler.py:385
2024-02-27T17:13:27.823 app[17816011be4689] ord [info] creating batching engine
2024-02-27T17:13:27.825 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,823 infinity_emb INFO: ready batch_handler.py:242
2024-02-27T17:13:27.825 app[17816011be4689] ord [info] to batch requests.
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] INFO 2024-02-27 17:13:27,825 infinity_emb INFO: server.py:49
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] βΎοΈ Infinity - Embedding Inference Server
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] MIT License; Copyright (c) 2023 Michael Feil
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Version 0.0.25
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Open the Docs via Swagger UI:
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] http://localhost:8000/docs
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] Access model via 'GET':
2024-02-27T17:13:27.828 app[17816011be4689] ord [info] curl http://localhost:8000/models
2024-02-27T17:13:27.829 app[17816011be4689] ord [info] INFO: Application startup complete.
2024-02-27T17:13:27.830 app[17816011be4689] ord [info] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
from infinity.
Yes, disabling works perfectly, even I noticed one curious thing, when disabling the optimizations the first request to /embeddings is much faster (much like 10-20s less, probably it's related to i'm not warming-up the model, should I?) than with optimizations enabled. This is all tested in a fly.io machine with an L40 .
from infinity.
Related Issues (20)
- cannot use rerank (BAAI/bge-base-en-v1.5) HOT 1
- How does this compare to Huggingface's Text Embedding Inference? HOT 4
- Create llama-index `InfinityEmbeddings` as langchain HOT 4
- Parity break with OpenAI API: /models HOT 4
- Asking to truncate to max_length but no maximum length HOT 1
- Adding torch.compile + fp16 + bettertransformer a CLI argument
- Support for nomic-ai/nomic-embed-text-v1.5 HOT 1
- Support for instructur/instructor-xl models HOT 5
- infinity_emb failed at startup using `torch.compile` when installed via pip HOT 9
- Reranker model fails to load (maidalun1020/bce-reranker-base_v1) - no max token length is set HOT 4
- "msg":"Input should be a valid list" HOT 6
- support for `revision` HOT 1
- unexpected keyword argument 'trust_remote_code' HOT 3
- Adding max token budget per batch
- How is long text handled?
- Return actual token count on forward pass HOT 1
- AMD ROCm docker images support (+ optimization) HOT 6
- AWQ-Bert / 4-bit Bert
- 422 error if /embeddings input is a string HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from infinity.