michaelfeil / infinity Goto Github PK

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.

Home Page: http://michaelfeil.eu/infinity/

License: MIT License

Dockerfile 2.22% Makefile 1.58% Python 95.97% Shell 0.24%

bert-embeddings llm text-embeddings

infinity's People

Contributors

Stargazers

Watchers

Forkers

francesnewman angelinarross923 jesusoctavioas mencelot natalia-gutov techthiyanes ninehills janglada birdhaihe zhnathaniellee autoai-org lyhiving baochi0212 linuer crazyforks seafitliu deep-diver eyusupov vikramsoni2 orefaleoluwayinka nirantk xiechengmude jmanhype zhangbeibei0902 tattrongvu swapnanilsharma llm-devops lckr xjpang sherwin684 namastexlabs bufferoverflow bet0x pent wannaten thinker007 freshworksinc shashipal95 hubayirp v-vietlq dfrsg id-2 shreygupta123 monotykamary zhangzhuobys ajaykarma05 brunoscaglione kir-gadjello wrmsr jimmc414

infinity's Issues

Shrink Docker images

Use multi-stage docker files

How does this compare to Huggingface's Text Embedding Inference?

Hi,

Thank you for your amazing work!

We'd like to add an embedding template for users to deploy on RunPod, and we're deciding between Infinity and HF's Text Embedding Inference. How would you say Infinity compares, especially in performance?

"msg":"Input should be a valid list"

I totally adapted the OpenAI CURL example as below:

curl https://localhost:7997/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Your text string goes here",
    "model": "BAAI/bge-small-en-v1.5"
  }'

Infinity showed errors as below: {"detail":[{"type":"list_type","loc":["body","input"],"msg":"Input should be a valid list".
I changed to:

curl https://localhost:7997/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["Your text string goes here"],
    "model": "BAAI/bge-small-en-v1.5"
  }'

It worked.

How can I modify Infinity codes to totally compatible with OpenAI API? Many thanks!

Multimodel inference

Will be great if we could load more models in the same container and switch between them using model name

Adding mkdocs url

#137

cannot use rerank (BAAI/bge-base-en-v1.5)

"message": "InternalServerError: the loaded moded cannot fullyfill rerank.options are {'embed'} inherited from model_class=<class 'infinity_emb.transformer.embedder.sentence_transformer.SentenceTransformerPatched'>",

ONNX: fastembed on GPU

#26

Content-Encoding: gzip

I wonder if it would make sense to support compressed requests, esp. for /rerank, where the query and document list could be many 1k or 2k chunks of text? The incoming request could easily exceed 20 or 30k. The http server does not appear to handle gzipped request bodies, if present.

Support Apple Metal

I saw @ninehills forked the repo to improve metal support.

I think we could add a metal device here.

infinity/libs/infinity_emb/infinity_emb/primitives.py

Lines 13 to 16 in 7556042

    
           class Device(enum.Enum): 
        
               cpu = "cpu" 
        
               cuda = "cuda" 
        
               auto = None

@ninehills As I am on WSL, do you want to open a PR from your fork and cotribute your MIT-licenced code? If you allow edits I can take it from there.

Update ResultKVStore

ResultKVStore currently uses a async event + a dict.

This should be replaced with a plain async.future.

unexpected keyword argument 'trust_remote_code'

  File "./infinity/libs/infinity_emb/infinity_emb/transformer/embedder/sentence_transformer.py", line 49, in __init__
    super().__init__(
TypeError: SentenceTransformer.__init__() got an unexpected keyword argument 'trust_remote_code'

The SentenceTransformer does not have this argument. Saw that you were working on it recently however it seems that main branch and the revision branch both have this issue.
Basic unit tests could cover that. Let me know if I can support with that.

Parity break with OpenAI API: /models

Reproduction:

curl -X GET https://infinity.semanticallyinvalid.net/models

Expected: List of model dictionary:

{
  "data": [
    {
      "id": "/thenlper/gte-small",
      "stats": {
        "queue_fraction": 0.0,
        "queue_absolute": 0,
        "results_pending": 0,
        "batch_size": 4096
      },
      "object": "model",
      "owned_by": "infinity",
      "created": 1708973209,
      "backend": "torch"
    }
  ],
  "object": "list"
}

Actual: Dictionary:

{
  "data": {
    "id": "/thenlper/gte-small",
    "stats": {
      "queue_fraction": 0.0,
      "queue_absolute": 0,
      "results_pending": 0,
      "batch_size": 4096
    },
    "object": "model",
    "owned_by": "infinity",
    "created": 1708973209,
    "backend": "torch"
  },
  "object": "list"
}

I have worked around this here, but for this commit to be accepted in the upstream, it would need to adhere to the expected list of models.

support embedding with "instructions"

embedding models like bge_small/large and instructor_xl/base are designed to be accompanied by instructions along with the embedding (especially for RAG use cases). If the embedding api currently do not support this functionality, it would be great to add, or if it is already supported then it would be good to clarify in example. thanks!

Support for Optimum Inference?

Are you currently supporting inference of Optimum converted models, e.g. through the ONNX Runtime? I tried a couple of pre-optimized HF models, e.g. this: https://huggingface.co/Xenova/bge-large-en-v1.5 but get these errors (due to a different directory structure of these models):

docker run --rm -p 8081:80 michaelf34/infinity:latest --model-name-or-path Xenova/bge-large-en-v1.5 --port 80

WARNING  2024-01-11 16:49:24,197                      SentenceTransformer.py:805
         sentence_transformers.SentenceTransformer                              
         WARNING: No sentence-transformers model                                
         found with name                                                        
         /app/.cache/torch/Xenova_bge-large-en-v1.5.                            
         Creating a new one with MEAN pooling.                                  
ERROR:    Traceback (most recent call last):
  File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__
    await self._router.startup()
  File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup
    await handler()
  File "/app/infinity_emb/infinity_server.py", line 67, in _startup
    app.model = AsyncEmbeddingEngine(
  File "/app/infinity_emb/engine.py", line 60, in __init__
    self._model, self._min_inference_t = select_model(
  File "/app/infinity_emb/inference/select_model.py", line 64, in select_model
    loaded_engine = unloaded_engine.value(model_name_or_path, device=device.value)
  File "/app/infinity_emb/transformer/embedder/sentence_transformer.py", line 47, in __init__
    super().__init__(model_name_or_path, **kwargs)
  File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 97, in __init__
    modules = self._load_auto_model(model_path)
  File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 806, in _load_auto_model
    transformer_model = Transformer(model_name_or_path)
  File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 29, in __init__
    self._load_model(model_name_or_path, config, cache_dir)
  File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 49, in _load_model
    self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
  File "/app/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/app/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3206, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /app/.cache/torch/Xenova_bge-large-en-v1.5.

Support of Huggingface Transformers?

Hello,

I was wondering if you've got plans to include support to transformers library?

So that people can use open source embedding models like e5 or bge over huggingface.

Kind thanks

AWQ-Bert / 4-bit Bert

Hoping to add a implementation of 4bit Bert, potentially in casper-hansen/AutoAWQ#328. Contributions welcome

Return actual token count on forward pass

Returning the actual token count that are used after truncating.

Add disk caching for identical embeddings

Proposal to integrate: https://github.com/grantjenks/python-diskcache into BatchHandler, storing hashed embeddings

Idea would be to use an

after added to the FiFOQueue add to CacheEngine q_retrieve_early_in
new thread: async await for q_retrieve_early_out in thread -> set future
pre-processing batchhandler -> filter cancelled futures.

CacheEngine

q_retrieve_early_in
q_retrieve_early_out
q_fill_cache
new background thread 1:
- q_retrieve_early_in - which queries diskcache
- if set: add to q_retrieve_early_out
new background thread 1:
- q_fill_cache - gets embeddings and sets them in diskcache

hash(sentence)

Make `cuda` fp16 the default.

Setup Auth Access?

If I'm wondering if it's possible to add some sort of api key so only authorized users can access the service?

Making torch optimal

Related to #5 - once another backend is available making any dependency to torch optional.

Faster JSON Body decoding / Response encoding

Idea would be to use https://fastapi.tiangolo.com/advanced/custom-response/

1. Response encoding

Factor 50x faster, end-end inference time down to by 3x.

2. Body.decode?

Default

json.loads:

SentenceTransformers latency: 21.628474655095488
Measuring latency via requests
Request latency: 21.012259179959074

orjson:

Measuring latency via SentenceTransformers
SentenceTransformers latency: 21.48577680112794
Measuring latency via requests
Request latency: 20.808999666944146

from fastapi import FastAPI, status, responses, Request
import orjson
...
    @app.post(f"{url_prefix}/embeddings", response_model=OpenAIEmbeddingResult)
    async def _embeddings(data: Request):
        """Encode Embeddings

        ```python
        import requests
        requests.post("https://..:8000/v1/embeddings",
            json={"model":"all-MiniLM-L6-v2","input":["A sentence to encode."]})
        """        
        bh: BatchHandler = app.batch_handler
        
        if bh.is_overloaded():
            raise errors.OpenAIException(
                "model overloaded", code=status.HTTP_429_TOO_MANY_REQUESTS
            )

        # body = await data.body()
        # 
        # validate data against OpenAIEmbeddingInput pydantic model
        try:
            data = orjson.loads(data)     
            data = OpenAIEmbeddingInput(**data)
            
        except Exception as ex:
            raise errors.OpenAIException(
                f"invalid input: {ex}", code=status.HTTP_422_UNPROCESSABLE_ENTITY
            )
        ...
        return responses.ORJSONResponse(content=res)

422 error if /embeddings input is a string

Hello,

There seems to be a discrepancy between Infinity's /embeddings API endpoint and OpenAI. OpenAI supports both string and string[] as input while Infinity throws 422: Unprocessed Entity for a simple string input. This prevents us from being able to use Infinity endpoint with Langchain's Open AI embeddings as embedQuery is sending a string input.

Any chance this could be updated to support both, in line with the OpenAI spec?

Appreciate your work!

Oleg

Refactor model class to make it composable

Related to #5

Launch via Dstack

Hey @peterschmidt85, I recently saw https://dstack.ai/examples/tei/#run-the-configuration - want to add a short dstack example here and link it in the readme?

Torch + Cuda + Bert crashes abruptly on startup

We're trying to run this jina embed model with Infinity:

embeddings_args = EngineArgs(
    model_name_or_path="jinaai/jina-embeddings-v2-base-es",
    engine=InferenceEngine.torch,
    device=Device.auto,
    trust_remote_code=True,
)

This is our nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   31C    P0              70W / 500W |   1885MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       439      C   /usr/bin/python3.10                        1872MiB |
+---------------------------------------------------------------------------------------+

Crash:

hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)

ERROR 2024-02-26 20:42:13,177 infinity_emb ERROR: Failed batch_handler.py:320

running call_function <built-in function

log2>(*(s0**3,), **{}):

must be real number, not SymFloat

from user code:

File

"/data/hf/modules/transformers_modules/jinaai/jina

-bert-v2-qk-devlin-norm-1e-2/a0ba9b2e7e2613a74d8cb

a43f2bbd420699db17c/modeling_bert.py", line 728,

in resume_in__get_alibi_head_slopes

get_slopes_power_of_2(closest_power_of_2)

File

"/data/hf/modules/transformers_modules/jinaai/jina

-bert-v2-qk-devlin-norm-1e-2/a0ba9b2e7e2613a74d8cb

a43f2bbd420699db17c/modeling_bert.py", line 715,

in get_slopes_power_of_2

start = 2 ** (-(2 ** -(math.log2(n) - 3)))

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1

Any suggerence @michaelfeil

Adding max token budget per batch

Currently allowing up to batch_size=64 as default. This can potentially lead to high memory usage, e.g. for jina-8k bert -> 64x8192. It would be better to adjust dynamically and set a token budget, e.g. 64*512=32768 per forward pass.

Idea: add a parameter to configure number of decimals in JSON output

Please consider adding a parameter to set the number of decimals in the Json output. This would be beneficial to reduce network bandwidth requirements and the time for parsing the output. This is relevant for users who do not need/want full accuracy e.g. is the embedding values are quantized and/or have a latency critical applications.

Add Onnx via fastembed

https://pypi.org/project/fastembed/

any tutorial on how to use Infinity?

Love the concept behind infinity! I wonder if you have a video tutorial or pdf about how to use Infinity? It will be great!

setting the pooling strategy - supported models?

I receive the error below when trying to use the "hkunlp/instructor-xl" . It seems related to selection of the pooling mode.

Also, when running WhereIsAI_UAE-Large-V1 it seems to auto-select "MEAN" pooling whereas the model authors recommend "cls"

Are these models currently supported?

===traceback
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2023-12-06 19:53:55,770 infinity_emb INFO: select_model.py:17
model=hkunlp/instructor-xl selected, using
engine=<class 'infinity_emb.transformer.sentence_transformer.Sente nceTransformerPatched'> and device=None
INFO 2023-12-06 19:53:55,772 SentenceTransformer.py:66
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
hkunlp/instructor-xl
ERROR: Traceback (most recent call last):
File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 566, in aenter
await self._router.startup()
File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup
await handler()
File "/app/infinity_emb/infinity_server.py", line 182, in _startup
model, min_inference_t = select_model_to_functional(
File "/app/infinity_emb/inference/select_model.py", line 21, in select_model_to_functional
init_engine = engine.value(model_name_or_path, device=device.value)
File "/app/infinity_emb/transformer/sentence_transformer.py", line 52, in init
super().init(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 95, in init
modules = self._load_sbert_model(model_path)
File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 840, in _load_sbert_model
module = module_class.load(os.path.join(model_path, module_config['path']))
File "/app/.venv/lib/python3.10/site-packages/sentence_transformers/models/Pooling.py", line 120, in load
return Pooling(**config)
TypeError: Pooling.init() got an unexpected keyword argument 'pooling_mode_weightedmean_tokens'

Adding torch.compile + fp16 + bettertransformer a CLI argument

Proposal:

Add torch.compile: bool, dtype: Enum and bettertransformer: bool to EngineArgs

Enum, dtype:

fp16
auto

shrink: docker image size by pruning venv

The docker image michaelf34/infinity:latest is about 6.5G uncompressed. Exploring this, I noticed inside the container:

# du -sh /root/rerank-test/.venv/ /app/.venv/
5.4G	/root/rerank-test/.venv/
5.6G	/app/.venv/

This adds up to more than 6.5 though, so I am not sure what is going on. But something might have leaked past .dockerignore when this container was built.

REPOSITORY                               TAG       DIGEST                                                                    IMAGE ID       CREATED        SIZE
michaelf34/infinity                      latest    sha256:42f31eeb195eec83960f8b505887aa8f4da64c7cdeddafd6f2fd6a7cbd008162   a8e432629682   2 days ago     6.5GB

I may not be using the very latest image.

flash attention support

Relevant issues & prs

huggingface/transformers#26557
huggingface/transformers#26350
huggingface/transformers#26585

Adding a startup message

OpenAI compatible server embeddings endpoint not accepting list[str]

Thanks a lot for this handy library!

When trying it out with langchain + milvus, I'm observing a duplicate of abetlen/llama-cpp-python#547 .

Steps to Reproduce

Launched the prebuilt docker container with steps provided here.
Using the following milvus tutorial here, with the following additions:

import os
os.environ["OPENAI_API_KEY"] = "XXX"
os.environ["OPENAI_API_BASE"] = "http://embeddings-service:8080/v1"

Problem

As mentioned in the linked issue the OpenAI embeddings API accepts both a string or array of tokens. But the server generated here too can't handle this input and fails with:

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIError: Invalid response object from API: '{"detail":[{"type":"string_type","loc":["body","input",0],"msg":"Input should be a valid string"

Batch handler priority queue can starve requests

Random scheduling has very high variance in latencies under any significant queuing scenarios.

I think this feedback is valid but let me know if I'm wrong.

infinity/libs/infinity_emb/infinity_emb/inference/batch_handler.py

Line 41 in 1546798

def pop_optimal_batch(

If I understand correctly, the default queue size is 64000. Batch size defaults to 64.

Let's pretend batches take 100ms and we keep the default queue roughly at 50% capacity (requests keep streaming in and holding it constant):
My queue is 32000 and I submit 1 request with 1 sample to embed.
Naively, I might assume that my request will get handled in up to 50 seconds (32000/64).
In reality, my odds of being picked are independent and 64/32000 (1/500) each time: 1-(1-P)^n . So the probability I get picked after n batches is (1 - (499/500)^n).

Solving for n. Worst case I never get picked, half of the time I'll get picked by the 346th batch, and 1 percent of the time I won't get picked by the 2300th batch.

In this scenario, 1% of the time my request would take 100ms*2300 = 230s to clear. If I have more samples in my batch, they complicate the math and make the queueing mechanism more adversarial.

Smaller queue sizes / less consistent traffic would be less adversarial but are still susceptible to bad luck.
Using a lower volume set of numbers (but user sends 10 random samples): queue size fixed at 100, request has 10 samples in it, batch size is 10: 50% of the time my request would take 96 batches or more to clear, 1% of the time my request would take 196 batches or more to clear.

Naively I would assume my request would clear in 1s (10*100ms) but on average it will take at least 10s and 1% of the time it will take at least 19.6s.

https://www.wolframalpha.com/input?i=binomial+probability+calculator

Reranker model fails to load (maidalun1020/bce-reranker-base_v1) - no max token length is set

Hello, when trying to load this specific model: maidalun1020/bce-reranker-base_v1
infinity_emb outputs the following error below. Is there something missing in this model config?

infinity_emb --model-name-or-path maidalun1020/bce-reranker-base_v1 --batch-size 16 --log-level info

INFO     2024-03-06 16:39:18,588 datasets INFO: PyTorch version 2.2.0 available.                                                                                        config.py:58
INFO:     Started server process [130475]
INFO:     Waiting for application startup.
INFO     2024-03-06 16:39:19,255 infinity_emb INFO: model=`maidalun1020/bce-reranker-base_v1` selected, using engine=`torch` and device=`None`                    select_model.py:54
INFO     2024-03-06 16:39:21,842 sentence_transformers.cross_encoder.CrossEncoder INFO: Use pytorch device: cuda                                                  CrossEncoder.py:82
INFO     2024-03-06 16:39:22,181 infinity_emb INFO: No optimizations via Huggingface optimum, it is disabled via env INFINITY_DISABLE_OPTIMUM                     acceleration.py:29
INFO     2024-03-06 16:39:22,182 infinity_emb INFO: Switching to half() precision (cuda: fp16). Disable by the setting the env var `INFINITY_DISABLE_HALF`               torch.py:60
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
INFO     2024-03-06 16:39:22,789 infinity_emb INFO: Getting timings for batch_size=16 and avg tokens per sentence=3                                               select_model.py:81
                 0.00     ms tokenization                                                                                                                                           
                 7.75     ms inference                                                                                                                                              
                 0.00     ms post-processing                                                                                                                                        
                 7.75     ms total                                                                                                                                                  
         embeddings/sec: 2064.93                                                                                                                                                    
ERROR:    Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__
    await self._router.startup()
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup
    await handler()
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/infinity_server.py", line 57, in _startup
    app.model = AsyncEmbeddingEngine.from_args(engine_args)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/engine.py", line 45, in from_args
    engine = cls(**asdict(engine_args), _show_deprecation_warning=False)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/engine.py", line 36, in __init__
    self._model, self._min_inference_t, self._max_inference_t = select_model(
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/inference/select_model.py", line 83, in select_model
    loaded_engine.warmup(batch_size=engine_args.batch_size, n_tokens=512)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/transformer/abstract.py", line 97, in warmup
    return run_warmup(self, inp)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/transformer/abstract.py", line 105, in run_warmup
    embed = model.encode_core(feat)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/transformer/crossencoder/torch.py", line 75, in encode_core
    out_features = self.predict(
  File "/opt/conda/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 332, in predict
    model_predictions = self.model(**features, return_dict=True)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1208, in forward
    outputs = self.roberta(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 803, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1028) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [16, 1028].  Tensor sizes: [1, 514]

ERROR:    Application startup failed. Exiting.

Support for nomic-ai/nomic-embed-text-v1.5

Nomic AI latest model nomic-ai/nomic-embed-text-v1.5 requires einops as a dependency as well as requiring sentence transformers >2.4.0. It would be great if Infinity supported the model

Async Python interface

Support e5-mistral-7b-instruct

Will Infinity support e5-mistral-7b-instruct?

Since this model is not support by sentence transformer, I can still use infinity to start the API server for e5-mistral-7b-instruct, but seems the embedding result is not the same as I follow the script in https://huggingface.co/intfloat/e5-mistral-7b-instruct.

Asking to truncate to max_length but no maximum length

2024-02-29T06:34:41.018 app[17816011be4689] ord [info] Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Should we worry about this log? We've seen that you pushed commits that truncates to longest_first on torch, but not released yet, could be related Michael?

support for `revision`

Currently now revision is used. It would be helpful to add a revision to download from a specific commit from the huggingface repo.

Support for instructur/instructor-xl models

Can you please support the instructor models here?

https://github.com/xlang-ai/instructor-embedding

These are arguably the best models for their sizes.

Add `bettertransformer` and `flash` to rerank and predict

Cross-encoder rerank support

AMD ROCm docker images support (+ optimization)

I am planning to evaluate hardware agnostic options

adapt poetry setup for optional deps
build a Docker Image for AMD Mi250/300
optimize settings e.g. torch.compile()/float16 etc for AMD

Create llama-index `InfinityEmbeddings` as langchain

Hi! Kudos for this project Michael! It is amazing.

We're migrating from a single repo with a RAG and and T40, to one repo with a RAG with just cpu and and another service with our embeddings models, rerankers... and just start/stop this machine (which is expensive) when traffic arrives.

We have seen your tool and looks promising, and we're willing to contribute.

Did you consider support llama-index? I think we (my company) could work on the llama-index integration, we can ping you when it's done to review.

What do you think?

How is long text handled?

Hey,

I'm trying to understand, what happens if we send a long text, which is longer from the model max length?
Will it be truncated by the tokenizer?
If not, what happens if the model gets a longer text than it's max length?

Thanks.

infinity_emb failed at startup using `torch.compile` when installed via pip

commit hash: 296472e

I tried it on my Linux machne - Ubuntu 22.04 with CUDA 12.3, and it was failed.

% infinity_emb --device cuda --engine torch
2024-03-03 11:05:28.807 | WARNING  | fastembed.embedding:<module>:7 - DefaultEmbedding, FlagEmbedding, JinaEmbedding are deprecated. Use TextEmbedding instead.
INFO:     Started server process [4620]
INFO:     Waiting for application startup.
INFO     2024-03-03 11:05:29,079 infinity_emb INFO: model=`BAAI/bge-small-en-v1.5` selected, using engine=`torch` and device=`cuda`                               select_model.py:54
INFO     2024-03-03 11:05:29,378 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5              SentenceTransformer.py:106
INFO     2024-03-03 11:05:31,576 infinity_emb INFO: Adding optimizations via Huggingface optimum. Disable by setting the env var `INFINITY_DISABLE_OPTIMUM`       acceleration.py:20
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
INFO     2024-03-03 11:05:31,580 infinity_emb INFO: Switching to half() precision (cuda: fp16). Disable by the setting the env var                        sentence_transformer.py:67
         `INFINITY_DISABLE_HALF`
INFO     2024-03-03 11:05:31,586 infinity_emb INFO: using torch.compile()                                                                                 sentence_transformer.py:73
zsh: segmentation fault (core dumped)  infinity_emb --device cuda --engine torch
%

I found issue #115 and export INFINITY_DISABLE_COMPILE=TRUE works. But it is strange that the default setting was failed. It is very strange.