OpenAI says its new "text-embedding-3-large" embedding model with 3072 dimensions is t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Andrew (<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Jonathan (<a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

OpenAI's text-embedding-3-large model not compatible with pgvector 5.1 about pgvector HOT 8 CLOSED

wmwsagara commented on June 5, 2024

OpenAI's text-embedding-3-large model not compatible with pgvector 5.1

from pgvector.

Comments (8)

jkatz commented on June 5, 2024

Noting that text-embedding-3-small is 1536 dimensions and fits within current indexing. The OpenAI docs also mention you can reduce the dimensions for text-embedding-3-large if you pass in the dimensions parameter (options of 256 and 1024).

I'm not quickly finding this in the docs, but is text-embedding-3-large using 4-byte or 2-byte floats? I wonder if releasing the half patch would help with the indexing issue.

from pgvector.

ankane commented on June 5, 2024

@wmwsagara You can pass any dimensions parameter that's 2000 or less to generate vectors that can be indexed. From the announcement post, 1024 dimensions gets a lot of the benefit of the new model.

@jkatz It uses 4-byte floats looking at the API response. The half type may be included in 0.7.0, but also want to look at other approaches.

from pgvector.

wmwsagara commented on June 5, 2024

Andrew (@ankane) please let me ask to reopen this issue for the benefit of wider pgvector user community.

Which embedding model is suitable for a specific application?
text-embedding-3-small:
"text-embedding-3-small is our new highly efficient embedding model and provides a significant upgrade over its predecessor..."

NOTE: "new highly efficient embedding model"

Comparing text-embedding-ada-002 to text-embedding-3-small, ... the average score on a commonly used benchmark for English tasks (MTEB) has increased from 61.0% to 62.3%.

text-embedding-3-large:
"text-embedding-3-large is our new next generation larger embedding model and creates embeddings with up to 3072 dimensions."

NOTE: "new next generation larger embedding model"

text-embedding-3-large is our new best performing model. Comparing text-embedding-ada-002 to text-embedding-3-large: ... while on MTEB, the average score has increased from 61.0% to 64.6%.

Other models:

voyage-lite-02-instruct:
MTEB the average score: 67.13 (Best performer on MTEB leaderboard)

https://docs.voyageai.com/embeddings/
Note, "But More advanced and specialized models are coming soon:
voyage-finance-2: coming soon
voyage-law-2: coming soon
voyage-multilingual-2: coming soon
voyage-healthcare-2: coming soon"

This means, voyage-lite-02-instruct may not be good enough for finance, law, healthcare.

Therefore, text-embedding-3-large (MTEB 64.6%) may be better for finance, law, healthcare than voyage-lite-02-instruct (MTEB 67.13%) unless tests prove otherwise.

Similarly, text-embedding-3-large (MTEB 64.6%) may be better for finance, law, healthcare than text-embedding-3-small (MTEB 62.3%) unless tests prove otherwise.

Of course, from the text-embedding-3-small one may retrieve 1024 dimensions, or may use a lower dimension model and save storage and RAM but what's the point, if real world tests shows for a user a significant loss of accuracy?

See: https://huggingface.co/spaces/mteb/leaderboard

Is MTEB average score 64.6% (text-embedding-3-large) good enough? How about a future model proves to be over 95%?

Please note, I'm not saying higher dimensions are always better, but I'm stressing is, users must have the ability to test which model is suitable for them.

Why two different maximum dimensions?

src/vector.h
#define VECTOR_MAX_DIM 16000

src/hnsw.h
#define HNSW_MAX_DIM 2000

Is there a technical reason that HNSW index cannot support VECTOR_MAX_DIM?

from pgvector.

wmwsagara commented on June 5, 2024

Jonathan (@jkatz), please note the above comment. Thanks.

from pgvector.

mateusmirandaalmeida commented on June 5, 2024

I don't understand why the issue was closed, the problem doesn't seem resolved, the indices don't accept dimensions above 2,000, is that right?

from pgvector.

wmwsagara commented on June 5, 2024

Yes, dimensions above 2,000 is still not supported yet.

The decision to ignore the elephant in the room (OpenAI's text-embedding-3-large) is puzzling.

from pgvector.

jkatz commented on June 5, 2024

@wmwsagara The 2K indexing limit is not being ignored. There have been numerous discussions about this topic both within the pgvector project and at events. At PGCon 2023, I gave a lightning talk discussing this exact point, as well as various strategies we can use in PostgreSQL to work beyond the limit. Also as mentioned, there are several open branches that can help address it.

That said, @ankane did provide an alternative approach to using the text-embedding-3-large. Additionally, you could considering doing some prefiltering of the data with PCA or other information to store it. You can also store it without an index and perform an exact nearest neighbor search. It all depends on your application, which I'm not sure you've provided information about.

The stats you provided showed a 2-4% gain in using the larger dimensionality. I'm many years removed from my machine learning studies, so that could be quite a significant gain. With my database hat on, I read it as having to store 3x the amount of information for a small gain, whereas I could store less and increasing some of my indexing tuning parameters to get better relevancy.

pgvector is open source, and ideas for how to support indexing larger vectors while still utilizing PostgreSQL's storage system (i.e. maintaining ACID compliance) are welcome.

from pgvector.

xfalcox commented on June 5, 2024

We are already using text-embedding-3-large with pgvector in production by simply passing the dimensions parameter with 2000 as the value.

from pgvector.

OpenAI's text-embedding-3-large model not compatible with pgvector 5.1 about pgvector HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent