Code Monkey home page Code Monkey logo

Comments (8)

jkatz avatar jkatz commented on June 5, 2024

Noting that text-embedding-3-small is 1536 dimensions and fits within current indexing. The OpenAI docs also mention you can reduce the dimensions for text-embedding-3-large if you pass in the dimensions parameter (options of 256 and 1024).

I'm not quickly finding this in the docs, but is text-embedding-3-large using 4-byte or 2-byte floats? I wonder if releasing the half patch would help with the indexing issue.

from pgvector.

ankane avatar ankane commented on June 5, 2024

@wmwsagara You can pass any dimensions parameter that's 2000 or less to generate vectors that can be indexed. From the announcement post, 1024 dimensions gets a lot of the benefit of the new model.

@jkatz It uses 4-byte floats looking at the API response. The half type may be included in 0.7.0, but also want to look at other approaches.

from pgvector.

wmwsagara avatar wmwsagara commented on June 5, 2024

Andrew (@ankane) please let me ask to reopen this issue for the benefit of wider pgvector user community.

  1. Which embedding model is suitable for a specific application?
    text-embedding-3-small:
    "text-embedding-3-small is our new highly efficient embedding model and provides a significant upgrade over its predecessor..."

NOTE: "new highly efficient embedding model"

Comparing text-embedding-ada-002 to text-embedding-3-small, ... the average score on a commonly used benchmark for English tasks (MTEB) has increased from 61.0% to 62.3%.

text-embedding-3-large:
"text-embedding-3-large is our new next generation larger embedding model and creates embeddings with up to 3072 dimensions."

NOTE: "new next generation larger embedding model"

text-embedding-3-large is our new best performing model. Comparing text-embedding-ada-002 to text-embedding-3-large: ... while on MTEB, the average score has increased from 61.0% to 64.6%.

Other models:

voyage-lite-02-instruct:
MTEB the average score: 67.13 (Best performer on MTEB leaderboard)

https://docs.voyageai.com/embeddings/
Note, "But More advanced and specialized models are coming soon:
voyage-finance-2: coming soon
voyage-law-2: coming soon
voyage-multilingual-2: coming soon
voyage-healthcare-2: coming soon"

This means, voyage-lite-02-instruct may not be good enough for finance, law, healthcare.

Therefore, text-embedding-3-large (MTEB 64.6%) may be better for finance, law, healthcare than voyage-lite-02-instruct (MTEB 67.13%) unless tests prove otherwise.

Similarly, text-embedding-3-large (MTEB 64.6%) may be better for finance, law, healthcare than text-embedding-3-small (MTEB 62.3%) unless tests prove otherwise.

Of course, from the text-embedding-3-small one may retrieve 1024 dimensions, or may use a lower dimension model and save storage and RAM but what's the point, if real world tests shows for a user a significant loss of accuracy?

See: https://huggingface.co/spaces/mteb/leaderboard

Is MTEB average score 64.6% (text-embedding-3-large) good enough? How about a future model proves to be over 95%?

Please note, I'm not saying higher dimensions are always better, but I'm stressing is, users must have the ability to test which model is suitable for them.

  1. Why two different maximum dimensions?

src/vector.h
#define VECTOR_MAX_DIM 16000

src/hnsw.h
#define HNSW_MAX_DIM 2000

Is there a technical reason that HNSW index cannot support VECTOR_MAX_DIM?

from pgvector.

wmwsagara avatar wmwsagara commented on June 5, 2024

Jonathan (@jkatz), please note the above comment. Thanks.

from pgvector.

mateusmirandaalmeida avatar mateusmirandaalmeida commented on June 5, 2024

I don't understand why the issue was closed, the problem doesn't seem resolved, the indices don't accept dimensions above 2,000, is that right?

from pgvector.

wmwsagara avatar wmwsagara commented on June 5, 2024

Yes, dimensions above 2,000 is still not supported yet.

The decision to ignore the elephant in the room (OpenAI's text-embedding-3-large) is puzzling.

from pgvector.

jkatz avatar jkatz commented on June 5, 2024

@wmwsagara The 2K indexing limit is not being ignored. There have been numerous discussions about this topic both within the pgvector project and at events. At PGCon 2023, I gave a lightning talk discussing this exact point, as well as various strategies we can use in PostgreSQL to work beyond the limit. Also as mentioned, there are several open branches that can help address it.

That said, @ankane did provide an alternative approach to using the text-embedding-3-large. Additionally, you could considering doing some prefiltering of the data with PCA or other information to store it. You can also store it without an index and perform an exact nearest neighbor search. It all depends on your application, which I'm not sure you've provided information about.

The stats you provided showed a 2-4% gain in using the larger dimensionality. I'm many years removed from my machine learning studies, so that could be quite a significant gain. With my database hat on, I read it as having to store 3x the amount of information for a small gain, whereas I could store less and increasing some of my indexing tuning parameters to get better relevancy.

pgvector is open source, and ideas for how to support indexing larger vectors while still utilizing PostgreSQL's storage system (i.e. maintaining ACID compliance) are welcome.

from pgvector.

xfalcox avatar xfalcox commented on June 5, 2024

We are already using text-embedding-3-large with pgvector in production by simply passing the dimensions parameter with 2000 as the value.

from pgvector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.