ankane / neighbor Goto Github PK

View Code? Open in Web Editor NEW

424.0 11.0 10.0 86 KB

Nearest neighbor search for Rails and Postgres

License: MIT License

Ruby 100.00%

nearest-neighbor-search

neighbor's Introduction

Neighbor

Nearest neighbor search for Rails and Postgres

Installation

Add this line to your application’s Gemfile:

gem "neighbor"

Choose An Extension

Neighbor supports two extensions: cube and vector. cube ships with Postgres, while vector supports more dimensions and approximate nearest neighbor search.

For cube, run:

rails generate neighbor:cube
rails db:migrate

For vector, install pgvector and run:

rails generate neighbor:vector
rails db:migrate

Getting Started

Create a migration

class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
  def change
    add_column :items, :embedding, :cube
    # or
    add_column :items, :embedding, :vector, limit: 3 # dimensions
  end
end

Add to your model

class Item < ApplicationRecord
  has_neighbors :embedding
end

Update the vectors

item.update(embedding: [1.0, 1.2, 0.5])

Get the nearest neighbors to a record

item.nearest_neighbors(:embedding, distance: "euclidean").first(5)

Get the nearest neighbors to a vector

Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)

Distance

Supported values are:

euclidean
cosine
taxicab (cube only)
chebyshev (cube only)
inner_product (vector only)

For cosine distance with cube, vectors must be normalized before being stored.

class Item < ApplicationRecord
  has_neighbors :embedding, normalize: true
end

For inner product with cube, see this example.

Records returned from nearest_neighbors will have a neighbor_distance attribute

nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
nearest_item.neighbor_distance

Dimensions

The cube data type can have up to 100 dimensions by default. See the Postgres docs for how to increase this. The vector data type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.

For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.

class Item < ApplicationRecord
  has_neighbors :embedding, dimensions: 3
end

Indexing

For vector, add an approximate index to speed up queries. Create a migration with:

class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.1]
  def change
    add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
    # or
    add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
  end
end

Use :vector_cosine_ops for cosine distance and :vector_ip_ops for inner product.

Set the size of the dynamic candidate list with HNSW

Item.connection.execute("SET hnsw.ef_search = 100")

Or the number of probes with IVFFlat

Item.connection.execute("SET ivfflat.probes = 3")

OpenAI Embeddings

Generate a model

rails generate model Document content:text embedding:vector{1536}
rails db:migrate

And add has_neighbors

class Document < ApplicationRecord
  has_neighbors :embedding
end

Create a method to call the embeddings API

def fetch_embeddings(input)
  url = "https://api.openai.com/v1/embeddings"
  headers = {
    "Authorization" => "Bearer #{ENV.fetch("OPENAI_API_KEY")}",
    "Content-Type" => "application/json"
  }
  data = {
    input: input,
    model: "text-embedding-ada-002"
  }

  response = Net::HTTP.post(URI(url), data.to_json, headers)
  JSON.parse(response.body)["data"].map { |v| v["embedding"] }
end

Pass your input

input = [
  "The dog is barking",
  "The cat is purring",
  "The bear is growling"
]
embeddings = fetch_embeddings(input)

Store the embeddings

documents = []
input.zip(embeddings) do |content, embedding|
  documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)

And get similar articles

document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)

See the complete code

Disco Recommendations

You can use Neighbor for online item-based recommendations with Disco. We’ll use MovieLens data for this example.

Generate a model

rails generate model Movie name:string factors:cube
rails db:migrate

And add has_neighbors

class Movie < ApplicationRecord
  has_neighbors :factors, dimensions: 20, normalize: true
end

Fit the recommender

data = Disco.load_movielens
recommender = Disco::Recommender.new(factors: 20)
recommender.fit(data)

Store the item factors

movies = []
recommender.item_ids.each do |item_id|
  movies << {name: item_id, factors: recommender.item_factors(item_id)}
end
Movie.insert_all!(movies) # use create! for Active Record < 6

And get similar movies

movie = Movie.find_by(name: "Star Wars (1977)")
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)

See the complete code for cube and vector

Upgrading

0.2.0

The distance option has been moved from has_neighbors to nearest_neighbors, and there is no longer a default. If you use cosine distance, set:

class Item < ApplicationRecord
  has_neighbors normalize: true
end

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/neighbor.git
cd neighbor
bundle install
createdb neighbor_test

# cube
bundle exec rake test

# vector
EXT=vector bundle exec rake test

neighbor's People

Contributors

Stargazers

Watchers

Forkers

bauerna gazay marckohlbrugge iq-scm kieranklaassen sebscholl retailzipline appyspaces rschumann

neighbor's Issues

Multiple attributes

Is there anyway to support multiple attributes?

For instance, a model may have embeddings and factors, both of which are used for different purposes. Is there any way then to do something like

class SomeModel < ApplicationRecord
  has_neighbors :embeddings, :factors
  
   ...
end

Is there a good way to support multiple embeddings tied to a particular record?

Let's say you have a collection of Documents and you store embeddings for these documents, but due to token constraints, you have to store multiple embeddings tied to each document.

Do you have a recommended way for supporting this with Neighbor?

Current ideas:

Have an Embeddings model that has an embeddable--Not sure Neighbor would support this out of the box because it would have to support filtering on embeddable_type
Have a DocumentEmbeddings model and run nearest neighbor searches on that with some deduplication logic

Ideas

Add support for Active Record 7 - rails7 branch
Add support for custom attribute - custom_attribute branch (figure out multiple attributes before merging)
Add support for multiple attributes

Ideas

Please create a new issue to discuss any ideas or share your own.

0.4.0

(breaking) Add type mapping for cube and vector columns without has_neighbor
(breaking) Remove default attribute name - no-default-attribute branch

Ideas

Add max_distance / threshold option

Limiting and ordering results

Problem

Unless I'm missing something, I believe that adding support for limiting and ordering is an important feature. Consider the model:

doc = Document(text="Some text", embedding=[])

Currently if I run a nearest_neighbor search on the doc, it returns all documents per the default ordering in Rails.

puts doc.nearest_neighbors(:embedding, distance: "inner_product").map(&:neighbor_distance)
=> [
  0.7474747,
  0.4638648,
  0.8382633,
  0.9837744,
  0.9237373,
  0.8366281
]

While with a small number of records it's not a problem searching an sorting the results, on larger datasets it becomes a real performance issue.

Solution

What would address this problem (I feel) would be to add limit, order, and threshold options.

# Order results by specified columns
doc.nearest_neighbors(:embedding, distance: "inner_product", order: { neighbor_distance: :desc })

# Only return records with distance score > or < X (gte, gt, lte, lt)
doc.nearest_neighbors(:embedding, distance: "inner_product", threshold: { gte: 0.9 })

# Limit number or records returned from neightbor search
doc.nearest_neighbors(:embedding, distance: "inner_product", limit: 5)

While all these operations can obviously be performed with any returned result in memory, it would be way better to have them happen at the DB level.

missing keyword: :distance

Ruby 3.2.2

I can't get past this error: missing keyword: :distance when calling .nearest_neighbors(:embeddings, distance: "euclidean")

Not sure whether it's my setup or what is going on...

Thanks in advance for any help.

Update OpenAI quickstart to use Cosine like their documentation

In the OpenAI section in README, inner_product is used. Let's use cosine instead:
https://platform.openai.com/docs/guides/embeddings/limitations-risks

`nearest_neighbors` on average of relational vectors?

This is in part a more generic question about embeddings and vectors, but I'm curious how it would apply to neighbor specifically.

Let's say I have a table of paragraphs, each with their appropriate embeddings.

Each paragraph belongs_to :chapter.

How can I use neighbor to find the nearest chapter for a given embedding?

It's my understanding (but I'm not 100% sure of this), that you could simply take the average of the paragraph embeddings for a given chapter, to get the embedding of that chapter. (e.g. if you were to calculate the embedding vectors for the whole chapter text, you'd end up with the same embeddings as averaging the embeddings of each individual paragraph).

For example I tried the following, but nearest_neighbors isn't defined on ActiveRecord::Relation

Chapter.joins(:paragraphs).nearest_neighbors("AVG(paragraphs.embedding)", [0.9, 1.3, 1.1], distance: "euclidean").first

Best practice for not loading embeddings throughout the app?

Embeddings can take up quite a lot of space. In fact, in my todo-list app they account for 96% of record size[1]. Considering I don't actually need access to these embeddings in the majority of cases, I wonder how we can avoid loading them except when we actually need them?

Idea 1: `default_scope`

The naieve solution would be to create a default scope like so:

class Todo < ApplicationRecord
  default_scope { select(column_names - ["embedding"]) }
end

But 1) I dislike the idea of spelling out all column_names in each query like this, and 2) I expect it to break neighbors.

Idea 2: Seperate table

Another solution would be to store the embeddings in a separate table. And only joins(:embedding) when needed. This seems like a cleaner solution, but breaks nearest_neighbors, etc.

What approach would you recommend to avoid loading the embedding column except when we explicitly need it?

[1] SQL query to calculate the size of the embeddings versus the rest of the table

SELECT
	avg_embedding_size,
	avg_total_row_size,
	(avg_embedding_size / avg_total_row_size) * 100 AS embedding_percentage
FROM (
	SELECT
		AVG(pg_column_size(embedding)) AS avg_embedding_size,
		AVG(pg_column_size(t.*)) AS avg_total_row_size
	FROM (
		SELECT
			*
		FROM
			todos
		ORDER BY
			completed_at DESC
		LIMIT 25) AS t) AS subquery

avg_embedding_size	avg_total_row_size	embedding_percentage
6141.7600000000000000	6361.2800000000000000	96.54912218924493183800

Is it possible to create a word embedding in Ruby without using OpenAI

I am currently working on a project that requires the creation of a word embedding for a given text. However, I am not sure if it is possible to do this in Ruby without using OpenAI. I have looked into various libraries such as gensim and fasttext, but I am not sure if they support creating word embeddings from scratch.

If anyone has any experience or advice on creating word embeddings in Ruby without using OpenAI, I would greatly appreciate any guidance or recommendations. Thank you in advance for your help.

Access to `neighbor_distance` when using `nearest_neighbors`

I found out that if i add attribute :neighbor_distance to my model. Then I will get the distance set on my models from the query. This is not exactly the max_distance option from the Ideas issue. But better than nothing.

class Document < ApplicationRecord
  has_neighbors :embedding

  attribute :neighbor_distance
end

ankane / neighbor Goto Github PK

neighbor's Introduction

Neighbor

Installation

Choose An Extension

Getting Started

Distance

Dimensions

Indexing

Examples

OpenAI Embeddings

Disco Recommendations

Upgrading

0.2.0

History

Contributing

neighbor's People

Contributors

Stargazers

Watchers

Forkers

neighbor's Issues

Problem

Solution

Idea 1: default_scope

Idea 2: Seperate table

Recommend Projects

Recommend Topics

Recommend Org

Idea 1: `default_scope`