Comments (3)
I've added the example above to examples/
:
https://github.com/xhluca/bm25s/blob/main/examples/index_with_metadata.py
from bm25s.
Awesome! I only read the readme (whoops, ha). Will update my llama-index PR to account for this :)
from bm25s.
This should already work with current version of bm25s (0.1), because the corpus passed to the BM25
object is not the corpus passed to the index()
method, but rather a "passthrough" that is only needed during retrieval.
Note however that the saving will only work with json
serializable objects (i.e. dict, list)
import bm25s
# Create your corpus here
corpus_json = [
{"text": "a cat is a feline and likes to purr", "metadata": {"source": "internet"}},
{"text": "a dog is the human's best friend and loves to play", "metadata": {"source": "encyclopedia"}},
{"text": "a bird is a beautiful animal that can fly", "metadata": {"source": "cnn"}},
{"text": "a fish is a creature that lives in water and swims", "metadata": {"source": "i made it up"}},
]
corpus_text = [doc["text"] for doc in corpus_json]
# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus_text, stopwords="en")
# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus_json)
retriever.index(corpus_tokens)
# Query the corpus
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query)
# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k)
results, scores = retriever.retrieve(query_tokens, k=2)
for i in range(results.shape[1]):
doc, score = results[0, i], scores[0, i]
print(f"Rank {i+1} (score: {score:.2f}): {doc}")
# You can save the arrays to a directory...
retriever.save("animal_index_bm25")
# ...and load them when you need them
import bm25s
reloaded_retriever = bm25s.BM25.load("animal_index_bm25", load_corpus=True)
# set load_corpus=False if you don't need the corpus
Output:
Rank 1 (score: 1.06): {'text': 'a cat is a feline and likes to purr', 'metadata': {'source': 'internet'}}
Rank 2 (score: 0.48): {'text': 'a fish is a creature that lives in water and swims', 'metadata': {'source': 'i made it up'}}
from bm25s.
Related Issues (15)
- Minor bug: `show_progress` not propagated in `BM25.index` HOT 1
- Not Working for langchain Documents
- how to dynamic add/delete documents
- Can you query without a tokenization step?
- 可以增量更新索引吗?
- 🚨Before submitting an issue, read this 🚨
- On-the-fly stemming HOT 1
- Other language than english for the stopwords list HOT 2
- How to apply bm25s to languages such as Chinese? HOT 2
- Order-based matching of corpus metadata to to tokens HOT 2
- Updating an index for batch indexing HOT 14
- Using with postgres? HOT 3
- Capability Inquiry: Retrieving Specific JSON Records Based on Text HOT 4
- Pre-computed TF-IDF
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bm25s.