Code Monkey home page Code Monkey logo

bm25s's Introduction

BM25-Sparse⚡

BM25S is an ultrafast implementation of BM25 in pure Python, powered by Scipy sparse matrices

💻 GitHub 🏠 Homepage 📝 Technical Report 🤗 Blog Post

Welcome to bm25s, a library that implements BM25 in Python, allowing you to rank documents based on a query. BM25 is a widely used ranking function used for text retrieval tasks, and is a core component of search services like Elasticsearch.

It is designed to be:

  • Fast: bm25s is implemented in pure Python and leverage Scipy sparse matrices to store eagerly computed scores for all document tokens. This allows extremely fast scoring at query time, improving performance over popular libraries by orders of magnitude (see benchmarks below).
  • Simple: bm25s is designed to be easy to use and understand. You can install it with pip and start using it in minutes. There is no dependencies on Java or Pytorch - all you need is Scipy and Numpy, and optional lightweight dependencies for stemming.

Below, we compare bm25s with Elasticsearch in terms of speedup over rank-bm25, the most popular Python implementation of BM25. We measure the throughput in queries per second (QPS) on a few popular datasets from BEIR in a single-threaded setting.

comparison

Click to show citation
@misc{bm25s,
      title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring}, 
      author={Xing Han Lù},
      year={2024},
      eprint={2407.03618},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.03618}, 
}

Important

BM25S just got faster! We are currently testing out integration with numba, which would make it up to 2x faster for larger datasets! Learn more about it and share your thoughts in this discussion thread.

Installation

You can install bm25s with pip:

pip install bm25s

If you want to use stemming for better results, you can install the recommended (but optional) dependencies:

# Install all extra dependencies
pip install bm25s[full]

# If you want to use stemming for better results, you can install a stemmer
pip install PyStemmer

# To speed up the top-k selection process, you can install `jax`
pip install jax[cpu]

Quickstart

Here is a simple example of how to use bm25s:

import bm25s
import Stemmer  # optional: for stemming

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k)
results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

# You can save the arrays to a directory...
retriever.save("animal_index_bm25")

# You can save the corpus along with the model
retriever.save("animal_index_bm25", corpus=corpus)

# ...and load them when you need them
import bm25s
reloaded_retriever = bm25s.BM25.load("animal_index_bm25", load_corpus=True)
# set load_corpus=False if you don't need the corpus

For an example that shows how to quickly index a 2M-documents corpus (Natural Questions), check out examples/index_nq.py.

Flexibility

bm25s provides a flexible API that allows you to customize the BM25 model and the tokenization process. Here are some of the options you can use:

# You can provide a list of queries instead of a single query
queries = ["What is a cat?", "is the bird a dog?"]

# Provide your own stopwords list if you don't like the default one
stopwords = ["a", "the"]

# For stemming, use any function that is callable on each word list
stemmer_fn = lambda lst: [word for word in lst]

# Tokenize the queries
query_token_ids = bm25s.tokenize(queries, stopwords=stopwords, stemmer=stemmer_fn)

# If you want the tokenizer to return strings instead of token ids, you can do this
query_token_strs = bm25s.tokenize(queries, return_ids=False)

# You can use a different corpus for retrieval, e.g., titles instead of full docs
titles = ["About Cat", "About Dog", "About Bird", "About Fish"]

# You can also choose to only return the documents and omit the scores
results = retriever.retrieve(query_token_ids, corpus=titles, k=2, return_as="documents")

# The documents are returned as a numpy array of shape (n_queries, k)
for i in range(results.shape[1]):
    print(f"Rank {i+1}: {results[0, i]}")

Memory Efficient Retrieval

bm25s is designed to be memory efficient. You can use the mmap option to load the BM25 index as a memory-mapped file, which allows you to load the index without loading the full index into memory. This is useful when you have a large index and want to save memory:

# Create a BM25 index
# ...

# let's say you have a large corpus
corpus = [
    "a very long document that is very long and has many words",
    "another long document that is long and has many words",
    # ...
]
# Save the BM25 index to a file
retriever.save("bm25s_very_big_index", corpus=corpus)

# Load the BM25 index as a memory-mapped file, which is memory efficient
# and reduce overhead of loading the full index into memory
retriever = bm25s.BM25.load("bm25s_very_big_index", mmap=True)

For an example of how to use retrieve using the mmap=True mode, check out examples/retrieve_nq.py.

Variants

You can use the following variants of BM25 in bm25s (see Kamphuis et al. 2020 for more details):

  • Original implementation (method="robertson") - we set idf>=0 to avoid negatives
  • ATIRE (method="atire")
  • BM25L (method="bm25l")
  • BM25+ (method="bm25+")
  • Lucene (method="lucene")

By default, bm25s uses method="lucene", which is Lucene's BM25 implementation (exact version). You can change the method by passing the method argument to the BM25 constructor:

# The IR book recommends default values of k1 between 1.2 and 2.0, and b=0.75
retriever = bm25s.BM25(method="robertson", k1=1.5, b=0.75)

# For BM25+, BM25L, you need a delta parameter (default is 0.5)
retriever = bm25s.BM25(method="bm25+", delta=1.5)

# You can also choose a different "method" for idf, while keeping the default for the rest
# for example, this is equivalent to rank-bm25 when `epsilon=0`
retriever = bm25s.BM25(method="atire", idf_method="robertson")
# and this is equivalent to bm25-pt
retriever = bm25s.BM25(method="atire", idf_method="lucene")

Hugging Face Integration

bm25 can naturally work with Hugging Face's huggingface_hub, allowing you to load and save to the model hub. This is useful for sharing BM25 indices and using community models.

First, make sure you have a valid access token for the Hugging Face model hub. This is needed to save models to the hub, or to load private models. Once you created it, you can add it to your environment variables (e.g. in your .bashrc or .zshrc):

export HUGGING_FACE_HUB_TOKEN="hf_..."

Now, let's install the huggingface_hub library:

pip install huggingface_hub

Let's see how to use BM25SHF.save_to_hub to save a BM25 index to the Hugging Face model hub:

import os
import bm25s
from bm25s.hf import BM25HF

# Create a BM25 index
retriever = BM25HF()
# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]
corpus_tokens = bm25s.tokenize(corpus)
retriever.index(corpus_tokens)

# Set your username and token
user = "your-username"
token = os.environ["HF_TOKEN"]
retriever.save_to_hub(f"{user}/bm25s-animals", token=token, corpus=corpus)
# You can also save it publicly with private=False

Then, you can use the following code to load a BM25 index from the Hugging Face model hub:

import bm25s
from bm25s.hf import BM25HF

# Load a BM25 index from the Hugging Face model hub
user = "your-username"
retriever = BM25HF.load_from_hub(f"{user}/bm25s-animals")

# you can specify revision and load_corpus=True if needed
retriever = BM25HF.load_from_hub(
    f"{user}/bm25s-animals", revision="main", load_corpus=True
)

# if you want a low-memory usage, you can load as memory map with `mmap=True`
retriever = BM25HF.load_from_hub(
    f"{user}/bm25s-animals", load_corpus=True, mmap=True
)

# Query the corpus
query = "does the fish purr like a cat?"

# Tokenize the query
query_tokens = bm25s.tokenize(query)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k)
results, scores = retriever.retrieve(query_tokens, k=2)

For a complete example, check out:

Comparison

Here are some benchmarks comparing bm25s to other popular BM25 implementations. We compare the following implementations:

  • bm25s: Our implementation of BM25 in pure Python, powered by Scipy sparse matrices.
  • rank-bm25 (Rank): A popular Python implementation of BM25.
  • bm25_pt (PT): A Pytorch implementation of BM25.
  • elasticsearch (ES): Elasticsearch with BM25 configurations.

OOM means the implementation ran out of memory during the benchmark.

Throughput (Queries per second)

We compare the throughput of the BM25 implementations on various datasets. The throughput is measured in queries per second (QPS), on a single-threaded Intel Xeon CPU @ 2.70GHz (found on Kaggle). For BM25S, we take the average of 10 runs. Instances exceeding 60 queries/s are in bold.

Dataset BM25S Elastic BM25-PT Rank-BM25
arguana 573.91 13.67 110.51 2
climate-fever 13.09 4.02 OOM 0.03
cqadupstack 170.91 13.38 OOM 0.77
dbpedia-entity 13.44 10.68 OOM 0.11
fever 20.19 7.45 OOM 0.06
fiqa 507.03 16.96 20.52 4.46
hotpotqa 20.88 7.11 OOM 0.04
msmarco 12.2 11.88 OOM 0.07
nfcorpus 1196.16 45.84 256.67 224.66
nq 41.85 12.16 OOM 0.1
quora 183.53 21.8 6.49 1.18
scidocs 767.05 17.93 41.34 9.01
scifact 952.92 20.81 184.3 47.6
trec-covid 85.64 7.34 3.73 1.48
webis-touche2020 60.59 13.53 OOM 1.1

More detailed benchmarks can be found in the bm25-benchmarks repo.

Disk usage

bm25s is designed to be lightweight. This means the total disk usage of the package is minimal, as it only requires wheels for numpy (18MB), scipy (37MB), and the package itself is less than 100KB. After installation, the full virtual environment takes more space than rank-bm25 but less than pyserini and bm25_pt:

Package Disk Usage
venv (no package) 45MB
rank-bm25 99MB
bm25s (ours) 479MB
bm25_pt 5346MB
pyserini 6976MB
elastic 1183MB
Show Details

The disk usage of the virtual environments is calculated using the following command:

$ du -s *env-* --block-size=1MB
6976    conda-env-pyserini
5346    venv-bm25-pt
479     venv-bm25s
45      venv-empty
99      venv-rank-bm25

For pyserini, we use the recommended installation with conda environment to account for Java dependencies.

Optimized RAM usage

bm25s allows considerable memory saving through the use of memory-mapping, which allows the index to be stored on disk and loaded on demand.

When testing with 6 arbitrary queries with an index built with MS MARCO (8.8M documents, 300M+ tokens), we have the following:

Method Load Index (s) Retrieval (s) RAM usage (GB)
Memory-mapped 0.62 0.18 0.90
In-memory 11.41 0.74 10.56

When you run bm25s on 1000 queries on the Natural Questions dataset (2M+ documents), the memory usage is over 50% lower than the in-memory version with trivial difference in speed. You can find more information in the GitHub repository.

Acknowledgement

  • The multilingual stopwords are sourced from the NLTK stopwords lists.
  • The numba implementation are inspired by numba implementations originally proposed by baguetter and retriv.

Citation

If you use bm25s in your work, please use the following bibtex:

@misc{bm25s,
      title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring}, 
      author={Xing Han Lù},
      year={2024},
      eprint={2407.03618},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.03618}, 
}

bm25s's People

Contributors

bm777 avatar dantetemplar avatar tomaarsen avatar xhluca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bm25s's Issues

Using with postgres?

I'm using Supabase for one of my projects and it has about 2.3M rows. Currently the data is only fetch using certain attributes as Full Text Search is pretty slow. Is there any way we can use BM25s with the existing infrastructure?

Thanks for your response.

Order-based matching of corpus metadata to to tokens

Hi! Thanks a lot for this nice little library, the timing is perfect :)

If I want to provide additional metadata in my corpus, how is it matched to the indexed corpus tokens at retrieval time? Is it entirely based on both structures having the same order such that the indices apply?

Just looking for a quick confirmation before using this in a real-world application :)

Quick example to illustrate:

import bm25s
import Stemmer

# corpus with metadata
corpus = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]

stemmer = Stemmer.Stemmer("english")

# build corpus without metadata
corpus_tokens = bm25s.tokenize([d['text'] for d in corpus], stopwords="en", stemmer=stemmer)
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):

    # doc is a dictionary with "id" and "text" - how are they matched?
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i + 1} (score: {score:.2f}): {doc}")

🚨Before submitting an issue, read this 🚨

There are many reason you might want to open an issue:

  • You have a question about how to use the library
  • You have an idea how the library could be improved, and would like to discuss it
  • You would like to highlight a general discussion
  • You found a bug
  • You would like to outline a new feature to add to the library

Please only open an issue for the latter two cases: bug (report or fix) and a concrete new feature. Everything else will be moved to Discussions

Maybe use `time.monotonic` instead of `time.time`?

time.time() is commonly used to measure elapsed time, but it can be unreliable because it depends on the system clock, which can be adjusted. This can lead to incorrect time measurements.

Why time.monotonic?

time.monotonic() provides a clock that only moves forward and is unaffected by system clock changes, making it ideal for accurate time interval measurements, such as in timeouts and performance benchmarks.

Proposal

Replace time.time() with time.monotonic() in the codebase:

start_time = time.time()

stop_time = time.time()

paused_time = time.time()

self.results[name]["last"] = time.time()

...ok, there are too much to list, I will make pr

On-the-fly stemming

Right now, stemming is done after the strings are split and converted to IDs:

bm25s/bm25s/tokenization.py

Lines 152 to 177 in 73c7dea

# Step 2: Stem the tokens if a stemmer is provided
if stemmer is not None:
if hasattr(stemmer, "stemWords"):
stemmer_fn = stemmer.stemWords
elif callable(stemmer):
stemmer_fn = stemmer
else:
error_msg = "Stemmer must have a `stemWord` method, or be callable. For example, you can use the PyStemmer library."
raise ValueError(error_msg)
# Now, we use the stemmer on the token_to_index dictionary to get the stemmed tokens
tokens_stemmed = stemmer_fn(unique_tokens)
vocab = set(tokens_stemmed)
vocab_dict = {token: i for i, token in enumerate(vocab)}
stem_id_to_stem = {v: k for k, v in vocab_dict.items()}
# We create a dictionary mapping the stemmed tokens to their index
doc_id_to_stem_id = {
token_to_index[token]: vocab_dict[stem]
for token, stem in zip(unique_tokens, tokens_stemmed)
}
# Now, we simply need to replace the tokens in the corpus with the stemmed tokens
for i, doc_ids in enumerate(tqdm(corpus_ids, desc="Stem Tokens", leave=leave, disable=not show_progress)):
corpus_ids[i] = [doc_id_to_stem_id[doc_id] for doc_id in doc_ids]
else:
vocab_dict = token_to_index

However, it can probably be done here instead:

bm25s/bm25s/tokenization.py

Lines 141 to 142 in 73c7dea

if token not in token_to_index:
token_to_index[token] = len(token_to_index)

Probably would need:

token_to_stem = {}  # do we need this? maybe useful to keep, though stemmer_fn should be sufficient
token_to_index = {}  # this is used to convert tokens to stem id (the true id) on the fly
stem_to_index = {}  # only tracks stems and their ID (this is the true vocab dict)

# example: changing -> chang, changed -> chang
# chang's stem_id = 42
# stem_to_index = {"chang": 42}  --> real vocab_dict
# token_to_index = {"changing": 42, "changed": 42}

# ...

for ...:
  if token not in token_to_index:
    stem = stemmer_fn(token)
    if stem not in stem_to_index:
      stem_to_index[stem] = len(stem_to_index)
    stem_id = stem_to_index[stem]
    token_to_index[token] = stem_id  # the token should now map to the stem's ID
  token_id = token_to_index[token]
# ...
vocab_dict = stem_to_index

Thread safe search

Amazing work on this! Works great.

Is retrieval thread-safe? On a glance, it seems like it should be, but I have trouble using multi-threading in a notebook. It crashes most of the time, but when it works the results are correct.

I should add that I have trouble irrespective of backend = jax or numpy.

Other language than english for the stopwords list

Thanks for writing this repo.

This project currently supports only English stopwords, but there is the possibility to pass also our list of stopwords. like for example Serbian French or Chinese language.

def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:
if stopwords in ["english", "en", True]:
return STOPWORDS_EN
elif stopwords in [None, False]:
return []
elif isinstance(stopwords, str):
raise ValueError(
f"{stopwords} not recognized. Only default English stopwords are currently supported. "
"Please input a list of stopwords"
)
else:
return stopwords


Could we add a list of stopwords of other languages in the repo by opening a PR or do you plan to incorporate it or not at all?

I could be open to adding other languages :)

[Feature request] Document metadata and filtering

Hi, I've used bm25s on a fairly large production dataset, and I'm super-impressed by the speed!! Having fumbled around with rank_bm25 quite a bit and suffered through the pain of its slow speed and large memory usage, I would say the speed and memory efficiency of bm25s is absolutely mind-blowing.

As a suggestion, I think it might be useful to add support for document metadata and filtering. The metadata would be fields like "author", "title", "date", etc. which wouldn't be included in the keyword tokenization, but can be used for filtering the search results during query (e.g. only searching for documents from a specific author).

Thanks!

Updating an index for batch indexing

Hi! Is it possible to use update an existing index, e.g. for batch indexing and larger-than-memory datasets?
Unfortunately, this does not work:

import Stemmer
import bm25s

batch_0 = [
    {"id": 0, "text": "a cat is a feline and likes to purr"},
    {"id": 1, "text": "a dog is the human's best friend and loves to play"},
]

batch_1 = [
    {"id": 2, "text": "a bird is a beautiful animal that can fly"},
    {"id": 3, "text": "a fish is a creature that lives in water and swims"},
]


def index_corpus(corpus_batch):
    corpus_tokens = bm25s.tokenize([d['text'] for d in corpus_batch], stopwords="en", stemmer=stemmer)
    retriever.index(corpus_tokens)


def query_corpus(query):
    query_tokens = bm25s.tokenize(query, stemmer=stemmer)

    all_batches = batch_0 + batch_1
    results, scores = retriever.retrieve(query_tokens, corpus=all_batches, k=2)

    for i in range(results.shape[1]):
        doc, score = results[0, i], scores[0, i]
        print(f"Rank {i + 1} (score: {score:.2f}): {doc}")


retriever = bm25s.BM25()
stemmer = Stemmer.Stemmer("english")

index_corpus(batch_0)
index_corpus(batch_1)

query_corpus("what is a fish?")

Pre-computed TF-IDF

Is it possible to pass a pre-computed TF-IDF matrix (with the shape [documents, vocabulary])?

Minor bug: `show_progress` not propagated in `BM25.index`

Just a minor bug: the show_progress setting is not getting propagated from BM25.index to BM25.build_index_from_ids. Currently __init__.py line 352-356:

            scores = self.build_index_from_ids(
                unique_token_ids=unique_token_ids,
                corpus_token_ids=corpus_token_ids,
                leave_progress=leave_progress
            )

should be

            scores = self.build_index_from_ids(
                unique_token_ids=unique_token_ids,
                corpus_token_ids=corpus_token_ids,
                leave_progress=leave_progress,
                show_progress=show_progress
            )

[Feature Request] Support attaching metadata to the corpus

It can be very helpful to attach metadata to a corpus, that is not indexed, but still returned during retrieval.

For example, a super naive approach:

corpus = [
  {"text": "Hello world", "metadata": {"source": "internet"}},
  ...
]

The main motivation for me is providing a more first-class integration in llama-index 😄 I can serialize the entire TextNode object to make saving/loading very smooth. But I think overall this would be a super valuable feature

Not Working for langchain Documents

from langchain.docstore.document import Document

doc = Document(page_content="text", metadata={"source": "local"})

Instead of a list of string, I want to give a list of Langchain documents but it's not working

可以增量更新索引吗?

如题。实际情况下有很多场景需要实时增量更新。

如果有可能的话,建议参考whoosh、tantivy等,封装为一个比较完整的全文检索底层库。

Can you query without a tokenization step?

In the case I have an index of the queries, I would like to retrieve the tokenized version of that query. This use case can come up when doing bm25 eval across a matrix of known x and y types of objects.

x_corpus = [...]

y_corpus = [
    "fooo",
    "does the fish purr like a cat?"
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

class XEntity:
  corpus_tokens = bm25s.tokenize(x_corpus, stopwords="en", stemmer=stemmer)
  x_retriever = bm25s.BM25()
  x_retriever.index(corpus_tokens)

class YEntity:
  corpus_tokens = bm25s.tokenize(y_corpus, stopwords="en", stemmer=stemmer)
  y_retriever = bm25s.BM25()
  y_retriever.index(corpus_tokens)

corpus_tokens for y_corpus

Tokenized(ids=[[8], [10, 7, 9, 2, 11, 4, 13, 6, 5, 12], [7, 0, 3, 14, 1]], vocab={'creatur': 0, 'swim': 1, 'like': 2, 'live': 3, 'bird': 4, 'can': 5, 'anim': 6, 'fish': 7, 'fooo': 8, 'purr': 9, 'doe': 10, 'cat': 11, 'fli': 12, 'beauti': 13, 'water': 14})

If I tokenize independently...

a_query = "does the fish purr like a cat?"
Tokenized(ids=[[3, 1, 2, 0, 4]], vocab={'like': 0, 'fish': 1, 'purr': 2, 'doe': 3, 'cat': 4})

Given the results when tokenizing the index (looks like some optimizations happening), is there a way to get a subset from the index that represents the query as represented when the index was built?

query_from_y = precomputed_representation_of_a_query_without_tokenization_step
ranked_results = x_retriever.retrieve(query_from_y, k=5)

Capability Inquiry: Retrieving Specific JSON Records Based on Text

Hi I am considering using the BM25 library for a project where I need to efficiently retrieve JSON records based on textual content matches. My data is structured in JSON format, each with several fields.

Use Case

When I input a query, such as "mountain cycling", I want to retrieve the top K JSON records that best match this query based on the content of the 'chunk' field.

Example of json

    {
        "chunk_id": 1,
        "chunk": "mountain cycling",
        "vocabulary_id": "SPORTS001",
        "vocabulary_name": "Global Sports Vocabulary",
        "concept_code": "MTCYCL001",
        "concept_name": "Mountain Cycling",
        "domain": "Outdoor Sports",
        "validity": true,
        "source": "Sports Encyclopedia"
    },

Questions

  1. Does the BM25 library support indexing and retrieving directly from JSON structures like the ones provided above, particularly focusing on a specific field for text matching?

  2. Setup Advice: If direct JSON handling is supported, could you provide guidance or documentation on how to set up the library for this specific use case?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.