Code Monkey home page Code Monkey logo

Comments (3)

xhluca avatar xhluca commented on September 2, 2024 1

That would be pretty interesting! I think it's worth adding an example and tests for this, if it works well i think it's a good idea to merge it. I think that the mask argument would be better than filter as it is more explicit about what it accomplishes?

from bm25s.

xhluca avatar xhluca commented on September 2, 2024

@dl423 Thank you for the suggestion! I think it's an interesting idea, however I'm not sure how that could be incorporated into the current API without changing how the corpus is currently filtered, which might require breaking changes.

For now, could you perhaps share an example of how that would work? We could add it to the examples dir, so users can follow best practices. If it's something that would be straightforward to add, I'm happy to review a PR with the new feature (perhaps as a util function that can be called before retrieve?) and relevant unit tests!

from bm25s.

dl423 avatar dl423 commented on September 2, 2024

@xhluca I've been thinking about how to implement the filter, and I settled on a simpler solution compared to what I last mentioned.

I was originally envisioning a filtering functionality where the query would be something like retriever.retrieve(query_tokens, k=2, filter={"author": "Charles Dickens"}). But implementing it might require significant change to the code, along with performance implications. Also, supporting more advanced filtering operations such as allowing multiple filter conditions joined by AND or OR can be a challenge.

Instead, I'm now thinking of a simpler approach where a bitmask is passed to the retriever to do the filtering. The bitmask will be a list of 1's and 0's, each corresponding to a doc in the corpus. Only the docs corresponding to a 1 will be included in the search results.

Here's an example of what I mean:

# Suppose there are 5 docs in the corpus
bitmask = [1, 0, 1, 1, 0] 
retriever.retrieve(query_tokens, k=2, filter=bitmask)

Then only the first, third and fourth documents can appear in the search result.

Here's a practical use case for this kind of filtering:

  1. Suppose a corpus of docs is stored in a database (e.g. Postgres) along with their metadata such as author, title, etc.
  2. This corpus is indexed in bm25s and stored in the exact same order as in the database
  3. When I want to do a bm25 search on only the docs written by Charles Dickens, I would first construct a bitmask from the database -- using a SQL to get the rows (docs) in the database whose author is Charles Dickens to return a 1 and the other rows become a 0.
  4. Pass the bitmask to bm25s retriever, so the search results only contain docs written by Charles Dickens.

Essentially, this approach leverages the powerful filtering capability already offered by a database system to do the real heavy-lifting for the filtering. This way, the filtering functionality in bm25s can be kept quite simple.

I expect this to be a relatively minor change that's mostly going to be made in selection.py. I will submit a PR once I'm done, thanks. :))

from bm25s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.