Comments (3)
That would be pretty interesting! I think it's worth adding an example and tests for this, if it works well i think it's a good idea to merge it. I think that the mask
argument would be better than filter
as it is more explicit about what it accomplishes?
from bm25s.
@dl423 Thank you for the suggestion! I think it's an interesting idea, however I'm not sure how that could be incorporated into the current API without changing how the corpus is currently filtered, which might require breaking changes.
For now, could you perhaps share an example of how that would work? We could add it to the examples dir, so users can follow best practices. If it's something that would be straightforward to add, I'm happy to review a PR with the new feature (perhaps as a util function that can be called before retrieve?) and relevant unit tests!
from bm25s.
@xhluca I've been thinking about how to implement the filter, and I settled on a simpler solution compared to what I last mentioned.
I was originally envisioning a filtering functionality where the query would be something like retriever.retrieve(query_tokens, k=2, filter={"author": "Charles Dickens"})
. But implementing it might require significant change to the code, along with performance implications. Also, supporting more advanced filtering operations such as allowing multiple filter conditions joined by AND or OR can be a challenge.
Instead, I'm now thinking of a simpler approach where a bitmask is passed to the retriever to do the filtering. The bitmask will be a list of 1's and 0's, each corresponding to a doc in the corpus. Only the docs corresponding to a 1 will be included in the search results.
Here's an example of what I mean:
# Suppose there are 5 docs in the corpus
bitmask = [1, 0, 1, 1, 0]
retriever.retrieve(query_tokens, k=2, filter=bitmask)
Then only the first, third and fourth documents can appear in the search result.
Here's a practical use case for this kind of filtering:
- Suppose a corpus of docs is stored in a database (e.g. Postgres) along with their metadata such as author, title, etc.
- This corpus is indexed in bm25s and stored in the exact same order as in the database
- When I want to do a bm25 search on only the docs written by Charles Dickens, I would first construct a bitmask from the database -- using a SQL to get the rows (docs) in the database whose author is Charles Dickens to return a 1 and the other rows become a 0.
- Pass the bitmask to bm25s retriever, so the search results only contain docs written by Charles Dickens.
Essentially, this approach leverages the powerful filtering capability already offered by a database system to do the real heavy-lifting for the filtering. This way, the filtering functionality in bm25s can be kept quite simple.
I expect this to be a relatively minor change that's mostly going to be made in selection.py
. I will submit a PR once I'm done, thanks. :))
from bm25s.
Related Issues (20)
- Order-based matching of corpus metadata to to tokens HOT 2
- Updating an index for batch indexing HOT 14
- Using with postgres? HOT 3
- Capability Inquiry: Retrieving Specific JSON Records Based on Text HOT 4
- Pre-computed TF-IDF
- Minor bug: `show_progress` not propagated in `BM25.index` HOT 1
- Not Working for langchain Documents
- [Feature Request] Support attaching metadata to the corpus HOT 3
- how to dynamic add/delete documents
- Can you query without a tokenization step?
- 可以增量更新索引吗?
- 🚨Before submitting an issue, read this 🚨
- On-the-fly stemming HOT 1
- Other language than english for the stopwords list HOT 2
- How to apply bm25s to languages such as Chinese? HOT 2
- Thread safe search HOT 3
- Consider orjson as faster and more robust alternative to ujson HOT 1
- [feature request] Implement BMX algorithm HOT 3
- Maybe use `time.monotonic` instead of `time.time`? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bm25s.