Comments (4)
@pandu-k multiple highlights from multiple documents would be awesome. For example:
If you imagine a query like 'what is the contract number, who are the signatories and summarise the scope of work'
The in Document 1 contract number might be on page 1 the signatories might be on page 100. Then then the scope of work might be in Document 2 on page 12.
This would also be really useful for the style of questions which are:
- 'compare these two documents'
- 'what are the similarities and differences between these sections of these documents'
- 'what is the difference between version 1 and version 1.5'
from marqo.
@aryanagarwal9 Maybe I misunderstand, but couldn't a smaller document size get the job done? Can also add overlap on your chunks if worried about missing context. For example:
"index_defaults": {
"treat_urls_and_pointers_as_images": False,
"model": "hf/all_datasets_v4_MiniLM-L6",
"normalize_embeddings": True,
"text_preprocessing": {
"split_method": "sentence",
"split_length": 2,
"split_overlap": 1,
}
}
See: https://docs.marqo.ai/0.0.18/API-Reference/indexes/#text-preprocessing-object
So if your podcast transcript is 100 "pages", this might become 100 marqo "documents" and within each of these documents there will be n "chunks" (aka facets) where, using the above settings, each chunk would be 2 sentences, with a stride/overlap of 1 sentence between them. We would then get 1 highlight per "page", which maybe is insufficient. But couldn't you just split your pages into something even smaller, such as paragraphs, to achieve the desired result?
from marqo.
Hey @jess-lord I don't think the above solution scales. If you have 30 pdfs with 100 pages each. You now have 3000 documents that will each return a highlight. You then need to find the answer you are looking for amongst these 3000 highlights using some other method which defeats the original purpose of finding the actual highlight. If you were using an LLM your token count/cost to process 3000 sentences per query would be high too (if not exceeding the limit).
from marqo.
@edmuthiah I was responding to the podcast use case, which I still think this covers because the facets can be retrieved independently of their "parent document". But for your use case (which I too am now bumping into) I agree. The only alternative I can come up with for the moment is tags and weighted queries.
from marqo.
Related Issues (20)
- [BUG] Model Cache / vectorise error when client parallel indexing HOT 2
- [BUG] Vectorise error when content is \n\n or \r\r HOT 1
- [BUG] We should give users 400 instead of 500 on OpenSearch errors HOT 1
- [ENHANCEMENT] Index parameter for health endpoint
- [BUG] The dependencies in setup.py and requirements.txt are inconsistent. HOT 1
- [BUG] Indexing multimodal with images using text-only models raises 500
- [BUG]
- [BUG] "Website" link in README is broken
- [ENHANCEMENT] Support for prefixing and suffixing of text HOT 11
- [ENHANCEMENT] change the web background color of doc please HOT 1
- [ENHANCEMENT]
- [ENHANCEMENT] any plan to integrate with mojo language?
- create_index: No validation when split_length <= split_overlap[BUG] HOT 1
- [Question] Mulitmodal Index HOT 1
- Internal error: Log error message and stack trace [ENHANCEMENT]
- [ENHANCEMENT] Please add "EVA: Visual Representation Fantasies from BAAI" HOT 1
- Modify an index's model [ENHANCEMENT]
- Add search model [ENHANCEMENT]
- [ENHANCEMENT] More descriptive `get_loaded_models` endpoint
- Help installing container on TrueNAS Scale (containerd) HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from marqo.