Is your feature request related to a problem? Please describe. Cu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[ENHANCEMENT] Search Operation Should Return Multiple Highlights. about marqo HOT 4 OPEN

marqo-ai commented on May 23, 2024

[ENHANCEMENT] Search Operation Should Return Multiple Highlights.

from marqo.

Comments (4)

edmuthiah commented on May 23, 2024

@pandu-k multiple highlights from multiple documents would be awesome. For example:

If you imagine a query like 'what is the contract number, who are the signatories and summarise the scope of work'

The in Document 1 contract number might be on page 1 the signatories might be on page 100. Then then the scope of work might be in Document 2 on page 12.

This would also be really useful for the style of questions which are:

'compare these two documents'
'what are the similarities and differences between these sections of these documents'
'what is the difference between version 1 and version 1.5'

from marqo.

jess-lord commented on May 23, 2024

@aryanagarwal9 Maybe I misunderstand, but couldn't a smaller document size get the job done? Can also add overlap on your chunks if worried about missing context. For example:
"index_defaults": {
"treat_urls_and_pointers_as_images": False,
"model": "hf/all_datasets_v4_MiniLM-L6",
"normalize_embeddings": True,
"text_preprocessing": {
"split_method": "sentence",
"split_length": 2,
"split_overlap": 1,
}
}
See: https://docs.marqo.ai/0.0.18/API-Reference/indexes/#text-preprocessing-object

So if your podcast transcript is 100 "pages", this might become 100 marqo "documents" and within each of these documents there will be n "chunks" (aka facets) where, using the above settings, each chunk would be 2 sentences, with a stride/overlap of 1 sentence between them. We would then get 1 highlight per "page", which maybe is insufficient. But couldn't you just split your pages into something even smaller, such as paragraphs, to achieve the desired result?

from marqo.

edmuthiah commented on May 23, 2024

Hey @jess-lord I don't think the above solution scales. If you have 30 pdfs with 100 pages each. You now have 3000 documents that will each return a highlight. You then need to find the answer you are looking for amongst these 3000 highlights using some other method which defeats the original purpose of finding the actual highlight. If you were using an LLM your token count/cost to process 3000 sentences per query would be high too (if not exceeding the limit).

from marqo.

jess-lord commented on May 23, 2024

@edmuthiah I was responding to the podcast use case, which I still think this covers because the facets can be retrieved independently of their "parent document". But for your use case (which I too am now bumping into) I agree. The only alternative I can come up with for the moment is tags and weighted queries.

from marqo.

Recommend Projects

[ENHANCEMENT] Search Operation Should Return Multiple Highlights. about marqo HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent