Code Monkey home page Code Monkey logo

Comments (4)

edmuthiah avatar edmuthiah commented on May 23, 2024

@pandu-k multiple highlights from multiple documents would be awesome. For example:

If you imagine a query like 'what is the contract number, who are the signatories and summarise the scope of work'

The in Document 1 contract number might be on page 1 the signatories might be on page 100. Then then the scope of work might be in Document 2 on page 12.

This would also be really useful for the style of questions which are:

  • 'compare these two documents'
  • 'what are the similarities and differences between these sections of these documents'
  • 'what is the difference between version 1 and version 1.5'

from marqo.

jess-lord avatar jess-lord commented on May 23, 2024

@aryanagarwal9 Maybe I misunderstand, but couldn't a smaller document size get the job done? Can also add overlap on your chunks if worried about missing context. For example:
"index_defaults": {
"treat_urls_and_pointers_as_images": False,
"model": "hf/all_datasets_v4_MiniLM-L6",
"normalize_embeddings": True,
"text_preprocessing": {
"split_method": "sentence",
"split_length": 2,
"split_overlap": 1,
}
}
See: https://docs.marqo.ai/0.0.18/API-Reference/indexes/#text-preprocessing-object

So if your podcast transcript is 100 "pages", this might become 100 marqo "documents" and within each of these documents there will be n "chunks" (aka facets) where, using the above settings, each chunk would be 2 sentences, with a stride/overlap of 1 sentence between them. We would then get 1 highlight per "page", which maybe is insufficient. But couldn't you just split your pages into something even smaller, such as paragraphs, to achieve the desired result?

from marqo.

edmuthiah avatar edmuthiah commented on May 23, 2024

Hey @jess-lord I don't think the above solution scales. If you have 30 pdfs with 100 pages each. You now have 3000 documents that will each return a highlight. You then need to find the answer you are looking for amongst these 3000 highlights using some other method which defeats the original purpose of finding the actual highlight. If you were using an LLM your token count/cost to process 3000 sentences per query would be high too (if not exceeding the limit).

from marqo.

jess-lord avatar jess-lord commented on May 23, 2024

@edmuthiah I was responding to the podcast use case, which I still think this covers because the facets can be retrieved independently of their "parent document". But for your use case (which I too am now bumping into) I agree. The only alternative I can come up with for the moment is tags and weighted queries.

from marqo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.