Code Monkey home page Code Monkey logo

Comments (6)

fulmicoton avatar fulmicoton commented on August 25, 2024 1

We don't do anything... Only one Indexwriter per index supported for S3 stuff. That's also what famous data lake do.

from quickwit.

fulmicoton avatar fulmicoton commented on August 25, 2024 1

Manifest Entry a struct that contains some information about each file metadata. They checksum, their size, their filepath if I recall correctly. You can have a look at the struct in barrel.

what exactly do you mean by a split needs to be staged before uploading any of its files to the storage?

In tantivy and in Quickwit, we need to be tolerant with to failure, at least to the extent of the Fail-stop assumption.
As long as the memory is not corrupted... The system should still work as expected.

That means that if your kill your indexing CLI while it is running, you should not end up with a corrupted index.
You should not have dangling file on S3 either.

For this reason, before registering any files, we need to record somewhere our intent to upload files.
When we restart the CLI, the CLI will check which files are in the Staged state and remove them...

Another approach would be to list the files from the stoarge, and remove all of the files in the directory that are not useful to the index according to the index meta. We do not use this approach because a user that added its own files to the directory could see their file deleted by Quickwit. Besides, listing files on S3 can rapidly get slow, and it would require to add an extra method to the Storage trait.

from quickwit.

guilload avatar guilload commented on August 25, 2024 1

We want to retrieve any split that overlaps with the time range. So for a [50, 200) range, every split that has range.end >= 50 or range.start < 200 is a candidate.

from quickwit.

evanxg852000 avatar evanxg852000 commented on August 25, 2024

@fulmicoton @guilload on the last part regarding the lock. how do we want to avoid multiple IndexWriter from corrupting the meta-store file without locking? Are we just settling with one IndexWritter per Index?

from quickwit.

mosuka avatar mosuka commented on August 25, 2024

@fulmicoton @guilload
What is the ManifestEntry introduced in Stage splits code? Is there a detailed explanation or code sample somewhere?
And, what exactly do you mean by a split needs to be staged before uploading any of its files to the storage?

from quickwit.

mosuka avatar mosuka commented on August 25, 2024

@fulmicoton @guilload
I have a question about list_splits.
Assume that there are splits that contain the following range in the meta store.
If I call the list_splits function with the following range, which splits should the return value contain?

split_1 : std::opts::Range { start : 0, end : 100 }, SplitState::Published
split_2 : std::opts::Range { start : 100, end : 200 }, SplitState::Published
split_3 : std::opts::Range { start : 50, end : 150 }, SplitState::Published

list_splits( SplitState::Published, std::opts::Range { start : 50, end : 200 } )

Is the above list_splits returns value expected to be split_2 and split_3? or only split_3 ?
Can you tell me the specification of the filter using time_range in list_splits?

from quickwit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.