Code Monkey home page Code Monkey logo

Comments (17)

dmitry-salin avatar dmitry-salin commented on May 13, 2024

There is interesting article from the creator of VictoriaMetrics database. ClickHouse added WAL support as part of Polymorphic parts work, but this is a configurable parameter.

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

@dmitry-salin thanks for your reference! It is interesting to see WAL can also cause some discussions. TB does not use the LSM. I am interesting to leave some thoughts after reading your references. It is a great opportunity to share some ideas to you and our community.

from tensorbase.

dmitry-salin avatar dmitry-salin commented on May 13, 2024

Thanks @jinmingjian ! I Think that the good option might be the ability to change ingestion settings for any db instance or table. So if we want the best possible safety guarantees and are willing to pay for it with ingestion performance - every incoming data chunk can be persisted to disk. I also looked at HSE, which is one of the newly created storage engines based on trie-like data structures. It also does not use WAL, but has API calls for forcing cached data to media - C0 data

I used to play with ClickHouse and ported it to Windows as part of my side project. This is not a classic LSM. The merge happens mainly in storage part, but it gives excellent compression rate and query performance (when it comes to large ranges of data from disk) Can you point me to TB part that implements storage?

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

I have quickly review your references:

  1. the VictoriaMetrics's article is a bit exaggerating the problem of WAL. The modern SSD is absolutely affordable as least for WAL kinds.
  2. for the OLAP system, we expect batch to come. You can test to insert one-row-by-one-row into ClickHouse, then watch the performance... So, if we are optimized for batch style data writing, the WAL implementation is quitely simple : DIO write for every incoming batch. For small-row inserting, this is low performance (and may waste a little more space) but you know that make this fast does not save system from low performant fate. Optimization on this case is not worth it.
  3. One of my interesting ideas about WAL. In a complex system, we have kinds of logs. Can we piggyback the WAL into other log?:)

from tensorbase.

dmitry-salin avatar dmitry-salin commented on May 13, 2024
  1. Maybe you are right. I think that the main point of article is WAL as a separate data structure (which is a kind of write amplification).
  2. Yes, ClickHouse is also optimized for relatively large batches, but in this default scenario it does not use WAL - compressed in-memory block is simply written to disk with DIO as it is.

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

So if we want the best possible safety guarantees and are willing to pay for it with ingestion performance - every incoming data chunk can be persisted to disk.

Cool! We are seemly on the same line. 😄 For your HSE, I will have a look.

This is not a classic LSM. The merge happens mainly in storage part, but it gives excellent compression rate and query performance.

Yes. And you said "the merge happens mainly in storage" is understandable. This also makes the I said OLTP ingestion(one-row-by-one-row) prohibitive for CH. The logic as I said belongs to OLAP. But CH's delayed merging behavior worsened the already heavy data movement on external storage. In the heavy bigdata ingestions (but this should be common for bigdata), I have observed severe performance degradation.

Can you point me to TB part that implements storage?

thanks for interesting. I should roll out more readings for community. But the time is a bit tight. I hope make it coming soon. Not sure if the people in community has some interests to help:)

Let's do a quick note before more detailed explanation:

TB uses a data structure that we call "Partition Tree" (now we use a sled as a implementation the api is in the PartStore here). The storage lay is thin now and embedded in the runtime::write. The data is directly written to the partition file. While maintaining the append only write performance, it avoids the subsequent compact overhead of the LSM structure.

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024
  1. Yes, ClickHouse is also optimized for relatively large batches, but in this default scenario it does not use WAL - compressed in-memory block is simply written to disk with DIO as it is.

I have not well investigating the implementation of CH. But from the simple observation, this seemly does not hold because the common CH store file is not aligned to the block size (512 or 4k). And small size batches less than block size can not be DIO-ed obviously. DIO is not really good in performance, its best value is at the full control. So, it is not reasonable to use DIO blindly.

from tensorbase.

GrapeBaBa avatar GrapeBaBa commented on May 13, 2024

any progress for this task?

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

@GrapeBaBa this is not too hard. But still no one works on this. Personally, I hope it could be resolved recently or at least in this summer. Do you have interests on this?:)

from tensorbase.

GrapeBaBa avatar GrapeBaBa commented on May 13, 2024

@GrapeBaBa this is not too hard. But still no one works on this. Personally, I hope it could be resolved recently or at least in this summer. Do you have interests on this?:)

Have you done some design for wal such as wal record data format, wal file format, storage(e.g.mmap, bufferedio), concurrency control? There are also many systems which can be refer such as mysql/pg redo log, leveldb wal, etcd wal,jraft log storage.

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

Great! You can give out your design. Yes, I have some ideas. We do not couple current wal to with other techniques. It is a just pure local wal for machine level crashing. Wait for a while, let me upgrade this to a RFC to attach some more details.

from tensorbase.

GrapeBaBa avatar GrapeBaBa commented on May 13, 2024

Great! You can give out your design. Yes, I have some ideas. We do not couple current wal to with other techniques. It is a just pure local wal for machine level crashing. Wait for a while, let me upgrade this to a RFC to attach some more details.

That's great.

from tensorbase.

nautaa avatar nautaa commented on May 13, 2024

The data in tensorbase is directly flush to disk, is the purpose of wal to ensure the atomicity of writing data?
Is it possible to write a block to the log area on the disk first, and then write the block to the corresponding partition? It is similar to mysql double write?
@jinmingjian

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

@nautaa yeah. "Double write" is easy. You should carefully implement a recovery processing to reuse some mark in meta store.

from tensorbase.

GrapeBaBa avatar GrapeBaBa commented on May 13, 2024

What is the progress for RFC? @jinmingjian

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

@GrapeBaBa thanks for concerning. After thoughts, I decided that this issue is a little tricky. I postpone this issue to the later of #147. The good news is that the summer 2021 is coming... 😄

from tensorbase.

jinmingjian avatar jinmingjian commented on May 13, 2024

it seems that current mainstream WALs are still in using "periodically flushing". This is not 100% guarantee way. But our TB design is 100% guarantee even for machine crashing. We may check in a simple version firstly.

from tensorbase.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.