Code Monkey home page Code Monkey logo

shannon-data / shannonbase Goto Github PK

View Code? Open in Web Editor NEW
6.0 0.0 3.0 450.6 MB

A MySQL HTAP Database, Open Source version of MySQL Heatwave, Powered by AI&ML.

License: Other

CMake 1.02% C++ 79.05% C 16.34% Batchfile 0.01% Perl 0.58% CSS 0.03% Shell 0.80% Makefile 1.26% M4 0.15% Python 0.01% JavaScript 0.24% Starlark 0.05% HTML 0.23% Roff 0.19% Awk 0.01% Dockerfile 0.01% Gnuplot 0.01% PowerShell 0.01% Vim Script 0.01% Emacs Lisp 0.01%
heatwave mysql ml-embeded htap in-memory-column-storage

shannonbase's Introduction

shannonbase's People

Contributors

hustjieke avatar ringsc avatar shannonbase avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

shannonbase's Issues

doc(shannon): Types supported in rapid engine

The following types has been supported in shannon rapid engine. (will be updated in future)

  • Numeric
    1: Integer
    2: double

  • String
    1: char
    2: varchar
    3: text

  • Datetime
    1: date
    2: time
    3: datetime

feat(shannon): embeded ML functions

To provide the native embeded machine learning functions, users can use these MLs as stored procedures in ShannonBase.

Taking the follow as an instance.

call ML.linear_regression(xxxx);

To impl the heatwave Auto ML abilities. Such as,

In-database Management of the Entire Model Lifecycle!
Fully automated ML with minimal number of required parameters → no advanced ML/data science expertise
needed
Data and ML model never leave the database
Familiar SQL Interface
Performance and Scalability

By now, some sys procedures are added. the names are with prefix ML_.

mysql> SELECT routine_name FROM information_schema.routines WHERE routine_schema = 'sys';
Current database: ML_SCHEMA_root
+-------------------------------------+
| ROUTINE_NAME                        |
+-------------------------------------+
| extract_schema_from_file_name       |
| extract_table_from_file_name        |
| format_bytes                        |
| format_path                         |
| format_statement                    |
| format_time                         |
| list_add                            |
| list_drop                           |
| ps_is_account_enabled               |
| ps_is_consumer_enabled              |
| ps_is_instrument_default_enabled    |
| ps_is_instrument_default_timed      |
| ps_is_thread_instrumented           |
| ps_thread_account                   |
| ps_thread_id                        |
| ps_thread_stack                     |
| ps_thread_trx_info                  |
| quote_identifier                    |
| sys_get_config                      |
| version_major                       |
| version_minor                       |
| version_patch                       |
| create_synonym_db                   |
| diagnostics                         |
| execute_prepared_stmt               |
| ml_explain                          |
| ml_explain_row                      |
| ml_explain_table                    |
| ml_model_import                     |
| ml_model_load                       |
| ml_model_unload                     |
| ml_predict_row                      |
| ml_predict_table                    |
| ml_train                            |

feat(rapid): Load/Unload data from primary engine to secondary engine

1: Overview

The first step of processing AP workloads is load the basic full data into rapid engine, then rapid will start propagation operation automatically. When a table loaded from innodb to rapid engine, some meta informations will be also loaded into catalog table, such as performance_schema.rpd_column, performance_schema.rpd_column_id, etc. A backgroud thread will be launched when system start, then start to monitor the redo log, when a new DML operation done, this background thread starts to parse the incoming redo log, and apply the changes into IMCS.

When the load statement was executed, it would peform the load operation. Overall, just like insert into xxx select xxx statement, the system firstly do a table scan via index or full table scan.

1: It scans the target table, usually it is an innodb table. And, here, there is a problem must be clarified at first. That is which data will be visible to operation, and which is not. Therefore, here, we define that only the committed data will be visible to scan operation. In other words, that means we will use auto commited transaction to do table scan. the transaciton will be read committed isolation level.

The new data inserted when we do table scaning, all these the latest data will not be seen by the operation, because this would not happen. An exclusive mdl lock is used to protected the new rows are inserted into table when the loading operation is running.

2: Except the core functions, there must be some system parameters to monitors the load operations, for example, how many data have been loaded? and how many remains, and so on. some parallel related parameters also will be introduced into, such as POD( parallel of degree), etc. Therefore, some system parameters will be introduced.

2: Implementation

By now, the innodb has only a few parallel table scan abitities, such as based on index counting operations, and check table.

Innodb index is orginzied as B+tree, and each node in B+tree has a fixed size (generallly 16K, or denotated by UNIV_PAGE_SIZE in source code). Innodb reads the data from disk and loaded into buffer pool, and will spill out the old pages and load the latest page into buffer pool because the size of buffer pool is NOT un-limited. usually the buffer pool is configurated to 70%-80% of physical memory size.
image

Data on disk is managed in three-level: (1)segment; (2)cluster; (3)page. InnoDB does in-place updates, so when accessing a specific page, latch protection is required.

2.1: Full Table Scan
As the innodb does. Here, we would not discuss anymore. We mainly foucs on parall index scan.

2.2 Parallel index scan
The basic idea of parall index scan is divided index B+tree into several workloads, and each workload is scanned by using ONE thread.

The index table scan will follow the following steps:

  1. From the root node;
  2. Turn to left or right according to the comparision result. (If node value less than key, then the cursor go to left, otherwise, to right)
  3. Go through all the B+tree until the leaf nodes.
  4. Check all the data in that leaf node, and if found return, or otherwise return NOT_FOUND;

If we do FULL table scan, it will check all the node in that B+tree.

In ShannonBase, the semantic load operation is defined as following. ShannonBase does full table scan, and find out all the records in target table, and transfer them from row-based tuple into column-based tuple, then be loaded into rapid engine.

Now, we can using parallel scanning B+tree via dividing scanning operations simultaneously. For example, it perform scanning operations in parallely according to POD and height of the index.

ShannonBase uses Multi-version Concurrency Control (MVCC) to determine which row is visible or not.

2.2.1 InnoDB baisc parallel scan
Before we start to support more parallel operation in innodb, we, firstly, start to re-examine. All the worklogs of MySQL is listed as below.

image

This research result is searched by parallel keyword. From the result we can read that parallel operation mainly includes: (1) parallel query execution; (2) innodb parallel read of index. And there are several main worklog we should notice, (1)WL#12978 | InnoDB:Fix imbalance during parallel scan; (2) WL#11720 InnoDB: Parallel read of index; (3) WL#1092 Parallel load data infile and bulk insert; (4)WL#12978 InnoDB:Fix imbalance during parallel scan

For more information, ref to: https://dev.mysql.com/worklog/?sc=&sd=&k=parallel&s=

PS: to find out what changeed in WL#11720, we can use git log --grep WL#11720 to figure out what changes in this commit. this commit id: dbfc59ffaf8096a2dc5b76766fedb45ff2fb8cbf

/** The core idea is to find the left and right paths down the B+Tree.These
paths correspond to the scan start and scan end search. Follow the links
at the appropriate btree level from the left to right and split the scan
on each of these sub-tree root nodes.

If the user has set the maximum number of threads to use at say 4 threads
and there are 5 sub-trees at the selected level then we will split the 5th
sub-tree dynamically when it is ready for scan.

We want to allow multiple parallel range scans on different indexes at the
same time. To achieve this split out the scan  context (Scan_ctx) from the
execution context (Ctx). The Scan_ctx has the index  and transaction
information and the Ctx keeps track of the cursor for a specific thread
during the scan.

To start a scan we need to instantiate a Parallel_reader. A parallel reader
can contain several Scan_ctx instances and a Scan_ctx can contain several
Ctx instances. Its' the Ctx instances that are eventually executed.

This design allows for a single Parallel_reader to scan multiple indexes
at once.  Each index range scan has to be added via its add_scan() method.
This functionality is required to handle parallel partition scans because
partitions are separate indexes. This can be used to scan completely
different indexes and tables by one instance of a Parallel_reader.

To solve the imbalance problem we dynamically split the sub-trees as and
when required. e.g., If you have 5 sub-trees to scan and 4 threads then
it will tag the 5th sub-tree as "to_be_split" during phase I (add_scan()),
the first thread that finishes scanning the first set of 4 partitions will
then dynamically split the 5th sub-tree and add the newly created sub-trees
to the execution context (Ctx) run queue in the Parallel_reader. As the
other threads complete their sub-tree scans they will pick up more execution
contexts (Ctx) from the Parallel_reader run queue and start scanning the
sub-partitions as normal.

Note: The Ctx instances are in a virtual list. Each Ctx instance has a
range to scan. The start point of this range instance is the end point
of the Ctx instance scanning values less than its start point. A Ctx
will scan from [Start, End) rows. We use std::shared_ptr to manage the
reference counting, this allows us to dispose of the Ctx instances
without worrying about dangling pointers.

2.2.2 InnoDB basic parallel scan implementation
The change for parallel scan mainly in these files. (1) sql/handler.h; (2) innobase/handler/ha_innodb.h; (3) innobase/handler/handle0alter.cc; (4) include/row0pread.h. And the system params mainly defined in srv/srv0srv.cc.

Limitation: Can NOT parallel scan on secondary index .

class Parallel_reader {
};

  /** Specifies the range from where to start the scan and where to end it. */
  struct Scan_range {
};

  /** Thread related context information. */
  struct Thread_ctx {
};

/** Parallel reader context. */
class Parallel_reader::Scan_ctx {
};


class handler {
...
/**
    Initializes a parallel scan. It creates a parallel_scan_ctx that has to
    be used across all parallel_scan methods. Also, gets the number of
    threads that would be spawned for parallel scan.
*/
  virtual int parallel_scan_init(void *&scan_ctx [[maybe_unused]],
                                 size_t *num_threads [[maybe_unused]],
                                 bool use_reserved_threads [[maybe_unused]]) {
    return 0;
  }

/**
    Run the parallel read of data.
*/
  virtual int parallel_scan(void *scan_ctx [[maybe_unused]],
                            void **thread_ctxs [[maybe_unused]],
                            Load_init_cbk init_fn [[maybe_unused]],
                            Load_cbk load_fn [[maybe_unused]],
                            Load_end_cbk end_fn [[maybe_unused]]) {
    return 0;
  }

/**
    End of the parallel scan.
*/
  virtual void parallel_scan_end(void *scan_ctx [[maybe_unused]]) { return; }
...

/** Start row of the scan range. */
struct Key_reader_row {
};

/** Parallel read implementation. */
template <typename T, typename R>
class Reader {
}; 

2.3: Data Foramt

  • Option 1:
    Each column is organized as a file, when it flushes to disk. The format of columns in memory is also called as IMCU(In-memory column unit). An IMCU consisted by CUs(Column Unit), A CU has two parts: (1) Header, Meta Information; (2) Data; Data also can be divided into a bunch of chunks.

image

image

All chunks are linked. The address of the first chunk can be found from Cu's header, and also contains the address of the next chunk.

A chunk's consist with header and data. Header contains the meta information of this chunk. the data part is where the real data located.

class Chunk {
  public:
    class Chunk_header {
     private:
        byte* m_data;
        Chunk* m_next_chunk;
    };
 private:
   std::mutex  m_header_mutex;
   Chunk_header* m_header; 
};

class Cu {
public:
  class Cu_header {
     ....
     Compression_algorithm* m_comp_algo;
     Local_dictionary* m_dictionary;
     Chunk::Chunk_header* m_start_header;
  };

private:
  std::mutex m_header_mutex; 
  Cu_header* m_header;
  Chunk* m_chunks;
};

Gets the first Cu from IMCS. In an IMCS instance header, it has a header, which has a pointer to the address of IMCU.

When a new data in, it stores it in order. Insert sort can be used to make it ordered. It uses binary search to find the data.
But it data is in compressed format, at this situation, we need a new algorithm to find the data in compresssed data.

Now, we go to deeper. Giving out the more specific details of the data. Here, we notice that every data we write into CU should a tansaction id attached to it to mark which transaction it belongs.

  • 1: Fixed length Data Type:
  • 2: Variable Length Data Type:

Let's disscuss these data type above in detail.

1: Fix Length Data Type:

image

A record is divided into 3 parts: (1): infos, to indicate its information, such as, deleted or not, is null or not, total length, etc.
for example: the 7th bit: delete flag; the 6th bit: null flag, others is: total length of this data.

2: Var Length Data Type:

For the var length data type, a field to indicate the lengt of the data is needed. Here, there is an issue we should pay more attention to is size cost of storage. If a record has too many extra info to describe but the size of data itselft does not contain too much data. It will make ratio of the data effectiveness is too small.

And, as for, the string data type is another talk. It has some attributes to describe it. (1) length; (2) charset information. In order to saving the space, dictionary copression is incorprated.

There's also three parts in: (1) infos(2~4 bytes); (2) trx id (8 bytes); (3) data(variety); If data len is less than 65535, infos will use 2bytes. otherwise 4bytes used,
image

(payload ratio of this way seems not to effieciency)

  • Option 2:
    Uinsg some system columns to store the meta info. And this will make imcs more possible for SIMD and parallel processing.
    Apart the info bit, trx id, rowid and smu_ptr as independent columns.
    image

Now format 1 has been implemented, and format 2 will be implemented in next version

2.4: Local Dictionary
ref to: data compression/dictionary compression

2.5: Related Code
Loading data operation, the sematic of load of secondary engine, is to load the data from primary engine to secondary engine.

when the load statement, alter statement, is executed. mysql_execute_command will be executed and in this function. lex->m_sql_cmd is Sql_cmd_secondary_load_unload. Therefore, Sql_cmd_secondary_load_unload::execute executed.

At last, the execution will be in ha_rapid::load_table, and in this function, it starts to scan the target table to read the corresponding data.

class ha_rapid : public handler {
 public:
  ha_rapid(handlerton *hton, TABLE_SHARE *table_share);

 private:
  int create(const char *, TABLE *, HA_CREATE_INFO *, dd::Table *) override {
    return HA_ERR_WRONG_COMMAND;
  }

  int open(const char *name, int mode, unsigned int test_if_locked,
           const dd::Table *table_def) override;

  int close() override { return 0; }

  int rnd_init(bool) override { return 0; }

  int rnd_next(unsigned char *) override { return HA_ERR_END_OF_FILE; }

  int rnd_pos(unsigned char *, unsigned char *) override {
    return HA_ERR_WRONG_COMMAND;
  }

  int info(unsigned int) override;

  ha_rows records_in_range(unsigned int index, key_range *min_key,
                           key_range *max_key) override;

  void position(const unsigned char *) override {}

  unsigned long index_flags(unsigned int, unsigned int, bool) const override;

  THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
                             thr_lock_type lock_type) override;

  Table_flags table_flags() const override;

  const char *table_type() const override { return "MOCK"; }

  int load_table(const TABLE &table) override;

  int unload_table(const char *db_name, const char *table_name,
                   bool error_if_not_loaded) override;

  THR_LOCK_DATA m_lock;
};

2.6: Meta Data Info

when a table loaded into rapid, all the information about loaded tables will be added to catalog table, performance_schema.rpd_xxxx. Taking performance_schema.rpd_column_id as an instance, it contains the following part: (1) Id; (2)Table ID; (3) Column Name.

3: Unload Data
Unload operation will release the loaded data from rapid. it's a opposite operation fo loading operation. After the data clearn up, the meta inforation in performance_schema.rpd_xxx also be cleared.

Here, we dont discuss this part in detail anymore.

Related Issue: #20

feat(sql-rapid): Optimization and Execution

A new type of engine employed, that provides a new optimization tactics for MySQL optimizer. In common case, MySQL optimizer will delivery the AP workload to rapid engine due to its low IO cost and fast execution.

In order to archieve that, MySQL optimizer should the the cost of workload, and the type of workload. If a workload is typed as a TP workload, it will be execute in primary engine(TP engine, InnoDB), on the other side, the AP workload will be executed in secondary engine(AP engine, Rapid).

The first of all, workload type should be tagged correctly. which one is taking as TP workload, and which one is a AP workload. After that, these workload can be offload to the corresponding engine.

other(asan): ASan cannot proceed Shadow memory range correctly

Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly.

==155884==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==155884==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==155884==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==155884==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==155884==Process memory map follows:
	0x061155cae000-0x061156000000	
	0x061156000000-0x061156028000	/usr/lib/x86_64-linux-gnu/libc.so.6
	0x061156028000-0x0611561bd000	/usr/lib/x86_64-linux-gnu/libc.so.6
	0x0611561bd000-0x061156215000	/usr/lib/x86_64-linux-gnu/libc.so.6
	0x061156215000-0x061156216000	/usr/lib/x86_64-linux-gnu/libc.so.6
	0x061156216000-0x06115621a000	/usr/lib/x86_64-linux-gnu/libc.so.6
	0x06115621a000-0x06115621c000	/usr/lib/x86_64-linux-gnu/libc.so.6
	0x06115621c000-0x061156229000	
	0x061156400000-0x06115649a000	/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
	0x06115649a000-0x0611565ab000	/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
	0x0611565ab000-0x06115661a000	/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
	0x06115661a000-0x06115661b000	/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
	0x06115661b000-0x061156626000	/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
	0x061156626000-0x061156629000	/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
	0x061156629000-0x06115662c000	
	0x06115682b000-0x061156895000	/home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
	0x061156895000-0x0611569ad000	/home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
	0x0611569ad000-0x0611569f0000	/home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
	0x0611569f0000-0x0611569f2000	/home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
	0x0611569f2000-0x0611569ff000	/home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
	0x0611569ff000-0x061156a00000	
	0x061156a00000-0x061156ab2000	/home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
	0x061156ab2000-0x061156d11000	/home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
	0x061156d11000-0x061156de3000	/home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
	0x061156de3000-0x061156e3e000	/home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
	0x061156e3e000-0x061156e41000	/home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
	0x061156e41000-0x061156e44000	
	0x061156e44000-0x061156e6a000	/home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
	0x061156f19000-0x061156f27000	/usr/lib/x86_64-linux-gnu/libm.so.6
	0x061156f27000-0x061156fa3000	/usr/lib/x86_64-linux-gnu/libm.so.6
	0x061156fa3000-0x061156ffe000	/usr/lib/x86_64-linux-gnu/libm.so.6
	0x061156ffe000-0x061156fff000	/usr/lib/x86_64-linux-gnu/libm.so.6
	0x061156fff000-0x061157000000	/usr/lib/x86_64-linux-gnu/libm.so.6
	0x061157000000-0x061157024000	/usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
	0x061157024000-0x0611570fb000	/usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
	0x0611570fb000-0x06115712d000	/usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
	0x06115712d000-0x06115712e000	/usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
	0x06115712e000-0x061157132000	/usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
	0x061157132000-0x061157135000	/usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
	0x061157135000-0x0611579e9000	
	0x061157a56000-0x061157a6d000	
	0x061157a6d000-0x061157a70000	/usr/lib/x86_64-linux-gnu/libgcc_s.so.1
	0x061157a70000-0x061157a87000	/usr/lib/x86_64-linux-gnu/libgcc_s.so.1
	0x061157a87000-0x061157a8b000	/usr/lib/x86_64-linux-gnu/libgcc_s.so.1
	0x061157a8b000-0x061157a8c000	/usr/lib/x86_64-linux-gnu/libgcc_s.so.1
	0x061157a8c000-0x061157a8d000	/usr/lib/x86_64-linux-gnu/libgcc_s.so.1
	0x061157a8d000-0x061157a8f000	
	0x061157a8f000-0x061157a92000	/usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
	0x061157a92000-0x061157a98000	/usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
	0x061157a98000-0x061157a9a000	/usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
	0x061157a9a000-0x061157a9b000	/usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
	0x061157a9b000-0x061157a9c000	/usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
	0x061157a9c000-0x061157aba000	/home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
	0x061157aba000-0x061157b15000	/home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
	0x061157b15000-0x061157b32000	/home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
	0x061157b32000-0x061157b3c000	/home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
	0x061157b3c000-0x061157b40000	/home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
	0x061157b40000-0x061157b47000	/home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
	0x061157b51000-0x061157b5c000	
	0x061157b5c000-0x061157b5e000	/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
	0x061157b5e000-0x061157b88000	/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
	0x061157b88000-0x061157b93000	/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
	0x061157b93000-0x061157b94000	
	0x061157b94000-0x061157b96000	/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
	0x061157b96000-0x061157b98000	/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
	0x56cd9ab69000-0x56cd9eca2000	/home/lihao/workshop/shannon-bin/bin/mysqld
	0x56cd9eca2000-0x56cda3adf000	/home/lihao/workshop/shannon-bin/bin/mysqld
	0x56cda3adf000-0x56cda6b62000	/home/lihao/workshop/shannon-bin/bin/mysqld
	0x56cda6b63000-0x56cda6d31000	/home/lihao/workshop/shannon-bin/bin/mysqld
	0x56cda6d31000-0x56cda7bb7000	/home/lihao/workshop/shannon-bin/bin/mysqld
	0x56cda7bb7000-0x56cda8241000	
	0x7ffd8aac8000-0x7ffd8aae9000	[stack]
	0x7ffd8ab91000-0x7ffd8ab95000	[vvar]
	0x7ffd8ab95000-0x7ffd8ab97000	[vdso]
	0xffffffffff600000-0xffffffffff601000	[vsyscall]
==155884==End of process memory map.

Use can use to fix:

sudo sysctl vm.mmap_rnd_bits=28

or ref to: google/sanitizers#856 for more disscussion.

feat(rapid): Background workers

We employ a new background thread, ib_log_pop thread. it used to pop the changes in innodb to rapid. (row to column)

void Populator::start_change_populate_threads() {
  if (!Populator::log_pop_thread_is_active()) {
    sys_log_rapid_thread =
      os_thread_create(rapid_populate_thread_key, 0, parse_log_func, log_sys);
    ShannonBase::Populate::sys_pop_started = true;
    sys_log_rapid_thread.start();
  }
}

feat(rapid): Repopulate the changes from the primary to the secondary

As we know that the analytical workloads are running on the AP engine. And, Shannonbase use the one system two copies system architecture, which means one copy of data is in transactional processing engine (the primary engine) and another copy is used for analytical workloads (the secondary engine). Therefore, the consistency between these two copies will be the most important things. We usually use data freshness to describe the differences between the data in the primary engine and the secondary engine.

TP engine runs DMLs which will make some changes in primary engine, in practice, but, how to notify secondary engine, and send the changes to secondary engine to keep trace the latest data in primary engine.

In this part, we will explore how primary populates the changes to secondary. what mechanism do we use, and why we chose this way to impl the changes repopulation between primary engine and secondary engine.

  1. Overview

The one of the most important issue of HTAP is data synchronization between transactional processing engine and analytical processing engine. Otherwise, if it has not synchronization ,it would not be called an HTAP database. Consider these secnarios below:

If a user buys some items, system insert some records into transaction table, now, the manager want to get real analytical reports on our sale or stock, etc. Traditional soultion is that to do some ETL jobs, move the data from TP database to AP database and do analytical processing on AP database. But, now we can do these works in one database instance, that HTAP does.

There're two solution to impl HTAP, or more specifically, changes propagation method. Loosely-coupled architecture, Integrated-coupled architecture. Shannonbase use integrated-coupuled architecture. In mysql, we can use binlog or redo log to synchronize the changes. Before, we discuss our solution, at first, we want to give some explanations on redo log and binlog.

  1. Redo log

Redo log explanation given by mysql as following:

The redo log is a disk-based data structure used during crash recovery to correct data written by incomplete transactions. During normal operations, the redo log encodes requests to change table data that result from SQL statements or low-level API calls. Modifications that did not finish updating data files before an unexpected shutdown are replayed automatically during initialization and before connections are accepted. For information about the role of the redo log in crash recovery, see Section 15.18.2, “InnoDB Recovery”.

The redo log is physically represented on disk by redo log files. Data that is written to redo log files is encoded in terms of records affected, and this data is collectively referred to as redo. The passage of data through redo log files is represented by an ever-increasing LSN value. Redo log data is appended as data modifications occur, and the oldest data is truncated as the checkpoint progresses.

From the description above, we know that redo log describe the changes made by SQL statement or api calls. Therefore, it can be a way to poulate the changes from innodb to rapid engine by parsing the redo logs. And, we can reuse some functions used in recovery modules. A LSN, rapid_lsn, which describe where the rapid engine has been replayed at, and it will persist into system table space.

  1. Binlog

The binary log contains “events” that describe database changes such as table creation operations or changes to table data. It also contains events for statements that potentially could have made changes (for example, a DELETE which matched no rows), unless row-based logging is used. The binary log also contains information about how long each statement took that updated data. The binary log has two important purposes:
For replication, the binary log on a replication source server provides a record of the data changes to be sent to replicas. The source sends the information contained in its binary log to its replicas, which reproduce those transactions to make the same data changes that were made on the source. See Section 17.2, “Replication Implementation”.

Certain data recovery operations require use of the binary log. After a backup has been restored, the events in the binary log that were recorded after the backup was made are re-executed. These events bring databases up to date from the point of the backup. See Section 7.5, “Point-in-Time (Incremental) Recovery”.

Another way to repopulate the changes from innodb to rapid engine is using binglog. But, from our point of view, binglog there're some drawbacks as tools to repopulate the changes. binlog is a logical log, it involves writing the binlog and relay log, and parsing the binlog and execute the statement which parsed from binlog. it's a long callstack, not a efficiency way.

Hence, it may lead to cannot repopulate the changes in time at heavy workload scenarios.

  1. Which one do we use?

Where the brief discussion above, we draw the conculsion the redo log maybe is the best choice for repopulating the changes from innodb to rapid.

image

4.1 Implementation
4.1.1 Basic concept

/** Redo log - single data structure with state of the redo log system.
In future, one could consider splitting this to multiple data structures. */
struct alignas(ut::INNODB_CACHE_LINE_SIZE) log_t {
...
  /** The recent written buffer.
  Protected by: locking sn not to add. */
  alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_written;

  /** Used for pausing the log writer threads.
  When paused, each user thread should write log as in the former version. */
  std::atomic_bool writer_threads_paused;

  /** Some threads waiting for the ready for write lsn by closer_event. */
  lsn_t current_ready_waiting_lsn;

  /** current_ready_waiting_lsn is waited using this sig_count. */
  int64_t current_ready_waiting_sig_count;

  /** The recent closed buffer.
  Protected by: locking sn not to add. */
  alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_closed;
  ...
  /** Maximum sn up to which there is free space in both the log buffer
  and the log files. This is limitation for the end of any write to the
  log buffer. Threads, which are limited need to wait, and possibly they
  hold latches of dirty pages making a deadlock possible.
  Protected by: writer_mutex (writes). */
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_sn_t buf_limit_sn;

  /** Up to this lsn, data has been written to disk (fsync not required).
  Protected by: writer_mutex (writes). */
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t write_lsn;
  ...
  alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *flush_events;

  /** Number of entries in the array with events. */
  size_t flush_events_size;

  /** This event is in the reset state when a flush is running;
  a thread should wait for this without owning any of redo mutexes,
  but NOTE that to reset this event, the thread MUST own the writer_mutex */
  os_event_t old_flush_event;

  /** Up to this lsn data has been flushed to disk (fsynced). */
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t flushed_to_disk_lsn;
...
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t rapid_lsn;
  alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *rapid_events;
  ...

In log_t, we add alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t rapid_lsn; to represent where the data has been repopulated to rapid, and initialization it in log_sys_create() and log_start().

the diagram of writing redo log
image
more information refere to redo log structure

the log record structure is listed as following:

Type  + Space ID + Page Number + Body

We only consider the following redo log type, which are about dml operations.

  • MLOG_REC_INSERT
  • MLOG_REC_CLUST_DELETE_MARK
  • MLOG_REC_UPDATE_IN_PLACE

After that, we reuse the logic of reconvery, to parse the redo log and gets the content of data of sql statement, then apply the data to rapdi engine.

recv_parse_log_recs(), recv_single_rec, recv_multi_rec, etc. are used to parse the redo log at recovery stage. We can reuse these codes to impl our changes population logic.

4.1.2 Backgroud thread.

A new backgroud worker launched to do real changes repopulation job. a new mlog generated, this worker will be notified to do parse mlog, and apply the changes to rapid engine.

4.1.3 persistentence of rapid_lsn.
rapid_lsn will be written to trx_sys page of system table space.

4.2 Implementation details

  • Background thread
    After secondary_load command executed, a new thread will be launched immediately, wich called repopulation thread.
void Populator::start_change_populate_threads(log_t* log) {
  Populator::log_rapid_thread =
      os_thread_create(rapid_populate_thread_key, 0, parse_log_func, log);

  ShannonBase::Populate::pop_started = true;
  Populator::log_rapid_thread.start();
}

and the thread function defined as below

static void parse_log_func (log_t *log_ptr) {
  current_thd = (current_thd == nullptr) ? new THD(false) : current_thd;
  THR_MALLOC = (THR_MALLOC == nullptr) ? &current_thd->mem_root : THR_MALLOC;

  os_event_reset(log_ptr->rapid_events[0]);
   //here we have a notifiyer, when checkpoint_lsn/flushed_lsn > rapid_lsn to start pop
  while (pop_started.load(std::memory_order_seq_cst)) {
    auto stop_condition = [&](bool wait) {
      if (population_buffer->readAvailable()) {
        return true;
      }

      if (wait) { //do somthing in waiting
      }

      return false;
    };

    os_event_wait_for(log_ptr->rapid_events[0], MAX_LOG_POP_SPIN_COUNT,
                      std::chrono::microseconds{100}, stop_condition);

    byte* from_ptr = population_buffer->peek();
    byte* end_ptr = from_ptr + population_buffer->readAvailable();

    uint parsed_bytes = parse_log.parse_redo(from_ptr, end_ptr);
    population_buffer->remove(parsed_bytes);
  } //wile(pop_started)

  pop_started.store(!pop_started, std::memory_order_seq_cst);
  THR_MALLOC = nullptr;
  if (current_thd) {
    delete current_thd;
    current_thd = nullptr;
  }
}

The thread will waiting for the event to be singnaled. In log_buffer_write function, a new redo log was written into redo log buffer, a copy of that also was written into population buffer. After writing finished, to notify population thread to apply the changes.

lsn_t log_buffer_write(log_t &log, const byte *str, size_t str_len,
                       lsn_t start_lsn) {

...

    log_sync_point("log_buffer_write_before_memcpy");

    /* This is the critical memcpy operation, which copies data
    from internal mtr's buffer to the shared log buffer. */
    std::memcpy(ptr, str, len);
    auto type = mlog_id_t(*ptr & ~MLOG_SINGLE_REC_FLAG);
    if (ShannonBase::Populate::pop_started &&
        ShannonBase::Populate::population_buffer && (
        type == MLOG_REC_INSERT )) {
      ShannonBase::Populate::population_buffer->writeBuff(str, len);
      os_event_set(log.rapid_events[0]);
    }
...

Note: rapid population buffer is a lock-free ring buffer.

feat(rapid): Persistence of the rapid data

Rapid engine is a in-memory AP engine. All the data store in meomory. But, the memory is a volatile storage media. If Shannonbase may occur some bugs lead to crash, the rapid engine will lose all the data. At reconvery stage, Shannonbase will reload the data into rapid and make sure that the instance is in correct status before crashsed. Suppose that if the rapid has more than 500GB data in. At recovery stage, we will load that 500GB data from primary engine(InnoDB) into Rapid engine, that is IO-intensive task and will takes a long time to load. In loading phase, it will convert the data in row format to column format.

Hence, we want to rapid has ability to load the data into rapid engien directly without format conversion. It will be time-saving job. and less IO.

Shannonbase will perform flush operation periodically, just like checkpoint in InnoDB. The data in rapid persistent to disk in column and compressed format. That make it takes a less disk space and IO.

The data file format on disk will be used for optiming IO and disk space.

feat(rapid): Meta information of rapid and engine status

All meta infor about rapid, like, which tables or columns are loaded into rapid. and how many rows are loaded. how many GBs used, etc.
We should provide an interface or command to get all the params the user interested.

In this part, we will describ all the system variables and system table in rapid in detail.

  • 1: System Varaiables.

  • 2: System Tables.
    All the system tables are in performance schema, not in information schema.

The system tables used in rapid are listed as below:
1: rpd_column_id , to keep the loaded column info, such as its id and table id, etc.
2: rpd_columns, which is use to trace which column has been loaded into rapid engine.
3: rpd_table_id, which is used to trace which table loaded in.
4: ...

Related PR(s):
#25

Ref: MySQL HeatWave Sys Tables

feat(shannon): ML libs enabled.

To impl the heatwave Auto ML abilities. Such as,

  • In-database Management of the Entire Model Lifecycle!
  • Fully automated ML with minimal number of required parameters → no advanced ML/data science expertise
    needed
  • Data and ML model never leave the database
  • Familiar SQL Interface
  • Performance and Scalability

feat(shannon): Table Recycle bin

To support recycle bin, which it likes file recycle bin. If you're enable this option, drop table or somethings like this, it does not perform real operation. It's a logical operation on this object you peform. ShannonBase will move these objects to recycle bin. After expired time reached, these tables will be dropped automatically.

feat(rapid): compression algorithm

In rapid engine, now, it used local dictionary to store the string we met in loading stage. But, it consumes too much memory. Therefore, we want to employ some string compression algorithm to reduce the memory usage.
The strings we met, firstly, will be compressed in integer or interger array, then, these integers or arrayes will be stored to represent these strings. Taking the following as an expamle:
string abababababab will be compressed as {0, 1, 3, 5, 2, 4} by using lzw algorithm. just only by storing this array in imcs, but not a whole string.

The algorithm can be: LZW, ZSTD, etc.

comment of a column will be used to choose the compression algorithm for this column. taking the following as an instance.

`N_COMMENT` varchar(152) COLLATE ascii_bin DEFAULT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED'

this column is encoded with SORTED, which use lz4 as its compression algorithm.

ref:
1: PR #65
2: https://dev.mysql.com/doc/heatwave/en/mys-hw-varlen-encoding.html
3: https://dev.mysql.com/doc/heatwave/en/mys-hw-dictionary-encoding.html

feat(shannon): Upgrade to 8.3.x

The base code line is 8.1.x, but the latest version is 8.3.x. and a lot of enhancements are added, and securities improvements added. Therefore, we shoul upgrade to 8.3.x to keep pace with community version.

feat(InnoDB): Gets `trx_id` from InnoDB to SQL layer

1: Overview

In order to suport MVCC in Rapid Engine. The most important thing is that Rapid engine should use same transaction&mvcc mechnisam which used in InnoDB to make sure shannonbase will has consistent-read ability.

In order to achieve consistent read view, all four isolation levels impl in InnoDB are using transaction id to identify wich data is visible to this transaction or not.

Therefore, we load the data from primary into secondary, meanwhile, the transaction id also retrieves from InnoDB to SQL layer. and used to deterimine which transaction can see this row in Secondary engine.

In MySQL, now, each table has a system column, named trx_id. this column store the invisible field of a row in InnoDB to indicate which transaction create/update this row. As we know that, there are three invisible fields to user, trx_id, row_id, and roll_ptr.

When we are trying to get data from InnoDB at SQL, first of all, a row template is built, row_prebuilt_t . It'used to save CPU at InnoDB layer. Each column has a template, ha_innobase::build_template.

  /** Save CPU time with prebuilt/cached data structures */
  row_prebuilt_t *m_prebuilt;`

Therefore, the navie idea of getting the db_trx_id is as following:
1: Adds a ghost column at SQL layer to store the the transaction id we get. In fill_columns_from_dd function, we add a new column to the end of share. The class named Field_sys_trx_id, incorporated to store the transaction id.

2: In build_template function, we add a mysql_row_templ_t for this column.

3: In row_sel_field_store_in_mysql_format_func to store transaction id in.

4: Gets its value to SQL layer.

2: Implementation

  1. Adds the ghost column
    In order to support getting system columns, such as trx id, row id, and roll_ptr, from InnoDB to MySQL, ghost columns are added to store these data.

At opening table, MySQL fills up table definition by reading the table description from table catalog table, dd::xxx.

static bool fill_columns_from_dd(THD *thd, TABLE_SHARE *share,
                                 const dd::Table *tab_obj) {

  // Allocate space for fields in TABLE_SHARE. Adds one extra field.
  const uint fields_size = ((share->fields + 1 + 1) * sizeof(Field *));
  share->field = (Field **)share->mem_root.Alloc((uint)fields_size);
  memset(share->field, 0, fields_size);
...

  /*we dont add the extra file for system table or at bootstrap phase.*/
  if (!thd->is_bootstrap_system_thread() && !is_internal_table){
    Field_sys_trx_id *sys_trx_id_field =
        new (*THR_MALLOC) Field_sys_trx_id(rec_pos, MAX_DB_TRX_ID_WIDTH);
    assert(sys_trx_id_field);
    sys_trx_id_field->set_hidden(dd::Column::enum_hidden_type::HT_HIDDEN_SE);

    share->field[field_nr] = sys_trx_id_field;
    //rec_pos += share->field[field_nr]->pack_length_in_rec();

    field_nr++;
    assert(share->fields + 1 == field_nr);
  } else
...
}

Here, a new ghost column, named Field_sys_trx_id is employeed to store the transaction id of a row. This field will be used to build a row prebuild template. Now, let's giving the definition fo mysql_row_templ_t below.

/* A struct describing a place for an individual column in the MySQL
row format which is presented to the table handler in ha_innobase.
This template struct is used to speed up row transformations between
Innobase and MySQL. */

struct mysql_row_templ_t {
  1. Builds template filed
    After table_share filled in fill_columns_from_dd, in this stage, shannonbase will try to build template filed according to table column definitions.
/** Adds a field to a m_prebuilt struct 'template'.
 @return the field template */
static mysql_row_templ_t *build_template_field(
    row_prebuilt_t *prebuilt,  /*!< in/out: template */
    dict_index_t *clust_index, /*!< in: InnoDB clustered index */
    dict_index_t *index,       /*!< in: InnoDB index to use */
    TABLE *table,              /*!< in: MySQL table object */
    const Field *field,        /*!< in: field in MySQL table */
    ulint i,                   /*!< in: field index in InnoDB table */
    ulint v_no)                /*!< in: field index for virtual col */
{
...
  if (!strcmp (field->field_name, "DB_TRX_ID") ||
      !strcmp(field->field_name, "DB_ROLL_PTR")) {
    templ->type = DATA_SYS;
    templ->rec_field_no = templ->col_no ;
    assert (templ->col_no == 1 || templ->col_no == 2);
  } else {
    templ->type = col->mtype;
  }
...
  1. Stores the data to MySQL
    At the last step, pass the row templates into innodb, in innodb layer, it store the data to template. when control flow returns from innodb to mysql layer, the data we need will be retrieved back to mysql layer.
/** Stores a non-SQL-NULL field in the MySQL format. The counterpart of this
function is row_mysql_store_col_in_innobase_format() in row0mysql.cc.*/

void row_sel_field_store_in_mysql_format_func(
    byte *dest, const mysql_row_templ_t *templ, const dict_index_t *index,
    IF_DEBUG(ulint field_no, ) const byte *data,
    ulint len IF_DEBUG(, ulint sec_field)) {
...
  switch (templ->type) {
    const byte *field_end;
    case DATA_INT: ... break;
    ...
    case DATA_SYS_CHILD:
    case DATA_SYS:
      /* These column types should never be shipped to MySQL. But, in Shannon,
         we will retrieve trx id to MySQL. */
      switch (prtype & DATA_SYS_PRTYPE_MASK) {
         case DATA_TRX_ID:
             id = mach_read_from_6(data);
             memcpy(dest, &id, sizeof(ib_id_t));
             break;
         case DATA_ROW_ID:
         case DATA_ROLL_PTR:
           assert(0);
           break;
      }
      break;
   ....
}

related PRs:
#15
#16

feat(shannon): Vector Engine

To build up the vector engine for shannonbase. The native idea just like the innodb does. Nowdays, the vector processing ability becomes the foundmental ability for LLM. In PostgreSQL, it has a plugin, pg_vector to process vector data. But, in mysql, we dont have this.

Therefore, we want to build a vector engine to process vector data. Taking the following as an instance,

create table tb1(col1 int, col2 vector(10)) engine=vector.  

or vector data type.
ref:
https://github.com/facebookresearch/faiss

feat(shannon): rapid status monitor

To add some status variables to monitor the status of rapid.
Such as:
In log_writer, MONITOR_LOG_WRITER_

   const auto wait_stats = waiting.wait(stop_condition);

    MONITOR_INC_WAIT_STATS(MONITOR_LOG_WRITER_, wait_stats);

Feature: SQL firewall.

Summary

To enable the SQL firewall in ShannonBase to protected it from SQL injection attacks.

1: Install Firewall.
2: Set the SQL rules.
3: Actions. including: ingore, disconnect, pass.

doc(shannon): The limitation of ShannonBase

In this part, we will list all the limitation in ShannonBase.
0: Supported the version is great than 8.0. MySQL 5.7 or less is not available.
1: Rapid Engine is mandatory installed. Rapid Engine is a memory column store, used to accelerate AP workloads.
2: Supported data type in Rapid Engine: (1) int, bigint, numeric, double, float; (2) Date, time; (3) char, varchar, text.
3: A table with rapid secondary engine MUST BE has at least ONE user defined primary key, and it's type is numeric supported now.

test(shannon): mtr

Goals:

To pass all mtr cases. If there're any cases cannot pass the test, we will fixed as soon as possible.

By now, we have passed main suite.

PS:
If it's taking too much effort to fix, then put into disabled case list temporaryly. and fixed in future.

disabled cases:

#shannonbase disabled we dont support myisam engine and dont test it anaymore
main.mysqlpump_bugs : WL#123456 shannonbase disabled it needs perl 5.34.0
main.almost_full : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
main.mysql_load_data_local_dir : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
main.mysqld_safe : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
mian.shutdown_report : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.lock_impl_to_expl : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case1 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case4 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case5 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case6 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_first_rec_group : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.partition_upgrade_5727_lin_lctn_0_debug : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.partition_upgrade_5727_lin_lctn_1_debug : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.upgrade_compressed_recv : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.upgrade_geometry_56 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed

feat(rapid): Impl an In-memory Column Store

Taking Oracle IM column store as an example, here, we will to imll our IM column store as Oracle does.

Oracle IM

In order to provide HTAP services, just like MySQL Heatwave, An in memory column store should be incoperated into, which is used to handle the analytical workloads. Any AP workloads will be offloaded to Rapid Engine(In memory column store).

The architecture of MySQL Heatwave is listed below. (copyright belongs to MySQL)

image

image

1: SQL Syntaxes

Firstly, the SQL syntaxes must be defined. All these syntaxes are the basis of all future works. This chapter introduces syntaxes of the following operations:

  • Create a table with secondary_engine=rapid.

  • Load data.

  • Process a query.

  • Monitor the system status.

To determine which SQL syntaxes to define, we must first figure out why we want to use ShannonBase: Firstly, we want users can port their workloads seamlessly from MySQL Heatwave to ShannonBase. Therefore, we try to adopt all SQL syntax that MySQL Heatwave uses (which is also used in MySQL version 8.1).

In addition, relevant changes will be implemented in the MySQL server layer. Following are examples showing some SQL syntaxes supported by ShannonBase.

Certain SQL grammars must be added in sql/sql_yacc.yy. The following uses the SELECT statement as an example:

select_stmt:
          query_expression
          {
            $$ = NEW_PTN PT_select_stmt($1);
          }
        | query_expression locking_clause_list
          {
            $$ = NEW_PTN PT_select_stmt(NEW_PTN PT_locking($1, $2),
                                        nullptr, true);
          }
        | query_expression_parens
          {
            $$ = NEW_PTN PT_select_stmt($1);
          }
        | select_stmt_with_into
        ;

After SQL syntaxes are added, new SQL items are created in yacc. These items will be processed in the MySQL server layer during query optimization.

Create a table with secondary_engine=rapid:

CREATETABLE orders (id INT)SECONDARY_ENGINE= rapid;
ALTERTABLE orders SECONDARY_ENGINE= rapid;

Compared with the syntax for creating tables used in StoneDB V1.0, StoneDB V2.0 will support a new keyword SECONDARY_ENGINE that is adopted in MySQL 8.0.

Original CREATE statement syntax used in MySQL:

create_table_stmt:
          CREATE opt_temporary TABLE_SYM opt_if_not_exists table_ident
          '(' table_element_list ')' opt_create_table_options_etc
          {
            $$= NEW_PTN PT_create_table_stmt(YYMEM_ROOT, $2, $4, $5,
                                             $7,
                                             $9.opt_create_table_options,
                                             $9.opt_partitioning,
                                             $9.on_duplicate,
                                             $9.opt_query_expression);
          }

opt_create_table_options_etc:
          create_table_options
          opt_create_partitioning_etc
          {
            $$= $2;
            $$.opt_create_table_options= $1;
          }
        | opt_create_partitioning_etc
        ;

create_table_option:
          ENGINE_SYM opt_equal ident_or_text
          {
            $$= NEW_PTN PT_create_table_engine_option(to_lex_cstring($3));
          }
        | **SECONDARY_ENGINE_SYM** opt_equal NULL_SYM
          {
            $$= NEW_PTN PT_create_table_secondary_engine_option();
          }
        | SECONDARY_ENGINE_SYM opt_equal ident_or_text
          {
            $$= NEW_PTN PT_create_table_secondary_engine_option(to_lex_cstring($3));
          }

From the definition above, SECONDARY_ENGINE_SYM is already defined in create_table_option and also should be in class PT_create_table_stmt. For more information about how SQL syntax support will be designed.

2: Load Data/Unload Data

This is a part, which is mainly focusing on how to load data from innodb into in-memory column store. This issue will give all the information about this part.

When the table with secondary engine created. the next step will be loaded data into the secondary engine. After the all the data we need have been loaded, we can do query processing. The load operation perform via using alter table statement with SECONDARY_LOAD option.

ALTER TABLE tb_name SECONDARY_LOAD.
/**
  Represents ALTER TABLE SECONDARY_LOAD/SECONDARY_UNLOAD statements.
*/
class Sql_cmd_secondary_load_unload final : public Sql_cmd_common_alter_table {
};

/**
 * Loads a table from its primary engine into its secondary engine.
 *
 * This call assumes that MDL_SHARED_NO_WRITE/SECLOAD_SCAN_START_MDL lock
 * on the table have been acquired by caller. During its execution it may
 * downgrade this lock to MDL_SHARED_UPGRADEABLE/SECLOAD_PAR_SCAN_MDL.
 *
 * @param thd              Thread handler.
 * @param table            Table in primary storage engine.
 *
 * @return True if error, false otherwise.
 */
static bool secondary_engine_load_table(THD *thd, const TABLE &table) {
};

class ha_tianmu_secondary : public handler {
 public:
  ha_tianmu_secondary(handlerton *hton, TABLE_SHARE *table_share);

 private:
  int create(const char *, TABLE *, HA_CREATE_INFO *, dd::Table *) override;

  int open(const char *name, int mode, unsigned int test_if_locked,
           const dd::Table *table_def) override;

  int close() override { return 0; }

  int rnd_init(bool) override { return 0; }

  int rnd_next(unsigned char *) override { return HA_ERR_END_OF_FILE; }

  int rnd_pos(unsigned char *, unsigned char *) override {
    return HA_ERR_WRONG_COMMAND;
  }

  int info(unsigned int) override;

  ha_rows records_in_range(unsigned int index, key_range *min_key,
                           key_range *max_key) override;

  void position(const unsigned char *) override {}

  unsigned long index_flags(unsigned int, unsigned int, bool) const override;

  THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
                             thr_lock_type lock_type) override;

  Table_flags table_flags() const override;

  const char *table_type() const override { return "TIANMU_RAPID"; }

  int load_table(const TABLE &table) override;

  int unload_table(const char *db_name, const char *table_name,
                   bool error_if_not_loaded) override;

  THR_LOCK_DATA m_lock;
};

3: Data formation

This part will describe the data format in memory. How we orginize the data in memory. More information, pls refer to: #8

feat(shannon): Reading and writing open file format

To support reading and writing the open file format, such as parquets. which will enable Shannonbase lakehouse processing abilities. Nowdays, volume of data is getting larger, and the type of data is going to be more versatile, such as structured data, semi-structured data, unstructured data.

MySQL Heatwave, Databricks, etc. are all support data lakehouse.

bug: crash at dropping the loaded table

Search before asking

  • I had searched in the issues and found no similar issues.

Version

0.2.0-dev

What's Wrong?

the server cashed at dropping the loaded table. When a table loaded into rapid engine, then we executed drop table, and after that server crashed.

How to Reproduce?

1: alter table xxx secondary_load;
2: drop table xxx;

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

feat(shannon): Architecture of ShannonBase

Before the first

Shannonbase is an AI/ML empowered open source MySQL HTAP Database. which utilize AI/ML to enhance the AI/ML ability of Shannonbase.

With these employed features, shannonbase will has autonomous tuning, workloads predicting, auto index choice, index recommendation, selectivity estimation, etc. (ref: a phd dissertation from UC berkeley by Zongheng Yang and MySQL heatwave.

In order to support analytical processing in mysql, shannonbase incorporate a secondary engine, which is an in-memory column engine to process analytical workloads. The secondary engine is new feature, which is introduced in MySQL 8.x, it provides an interface to support multi-model, heterogeneous database. It synchronize the data from primary engine(InnoDB) to secondary engine(Rapid).

Now, our in-memory column sotre, also called Rapid.

It will be based on MySQL 8.1. It aims to achieve at least x times higher query performance on TPC-H than that of xxx, with a more concise architecture and a query optimizer that can intelligently offload query workloads to corresponding storage engines.

The design philosophy of ShannonBase Rapid is modularity and performance-cost balance. The following outlines the new features that will be implemented in ShannonBase. To learn details about each feature, see the relevant chapter.

ShannonBase Rapid will still be an open source project, which is a counterpart of close source service, MySQL Heatwave.

At first, an in-memory column store (IMCS) will be used. Secondly, a cost-based query engine will be developed to automatically offload transactional and analytics workloads. Thirdly, ShannonBase Rapid will provide a vectorized execution engine and support massive parallel processing. In this way, the execution performance of ShannonBase Rapid will be at least xxx times as that of xxx.

ShannonBase will load the data into memory from InnoDB to Rapid, just the same as MySQL Heatwave does.

’‘’MySQL Analytics is an in-memory processing engine, data is only persisted in MySQL InnoDB storage engine.‘’‘

This sentence functions as the basic rule and guideline for us when implementation ShannonBase Rapid. This design document introduces the main changes that will be achieved and gives you an overview of architecture of ShannonBase.

The main design goals of ShannonBase will include:

  1. Large Scale.
  2. Real Time.
  3. Highly Fresh Data Changes.
  4. Strong Data Consistency.
  5. Query Performance Capability.
  6. Single System Interface.
  7. Workload Isolation.

1: Overview of ShannonBase Rapid

ShannonBase is an integrated HTAP database that adopts hybrid row-column store and in-memory computing. It is fully compatible with MySQL version 8.1.

The architecture overview of ShannonBase
image

In MySQL 8.0, it provides the secondary engine which can intelligently route the TP workloads to the primary engine(InnonDB) and routes the AP workloads to secondary engine(Rapid), all these operations are based on the workload type.
image

2: Query Engine

After all new SQL syntaxes are enabled, the server will understand all the SQL statements. When the server receives an SQL statement, the SQL string will create some SQL classes, such as PT_create_table_stmt after lexical processing and grammatical processing. We will not discuss how distributed query plans in MPP are generated in this document. Instead, we focus on ONE NODE and try to explain what happens in ONE node when processing an SQL statement.

In MySQL 8.0, when the cost of a query plan on the primary engine is greater than the threshold defined by the new system variable (secondary_engine_cost_threshold), the query optimization engine will offload this workload to the secondary engine, ensuring optimal processing efficiency.

At the last phase of query optimization, The query engine will add optimize_secondary_engine to determine to which engine will the workload route for execution by performing the following three steps:

  1. Use the original processing way: unit->optimize().

  2. Estimate the cost spent by each engine to process the query: current_query_cost and accumulate_current_query_cost.

  3. If current_query_cost is greater than secondary_engine_cost_threshold, forward the workload to optimize_secondary_engine.

if (current_query_cost < variables.secondary_engine_cost_threshold) 
    return false; 
optimize_secondary_engine;

image

In future, after ShannonBase achieves MPP, the way that ShannonBase processes SQL statements will be different from centralized systems. A distributed query plan will be generated after query optimization is complete.

3: Execution Engine

As for the execution engine, a vectorized execution engine will be incorporated in ShannonBase. Shannonbase will support parallel query and vectorized execution. A column-based AP system is native to implement a vectorized execution engine. The vectorized engine seems as a standard feature to improve the performance of AP workloads. RDBMS systems such as ClickHouse also use vectorized execution engines.

Two ways are available to achieve vectorized execution as following:

  1. Use SIMD (single instruction, multiple data) to re-write execution plans.
  2. Multiple tuples will be fetched in an iteration, rather than a-tuple-an-iteration.
  3. Use GCC to generate vectorized code.

Some aggregation functions such as count(), sum(), and avg() can be executed in parallel mode. After a query plan is dispatched on a data node through the management node, the execution engine executes this query plan in parallel and the job is divided into sub-jobs and simultaneously executed through threads. The framework of parallel execution is discussed in issue #xxxx. You can refere to MySQL NDB cluster.

MySQL Cluster has a unique parallel query engine. It gives a consistent consolidated transactional view of the entire distributed partitioned dataset. This makes designing and programming scaleable distributed applications straightforward and extremely simple. The developer can completely focus on the application logic and doesn’t need to think about data distribution.

4: Rapid, A In-Memory Column Store Secondary Engine

Ref #2

5: Crash recovery

Rapid engine is an in-memory column engine, memory is volatile storage media, and the data in rapid will lose when engine goes crash. Therefore, crash safe is a key feature for any databases. Without crash safe, it cann't be used in production. And, for another reason, crash safe brings fast recovery ability due to all the data in rapid has been flushed to disk in compressed format. (The compressed format reduces the disk usage.). There are some scenarios:

  • 1: Normal Shutdown/Restart
    When user uses service mysql stop, command to stop a mysql instance. The processing workflow goes into shutdown stage.

  • 1.1: Shutdown

  • 1.2 Checkpoint

  • 2: Crash/Restart

7: Backup & Restore

8: Replication

9: Data Security

[bug]: make build failed in Mac M1

build error msg:

/Users/gaoriyao/go/src/github.com/hustjieke/ShannonBase/storage/rapid_engine/handler/ha_shannon_rapid.cc:1033:8: error: cannot initialize a member subobject of type 'unsigned long *' with an rvalue of type 'uint64 *' (aka 'unsigned long long *')
static MYSQL_SYSVAR_ULONG(rapid_populate_buffer_size_max,
       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/gaoriyao/go/src/github.com/hustjieke/ShannonBase/include/mysql/plugin.h:391:7: note: expanded from macro 'MYSQL_SYSVAR_ULONG'
      &varname,                                                             \
      ^~~~~~~~
14 warnings and 1 error generated.
make[2]: *** [storage/rapid_engine/CMakeFiles/shannon_rapid.dir/handler/ha_shannon_rapid.cc.o] Error 1
make[1]: *** [storage/rapid_engine/CMakeFiles/shannon_rapid.dir/all] Error 2
make: *** [all] Error 2

env:
Mac OS M1

code:

//static MYSQL_SYSVAR_ULONGLONG(rapid_populate_buffer_size_max,
static MYSQL_SYSVAR_ULONG(rapid_populate_buffer_size_max,
                          ShannonBase::Populate::population_buffer_size,
                          PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_READONLY,
                          "Number of populate buffer size that must not be 10% "
                          "rapid_populate_buffer size.",
                          NULL, nullptr,
                          ShannonBase::SHANNON_MAX_POPULATION_BUFFER_SIZE, 0,
                          ShannonBase::SHANNON_MAX_POPULATION_BUFFER_SIZE,0);

root cause:
As SHANNON_MAX_POPULATION_BUFFER_SIZE defined type is uint64, but MYSQL_SYSVAR_ULONG is used, we should change it to MYSQL_SYSVAR_ULONGLONG

constexpr uint64 SHANNON_MAX_POPULATION_BUFFER_SIZE = 8 * SHANNON_MB;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.