shannon-data / shannonbase Goto Github PK
View Code? Open in Web Editor NEWA MySQL HTAP Database, Open Source version of MySQL Heatwave, Powered by AI&ML.
License: Other
A MySQL HTAP Database, Open Source version of MySQL Heatwave, Powered by AI&ML.
License: Other
0.2.0-dev
the server cashed at dropping the loaded table. When a table loaded into rapid engine, then we executed drop table, and after that server crashed.
1: alter table xxx secondary_load;
2: drop table xxx;
It's a key module to HTAP.
TP workloads and AP workloads dont interfere with each other. AKA resource isolation.
To evaluate all the ML libs and try to incorporate someone into Shannon makes it avaiable to ShannonBase.
To support recycle bin, which it likes file recycle bin. If you're enable this option, drop table
or somethings like this, it does not perform real operation. It's a logical operation on this object you peform. ShannonBase will move these objects to recycle bin. After expired time reached, these tables will be dropped automatically.
Summary
adding a new condtion to decide whether the changes should be poped or not.
case SQLCOM_REPLACE_SELECT:
case SQLCOM_INSERT:
case SQLCOM_UPDATE:
- case SQLCOM_DELETE:
- prebuilt->m_to_pop_buff = true;
+ case SQLCOM_DELETE: {
+ prebuilt->m_to_pop_buff =
+ ShannonBase::shannon_loaded_tables->get(table->s->db.str,
+ table->s->table_name.str) ? true : false;
+ }break;
Summary
When defined a new column type in include/field_types.h
, Keep include/mysql/components/services/bits/stored_program_bits.h in
sync with it.
As described in include/field_types.h:
/**
Column types for MySQL
Note: Keep include/mysql/components/services/bits/stored_program_bits.h in
sync with this
*/
enum enum_field_types {
MYSQL_TYPE_DECIMAL,
MYSQL_TYPE_TINY,
MYSQL_TYPE_SHORT,
MYSQL_TYPE_LONG,
MYSQL_TYPE_FLOAT,
...
...
Description for this feature.
Add types in include/mysql/components/services/bits/stored_program_bits.h
...
#define MYSQL_SP_ARG_TYPE_DB_TRX_ID (1ULL << 21)
#define MYSQL_SP_ARG_TYPE_INVALID (1ULL << 22)
#define MYSQL_SP_ARG_TYPE_BOOL (1ULL << 23)
#define MYSQL_SP_ARG_TYPE_JSON (1ULL << 24)
#define MYSQL_SP_ARG_TYPE_NEWDECIMAL (1ULL << 25)
#define MYSQL_SP_ARG_TYPE_ENUM (1ULL << 26)
#define MYSQL_SP_ARG_TYPE_SET (1ULL << 27)
#define MYSQL_SP_ARG_TYPE_TINY_BLOB (1ULL << 28)
#define MYSQL_SP_ARG_TYPE_MEDIUM_BLOB (1ULL << 29)
#define MYSQL_SP_ARG_TYPE_LONG_BLOB (1ULL << 30)
#define MYSQL_SP_ARG_TYPE_BLOB (1ULL << 31)
#define MYSQL_SP_ARG_TYPE_VAR_STRING (1ULL << 32)
#define MYSQL_SP_ARG_TYPE_STRING (1ULL << 33)
#define MYSQL_SP_ARG_TYPE_GEOMETRY (1ULL << 34)
We employ a new background thread, ib_log_pop
thread. it used to pop the changes in innodb to rapid. (row to column)
void Populator::start_change_populate_threads() {
if (!Populator::log_pop_thread_is_active()) {
sys_log_rapid_thread =
os_thread_create(rapid_populate_thread_key, 0, parse_log_func, log_sys);
ShannonBase::Populate::sys_pop_started = true;
sys_log_rapid_thread.start();
}
}
build error msg:
/Users/gaoriyao/go/src/github.com/hustjieke/ShannonBase/storage/rapid_engine/handler/ha_shannon_rapid.cc:1033:8: error: cannot initialize a member subobject of type 'unsigned long *' with an rvalue of type 'uint64 *' (aka 'unsigned long long *')
static MYSQL_SYSVAR_ULONG(rapid_populate_buffer_size_max,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/gaoriyao/go/src/github.com/hustjieke/ShannonBase/include/mysql/plugin.h:391:7: note: expanded from macro 'MYSQL_SYSVAR_ULONG'
&varname, \
^~~~~~~~
14 warnings and 1 error generated.
make[2]: *** [storage/rapid_engine/CMakeFiles/shannon_rapid.dir/handler/ha_shannon_rapid.cc.o] Error 1
make[1]: *** [storage/rapid_engine/CMakeFiles/shannon_rapid.dir/all] Error 2
make: *** [all] Error 2
env:
Mac OS M1
code:
//static MYSQL_SYSVAR_ULONGLONG(rapid_populate_buffer_size_max,
static MYSQL_SYSVAR_ULONG(rapid_populate_buffer_size_max,
ShannonBase::Populate::population_buffer_size,
PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_READONLY,
"Number of populate buffer size that must not be 10% "
"rapid_populate_buffer size.",
NULL, nullptr,
ShannonBase::SHANNON_MAX_POPULATION_BUFFER_SIZE, 0,
ShannonBase::SHANNON_MAX_POPULATION_BUFFER_SIZE,0);
root cause:
As SHANNON_MAX_POPULATION_BUFFER_SIZE
defined type is uint64
, but MYSQL_SYSVAR_ULONG
is used, we should change it to MYSQL_SYSVAR_ULONGLONG
constexpr uint64 SHANNON_MAX_POPULATION_BUFFER_SIZE = 8 * SHANNON_MB;
To support reading and writing the open file format, such as parquets. which will enable Shannonbase lakehouse processing abilities. Nowdays, volume of data is getting larger, and the type of data is going to be more versatile, such as structured data, semi-structured data, unstructured data.
MySQL Heatwave, Databricks, etc. are all support data lakehouse.
Summary
To enable the SQL firewall in ShannonBase to protected it from SQL injection attacks.
1: Install Firewall.
2: Set the SQL rules.
3: Actions. including: ingore, disconnect, pass.
To build up the vector engine for shannonbase. The native idea just like the innodb does. Nowdays, the vector processing ability becomes the foundmental ability for LLM. In PostgreSQL, it has a plugin, pg_vector to process vector data. But, in mysql, we dont have this.
Therefore, we want to build a vector engine to process vector data. Taking the following as an instance,
create table tb1(col1 int, col2 vector(10)) engine=vector.
or vector data type.
ref:
https://github.com/facebookresearch/faiss
To add some status variables to monitor the status of rapid.
Such as:
In log_writer
, MONITOR_LOG_WRITER_
const auto wait_stats = waiting.wait(stop_condition);
MONITOR_INC_WAIT_STATS(MONITOR_LOG_WRITER_, wait_stats);
Build git actions for ShannonBase.
Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly.
==155884==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==155884==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==155884==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==155884==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==155884==Process memory map follows:
0x061155cae000-0x061156000000
0x061156000000-0x061156028000 /usr/lib/x86_64-linux-gnu/libc.so.6
0x061156028000-0x0611561bd000 /usr/lib/x86_64-linux-gnu/libc.so.6
0x0611561bd000-0x061156215000 /usr/lib/x86_64-linux-gnu/libc.so.6
0x061156215000-0x061156216000 /usr/lib/x86_64-linux-gnu/libc.so.6
0x061156216000-0x06115621a000 /usr/lib/x86_64-linux-gnu/libc.so.6
0x06115621a000-0x06115621c000 /usr/lib/x86_64-linux-gnu/libc.so.6
0x06115621c000-0x061156229000
0x061156400000-0x06115649a000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
0x06115649a000-0x0611565ab000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
0x0611565ab000-0x06115661a000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
0x06115661a000-0x06115661b000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
0x06115661b000-0x061156626000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
0x061156626000-0x061156629000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
0x061156629000-0x06115662c000
0x06115682b000-0x061156895000 /home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
0x061156895000-0x0611569ad000 /home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
0x0611569ad000-0x0611569f0000 /home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
0x0611569f0000-0x0611569f2000 /home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
0x0611569f2000-0x0611569ff000 /home/lihao/workshop/shannon-bin/lib/private/libprotobuf-lite.so.3.19.4
0x0611569ff000-0x061156a00000
0x061156a00000-0x061156ab2000 /home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
0x061156ab2000-0x061156d11000 /home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
0x061156d11000-0x061156de3000 /home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
0x061156de3000-0x061156e3e000 /home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
0x061156e3e000-0x061156e41000 /home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
0x061156e41000-0x061156e44000
0x061156e44000-0x061156e6a000 /home/lihao/workshop/shannon-bin/lib/private/libcrypto.so.3
0x061156f19000-0x061156f27000 /usr/lib/x86_64-linux-gnu/libm.so.6
0x061156f27000-0x061156fa3000 /usr/lib/x86_64-linux-gnu/libm.so.6
0x061156fa3000-0x061156ffe000 /usr/lib/x86_64-linux-gnu/libm.so.6
0x061156ffe000-0x061156fff000 /usr/lib/x86_64-linux-gnu/libm.so.6
0x061156fff000-0x061157000000 /usr/lib/x86_64-linux-gnu/libm.so.6
0x061157000000-0x061157024000 /usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
0x061157024000-0x0611570fb000 /usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
0x0611570fb000-0x06115712d000 /usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
0x06115712d000-0x06115712e000 /usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
0x06115712e000-0x061157132000 /usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
0x061157132000-0x061157135000 /usr/lib/x86_64-linux-gnu/libasan.so.6.0.0
0x061157135000-0x0611579e9000
0x061157a56000-0x061157a6d000
0x061157a6d000-0x061157a70000 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x061157a70000-0x061157a87000 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x061157a87000-0x061157a8b000 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x061157a8b000-0x061157a8c000 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x061157a8c000-0x061157a8d000 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
0x061157a8d000-0x061157a8f000
0x061157a8f000-0x061157a92000 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
0x061157a92000-0x061157a98000 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
0x061157a98000-0x061157a9a000 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
0x061157a9a000-0x061157a9b000 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
0x061157a9b000-0x061157a9c000 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
0x061157a9c000-0x061157aba000 /home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
0x061157aba000-0x061157b15000 /home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
0x061157b15000-0x061157b32000 /home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
0x061157b32000-0x061157b3c000 /home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
0x061157b3c000-0x061157b40000 /home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
0x061157b40000-0x061157b47000 /home/lihao/workshop/shannon-bin/lib/private/libssl.so.3
0x061157b51000-0x061157b5c000
0x061157b5c000-0x061157b5e000 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x061157b5e000-0x061157b88000 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x061157b88000-0x061157b93000 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x061157b93000-0x061157b94000
0x061157b94000-0x061157b96000 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x061157b96000-0x061157b98000 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
0x56cd9ab69000-0x56cd9eca2000 /home/lihao/workshop/shannon-bin/bin/mysqld
0x56cd9eca2000-0x56cda3adf000 /home/lihao/workshop/shannon-bin/bin/mysqld
0x56cda3adf000-0x56cda6b62000 /home/lihao/workshop/shannon-bin/bin/mysqld
0x56cda6b63000-0x56cda6d31000 /home/lihao/workshop/shannon-bin/bin/mysqld
0x56cda6d31000-0x56cda7bb7000 /home/lihao/workshop/shannon-bin/bin/mysqld
0x56cda7bb7000-0x56cda8241000
0x7ffd8aac8000-0x7ffd8aae9000 [stack]
0x7ffd8ab91000-0x7ffd8ab95000 [vvar]
0x7ffd8ab95000-0x7ffd8ab97000 [vdso]
0xffffffffff600000-0xffffffffff601000 [vsyscall]
==155884==End of process memory map.
Use can use to fix:
sudo sysctl vm.mmap_rnd_bits=28
or ref to: google/sanitizers#856 for more disscussion.
The base code line is 8.1.x, but the latest version is 8.3.x. and a lot of enhancements are added, and securities improvements added. Therefore, we shoul upgrade to 8.3.x to keep pace with community version.
To support the parallel query processing in ShannonBase.
In this part, we will list all the limitation in ShannonBase.
0: Supported the version is great than 8.0. MySQL 5.7 or less is not available.
1: Rapid Engine is mandatory installed. Rapid Engine is a memory column store, used to accelerate AP workloads.
2: Supported data type in Rapid Engine: (1) int, bigint, numeric, double, float; (2) Date, time; (3) char, varchar, text.
3: A table with rapid secondary engine MUST BE has at least ONE user defined primary key, and it's type is numeric supported now.
Summary
Using perf to evaluate the performance of ShannonBase, mainly focus on memory
, such as memory bandwith usage, cache miss, etc.
and to evaluate the impact of options of gcc on performance.
Now the version 1format of IMCS has been implemented. And now, it should to implement the version 2 format.
More information refer to #8
All meta infor about rapid, like, which tables or columns are loaded into rapid. and how many rows are loaded. how many GBs used, etc.
We should provide an interface or command to get all the params the user interested.
In this part, we will describ all the system variables and system table in rapid in detail.
1: System Varaiables.
2: System Tables.
All the system tables are in performance schema
, not in information schema.
The system tables used in rapid are listed as below:
1: rpd_column_id
, to keep the loaded column info, such as its id and table id, etc.
2: rpd_columns
, which is use to trace which column has been loaded into rapid engine.
3: rpd_table_id
, which is used to trace which table loaded in.
4: ...
Related PR(s):
#25
In order to suport MVCC in Rapid Engine. The most important thing is that Rapid engine should use same transaction&mvcc mechnisam which used in InnoDB to make sure shannonbase will has consistent-read ability.
In order to achieve consistent read view, all four isolation levels impl in InnoDB are using transaction id to identify wich data is visible to this transaction or not.
Therefore, we load the data from primary into secondary, meanwhile, the transaction id also retrieves from InnoDB to SQL layer. and used to deterimine which transaction can see this row in Secondary engine.
In MySQL, now, each table has a system column, named trx_id
. this column store the invisible field of a row in InnoDB to indicate which transaction create/update this row. As we know that, there are three invisible fields to user, trx_id
, row_id
, and roll_ptr
.
When we are trying to get data from InnoDB at SQL, first of all, a row template is built, row_prebuilt_t
. It'used to save CPU at InnoDB layer. Each column has a template, ha_innobase::build_template
.
/** Save CPU time with prebuilt/cached data structures */
row_prebuilt_t *m_prebuilt;`
Therefore, the navie idea of getting the db_trx_id
is as following:
1: Adds a ghost
column at SQL layer to store the the transaction id we get. In fill_columns_from_dd
function, we add a new column to the end of share
. The class named Field_sys_trx_id
, incorporated to store the transaction id.
2: In build_template
function, we add a mysql_row_templ_t
for this column.
3: In row_sel_field_store_in_mysql_format_func
to store transaction id in.
4: Gets its value to SQL layer.
trx id
, row id
, and roll_ptr
, from InnoDB to MySQL, ghost columns are added to store these data.At opening table, MySQL fills up table definition by reading the table description from table catalog table, dd::xxx
.
static bool fill_columns_from_dd(THD *thd, TABLE_SHARE *share,
const dd::Table *tab_obj) {
// Allocate space for fields in TABLE_SHARE. Adds one extra field.
const uint fields_size = ((share->fields + 1 + 1) * sizeof(Field *));
share->field = (Field **)share->mem_root.Alloc((uint)fields_size);
memset(share->field, 0, fields_size);
...
/*we dont add the extra file for system table or at bootstrap phase.*/
if (!thd->is_bootstrap_system_thread() && !is_internal_table){
Field_sys_trx_id *sys_trx_id_field =
new (*THR_MALLOC) Field_sys_trx_id(rec_pos, MAX_DB_TRX_ID_WIDTH);
assert(sys_trx_id_field);
sys_trx_id_field->set_hidden(dd::Column::enum_hidden_type::HT_HIDDEN_SE);
share->field[field_nr] = sys_trx_id_field;
//rec_pos += share->field[field_nr]->pack_length_in_rec();
field_nr++;
assert(share->fields + 1 == field_nr);
} else
...
}
Here, a new ghost column, named Field_sys_trx_id
is employeed to store the transaction id of a row. This field will be used to build a row prebuild template. Now, let's giving the definition fo mysql_row_templ_t
below.
/* A struct describing a place for an individual column in the MySQL
row format which is presented to the table handler in ha_innobase.
This template struct is used to speed up row transformations between
Innobase and MySQL. */
struct mysql_row_templ_t {
table_share
filled in fill_columns_from_dd
, in this stage, shannonbase will try to build template filed according to table column definitions./** Adds a field to a m_prebuilt struct 'template'.
@return the field template */
static mysql_row_templ_t *build_template_field(
row_prebuilt_t *prebuilt, /*!< in/out: template */
dict_index_t *clust_index, /*!< in: InnoDB clustered index */
dict_index_t *index, /*!< in: InnoDB index to use */
TABLE *table, /*!< in: MySQL table object */
const Field *field, /*!< in: field in MySQL table */
ulint i, /*!< in: field index in InnoDB table */
ulint v_no) /*!< in: field index for virtual col */
{
...
if (!strcmp (field->field_name, "DB_TRX_ID") ||
!strcmp(field->field_name, "DB_ROLL_PTR")) {
templ->type = DATA_SYS;
templ->rec_field_no = templ->col_no ;
assert (templ->col_no == 1 || templ->col_no == 2);
} else {
templ->type = col->mtype;
}
...
/** Stores a non-SQL-NULL field in the MySQL format. The counterpart of this
function is row_mysql_store_col_in_innobase_format() in row0mysql.cc.*/
void row_sel_field_store_in_mysql_format_func(
byte *dest, const mysql_row_templ_t *templ, const dict_index_t *index,
IF_DEBUG(ulint field_no, ) const byte *data,
ulint len IF_DEBUG(, ulint sec_field)) {
...
switch (templ->type) {
const byte *field_end;
case DATA_INT: ... break;
...
case DATA_SYS_CHILD:
case DATA_SYS:
/* These column types should never be shipped to MySQL. But, in Shannon,
we will retrieve trx id to MySQL. */
switch (prtype & DATA_SYS_PRTYPE_MASK) {
case DATA_TRX_ID:
id = mach_read_from_6(data);
memcpy(dest, &id, sizeof(ib_id_t));
break;
case DATA_ROW_ID:
case DATA_ROLL_PTR:
assert(0);
break;
}
break;
....
}
Summary
Now that, we parse all the mlog records, but we should just care about three type operations: (1) insert; (2) delete; (3) update.
Besides, the other operations is not our concern. Taking create database
or drop database
as examples, all the mlog records generate in these operations should not be added into sys_pop_buff
.
A new type MTR_LOG_WITH_POP
is added, if the mlog is not this type, we do not add these mlogs into pop buff, otherwise, these mlogs are added.
To provide the native embeded machine learning functions, users can use these MLs as stored procedures in ShannonBase.
Taking the follow as an instance.
call ML.linear_regression(xxxx);
To impl the heatwave Auto ML abilities. Such as,
In-database Management of the Entire Model Lifecycle!
Fully automated ML with minimal number of required parameters → no advanced ML/data science expertise
needed
Data and ML model never leave the database
Familiar SQL Interface
Performance and Scalability
By now, some sys procedures are added. the names are with prefix ML_
.
mysql> SELECT routine_name FROM information_schema.routines WHERE routine_schema = 'sys';
Current database: ML_SCHEMA_root
+-------------------------------------+
| ROUTINE_NAME |
+-------------------------------------+
| extract_schema_from_file_name |
| extract_table_from_file_name |
| format_bytes |
| format_path |
| format_statement |
| format_time |
| list_add |
| list_drop |
| ps_is_account_enabled |
| ps_is_consumer_enabled |
| ps_is_instrument_default_enabled |
| ps_is_instrument_default_timed |
| ps_is_thread_instrumented |
| ps_thread_account |
| ps_thread_id |
| ps_thread_stack |
| ps_thread_trx_info |
| quote_identifier |
| sys_get_config |
| version_major |
| version_minor |
| version_patch |
| create_synonym_db |
| diagnostics |
| execute_prepared_stmt |
| ml_explain |
| ml_explain_row |
| ml_explain_table |
| ml_model_import |
| ml_model_load |
| ml_model_unload |
| ml_predict_row |
| ml_predict_table |
| ml_train |
In rapid engine, now, it used local dictionary to store the string we met in loading stage. But, it consumes too much memory. Therefore, we want to employ some string compression algorithm to reduce the memory usage.
The strings we met, firstly, will be compressed in integer or interger array, then, these integers or arrayes will be stored to represent these strings. Taking the following as an expamle:
string abababababab
will be compressed as {0, 1, 3, 5, 2, 4}
by using lzw algorithm. just only by storing this array in imcs, but not a whole string.
The algorithm can be: LZW, ZSTD, etc.
comment of a column will be used to choose the compression algorithm for this column. taking the following as an instance.
`N_COMMENT` varchar(152) COLLATE ascii_bin DEFAULT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED'
this column is encoded with SORTED
, which use lz4 as its compression algorithm.
ref:
1: PR #65
2: https://dev.mysql.com/doc/heatwave/en/mys-hw-varlen-encoding.html
3: https://dev.mysql.com/doc/heatwave/en/mys-hw-dictionary-encoding.html
to enable the multi-version ability of rapid.
Now that, we have implemented ML_TRAIN
, and the framework of ML libs has been done. the others of ML_xxx
can be implemented easly by using the framework i have built.
ml_explain_row
ml_explain_table
ml_explain
ml_model_import
ml_model_load
ml_model_unload
ml_predict_row
ml_predict_table
ml_score
More details about internal visit: https://github.com/Shannon-Data/ShannonBase/wiki/Internals-of-ShannonBase#4-automl
try to incorporate coroutine in.
To impl the heatwave Auto ML abilities. Such as,
The following types has been supported in shannon rapid engine. (will be updated in future)
Numeric
1: Integer
2: double
String
1: char
2: varchar
3: text
Datetime
1: date
2: time
3: datetime
Taking Oracle IM column store as an example, here, we will to imll our IM column store as Oracle does.
In order to provide HTAP services, just like MySQL Heatwave, An in memory column store should be incoperated into, which is used to handle the analytical workloads. Any AP workloads will be offloaded to Rapid Engine(In memory column store).
The architecture of MySQL Heatwave is listed below. (copyright belongs to MySQL)
Firstly, the SQL syntaxes must be defined. All these syntaxes are the basis of all future works. This chapter introduces syntaxes of the following operations:
Create a table with secondary_engine=rapid.
Load data.
Process a query.
Monitor the system status.
To determine which SQL syntaxes to define, we must first figure out why we want to use ShannonBase: Firstly, we want users can port their workloads seamlessly from MySQL Heatwave to ShannonBase. Therefore, we try to adopt all SQL syntax that MySQL Heatwave uses (which is also used in MySQL version 8.1).
In addition, relevant changes will be implemented in the MySQL server layer. Following are examples showing some SQL syntaxes supported by ShannonBase.
Certain SQL grammars must be added in sql/sql_yacc.yy
. The following uses the SELECT statement as an example:
select_stmt:
query_expression
{
$$ = NEW_PTN PT_select_stmt($1);
}
| query_expression locking_clause_list
{
$$ = NEW_PTN PT_select_stmt(NEW_PTN PT_locking($1, $2),
nullptr, true);
}
| query_expression_parens
{
$$ = NEW_PTN PT_select_stmt($1);
}
| select_stmt_with_into
;
After SQL syntaxes are added, new SQL items are created in yacc. These items will be processed in the MySQL server layer during query optimization.
Create a table with secondary_engine=rapid:
CREATETABLE orders (id INT)SECONDARY_ENGINE= rapid;
ALTERTABLE orders SECONDARY_ENGINE= rapid;
Compared with the syntax for creating tables used in StoneDB V1.0, StoneDB V2.0 will support a new keyword SECONDARY_ENGINE that is adopted in MySQL 8.0.
Original CREATE statement syntax used in MySQL:
create_table_stmt:
CREATE opt_temporary TABLE_SYM opt_if_not_exists table_ident
'(' table_element_list ')' opt_create_table_options_etc
{
$$= NEW_PTN PT_create_table_stmt(YYMEM_ROOT, $2, $4, $5,
$7,
$9.opt_create_table_options,
$9.opt_partitioning,
$9.on_duplicate,
$9.opt_query_expression);
}
opt_create_table_options_etc:
create_table_options
opt_create_partitioning_etc
{
$$= $2;
$$.opt_create_table_options= $1;
}
| opt_create_partitioning_etc
;
create_table_option:
ENGINE_SYM opt_equal ident_or_text
{
$$= NEW_PTN PT_create_table_engine_option(to_lex_cstring($3));
}
| **SECONDARY_ENGINE_SYM** opt_equal NULL_SYM
{
$$= NEW_PTN PT_create_table_secondary_engine_option();
}
| SECONDARY_ENGINE_SYM opt_equal ident_or_text
{
$$= NEW_PTN PT_create_table_secondary_engine_option(to_lex_cstring($3));
}
From the definition above, SECONDARY_ENGINE_SYM is already defined in create_table_option and also should be in class PT_create_table_stmt
. For more information about how SQL syntax support will be designed.
This is a part, which is mainly focusing on how to load data from innodb into in-memory column store. This issue will give all the information about this part.
When the table with secondary engine created. the next step will be loaded data into the secondary engine. After the all the data we need have been loaded, we can do query processing. The load operation perform via using alter table statement with SECONDARY_LOAD
option.
ALTER TABLE tb_name SECONDARY_LOAD.
/**
Represents ALTER TABLE SECONDARY_LOAD/SECONDARY_UNLOAD statements.
*/
class Sql_cmd_secondary_load_unload final : public Sql_cmd_common_alter_table {
};
/**
* Loads a table from its primary engine into its secondary engine.
*
* This call assumes that MDL_SHARED_NO_WRITE/SECLOAD_SCAN_START_MDL lock
* on the table have been acquired by caller. During its execution it may
* downgrade this lock to MDL_SHARED_UPGRADEABLE/SECLOAD_PAR_SCAN_MDL.
*
* @param thd Thread handler.
* @param table Table in primary storage engine.
*
* @return True if error, false otherwise.
*/
static bool secondary_engine_load_table(THD *thd, const TABLE &table) {
};
class ha_tianmu_secondary : public handler {
public:
ha_tianmu_secondary(handlerton *hton, TABLE_SHARE *table_share);
private:
int create(const char *, TABLE *, HA_CREATE_INFO *, dd::Table *) override;
int open(const char *name, int mode, unsigned int test_if_locked,
const dd::Table *table_def) override;
int close() override { return 0; }
int rnd_init(bool) override { return 0; }
int rnd_next(unsigned char *) override { return HA_ERR_END_OF_FILE; }
int rnd_pos(unsigned char *, unsigned char *) override {
return HA_ERR_WRONG_COMMAND;
}
int info(unsigned int) override;
ha_rows records_in_range(unsigned int index, key_range *min_key,
key_range *max_key) override;
void position(const unsigned char *) override {}
unsigned long index_flags(unsigned int, unsigned int, bool) const override;
THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
thr_lock_type lock_type) override;
Table_flags table_flags() const override;
const char *table_type() const override { return "TIANMU_RAPID"; }
int load_table(const TABLE &table) override;
int unload_table(const char *db_name, const char *table_name,
bool error_if_not_loaded) override;
THR_LOCK_DATA m_lock;
};
This part will describe the data format in memory. How we orginize the data in memory. More information, pls refer to: #8
There has been in each release.
In Field_longlong type define:
class Field_longlong : public Field_num {
...
double val_real() const override;
...
}
funciton val_real()
marked as override
, it is not inherited by Field_sys_trx_id
. In native MySQL 8.0, this function is marked as final
No response
The first step of processing AP workloads is load the basic full data into rapid engine, then rapid will start propagation operation automatically. When a table loaded from innodb to rapid engine, some meta informations will be also loaded into catalog table, such as performance_schema.rpd_column
, performance_schema.rpd_column_id
, etc. A backgroud thread will be launched when system start, then start to monitor the redo log, when a new DML operation done, this background thread starts to parse the incoming redo log, and apply the changes into IMCS.
When the load statement was executed, it would peform the load operation. Overall, just like insert into xxx select xxx
statement, the system firstly do a table scan via index or full table scan.
1: It scans the target table, usually it is an innodb table. And, here, there is a problem must be clarified at first. That is which data will be visible to operation, and which is not. Therefore, here, we define that only the committed data will be visible to scan operation. In other words, that means we will use auto commited transaction to do table scan. the transaciton will be read committed
isolation level.
The new data inserted when we do table scaning, all these the latest data will not be seen by the operation, because this would not happen. An exclusive mdl lock is used to protected the new rows are inserted into table when the loading operation is running.
2: Except the core functions, there must be some system parameters to monitors the load operations, for example, how many data have been loaded? and how many remains, and so on. some parallel related parameters also will be introduced into, such as POD( parallel of degree), etc. Therefore, some system parameters will be introduced.
By now, the innodb has only a few parallel table scan abitities, such as based on index counting operations, and check table.
Innodb index is orginzied as B+tree, and each node in B+tree has a fixed size (generallly 16K, or denotated by UNIV_PAGE_SIZE in source code). Innodb reads the data from disk and loaded into buffer pool, and will spill out the old pages and load the latest page into buffer pool because the size of buffer pool is NOT un-limited. usually the buffer pool is configurated to 70%-80% of physical memory size.
Data on disk is managed in three-level: (1)segment; (2)cluster; (3)page. InnoDB does in-place updates, so when accessing a specific page, latch protection is required.
2.1: Full Table Scan
As the innodb does. Here, we would not discuss anymore. We mainly foucs on parall index scan.
2.2 Parallel index scan
The basic idea of parall index scan is divided index B+tree into several workloads, and each workload is scanned by using ONE thread.
The index table scan will follow the following steps:
If we do FULL table scan, it will check all the node in that B+tree.
In ShannonBase, the semantic load operation is defined as following. ShannonBase does full table scan, and find out all the records in target table, and transfer them from row-based tuple into column-based tuple, then be loaded into rapid engine.
Now, we can using parallel scanning B+tree via dividing scanning operations simultaneously. For example, it perform scanning operations in parallely according to POD and height of the index.
ShannonBase uses Multi-version Concurrency Control (MVCC) to determine which row is visible or not.
2.2.1 InnoDB baisc parallel scan
Before we start to support more parallel operation in innodb, we, firstly, start to re-examine. All the worklogs of MySQL is listed as below.
This research result is searched by parallel keyword. From the result we can read that parallel operation mainly includes: (1) parallel query execution; (2) innodb parallel read of index. And there are several main worklog we should notice, (1)WL#12978 | InnoDB:Fix imbalance during parallel scan; (2) WL#11720 InnoDB: Parallel read of index; (3) WL#1092 Parallel load data infile and bulk insert; (4)WL#12978 InnoDB:Fix imbalance during parallel scan
For more information, ref to: https://dev.mysql.com/worklog/?sc=&sd=&k=parallel&s=
PS: to find out what changeed in WL#11720, we can use git log --grep WL#11720 to figure out what changes in this commit. this commit id: dbfc59ffaf8096a2dc5b76766fedb45ff2fb8cbf
/** The core idea is to find the left and right paths down the B+Tree.These
paths correspond to the scan start and scan end search. Follow the links
at the appropriate btree level from the left to right and split the scan
on each of these sub-tree root nodes.
If the user has set the maximum number of threads to use at say 4 threads
and there are 5 sub-trees at the selected level then we will split the 5th
sub-tree dynamically when it is ready for scan.
We want to allow multiple parallel range scans on different indexes at the
same time. To achieve this split out the scan context (Scan_ctx) from the
execution context (Ctx). The Scan_ctx has the index and transaction
information and the Ctx keeps track of the cursor for a specific thread
during the scan.
To start a scan we need to instantiate a Parallel_reader. A parallel reader
can contain several Scan_ctx instances and a Scan_ctx can contain several
Ctx instances. Its' the Ctx instances that are eventually executed.
This design allows for a single Parallel_reader to scan multiple indexes
at once. Each index range scan has to be added via its add_scan() method.
This functionality is required to handle parallel partition scans because
partitions are separate indexes. This can be used to scan completely
different indexes and tables by one instance of a Parallel_reader.
To solve the imbalance problem we dynamically split the sub-trees as and
when required. e.g., If you have 5 sub-trees to scan and 4 threads then
it will tag the 5th sub-tree as "to_be_split" during phase I (add_scan()),
the first thread that finishes scanning the first set of 4 partitions will
then dynamically split the 5th sub-tree and add the newly created sub-trees
to the execution context (Ctx) run queue in the Parallel_reader. As the
other threads complete their sub-tree scans they will pick up more execution
contexts (Ctx) from the Parallel_reader run queue and start scanning the
sub-partitions as normal.
Note: The Ctx instances are in a virtual list. Each Ctx instance has a
range to scan. The start point of this range instance is the end point
of the Ctx instance scanning values less than its start point. A Ctx
will scan from [Start, End) rows. We use std::shared_ptr to manage the
reference counting, this allows us to dispose of the Ctx instances
without worrying about dangling pointers.
2.2.2 InnoDB basic parallel scan implementation
The change for parallel scan mainly in these files. (1) sql/handler.h
; (2) innobase/handler/ha_innodb.h
; (3) innobase/handler/handle0alter.cc
; (4) include/row0pread.h
. And the system params mainly defined in srv/srv0srv.cc
.
Limitation: Can NOT parallel scan on secondary index .
class Parallel_reader {
};
/** Specifies the range from where to start the scan and where to end it. */
struct Scan_range {
};
/** Thread related context information. */
struct Thread_ctx {
};
/** Parallel reader context. */
class Parallel_reader::Scan_ctx {
};
class handler {
...
/**
Initializes a parallel scan. It creates a parallel_scan_ctx that has to
be used across all parallel_scan methods. Also, gets the number of
threads that would be spawned for parallel scan.
*/
virtual int parallel_scan_init(void *&scan_ctx [[maybe_unused]],
size_t *num_threads [[maybe_unused]],
bool use_reserved_threads [[maybe_unused]]) {
return 0;
}
/**
Run the parallel read of data.
*/
virtual int parallel_scan(void *scan_ctx [[maybe_unused]],
void **thread_ctxs [[maybe_unused]],
Load_init_cbk init_fn [[maybe_unused]],
Load_cbk load_fn [[maybe_unused]],
Load_end_cbk end_fn [[maybe_unused]]) {
return 0;
}
/**
End of the parallel scan.
*/
virtual void parallel_scan_end(void *scan_ctx [[maybe_unused]]) { return; }
...
/** Start row of the scan range. */
struct Key_reader_row {
};
/** Parallel read implementation. */
template <typename T, typename R>
class Reader {
};
2.3: Data Foramt
All chunks are linked. The address of the first chunk can be found from Cu's header, and also contains the address of the next chunk.
A chunk's consist with header and data. Header contains the meta information of this chunk. the data part is where the real data located.
class Chunk {
public:
class Chunk_header {
private:
byte* m_data;
Chunk* m_next_chunk;
};
private:
std::mutex m_header_mutex;
Chunk_header* m_header;
};
class Cu {
public:
class Cu_header {
....
Compression_algorithm* m_comp_algo;
Local_dictionary* m_dictionary;
Chunk::Chunk_header* m_start_header;
};
private:
std::mutex m_header_mutex;
Cu_header* m_header;
Chunk* m_chunks;
};
Gets the first Cu
from IMCS. In an IMCS instance header, it has a header, which has a pointer to the address of IMCU.
When a new data in, it stores it in order. Insert sort can be used to make it ordered. It uses binary search to find the data.
But it data is in compressed format, at this situation, we need a new algorithm to find the data in compresssed data.
Now, we go to deeper. Giving out the more specific details of the data. Here, we notice that every data we write into CU should a tansaction id attached to it to mark which transaction it belongs.
Let's disscuss these data type above in detail.
1: Fix Length Data Type:
A record is divided into 3 parts: (1): infos, to indicate its information, such as, deleted or not, is null or not, total length, etc.
for example: the 7th bit: delete flag; the 6th bit: null flag, others is: total length of this data.
2: Var Length Data Type:
For the var length data type, a field to indicate the lengt of the data is needed. Here, there is an issue we should pay more attention to is size cost of storage. If a record has too many extra info to describe but the size of data itselft does not contain too much data. It will make ratio of the data effectiveness is too small.
And, as for, the string data type is another talk. It has some attributes to describe it. (1) length; (2) charset information. In order to saving the space, dictionary copression is incorprated.
There's also three parts in: (1) infos(2~4 bytes); (2) trx id (8 bytes); (3) data(variety); If data len is less than 65535, infos will use 2bytes. otherwise 4bytes used,
(payload ratio of this way seems not to effieciency)
info bit
, trx id
, rowid
and smu_ptr
as independent columns.Now format 1 has been implemented, and format 2 will be implemented in next version
2.4: Local Dictionary
ref to: data compression/dictionary compression
2.5: Related Code
Loading data operation, the sematic of load of secondary engine, is to load the data from primary engine to secondary engine.
when the load
statement, alter
statement, is executed. mysql_execute_command
will be executed and in this function. lex->m_sql_cmd is Sql_cmd_secondary_load_unload
. Therefore, Sql_cmd_secondary_load_unload::execute
executed.
At last, the execution will be in ha_rapid::load_table
, and in this function, it starts to scan the target table to read the corresponding data.
class ha_rapid : public handler {
public:
ha_rapid(handlerton *hton, TABLE_SHARE *table_share);
private:
int create(const char *, TABLE *, HA_CREATE_INFO *, dd::Table *) override {
return HA_ERR_WRONG_COMMAND;
}
int open(const char *name, int mode, unsigned int test_if_locked,
const dd::Table *table_def) override;
int close() override { return 0; }
int rnd_init(bool) override { return 0; }
int rnd_next(unsigned char *) override { return HA_ERR_END_OF_FILE; }
int rnd_pos(unsigned char *, unsigned char *) override {
return HA_ERR_WRONG_COMMAND;
}
int info(unsigned int) override;
ha_rows records_in_range(unsigned int index, key_range *min_key,
key_range *max_key) override;
void position(const unsigned char *) override {}
unsigned long index_flags(unsigned int, unsigned int, bool) const override;
THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
thr_lock_type lock_type) override;
Table_flags table_flags() const override;
const char *table_type() const override { return "MOCK"; }
int load_table(const TABLE &table) override;
int unload_table(const char *db_name, const char *table_name,
bool error_if_not_loaded) override;
THR_LOCK_DATA m_lock;
};
2.6: Meta Data Info
when a table loaded into rapid, all the information about loaded tables will be added to catalog table, performance_schema.rpd_xxxx
. Taking performance_schema.rpd_column_id
as an instance, it contains the following part: (1) Id; (2)Table ID; (3) Column Name.
3: Unload Data
Unload operation will release the loaded data from rapid. it's a opposite operation fo loading operation. After the data clearn up, the meta inforation in performance_schema.rpd_xxx
also be cleared.
Here, we dont discuss this part in detail anymore.
Related Issue: #20
Before the first
Shannonbase is an AI/ML empowered open source MySQL HTAP Database. which utilize AI/ML to enhance the AI/ML ability of Shannonbase.
With these employed features, shannonbase will has autonomous tuning, workloads predicting, auto index choice, index recommendation, selectivity estimation, etc. (ref: a phd dissertation from UC berkeley by Zongheng Yang and MySQL heatwave.
In order to support analytical processing in mysql, shannonbase incorporate a secondary engine, which is an in-memory column engine to process analytical workloads. The secondary engine is new feature, which is introduced in MySQL 8.x, it provides an interface to support multi-model, heterogeneous database. It synchronize the data from primary engine(InnoDB) to secondary engine(Rapid).
Now, our in-memory column sotre, also called Rapid.
It will be based on MySQL 8.1. It aims to achieve at least x times higher query performance on TPC-H than that of xxx, with a more concise architecture and a query optimizer that can intelligently offload query workloads to corresponding storage engines.
The design philosophy of ShannonBase Rapid is modularity and performance-cost balance. The following outlines the new features that will be implemented in ShannonBase. To learn details about each feature, see the relevant chapter.
ShannonBase Rapid will still be an open source project, which is a counterpart of close source service, MySQL Heatwave.
At first, an in-memory column store (IMCS) will be used. Secondly, a cost-based query engine will be developed to automatically offload transactional and analytics workloads. Thirdly, ShannonBase Rapid will provide a vectorized execution engine and support massive parallel processing. In this way, the execution performance of ShannonBase Rapid will be at least xxx times as that of xxx.
ShannonBase will load the data into memory from InnoDB to Rapid, just the same as MySQL Heatwave does.
’‘’MySQL Analytics is an in-memory processing engine, data is only persisted in MySQL InnoDB storage engine.‘’‘
This sentence functions as the basic rule and guideline for us when implementation ShannonBase Rapid. This design document introduces the main changes that will be achieved and gives you an overview of architecture of ShannonBase.
The main design goals of ShannonBase will include:
ShannonBase is an integrated HTAP database that adopts hybrid row-column store and in-memory computing. It is fully compatible with MySQL version 8.1.
The architecture overview of ShannonBase
In MySQL 8.0, it provides the secondary engine which can intelligently route the TP workloads to the primary engine(InnonDB) and routes the AP workloads to secondary engine(Rapid), all these operations are based on the workload type.
After all new SQL syntaxes are enabled, the server will understand all the SQL statements. When the server receives an SQL statement, the SQL string will create some SQL classes, such as PT_create_table_stmt after lexical processing and grammatical processing. We will not discuss how distributed query plans in MPP are generated in this document. Instead, we focus on ONE NODE and try to explain what happens in ONE node when processing an SQL statement.
In MySQL 8.0, when the cost of a query plan on the primary engine is greater than the threshold defined by the new system variable (secondary_engine_cost_threshold), the query optimization engine will offload this workload to the secondary engine, ensuring optimal processing efficiency.
At the last phase of query optimization, The query engine will add optimize_secondary_engine to determine to which engine will the workload route for execution by performing the following three steps:
Use the original processing way: unit->optimize()
.
Estimate the cost spent by each engine to process the query: current_query_cost
and accumulate_current_query_cost
.
If current_query_cost
is greater than secondary_engine_cost_threshold
, forward the workload to optimize_secondary_engine.
if (current_query_cost < variables.secondary_engine_cost_threshold)
return false;
optimize_secondary_engine;
In future, after ShannonBase achieves MPP, the way that ShannonBase processes SQL statements will be different from centralized systems. A distributed query plan will be generated after query optimization is complete.
As for the execution engine, a vectorized execution engine will be incorporated in ShannonBase. Shannonbase will support parallel query and vectorized execution. A column-based AP system is native to implement a vectorized execution engine. The vectorized engine seems as a standard feature to improve the performance of AP workloads. RDBMS systems such as ClickHouse also use vectorized execution engines.
Two ways are available to achieve vectorized execution as following:
Some aggregation functions such as count(), sum(), and avg() can be executed in parallel mode. After a query plan is dispatched on a data node through the management node, the execution engine executes this query plan in parallel and the job is divided into sub-jobs and simultaneously executed through threads. The framework of parallel execution is discussed in issue #xxxx. You can refere to MySQL NDB cluster.
MySQL Cluster has a unique parallel query engine. It gives a consistent consolidated transactional view of the entire distributed partitioned dataset. This makes designing and programming scaleable distributed applications straightforward and extremely simple. The developer can completely focus on the application logic and doesn’t need to think about data distribution.
Ref #2
Rapid engine is an in-memory column engine, memory is volatile storage media, and the data in rapid will lose when engine goes crash. Therefore, crash safe is a key feature for any databases. Without crash safe, it cann't be used in production. And, for another reason, crash safe brings fast recovery ability due to all the data in rapid has been flushed to disk in compressed format. (The compressed format reduces the disk usage.). There are some scenarios:
1: Normal Shutdown/Restart
When user uses service mysql stop
, command to stop a mysql instance. The processing workflow goes into shutdown stage.
1.1: Shutdown
1.2 Checkpoint
2: Crash/Restart
To support index in rapid engine. Although Rapid is a memory engine, the system further enhances point query performance by adding indexes.
To pass all mtr cases. If there're any cases cannot pass the test, we will fixed as soon as possible.
By now, we have passed main
suite.
PS:
If it's taking too much effort to fix, then put into disabled case list temporaryly. and fixed in future.
disabled cases:
#shannonbase disabled we dont support myisam engine and dont test it anaymore
main.mysqlpump_bugs : WL#123456 shannonbase disabled it needs perl 5.34.0
main.almost_full : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
main.mysql_load_data_local_dir : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
main.mysqld_safe : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
mian.shutdown_report : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.lock_impl_to_expl : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case1 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case4 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case5 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_5_7_9_case6 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.log_first_rec_group : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.partition_upgrade_5727_lin_lctn_0_debug : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.partition_upgrade_5727_lin_lctn_1_debug : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.upgrade_compressed_recv : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
innodb.upgrade_geometry_56 : WL#123456 2024-01-05 shannonbase disable mysisam will be fixed
A new type of engine employed, that provides a new optimization tactics for MySQL optimizer. In common case, MySQL optimizer will delivery the AP workload to rapid engine due to its low IO cost and fast execution.
In order to archieve that, MySQL optimizer should the the cost of workload, and the type of workload. If a workload is typed as a TP workload, it will be execute in primary engine(TP engine, InnoDB), on the other side, the AP workload will be executed in secondary engine(AP engine, Rapid).
The first of all, workload type should be tagged correctly. which one is taking as TP workload, and which one is a AP workload. After that, these workload can be offload to the corresponding engine.
As we know that the analytical workloads are running on the AP engine. And, Shannonbase use the one system two copies system architecture, which means one copy of data is in transactional processing engine (the primary engine) and another copy is used for analytical workloads (the secondary engine). Therefore, the consistency between these two copies will be the most important things. We usually use data freshness to describe the differences between the data in the primary engine and the secondary engine.
TP engine runs DMLs which will make some changes in primary engine, in practice, but, how to notify secondary engine, and send the changes to secondary engine to keep trace the latest data in primary engine.
In this part, we will explore how primary populates the changes to secondary. what mechanism do we use, and why we chose this way to impl the changes repopulation between primary engine and secondary engine.
The one of the most important issue of HTAP is data synchronization between transactional processing engine and analytical processing engine. Otherwise, if it has not synchronization ,it would not be called an HTAP database. Consider these secnarios below:
If a user buys some items, system insert some records into transaction table, now, the manager want to get real analytical reports on our sale or stock, etc. Traditional soultion is that to do some ETL jobs, move the data from TP database to AP database and do analytical processing on AP database. But, now we can do these works in one database instance, that HTAP does.
There're two solution to impl HTAP, or more specifically, changes propagation method. Loosely-coupled architecture, Integrated-coupled architecture. Shannonbase use integrated-coupuled architecture. In mysql, we can use binlog or redo log to synchronize the changes. Before, we discuss our solution, at first, we want to give some explanations on redo log and binlog.
Redo log explanation given by mysql as following:
The redo log is a disk-based data structure used during crash recovery to correct data written by incomplete transactions. During normal operations, the redo log encodes requests to change table data that result from SQL statements or low-level API calls. Modifications that did not finish updating data files before an unexpected shutdown are replayed automatically during initialization and before connections are accepted. For information about the role of the redo log in crash recovery, see Section 15.18.2, “InnoDB Recovery”.
The redo log is physically represented on disk by redo log files. Data that is written to redo log files is encoded in terms of records affected, and this data is collectively referred to as redo. The passage of data through redo log files is represented by an ever-increasing LSN value. Redo log data is appended as data modifications occur, and the oldest data is truncated as the checkpoint progresses.
From the description above, we know that redo log describe the changes made by SQL statement or api calls. Therefore, it can be a way to poulate the changes from innodb to rapid engine by parsing the redo logs. And, we can reuse some functions used in recovery modules. A LSN, rapid_lsn
, which describe where the rapid engine has been replayed at, and it will persist into system table space.
The binary log contains “events” that describe database changes such as table creation operations or changes to table data. It also contains events for statements that potentially could have made changes (for example, a DELETE which matched no rows), unless row-based logging is used. The binary log also contains information about how long each statement took that updated data. The binary log has two important purposes:
For replication, the binary log on a replication source server provides a record of the data changes to be sent to replicas. The source sends the information contained in its binary log to its replicas, which reproduce those transactions to make the same data changes that were made on the source. See Section 17.2, “Replication Implementation”.
Certain data recovery operations require use of the binary log. After a backup has been restored, the events in the binary log that were recorded after the backup was made are re-executed. These events bring databases up to date from the point of the backup. See Section 7.5, “Point-in-Time (Incremental) Recovery”.
Another way to repopulate the changes from innodb to rapid engine is using binglog. But, from our point of view, binglog there're some drawbacks as tools to repopulate the changes. binlog is a logical log, it involves writing the binlog and relay log, and parsing the binlog and execute the statement which parsed from binlog. it's a long callstack, not a efficiency way.
Hence, it may lead to cannot repopulate the changes in time at heavy workload scenarios.
Where the brief discussion above, we draw the conculsion the redo log maybe is the best choice for repopulating the changes from innodb to rapid.
4.1 Implementation
4.1.1 Basic concept
/** Redo log - single data structure with state of the redo log system.
In future, one could consider splitting this to multiple data structures. */
struct alignas(ut::INNODB_CACHE_LINE_SIZE) log_t {
...
/** The recent written buffer.
Protected by: locking sn not to add. */
alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_written;
/** Used for pausing the log writer threads.
When paused, each user thread should write log as in the former version. */
std::atomic_bool writer_threads_paused;
/** Some threads waiting for the ready for write lsn by closer_event. */
lsn_t current_ready_waiting_lsn;
/** current_ready_waiting_lsn is waited using this sig_count. */
int64_t current_ready_waiting_sig_count;
/** The recent closed buffer.
Protected by: locking sn not to add. */
alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_closed;
...
/** Maximum sn up to which there is free space in both the log buffer
and the log files. This is limitation for the end of any write to the
log buffer. Threads, which are limited need to wait, and possibly they
hold latches of dirty pages making a deadlock possible.
Protected by: writer_mutex (writes). */
alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_sn_t buf_limit_sn;
/** Up to this lsn, data has been written to disk (fsync not required).
Protected by: writer_mutex (writes). */
alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t write_lsn;
...
alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *flush_events;
/** Number of entries in the array with events. */
size_t flush_events_size;
/** This event is in the reset state when a flush is running;
a thread should wait for this without owning any of redo mutexes,
but NOTE that to reset this event, the thread MUST own the writer_mutex */
os_event_t old_flush_event;
/** Up to this lsn data has been flushed to disk (fsynced). */
alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t flushed_to_disk_lsn;
...
alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t rapid_lsn;
alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *rapid_events;
...
In log_t
, we add alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t rapid_lsn;
to represent where the data has been repopulated to rapid, and initialization it in log_sys_create()
and log_start()
.
the diagram of writing redo log
more information refere to redo log structure
the log record structure is listed as following:
Type + Space ID + Page Number + Body
We only consider the following redo log type, which are about dml operations.
After that, we reuse the logic of reconvery, to parse the redo log and gets the content of data of sql statement, then apply the data to rapdi engine.
recv_parse_log_recs()
, recv_single_rec
, recv_multi_rec
, etc. are used to parse the redo log at recovery stage. We can reuse these codes to impl our changes population logic.
4.1.2 Backgroud thread.
A new backgroud worker launched to do real changes repopulation job. a new mlog
generated, this worker will be notified to do parse mlog, and apply the changes to rapid engine.
4.1.3 persistentence of rapid_lsn
.
rapid_lsn
will be written to trx_sys
page of system table space.
4.2 Implementation details
secondary_load
command executed, a new thread will be launched immediately, wich called repopulation thread.void Populator::start_change_populate_threads(log_t* log) {
Populator::log_rapid_thread =
os_thread_create(rapid_populate_thread_key, 0, parse_log_func, log);
ShannonBase::Populate::pop_started = true;
Populator::log_rapid_thread.start();
}
and the thread function defined as below
static void parse_log_func (log_t *log_ptr) {
current_thd = (current_thd == nullptr) ? new THD(false) : current_thd;
THR_MALLOC = (THR_MALLOC == nullptr) ? ¤t_thd->mem_root : THR_MALLOC;
os_event_reset(log_ptr->rapid_events[0]);
//here we have a notifiyer, when checkpoint_lsn/flushed_lsn > rapid_lsn to start pop
while (pop_started.load(std::memory_order_seq_cst)) {
auto stop_condition = [&](bool wait) {
if (population_buffer->readAvailable()) {
return true;
}
if (wait) { //do somthing in waiting
}
return false;
};
os_event_wait_for(log_ptr->rapid_events[0], MAX_LOG_POP_SPIN_COUNT,
std::chrono::microseconds{100}, stop_condition);
byte* from_ptr = population_buffer->peek();
byte* end_ptr = from_ptr + population_buffer->readAvailable();
uint parsed_bytes = parse_log.parse_redo(from_ptr, end_ptr);
population_buffer->remove(parsed_bytes);
} //wile(pop_started)
pop_started.store(!pop_started, std::memory_order_seq_cst);
THR_MALLOC = nullptr;
if (current_thd) {
delete current_thd;
current_thd = nullptr;
}
}
The thread will waiting for the event to be singnaled. In log_buffer_write
function, a new redo log was written into redo log buffer, a copy of that also was written into population buffer. After writing finished, to notify population thread to apply the changes.
lsn_t log_buffer_write(log_t &log, const byte *str, size_t str_len,
lsn_t start_lsn) {
...
log_sync_point("log_buffer_write_before_memcpy");
/* This is the critical memcpy operation, which copies data
from internal mtr's buffer to the shared log buffer. */
std::memcpy(ptr, str, len);
auto type = mlog_id_t(*ptr & ~MLOG_SINGLE_REC_FLAG);
if (ShannonBase::Populate::pop_started &&
ShannonBase::Populate::population_buffer && (
type == MLOG_REC_INSERT )) {
ShannonBase::Populate::population_buffer->writeBuff(str, len);
os_event_set(log.rapid_events[0]);
}
...
Note: rapid population buffer is a lock-free ring buffer.
Summary
adds nightly
, weekly
git actions.
Rapid engine is a in-memory AP engine. All the data store in meomory. But, the memory is a volatile storage media. If Shannonbase may occur some bugs lead to crash, the rapid engine will lose all the data. At reconvery stage, Shannonbase will reload the data into rapid and make sure that the instance is in correct status before crashsed. Suppose that if the rapid has more than 500GB data in. At recovery stage, we will load that 500GB data from primary engine(InnoDB) into Rapid engine, that is IO-intensive task and will takes a long time to load. In loading phase, it will convert the data in row format to column format.
Hence, we want to rapid has ability to load the data into rapid engien directly without format conversion. It will be time-saving job. and less IO.
Shannonbase will perform flush operation periodically, just like checkpoint in InnoDB. The data in rapid persistent to disk in column and compressed format. That make it takes a less disk space and IO.
The data file format on disk will be used for optiming IO and disk space.
Summary
ref to : EECS-2022-194 Machine Learning for Query Optimization
.
Summary
To ship to ARM and RISC-V platform, and do verification.
ver0.2.0-beta
session 1:
alter table tt secondary_load;
session2 :
insert into tt values(xxxx); batch insertion.
then in session 1: execute select statement, a dead lock occurs.
when pop the changes into rapid, and now, executing some queries on rapid. it could occur deadlock.
1: session1:
CALL insert_into_tt1(); //trying to insert 1000rows into a table.
2: session2:
select count(1) from tt1;
Summary
for adding more mtr cases.
Ref: Oracle
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.