mariadb / webscalesql-5.6 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from facebookarchive/webscalesql-5.6
Web Scale SQL, Version 5.6, based upon the MySQL-5.6 community releases.
License: GNU General Public License v2.0
This project forked from facebookarchive/webscalesql-5.6
Web Scale SQL, Version 5.6, based upon the MySQL-5.6 community releases.
License: GNU General Public License v2.0
This testcase
CREATE TABLE t1 (
i INT,
t TINYINT,
s SMALLINT,
m MEDIUMINT,
b BIGINT,
pk MEDIUMINT AUTO_INCREMENT PRIMARY KEY,
UNIQUE KEY b_t (b,t)
) ENGINE=rocksdb;
INSERT INTO t1 (i,t,s,m,b) VALUES (1,2,3,4,5),(1000,100,10000,1000000,1000000000000000000),(5,100,10000,1000000,100000000000000000),(2,3,4,5,6),(3,4,5,6,7),(101,102,103,104,105),(10001,103,10002,10003,10004),(10,11,12,13,14),(11,12,13,14,15),(12,13,14,15,16);
SELECT b+t FROM t1 WHERE (b,t) IN ( SELECT b, t FROM t1 WHERE i>1 ) ORDER BY b+t;
crashes like this:
#2 0x000000000073cd10 in handle_fatal_signal (sig=11) at /home/psergey/dev-git/webscalesql-5.6/sql/signal_handler.cc:248
#3 <signal handler called>
#4 __memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:135
#5 0x0000000000abc9c4 in RDBSE_KEYDEF::get_primary_key_tuple (this=0x7facc0058630, pk_descr=0x7facc005b470, key=0x7facc0056050 "", pk_buffer=0x7facc0076b00 "") at /home/psergey/dev-git/webscalesql-5.6/storage/rocksdb/rdb_datadic.cc:156
#6 0x0000000000ab75f1 in ha_rocksdb::index_read_map (this=0x7facc00533e0, buf=0x7facc00536d0 "\301\f", key=0x7facc0055e00 "", keypart_map=3, find_flag=HA_READ_KEY_EXACT) at /home/psergey/dev-git/webscalesql-5.6/storage/rocksdb/ha_rocksdb.cc:1452
#7 0x000000000065886f in handler::ha_index_read_map (this=0x7facc00533e0, buf=0x7facc00536d0 "\301\f", key=0x7facc0055e00 "", keypart_map=3, find_flag=HA_READ_KEY_EXACT) at /home/psergey/dev-git/webscalesql-5.6/sql/handler.cc:2767
#8 0x00000000007ba96d in join_read_always_key (tab=0x7facc0084f28) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:2206
#9 0x00000000007b8875 in sub_select (join=0x7facc0058dc0, join_tab=0x7facc0084f28, end_of_records=false) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:1256
#10 0x00000000007b9388 in evaluate_join_record (join=0x7facc0058dc0, join_tab=0x7facc0084c28) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:1601
#11 0x00000000007b8990 in sub_select (join=0x7facc0058dc0, join_tab=0x7facc0084c28, end_of_records=false) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:1276
#12 0x00000000007b828b in do_select (join=0x7facc0058dc0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:933
#13 0x00000000007b6254 in JOIN::exec (this=0x7facc0058dc0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:194
#14 0x0000000000813f60 in mysql_execute_select (thd=0x3946320, select_lex=0x39486c0, free_join=true) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_select.cc:1100
#15 0x0000000000814244 in mysql_select (thd=0x3946320, tables=0x7facc00050b8, wild_num=0, fields=..., conds=0x7facc0058818, order=0x3948888, group=0x39487c0, having=0x0, select_options=2147748608, result=0x7facc0058d98, unit=0x3948078, select_lex=0x39486c0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_select.cc:1221
#16 0x0000000000812146 in handle_select (thd=0x3946320, result=0x7facc0058d98, setup_tables_done_option=0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_select.cc:110
#17 0x00000000007ec7ae in execute_sqlcom_select (thd=0x3946320, all_tables=0x7facc00050b8) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:5124
#18 0x00000000007e5548 in mysql_execute_command (thd=0x3946320) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:2660
#19 0x00000000007eede3 in mysql_parse (thd=0x3946320, rawbuf=0x7facc0004c50 "SELECT b+t FROM t1 WHERE (b,t) IN ( SELECT b, t FROM t1 WHERE i>1 ) ORDER BY b+t", length=80, parser_state=0x7face0ab3dd0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:6266
#20 0x00000000007e26eb in dispatch_command (command=COM_QUERY, thd=0x3946320, packet=0x3959f01 "", packet_length=80) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:1336
#21 0x00000000007e18f8 in do_command (thd=0x3946320) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:1038
...
I suspect this happens because of a mismatch between SQL layer and RocksDB-SE about whether secondary unique keys include primary key columns.
RocksDB-SE keys always include primary key columns (i.e. they support extended keys).
However, SQL layer thinks that unique secondary indexes will not include PK columns.
Try this:
create table t1 (a int primary key auto_increment) engine=rocksdb;
insert into t1 values(NULL);
insert into t1 values(NULL);
mysqladmin shutdown
re-start the server, and:
MySQL [test]> insert into t1 values(NULL);
ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'
This should avoid doing a table copy when adding an index. See http://dev.mysql.com/doc/refman/5.5/en/innodb-create-index.html
We need an online physical backup tool, similar to xtrabackup in InnoDB. It should meet at least following features/conditions.
Nice to have:
As noticed by Yoshinori:
create table r1 (id int auto_increment primary key, value int) engine=rocksdb;
begin;
insert into r1 values (1, 1);
insert into r1 values (2, 2);
insert into r1 values (3, 3);
...
insert into r1 values (99999, 99999);
rollback;
Then the query returned 99099 rows.
Currently, ha_rocksdb::write_row() updates all indexes, even those who do not have a single changed column. We should update only those whose values were not changed.
A question: are there any rocksdb status counters that could be used to observe this?
How do we do DROP TABLE and DROP INDEX? Using a column family per index makes this easy, just drop the CF but I assume we must support N tables or indexes per column family and need a compaction filter to eventually remove deleted rows along with an option to trigger manual compaction after dropping a large table or index to reclaim disk space faster. Assuming each key for RocksDB has the table/index ID as the prefix do we track IDs for valid objects or IDs for dropped objects that have key-value pairs that must be removed? If we track valid IDs then compaction filters must always do a bloom filter check or hash table lookup to confirm that the IDs in the merged rows are valid. And this can be a big CPU overhead. If we track IDs for dropped objects then we have a new statement management problem. This state must be persisted to survive mysqld restart. We must also figure out when an ID can be removed from this set. But the benefit is that we only need the compaction filter when that set of IDs is not empty. As an optimization we can avoid the compaction filter checks for merges between L0 and L1 because that must be as fast as possible.
For background see http://dev.mysql.com/doc/refman/5.6/en/show-engine.html
We need to change what is in SHOW ENGINE ROCKSDB STATUS. Right now it has live sst files which is a huge list with leveled compaction. For now I prefer to have it list the output from compaction stats. That probably needs to use one of:
db->GetProperty("rocksdb.stats", ...
db->GetProperty("rocksdb.cfstats", ...
*************************** 1. row ***************************
Type: ROCKSDB
Name: live_files
Status: cf=default name=/4908814.sst size=97853952
cf=default name=/4908812.sst size=97879865
cf=default name=/4908807.sst size=97833748
cf=default name=/4905498.sst size=1865749
cf=default name=/4905500.sst size=2670668
I just noticed primary keys are stored twice in RocksDB SE โ as both key and value, even though
they are not needed to be stored as value.
How to repeat:
#10 column primary key
create table r1 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10)
) engine=rocksdb charset=latin1 collate=latin1_bin;
#1 column primary key
create table r2 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1)
) engine=rocksdb charset=latin1 collate=latin1_bin;
Inserting 2 million rows on these tables. Values sequentially increasing from 1 to 2 million (all columns have same values per row) -- can be generated like this.
for(my $i=1; $i<= 2000000; $i++) {
for(my $j=0; $j < 12; $j++) {
if($j < 11) {
print "$i,";
}else {
print "$i\n";
}
}
}
then make compaction happen.
Size in KB:
id1-id10 pk:
353764
id1 pk:
212984
innodb
id1-id10 pk:
264241
id1 pk:
260046
So basically long primary keys increase on-disk space.
I also checked via gdb (setting breakpoint at rocksdb::WriteBatch::Put()) and confirmed.
insert into r1 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$1 = 84
(gdb) p value.size_
$2 = 96
insert into r2 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$3 = 12
(gdb) p value.size_
$4 = 96
I expected value.size_ were 12 and 84 respectively. Since both key fields and value fields are stored, duplicate fields can be removed from value fields.
We can get an estimate for the row count via ('Rows:' below is always 10,000 today) via RocksDB property "rocksdb.estimate-num-keys".
Data_length and index_length are best defined when not using an index-organized table. For an index-organized table like InnoDB, the data_length column includes the size of the PK index (including all columns) and the index_length column has the size of all secondary indexes. We can treat RocksDB in the same way. However, we aren't using file-per-table or file-per-index for RocksDB so we need to determine whether we can get estimates on index sizes.
*************************** 1. row ***************************
Name: ai
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 1000
Avg_row_length: 0
Data_length: 0
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
I get this link error:
...a39a3e/bin/ld: cannot find -lsnappy
From the command line I see that librocksdb.a has a full path, but not so for snappy and bz2:
... /home/mcallaghan/rocksdb/librocksdb.a -lsnappy -lbz2
And there is magic in storage/rocksdb/CMakeLists.txt to do that for librocksdb.a, but not the others
FIND_LIBRARY(Rocksdb_LIBS NAMES rocksdb
PATHS ${Rocksdb_LIB_PATHS} ${Rocksdb_LIB} $ENV{ROCKSDB_LIBRARIES})
IF(EXISTS ${Rocksdb_LIBS})
GET_FILENAME_COMPONENT(LINK_DIR ${Rocksdb_LIBS} ABSOLUTE)
ELSE()
MESSAGE(SEND_ERROR "No rocksdb library directory found! Consider using environment variable ROCKSDB_LIBRARIES=...")
RETURN()
ENDIF()
...
MYSQL_ADD_PLUGIN(rocksdb_se ${ROCKSDB_SOURCES} STORAGE_ENGINE STATIC_ONLY MANDATORY LINK_LIBRARIES ${LINK_DIR} snappy bz2)
I think the workaround is to remove "snappy bz2" from the line above and build RocksDB without support for bz2 and snappy.
We currently provide read committed. We need to define and implement repeatable read. See https://github.com/MariaDB/webscalesql-5.6/wiki/Cursor-Isolation
This is a placeholder for now.
Need to make a copy of mysql-test/suite/leveldb for rocksdb and run it with rocksdb.
SHOW TABLE STATUS shows Auto_increment=0. This is because ha_rocksdb::info(HA_STATUS_AUTO) doesn't do anything.
This include per column family options and configuration options.
I assume we will put many indexes in one column family as column family per index doesn't scale to a large number of indexes. When that is done we need a method for specifying the index columns that will be used in the column family. This assumes that we can use bloom filters on a prefix of the index key and that the format of an index key within RockDB is:
<index ID - hidden column> <index column 1> <index column 2> ...
In that case the options are:
Note that
This requires much more discussion.
Example output is listed below. There is work-in-progress to get a "show processlist" like feature into RocksDB so we can track per-thread states. Output from that can be displayed during SHOW ENGINE ROCKSDB TRANSACTION STATUS.
The other request is to get the age of the per-transaction or per-statement snapshots and iterators in use by a given connection. There is work-in-progress from RocksDB to track that. But give that the GetSnapshot call is done by our code (on the MySQL side) we can also read the clock when that is done and track the snapshot age.
Long-open snapshots block purge for InnoDB which can be bad. Long open snapshots for RocksDB mean that old versions of some keys cannot be dropped and creates more CPU overhead during compaction. Long open iterators mean that files replaced by compaction won't be deleted until the iterators that reference them are closed.
mysql -h udb5784.prn2 -P3307 -A -e 'show engine innodb transaction status\G'
*************************** 1. row ***************************
Type: InnoDB
Name:
#2014-12-06 08:16:56 7f9722cc4700 INNODB TRANSACTION MONITOR OUTPUT
Trx id counter 168494781347
Purge done for trx's n:o < 168494781347 undo n:o < 0 state: running but idle
History list length 1675
Lock stats: 0 deadlocks, 0 lock wait timeouts
Commits: 67286553 all, 67126992 with undo
Rollback: 619397 total, 37201404 partial
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 17872, OS thread handle 0x7f9722cc4700, query id 178298354 2401:db00:11:2a:face:0:3b:0 admin3 cleaning up
show engine innodb transaction status
---TRANSACTION 168494781345, not started
This requires more discussion but many of us are in favor of it.
It might be time to use the binlog as the source of truth to avoid the complexity and inefficiency of keeping RocksDB and the binlog synchronized via internal XA. There are two modes for this. The first mode is durable in which case fsync is done after writing the binlog and RocksDB WAL. The other mode is non-durable in which case fsync might only be done once per second and we rely on lossless semisync to recover. Binlog as source of truth might have been discussed on a MariaDB mail list many years ago - https://lists.launchpad.net/maria-developers/msg01998.html
Some details are at http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html
The new protocol will be:
When lossless semisync is used we skip steps 2 and 4. When lossless semisync is not used we do step 2 and skip 3. Step 4 is optional. Recovery in this case is done by:
When running in non durable mode, then on a crash one of the following is true where the relation describes which one has more commits:
It will help a lot to have RocksDB perf counters in SHOW GLOBAL STATUS output for RocksDB + MySQL. For info on SHOW STATUS see http://dev.mysql.com/doc/refman/5.0/en/show-status.html.
For InnoDB, see innodb_status_variables in https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/innobase/handler/ha_innodb.cc
For RocksDB, see rocksdb_status_variables in https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/rocksdb/ha_rocksdb.cc
For RocksDB entry points, see include/rocksdb/options.h and in db/db_bench.cc the function PrintStats that calls db->GetProperty("rocksdb.stats",...)
bin/mysql -A -e 'show global status' | grep -i rocksdb
rocksdb_bytes_written 2372229291762
rocksdb_bytes_read 7763473524122
rocksdb_number_keys_written 25970022279
rocksdb_number_keys_read 160304813639
rocksdb_number_keys_updated 0
I think this depends on basic support for IS.table_statistics getting ported to WebScaleSQL
I want the following for RocksDB. This is similar to innodb_flush_log_at_trx_commit which uses an integer set to 0, 1 or 2
We need a) and b) before doing c) so this can be done in two diffs. Doing c) requires a background thread. See srv_master_thread in innodb source code for tips on doing that -- both thread start and thread stop on db shutdown.
Additionally it would be good to have a per-session variable that can override these values. For the per-session variable see bulk_load in ha_rocksdb.cc at https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/rocksdb/ha_rocksdb.cc#L128
Whether to sync the RocksDB WAL is specified by the RocksDB user in WriteOptions::sync
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.