mariadb / webscalesql-5.6 Goto Github PK

This project forked from facebookarchive/webscalesql-5.6

Web Scale SQL, Version 5.6, based upon the MySQL-5.6 community releases.

License: GNU General Public License v2.0

Shell 1.58% PHP 0.01% C 32.36% C++ 58.92% Perl 1.65% R 0.04% Objective-C 0.79% Pascal 0.10% Puppet 0.01% Python 0.09% Perl 6 0.01% Visual Basic 0.01% Bison 0.03% Java 4.30% GAP 0.09% Awk 0.01% D 0.02% Scilab 0.01% JavaScript 0.01%

webscalesql-5.6's People

Contributors

Stargazers

Watchers

Forkers

yoshinorim jonahcohen mdcallag lizhanhui ggg17226 lmtwga zackhaikal sunghopark1 pk-codebox-evo alexsilva1983 ccf19881030 isabella232

webscalesql-5.6's Issues

support my.cnf options to configure RocksDB

Crash when using pseudo-unique keys

This testcase

CREATE TABLE t1 (
  i INT,
  t TINYINT,
  s SMALLINT,
  m MEDIUMINT,
  b BIGINT,
  pk MEDIUMINT AUTO_INCREMENT PRIMARY KEY,
  UNIQUE KEY b_t (b,t)
) ENGINE=rocksdb;

INSERT INTO t1 (i,t,s,m,b) VALUES (1,2,3,4,5),(1000,100,10000,1000000,1000000000000000000),(5,100,10000,1000000,100000000000000000),(2,3,4,5,6),(3,4,5,6,7),(101,102,103,104,105),(10001,103,10002,10003,10004),(10,11,12,13,14),(11,12,13,14,15),(12,13,14,15,16);

SELECT b+t FROM t1 WHERE (b,t) IN ( SELECT b, t FROM t1 WHERE i>1 ) ORDER BY b+t;

crashes like this:

#2  0x000000000073cd10 in handle_fatal_signal (sig=11) at /home/psergey/dev-git/webscalesql-5.6/sql/signal_handler.cc:248
#3  <signal handler called>
#4  __memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:135
#5  0x0000000000abc9c4 in RDBSE_KEYDEF::get_primary_key_tuple (this=0x7facc0058630, pk_descr=0x7facc005b470, key=0x7facc0056050 "", pk_buffer=0x7facc0076b00 "") at /home/psergey/dev-git/webscalesql-5.6/storage/rocksdb/rdb_datadic.cc:156
#6  0x0000000000ab75f1 in ha_rocksdb::index_read_map (this=0x7facc00533e0, buf=0x7facc00536d0 "\301\f", key=0x7facc0055e00 "", keypart_map=3, find_flag=HA_READ_KEY_EXACT) at /home/psergey/dev-git/webscalesql-5.6/storage/rocksdb/ha_rocksdb.cc:1452
#7  0x000000000065886f in handler::ha_index_read_map (this=0x7facc00533e0, buf=0x7facc00536d0 "\301\f", key=0x7facc0055e00 "", keypart_map=3, find_flag=HA_READ_KEY_EXACT) at /home/psergey/dev-git/webscalesql-5.6/sql/handler.cc:2767
#8  0x00000000007ba96d in join_read_always_key (tab=0x7facc0084f28) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:2206
#9  0x00000000007b8875 in sub_select (join=0x7facc0058dc0, join_tab=0x7facc0084f28, end_of_records=false) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:1256
#10 0x00000000007b9388 in evaluate_join_record (join=0x7facc0058dc0, join_tab=0x7facc0084c28) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:1601
#11 0x00000000007b8990 in sub_select (join=0x7facc0058dc0, join_tab=0x7facc0084c28, end_of_records=false) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:1276
#12 0x00000000007b828b in do_select (join=0x7facc0058dc0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:933
#13 0x00000000007b6254 in JOIN::exec (this=0x7facc0058dc0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_executor.cc:194
#14 0x0000000000813f60 in mysql_execute_select (thd=0x3946320, select_lex=0x39486c0, free_join=true) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_select.cc:1100
#15 0x0000000000814244 in mysql_select (thd=0x3946320, tables=0x7facc00050b8, wild_num=0, fields=..., conds=0x7facc0058818, order=0x3948888, group=0x39487c0, having=0x0, select_options=2147748608, result=0x7facc0058d98, unit=0x3948078, select_lex=0x39486c0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_select.cc:1221
#16 0x0000000000812146 in handle_select (thd=0x3946320, result=0x7facc0058d98, setup_tables_done_option=0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_select.cc:110
#17 0x00000000007ec7ae in execute_sqlcom_select (thd=0x3946320, all_tables=0x7facc00050b8) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:5124
#18 0x00000000007e5548 in mysql_execute_command (thd=0x3946320) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:2660
#19 0x00000000007eede3 in mysql_parse (thd=0x3946320, rawbuf=0x7facc0004c50 "SELECT b+t FROM t1 WHERE (b,t) IN ( SELECT b, t FROM t1 WHERE i>1 ) ORDER BY b+t", length=80, parser_state=0x7face0ab3dd0) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:6266
#20 0x00000000007e26eb in dispatch_command (command=COM_QUERY, thd=0x3946320, packet=0x3959f01 "", packet_length=80) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:1336
#21 0x00000000007e18f8 in do_command (thd=0x3946320) at /home/psergey/dev-git/webscalesql-5.6/sql/sql_parse.cc:1038
...

I suspect this happens because of a mismatch between SQL layer and RocksDB-SE about whether secondary unique keys include primary key columns.
RocksDB-SE keys always include primary key columns (i.e. they support extended keys).
However, SQL layer thinks that unique secondary indexes will not include PK columns.

AUTO_INCREMENT value doesn't survive server shutdown

Try this:

create table t1 (a int primary key auto_increment) engine=rocksdb;
insert into t1 values(NULL);
insert into t1 values(NULL);

mysqladmin shutdown
re-start the server, and:

MySQL [test]> insert into t1 values(NULL);
ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'

Support 'fast index create' for RocksDB

This should avoid doing a table copy when adding an index. See http://dev.mysql.com/doc/refman/5.5/en/innodb-create-index.html

Create an online physical backup tool (similar to xtrabackup)

We need an online physical backup tool, similar to xtrabackup in InnoDB. It should meet at least following features/conditions.

Streaming backup (copying directly to remote host, since local host may not have enough space for backups)
Can take consistent binlog position or GTID (so that we can create slave instances easily)
Backup can finish within reasonable time (close to disk read & transfer & write time of the target data volume, and applying logs)
Not too much write performance drops while taking backups
Not too much read performance drops while taking backups

Nice to have:

Incremental backup
Partial backups (per CF)

Transaction rollback doesn't undo all changes.

As noticed by Yoshinori:

When I tested with below very long transaction (and rollback), behavior was more strange.

create table r1 (id int auto_increment primary key, value int) engine=rocksdb;
begin;
insert into r1 values (1, 1);
insert into r1 values (2, 2);
insert into r1 values (3, 3);
...
insert into r1 values (99999, 99999);
rollback;

select count(*) from r1;

Then the query returned 99099 rows.

Don't update indexes if index values have not changed

Currently, ha_rocksdb::write_row() updates all indexes, even those who do not have a single changed column. We should update only those whose values were not changed.

A question: are there any rocksdb status counters that could be used to observe this?

Add rows_read/inserted/updated/deleted statistics

Use a compaction filter to eventually remove data after DROP TABLE and DROP INDEX

How do we do DROP TABLE and DROP INDEX? Using a column family per index makes this easy, just drop the CF but I assume we must support N tables or indexes per column family and need a compaction filter to eventually remove deleted rows along with an option to trigger manual compaction after dropping a large table or index to reclaim disk space faster. Assuming each key for RocksDB has the table/index ID as the prefix do we track IDs for valid objects or IDs for dropped objects that have key-value pairs that must be removed? If we track valid IDs then compaction filters must always do a bloom filter check or hash table lookup to confirm that the IDs in the merged rows are valid. And this can be a big CPU overhead. If we track IDs for dropped objects then we have a new statement management problem. This state must be persisted to survive mysqld restart. We must also figure out when an ID can be removed from this set. But the benefit is that we only need the compaction filter when that set of IDs is not empty. As an optimization we can avoid the compaction filter checks for merges between L0 and L1 because that must be as fast as possible.

change SHOW ENGINE ROCKSDB STATUS

For background see http://dev.mysql.com/doc/refman/5.6/en/show-engine.html

We need to change what is in SHOW ENGINE ROCKSDB STATUS. Right now it has live sst files which is a huge list with leveled compaction. For now I prefer to have it list the output from compaction stats. That probably needs to use one of:
db->GetProperty("rocksdb.stats", ...
db->GetProperty("rocksdb.cfstats", ...

*************************** 1. row ***************************
Type: ROCKSDB
Name: live_files
Status: cf=default name=/4908814.sst size=97853952
cf=default name=/4908812.sst size=97879865
cf=default name=/4908807.sst size=97833748
cf=default name=/4905498.sst size=1865749
cf=default name=/4905500.sst size=2670668

Doubling space by primary key

I just noticed primary keys are stored twice in RocksDB SE – as both key and value, even though
they are not needed to be stored as value.

How to repeat:
#10 column primary key

create table r1 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10)
) engine=rocksdb charset=latin1 collate=latin1_bin;
#1 column primary key

create table r2 (
id1 bigint not null,
id2 bigint not null,
id3 bigint not null,
id4 bigint not null,
id5 bigint not null,
id6 bigint not null,
id7 bigint not null,
id8 bigint not null,
id9 bigint not null,
id10 bigint not null,
id11 bigint not null,
id12 bigint not null,
primary key (id1)
) engine=rocksdb charset=latin1 collate=latin1_bin;

Inserting 2 million rows on these tables. Values sequentially increasing from 1 to 2 million (all columns have same values per row) -- can be generated like this.
for(my $i=1; $i<= 2000000; $i++) {
for(my $j=0; $j < 12; $j++) {
if($j < 11) {
print "$i,";
}else {
print "$i\n";
}
}
}

then make compaction happen.

Size in KB:
id1-id10 pk:
353764

id1 pk:
212984

innodb
id1-id10 pk:
264241

id1 pk:
260046

So basically long primary keys increase on-disk space.

I also checked via gdb (setting breakpoint at rocksdb::WriteBatch::Put()) and confirmed.
insert into r1 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$1 = 84
(gdb) p value.size_
$2 = 96

insert into r2 values (1,1,1,1,1,1,1,1,1,1,1,1);
=>
(gdb) p key.size_
$3 = 12
(gdb) p value.size_
$4 = 96

I expected value.size_ were 12 and 84 respectively. Since both key fields and value fields are stored, duplicate fields can be removed from value fields.

test issue

get correct row counts into SHOW TABLE STATUS

We can get an estimate for the row count via ('Rows:' below is always 10,000 today) via RocksDB property "rocksdb.estimate-num-keys".

Data_length and index_length are best defined when not using an index-organized table. For an index-organized table like InnoDB, the data_length column includes the size of the PK index (including all columns) and the index_length column has the size of all secondary indexes. We can treat RocksDB in the same way. However, we aren't using file-per-table or file-per-index for RocksDB so we need to determine whether we can get estimates on index sizes.

mysql -e 'show table status\G' iok1

*************************** 1. row ***************************
Name: ai
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 1000
Avg_row_length: 0
Data_length: 0
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:

Edit New issue Fix build for libsnappy and libbz2

I get this link error:
...a39a3e/bin/ld: cannot find -lsnappy

From the command line I see that librocksdb.a has a full path, but not so for snappy and bz2:
... /home/mcallaghan/rocksdb/librocksdb.a -lsnappy -lbz2

And there is magic in storage/rocksdb/CMakeLists.txt to do that for librocksdb.a, but not the others
FIND_LIBRARY(Rocksdb_LIBS NAMES rocksdb
PATHS ${Rocksdb_LIB_PATHS} ${Rocksdb_LIB} $ENV{ROCKSDB_LIBRARIES})
IF(EXISTS ${Rocksdb_LIBS})
GET_FILENAME_COMPONENT(LINK_DIR ${Rocksdb_LIBS} ABSOLUTE)
ELSE()
MESSAGE(SEND_ERROR "No rocksdb library directory found! Consider using environment variable ROCKSDB_LIBRARIES=...")
RETURN()
ENDIF()

...

MYSQL_ADD_PLUGIN(rocksdb_se ${ROCKSDB_SOURCES} STORAGE_ENGINE STATIC_ONLY MANDATORY LINK_LIBRARIES ${LINK_DIR} snappy bz2)

I think the workaround is to remove "snappy bz2" from the line above and build RocksDB without support for bz2 and snappy.

Implement repeatable read transaction isolation for RocksDB

We currently provide read committed. We need to define and implement repeatable read. See https://github.com/MariaDB/webscalesql-5.6/wiki/Cursor-Isolation

This is a placeholder for now.

Run 'leveldb' test suite for rocksdb SE.

Need to make a copy of mysql-test/suite/leveldb for rocksdb and run it with rocksdb.

SHOW TABLE STATUS shows Auto_increment=0

SHOW TABLE STATUS shows Auto_increment=0. This is because ha_rocksdb::info(HA_STATUS_AUTO) doesn't do anything.

get RocksDB config details into SHOW GLOBAL VARIABLES or SHOW ENGINE ROCKSDB STATUS

This include per column family options and configuration options.

Bloom filters and column families

I assume we will put many indexes in one column family as column family per index doesn't scale to a large number of indexes. When that is done we need a method for specifying the index columns that will be used in the column family. This assumes that we can use bloom filters on a prefix of the index key and that the format of an index key within RockDB is:
<index ID - hidden column> <index column 1> <index column 2> ...

In that case the options are:

no bloom filter
bloom filter on index ID
bloom filter on index ID, index column 1, ..., index column N-1>
bloom filter on index ID, index column 1, ..., index column N>

Note that

1. and 3) above are on a prefix of the visible index columns
2), 3), and 4) require all indexes in the column family to have a common prefix in columns

This requires much more discussion.

Get something like SHOW ENGINE INNODB TRANSACTION STATUS, but for RocksDB

Example output is listed below. There is work-in-progress to get a "show processlist" like feature into RocksDB so we can track per-thread states. Output from that can be displayed during SHOW ENGINE ROCKSDB TRANSACTION STATUS.

The other request is to get the age of the per-transaction or per-statement snapshots and iterators in use by a given connection. There is work-in-progress from RocksDB to track that. But give that the GetSnapshot call is done by our code (on the MySQL side) we can also read the clock when that is done and track the snapshot age.

Long-open snapshots block purge for InnoDB which can be bad. Long open snapshots for RocksDB mean that old versions of some keys cannot be dropped and creates more CPU overhead during compaction. Long open iterators mean that files replaced by compaction won't be deleted until the iterators that reference them are closed.

mysql -h udb5784.prn2 -P3307 -A -e 'show engine innodb transaction status\G'
*************************** 1. row ***************************
Type: InnoDB
Name:

Status:

#2014-12-06 08:16:56 7f9722cc4700 INNODB TRANSACTION MONITOR OUTPUT

TRANSACTIONS

Trx id counter 168494781347
Purge done for trx's n:o < 168494781347 undo n:o < 0 state: running but idle
History list length 1675
Lock stats: 0 deadlocks, 0 lock wait timeouts
Commits: 67286553 all, 67126992 with undo
Rollback: 619397 total, 37201404 partial
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 17872, OS thread handle 0x7f9722cc4700, query id 178298354 2401:db00:11:2a:face:0:3b:0 admin3 cleaning up
show engine innodb transaction status
---TRANSACTION 168494781345, not started

MySQL thread id 12, OS thread handle 0x7f9725378700, query id 178298352 Slave has read all relay log; waiting for the slave I/O thread to update it

END OF INNODB TRANSACTION MONITOR OUTPUT

support binlog + rocksdb group commit

This requires more discussion but many of us are in favor of it.

It might be time to use the binlog as the source of truth to avoid the complexity and inefficiency of keeping RocksDB and the binlog synchronized via internal XA. There are two modes for this. The first mode is durable in which case fsync is done after writing the binlog and RocksDB WAL. The other mode is non-durable in which case fsync might only be done once per second and we rely on lossless semisync to recover. Binlog as source of truth might have been discussed on a MariaDB mail list many years ago - https://lists.launchpad.net/maria-developers/msg01998.html

Some details are at http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html

The new protocol will be:

write binlog
optionally sync binlog
optionally wait for semisync ack
commit rocksdb - this also persists the GTID within RocksDB for the most recent commit, this also makes changes from the transaction visible to others
optionally sync rocksdb WAL

When lossless semisync is used we skip steps 2 and 4. When lossless semisync is not used we do step 2 and skip 3. Step 4 is optional. Recovery in this case is done by:

query RocksDB to determine GTID of last commit it has
extract/replay transactions from binlog >= GTID from previous step

When running in non durable mode, then on a crash one of the following is true where the relation describes which one has more commits:

rocksdb > binlog
binlog > rocksdb
rocksdb == binlog
If you know which state the server is in, then you can reach state 3. If in state 1 then append events to the binlog without running them on innodb. If in state 2 then replay events to innodb without recording to binlog. If in state 3 then do nothing. Both RocksDB and the binlog can tell us the last GTID they contain and we can compare that with the binlog archived via lossless semisync to determine the state.

get RocksDB perf counters into SHOW GLOBAL STATUS

It will help a lot to have RocksDB perf counters in SHOW GLOBAL STATUS output for RocksDB + MySQL. For info on SHOW STATUS see http://dev.mysql.com/doc/refman/5.0/en/show-status.html.

For InnoDB, see innodb_status_variables in https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/innobase/handler/ha_innodb.cc

For RocksDB, see rocksdb_status_variables in https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/rocksdb/ha_rocksdb.cc

For RocksDB entry points, see include/rocksdb/options.h and in db/db_bench.cc the function PrintStats that calls db->GetProperty("rocksdb.stats",...)

bin/mysql -A -e 'show global status' | grep -i rocksdb

rocksdb_bytes_written 2372229291762
rocksdb_bytes_read 7763473524122
rocksdb_number_keys_written 25970022279
rocksdb_number_keys_read 160304813639
rocksdb_number_keys_updated 0

Implement ha_rocksdb::records_in_range()

Moving from: https://github.com/mariadb-corporation/webscalesql-5.6-rocksdb/issues/23

support information_schema.table_statistics for RocksDB

I think this depends on basic support for IS.table_statistics getting ported to WebScaleSQL

Add option to enable/disable fsync on commit

I want the following for RocksDB. This is similar to innodb_flush_log_at_trx_commit which uses an integer set to 0, 1 or 2

A global option (rocksdb_sync_on_commit) to either
a) disable fsync-on-commit
b) enable fsync-on-commit
c) do an fsync for the WAL once per second in the background.

We need a) and b) before doing c) so this can be done in two diffs. Doing c) requires a background thread. See srv_master_thread in innodb source code for tips on doing that -- both thread start and thread stop on db shutdown.

Additionally it would be good to have a per-session variable that can override these values. For the per-session variable see bulk_load in ha_rocksdb.cc at https://github.com/MariaDB/webscalesql-5.6/blob/webscalesql-5.6.19/storage/rocksdb/ha_rocksdb.cc#L128

Whether to sync the RocksDB WAL is specified by the RocksDB user in WriteOptions::sync