martinsumner / leveled Goto Github PK
View Code? Open in Web Editor NEWA pure Erlang Key/Value store - based on a LSM-tree, optimised for HEAD requests
License: Apache License 2.0
A pure Erlang Key/Value store - based on a LSM-tree, optimised for HEAD requests
License: Apache License 2.0
Initially leveled has been developed to support any erlang term as Key or Bucket. In Riak, these will also be binaries.
The openness for any Erlang term though meant that the definition of an end key for things such as an all bucket query hard - as it was not possible to be sure what would be bigger than any term.
So matching on an EndKey doesn't use a simple > guard, it must use the ugly concept of leveled_codec:endkey_passed.
Really - keys should just be force to be binaries, and so nil can be used in end keys as something greater than all binaries.
recent_aae_expiry test failed:
%%% tictac_SUITE ==> {failed,{{badmatch,false},
[{tictac_SUITE,recent_aae_expiry,1,
[{file,"/Users/martinsumner/dbroot/leveled/_build/test/lib/leveled/test/end_to_end/tictac_SUITE.erl"},
{line,885}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1533}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1053}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,985}]}]}}
Failed 1 tests. Passed 22 tests.
I'm assuming this is intermittent at present. Re-testing to confirm. Speculative assumption is that it may be a timer sleep timing thing - @Licenser has noted problems elsewhere with this.
I don't think this test would have been run more than a couple of times before I added it.
The binary bucket list type query (which underpins list_buckets) doesn't check if the key is active before returning a bucket
Penciller follows the original advertised behaviour of leveldb in increasing the size of each level by an order of magnitude (actually a factor of 8).
In Basho's modified leveldb this isn't followed. In part this is because the use of overlapping files at Level 1, but it remains narrow beyond Level 2, increasing by only a factor of 2 from level 2 to Level 3 - but then by a factor of 12 from Level 3 to Level 4 and beyond.
Level | Level Size | Cumulative Size | Cumulative with AAE |
---|---|---|---|
0 | 360 | 360 | 720 |
1 | 2,160 | 2,520 | 5,040 |
2 | 2,940 | 5,460 | 10,920 |
3 | 6,144 | 11,604 | 23,208 |
4 | 122,880 | 134,484 | 268,968 |
5 | 2,362,232 | 2,496,716 | 4,993,432 |
6 | not limited | not limited | not limited |
I haven't been able to construct a model to understand this - to try and reason about what the right shape of merged tree is. But some experimentation seems worthwhile.
Sizes in leveled penciller are currently defined as a list of {level, FileCount}
tuples:
-define(LEVEL_SCALEFACTOR, [{0, 0}, {1, 8}, {2, 64}, {3, 512}, {4, 4096}, {5, 32768}, {6, 262144}, {7, infinity}]).
Level 5 allows for 1 trillion keys. Level 6 and Level 7 seem a little superfluous in this case
An alternative might be:
-define(LEVEL_SCALEFACTOR, [{0, 0}, {1, 4}, {2, 16}, {3, 64}, {4, 512}, {5, 4096}, {6, 32768}, {7, infinity}]).
See martinsumner/kv_index_tictactree#24
Need to make sure simple leveled functionality exists to verify indexes, and prompt a keystore to be rebuilt whilst a store is offline.
For the first part need to return a folder that will first fold objects (folding over the journal not the ledger) building an AAE tree for index entries of active objects. Then it should fold over all index entries in the Ledger providing an AAE tree to compare, and then compare the AAE trees - and log a warning if they don't compare. Use SnapPreFold = true for both folds, so they can be taken at a consistent point in time.
For the second part need to be able to call book_offlinerebuild(RootPath, ArchivePath), and have that archive the old Ledger, and then open the store with an empty Ledger, and then close again once it is rebuilt.
Currently leveled will compile for OTP20 only by disabling warnings as errors. Once compiled, it appears to show some signs of measurable speed ups.
OTP21 comes with potentially greater enhancements.
Need to find some way of supporting R16 through to OTP21 in the short term. Either convert gen_fsm to gen_server or to https://gitlab.com/Project-FiFo/gen_fsm_compat
It may be that compaction has a memory leak
Load: cpu 17 Memory: total 23476453 binary 17417838
procs 18034 processes 5858881 code 15786
runq 0 atom 719 ets 56323
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6938.30.0> user_drv '-' 120060 715148488 2465279 erlang:bif_return_trap/1
<6938.4742.1815> leveled_penciller:init/1 '-' 0 59419528 0 gen_server:loop/6
<6938.11643.4457> leveled_penciller:init/1 '-' 0 49516424 0 gen_server:loop/6
<6938.763.0> leveled_penciller:init/1 '-' 1934877 41263840 0 gen_server:loop/6
<6938.1503.0> leveled_penciller:init/1 '-' 2262527 41263840 0 gen_server:loop/6
<6938.1885.0> leveled_penciller:init/1 '-' 1594216 41263840 0 gen_server:loop/6
<6938.1157.0> leveled_penciller:init/1 '-' 2293928 34386688 0 gen_server:loop/6
<6938.1250.2186> leveled_penciller:init/1 '-' 0 34386688 0 gen_server:loop/6
<6938.6873.1992> leveled_penciller:init/1 '-' 0 34386688 0 gen_server:loop/6
<6938.15673.2775> leveled_penciller:init/1 '-' 0 34386688 0 gen_server:loop/6
Since increasing the pace of compaction, top shows the beam taking > 60% of memory on some nodes (normally for equivalent tests would be about 10%.
The constant database files are genuine constant database files, as the CRC check is wrapped in to the CDB file. This means that the journal file could not be interrogated by another CDB library.
This is unnecessary. The CRC is at the wrong layer of abstraction and should be added prior to pushing by the Inker, and when scanning the Inker should provide a validate function to the CDB process that does the CRC check.
I see that leveled can store terms attached to an index, is there any optimisations to use it instead of having special keys for it?
It is expected that the memory consumed by processes grows with the database - in essence in line with the number of SST processes. However, I do not expect the binary heap or the ETS memory to perpetually grow. However (from about 5 hours into a load test)...
sudo rel/riak/bin/riak-admin top -interval 300 -sort memory -lines 10
===============================================================================================================================
'[email protected]' 22:22:27
Load: cpu 0 Memory: total 11535910 binary 3587950
procs 21334 processes 6067068 code 16414
runq 13 atom 727 ets 1809045
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1867.0> leveled_penciller:init/1 '-' 7517654176 49516496 0 gen_server:loop/6
<6816.760.0> leveled_penciller:init/1 '-' 7425504054 34386688 0 gen_server:loop/6
<6816.761.0> leveled_penciller:init/1 '-' 7547983471 28655728 0 gen_server:loop/6
<6816.1147.0> leveled_penciller:init/1 '-' 7555039012 23880072 0 gen:do_call/4
<6816.1882.0> leveled_penciller:init/1 '-' 7529623803 23880032 1 gen_server:loop/6
<6816.1127.0> leveled_penciller:init/1 '-' 7577043611 23879928 0 gen_server:loop/6
<6816.1523.0> leveled_penciller:init/1 '-' 7558402565 19900272 1 gen:do_call/4
<6816.1494.0> leveled_penciller:init/1 '-' 7567906914 19900096 0 gen_server:loop/6
<6816.1493.0> leveled_penciller:init/1 '-' 7596748048 16583744 1 gen:do_call/4
<6816.1501.0> leveled_penciller:init/1 '-' 7324700860 7998096 0 gen:do_call/4
===============================================================================================================================
'[email protected]' 22:27:27
Load: cpu 92 Memory: total 11690010 binary 3611813
procs 21684 processes 6171523 code 16414
runq 11 atom 727 ets 1832505
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1494.0> leveled_penciller:init/1 '-' 104484942 49516600 1 gen:do_call/4
<6816.1151.0> leveled_penciller:init/1 '-' 97723216 34386864 1 gen:do_call/4
<6816.1501.0> leveled_penciller:init/1 '-' 100634488 34386760 0 gen_server:loop/6
<6816.759.0> leveled_penciller:init/1 '-' 102692730 28655728 0 gen_server:loop/6
<6816.1867.0> leveled_penciller:init/1 '-' 100708933 23880104 1 io:wait_io_mon_reply/2
<6816.760.0> leveled_penciller:init/1 '-' 101042794 19900200 1 gen_server:loop/6
<6816.1127.0> leveled_penciller:init/1 '-' 101072696 16583712 0 gen:do_call/4
<6816.1523.0> leveled_penciller:init/1 '-' 101710445 16583712 0 gen:do_call/4
<6816.761.0> leveled_penciller:init/1 '-' 101727371 16583584 0 dict:maybe_expand_aux/2
<6816.1147.0> leveled_penciller:init/1 '-' 101437091 16583568 0 gen_server:loop/6
===============================================================================================================================
'[email protected]' 22:32:27
Load: cpu 92 Memory: total 11933060 binary 3745714
procs 22050 processes 6256184 code 16414
runq 13 atom 727 ets 1857837
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1867.0> leveled_penciller:init/1 '-' 105129452 41263944 1 gen_server:loop/6
<6816.1873.0> leveled_penciller:init/1 '-' 100625652 41263840 0 gen_server:loop/6
<6816.1493.0> leveled_penciller:init/1 '-' 105661484 34386688 0 gen_server:loop/6
<6816.1882.0> leveled_penciller:init/1 '-' 103609840 34386688 0 gen_server:loop/6
<6816.1523.0> leveled_penciller:init/1 '-' 103802488 23880072 0 gen:do_call/4
<6816.1151.0> leveled_penciller:init/1 '-' 100007855 23879928 0 gen_server:loop/6
<6816.760.0> leveled_penciller:init/1 '-' 99651317 16583744 1 gen:do_call/4
<6816.759.0> leveled_penciller:init/1 '-' 101804936 16583568 0 gen_server:loop/6
<6816.1494.0> leveled_penciller:init/1 '-' 100101865 13820424 1 dict:on_bucket/3
<6816.1501.0> leveled_penciller:init/1 '-' 100864688 13819792 0 gen_server:loop/6
===============================================================================================================================
'[email protected]' 22:37:26
Load: cpu 93 Memory: total 12082859 binary 3719957
procs 22387 processes 6412275 code 16414
runq 20 atom 727 ets 1878931
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1501.0> leveled_penciller:init/1 '-' 104484691 49516568 0 gen:do_call/4
<6816.759.0> leveled_penciller:init/1 '-' 106003999 49516424 0 gen_server:loop/6
<6816.1147.0> leveled_penciller:init/1 '-' 102794123 41263840 0 gen_server:loop/6
<6816.1867.0> leveled_penciller:init/1 '-' 99114913 34386864 1 gen:do_call/4
<6816.1493.0> leveled_penciller:init/1 '-' 102038118 34386688 0 gen_server:loop/6
<6816.761.0> leveled_penciller:init/1 '-' 102368120 28655728 0 gen_server:loop/6
<6816.1127.0> leveled_penciller:init/1 '-' 102123399 28655728 0 gen_server:loop/6
<6816.1523.0> leveled_penciller:init/1 '-' 104283115 16583568 0 gen_server:loop/6
<6816.1882.0> leveled_penciller:init/1 '-' 103704673 13819864 0 gen_server:loop/6
<6816.1151.0> leveled_penciller:init/1 '-' 104185132 13819792 0 gen_server:loop/6
Tests intermittently failing on binary_to_term:
Failure/Error: {error,badarg,
[{erlang,binary_to_term,
[<<131,80,0,0,8,62,120,1,173,212,79,104,211,80,
28,192,241,95,99,218,174,130,214,162,204,185,
191,8,211,58,80,73,154,216,54,206,226,16,166,
7,61,136,162,7,17,70,179,116,38,166,166,14,
210,193,192,127,7,209,57,84,80,152,32,3,81,17,
167,162,162,94,220,64,173,48,16,188,168,245,
224,65,180,238,160,34,195,50,231,208,131,66,
241,23,58,205,59,172,239,33,244,246,114,249,
240,229,247,126,121,105,0,168,211,57,157,215,
192,99,152,224,219,148,237,54,83,182,206,225,
209,22,187,12,203,86,1,234,79,155,192,111,77,
245,203,58,159,228,117,78,3,95,178,219,54,250,
82,26,212,24,86,143,97,25,118,191,6,1,43,211,
149,206,100,204,236,65,13,120,43,155,78,179,
208,82,25,141,32,202,85,11,109,88,101,130,23,
75,69,9,213,5,213,82,27,47,205,170,10,170,139,
171,165,182,212,150,85,73,64,181,185,106,234,
134,178,26,137,161,218,240,191,106,198,132,
249,229,13,16,4,33,98,66,0,103,137,39,81,144,
254,94,42,159,92,52,55,170,115,184,41,207,191,
88,60,100,239,230,126,226,151,87,123,26,72,
118,204,110,65,37,88,22,93,184,158,2,191,24,
87,155,247,230,167,29,54,52,58,193,98,37,81,
113,217,16,133,205,61,192,222,251,197,252,71,
132,61,133,239,63,152,176,20,113,225,54,10,
188,113,23,194,147,123,244,247,206,88,194,32,
177,96,89,142,186,112,11,5,158,167,33,124,224,
66,103,209,41,126,242,246,4,11,94,167,16,87,
23,164,192,19,117,106,251,215,99,191,157,9,
183,143,55,177,216,168,66,244,46,161,176,249,
1,236,221,169,47,63,137,176,111,234,78,144,5,
199,68,162,119,25,5,110,13,34,124,111,243,212,
85,103,194,135,126,141,176,97,98,39,42,252,
110,14,5,183,108,53,252,238,138,31,207,156,49,
144,97,177,113,57,230,94,220,90,74,111,232,38,
246,94,127,124,113,6,97,254,92,31,19,86,4,217,
133,69,10,188,254,184,122,184,52,88,66,214,
215,250,114,136,213,171,68,136,139,91,74,97,
11,71,176,247,97,226,245,160,211,59,106,60,
155,27,150,254,189,17,130,66,204,183,194,195,
131,20,192,237,48,194,197,222,221,5,252,242,
76,150,98,44,88,36,55,98,5,165,120,108,88,245,
111,147,135,145,229,222,244,238,99,178,50,177,
104,171,41,172,71,197,222,161,158,15,57,7,126,
180,240,21,11,150,162,196,27,177,134,2,207,20,
212,163,223,46,159,69,214,123,102,199,13,38,
27,39,246,97,37,133,109,108,194,222,182,120,
34,143,48,63,114,237,19,11,142,202,68,111,45,
5,174,73,32,188,165,243,212,121,132,253,219,
63,79,39,59,246,255,1,160,51,103,175>>],
[]},
{leveled_sst,deserialise_block,2,
[{file,
"/Users/martinsumner/dbroot/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
{line,875}]},....
but:
Eshell V7.3 (abort with ^G)
1> B = <<131,80,0,0,8,62,120,1,173,212,79,104,211,80,
1> 28,192,241,95,99,218,174,130,214,162,204,185,
1> 191,8,211,58,80,73,154,216,54,206,226,16,166,
1> 7,61,136,162,7,17,70,179,116,38,166,166,14,
1> 210,193,192,127,7,209,57,84,80,152,32,3,81,17,
1> 167,162,162,94,220,64,173,48,16,188,168,245,
1> 224,65,180,238,160,34,195,50,231,208,131,66,
1> 241,23,58,205,59,172,239,33,244,246,114,249,
1> 240,229,247,126,121,105,0,168,211,57,157,215,
1> 192,99,152,224,219,148,237,54,83,182,206,225,
1> 209,22,187,12,203,86,1,234,79,155,192,111,77,
1> 245,203,58,159,228,117,78,3,95,178,219,54,250,
1> 82,26,212,24,86,143,97,25,118,191,6,1,43,211,
1> 149,206,100,204,236,65,13,120,43,155,78,179,
1> 208,82,25,141,32,202,85,11,109,88,101,130,23,
1> 75,69,9,213,5,213,82,27,47,205,170,10,170,139,
1> 171,165,182,212,150,85,73,64,181,185,106,234,
1> 134,178,26,137,161,218,240,191,106,198,132,
1> 249,229,13,16,4,33,98,66,0,103,137,39,81,144,
1> 254,94,42,159,92,52,55,170,115,184,41,207,191,
1> 88,60,100,239,230,126,226,151,87,123,26,72,
1> 118,204,110,65,37,88,22,93,184,158,2,191,24,
1> 87,155,247,230,167,29,54,52,58,193,98,37,81,
1> 113,217,16,133,205,61,192,222,251,197,252,71,
1> 132,61,133,239,63,152,176,20,113,225,54,10,
1> 188,113,23,194,147,123,244,247,206,88,194,32,
1> 177,96,89,142,186,112,11,5,158,167,33,124,224,
1> 66,103,209,41,126,242,246,4,11,94,167,16,87,
1> 23,164,192,19,117,106,251,215,99,191,157,9,
1> 183,143,55,177,216,168,66,244,46,161,176,249,
1> 1,236,221,169,47,63,137,176,111,234,78,144,5,
1> 199,68,162,119,25,5,110,13,34,124,111,243,212,
1> 85,103,194,135,126,141,176,97,98,39,42,252,
1> 110,14,5,183,108,53,252,238,138,31,207,156,49,
1> 144,97,177,113,57,230,94,220,90,74,111,232,38,
1> 246,94,127,124,113,6,97,254,92,31,19,86,4,217,
1> 133,69,10,188,254,184,122,184,52,88,66,214,
1> 215,250,114,136,213,171,68,136,139,91,74,97,
1> 11,71,176,247,97,226,245,160,211,59,106,60,
1> 155,27,150,254,189,17,130,66,204,183,194,195,
1> 131,20,192,237,48,194,197,222,221,5,252,242,
1> 76,150,98,44,88,36,55,98,5,165,120,108,88,245,
1> 111,147,135,145,229,222,244,238,99,178,50,177,
1> 104,171,41,172,71,197,222,161,158,15,57,7,126,
1> 180,240,21,11,150,162,196,27,177,134,2,207,20,
1> 212,163,223,46,159,69,214,123,102,199,13,38,
1> 27,39,246,97,37,133,109,108,194,222,182,120,
1> 34,143,48,63,114,237,19,11,142,202,68,111,45,
1> 5,174,73,32,188,165,243,212,121,132,253,219,
1> 63,79,39,59,246,255,1,160,51,103,175>>.
<<131,80,0,0,8,62,120,1,173,212,79,104,211,80,28,192,241,
95,99,218,174,130,214,162,204,185,191,8,211,...>>
2> binary_to_term(B).
[{{i,"Bucket",{"t1_int",6796},"Key4"},
{4,{active,infinity},no_lookup,null}},
{{i,"Bucket",{"t1_int",6910},"Key2"},
{2,{active,infinity},no_lookup,null}},
{{i,"Bucket",{"t1_int",6952},"Key13"},
{13,{active,infinity},no_lookup,null}},
{{i,"Bucket",{"t1_int",7326},"Key19"},
{19,{active,infinity},no_lookup,null}},
{{i,"Bucket",{"t1_int",7958},"Key30"},
{30,{active,infinity},no_lookup,null}},
{{i,"Bucket",{"t1_int",7996},"Key27"},
{27,{active,infinity},no_lookup,null}},
{{o,"Bucket0002","Key000103",null},
{16,{active,infinity},{51688,4139757173},{90488841,64}}},
{{o,"Bucket0002","Key000141",null},
{26,{active,infinity},{52931,509399537},{85047520,64}}},
{{o,"Bucket0002","Key000319",null},
{17,{active,infinity},{49074,3838963121},{31388405,64}}},
{{o,"Bucket0002","Key000332",null},
{41,{active,infinity},{16213,3714603754},{2555955,64}}},
{{o,"Bucket0002","Key000446",null},
{31,{active,infinity},{868,3980760685},{29284998,64}}},
{{o,"Bucket0002","Key000593",null},
{15,{active,infinity},{57368,1005355259},{87802653,64}}},
{{o,"Bucket0002","Key000696",null},
{20,{active,infinity},{53640,2267113555},{116370703,64}}},
{{o,"Bucket0002","Key000713",null},
{25,{active,infinity},{9231,2733590192},{8190631,64}}},
{{o,"Bucket0002","Key000719",null},
{30,{active,infinity},{43636,668770567},{40470639,64}}},
{{o,"Bucket0002","Key000847",null},
{46,{active,infinity},{4521,4086939046},{76641903,64}}},
{{o,"Bucket0002","Key000904",null},
{49,{active,infinity},{14980,2113833726},{103075733,64}}},
{{o,"Bucket0002","Key000926",null},
{23,{active,infinity},{56958,2329034167},{79194566,64}}},
{{o,"Bucket0003","Key000099",null},
{27,{active,infinity},{44071,3730207213},{32177719,64}}},
{{o,"Bucket0003","Key000113",null},
{37,{active,infinity},{47515,122434715},{47608167,64}}},
{{o,"Bucket0003","Key000143",null},
{44,{active,infinity},{354,3219089045},{45878992,64}}},
{{o,"Bucket0003","Key000362",null},
{45,{active,infinity},{62430,2146476174},{93147816,...}}},
{{o,"Bucket0003","Key000384",null},
{38,{active,infinity},{7197,...},{...}}},
{{o,"Bucket0003","Key000642",null},
{22,{active,...},{...},...}}]
3>
The original intention for supporting hashtree rebuilds in was to modify the fold command called when prompting a rebuild so that it could use a fold that was optimised within leveled - i.e. rather than use fold_objects, use a fold that would return just Keys and Clocks, the minimum necessary information to produce the rebuild.
However, Riak has now added the kv_sweeper. The sweeper now makes it harder to fold using a fold other than fold_objects as:
There are two possible changes which can be considered for resolving this
In both these cases the riak_kv_sweeper_fold:fold_req_fun/4 will need to changed to handle the potential variance in response.
The second solution makes riak_kv_sweeper more generically efficient in that all sweeps will be optimised where they do not require access to object values, but less efficient specifically in the case of hashtree rebuilds.
Allow compression in both SST and CDB to use snappy or lz4 (but default to some pure Erlang library compression)
Hello,
leveled is very interesting. Pure Erlang riak-core backend ? Whaoo !
I didn't know about CDB datastores. Their page explains about 32bits adresses hence a 4Gb limit.
Does this limit apply to a whole leveled node ? or is a leveled node a growing set of several CDB files ?
Can one leveled node achieve some Tb of standalone storage ?
Can a 100ish nodes leveled powered riak-core cluster achive some Pb of cluster storage ?
Happy design and coding
The use of Tags to indicate types, and the changing of behaviour by type (such as on recovery , inker compaction or penciller compaction) is a potentially powerful feature in leveled.
However, the implementation is a muddle within leveled_codec at the moment. It is not clear what is required for a new type, and what can be done with types.
What is the current status of leveled? Can it start to be used in small prods? Is there any release planned soon?
When running rebar3 ct
on otp-18.3 built from erlang.org/downloads otp-18.3 src tar, all tests pass. The same command run with otp-19.2 fails one test:
%%% basic_SUITE ==> {failed,{{badmatch,false},
[{basic_SUITE,fetchput_snapshot,1,
[{file,"/Users/russell/dev/e/NHS/leveled/_build/test/lib/leveled/test/end_to_end/basic_SUITE.erl"},
{line,304}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1529}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1045}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,977}]}]}}
This issue is just a reminder to investigate.
When running in head_only mode key listing will not work, as when accumulating keys only {B, K} is passed to the FoldFun (so SubKey is ignored).
book_isempty currently only supports checking for empty when the bucket and keys are binary. This also doesn't work in head_only mode if for every {B, K} the first {B, K, SK} is not an active object.
The bookie actor will use a non-normal terminate for destroy, but this is a potentially normal activity e.g. fallback partition closing.
eunit tests don't cope with the lack of O_SYNC support in OTP 16.
Wasn't sure how to handle this, however may be able to copy from some of the stuff done by @Licenser using platform_define in rebar.config and ifdef
Really should tackle this, is it absolutely must run in OTP16 at present to work in Riak_kv. I've been a bit lazy not checking this each time before merging.
Also see #82
Since Is started running tests regularly on a macbook, I started seeing another test intermittently fail - space_clear_on_delete in basic_SUITE. I initially put this down to timeouts - but it isn't the case. There is a genuine reason why the test fails, which is maybe a bug, but seemingly a safe one.
The penciller keeps up to the last 40K or so keys and values it has received in its L0 cache until it is prompted (by a combination of hitting a threshold and some jitter magic) by a flush from above to write a new L0 file.
When a penciller is shut down, and if that shut down is normal - it tries to write that cache to a new L0 file (assuming there isn't one already present), with the filename for that L0 file being set using the current Manifest SQN + 1. On startup - it loads all the files from the manifest, and then looks for any L0 file at manifest SQN + 1. If there isn't one (perhaps because the shutdown wan't smooth) - no worries. The penciller will startup from the highest SQN from the sst files found in the manifest, and then load the remainder from the journal (which is the WAL and the source of truth). So the penciller will restart correctly, but with a slower startup.
The space_clear_on_delete test loads a load of data, the deletes it all. It then checks that the ledger files have been removed. However this relies on the last breadth functionality. to make sure compaction is triggered it does a Penciller shutdown - which should empty the cache of any remaining deletes to a L0. It then restarts, causing the Penciller to find the L0 .. and within a few seconds of the restart the penciller will prompt a compaction from L0 to L1 - which will merge all the deletes and ultimately remove all the files.
The issue emerges because the clerk which performs a merge updates the persisted manifest, before prompting the penciller to update its local manifest SQN. (note - I think this is probably a good thing).
this means thta a manifest could be persisted at SQN X during the shutdown, whilst the penciller shutting down thinks the SQN is X - 1 and stores its last breadth a L0 at X not X + 1. the L0 file is ignored at startup, the missing cache is rebuilt from the journal - and the penciller starts in the correct state.
However:
So the test does a listdir - and unexpectedly finds files still left after all the shutdown/open/shutdown routine.
A follow-up to the discussion in #82 about using EQC to test.
Placeholder for now, need to do some volume tests first.
Compaction of the Journal works functionally (according to ct tests). Leaving the database running overnight leads to the database being compacted. However what happens if compaction coincides with load? Do the vnodes jitter between the sufficiently, or might they all try and compact at once?
Perhaps there may be other strategies, maybe every vnode should choose a compaction hour at random, and compaction should be a continuous process - but with only 1/24th of vnodes doing compaction at once?
Scoring for compaction may need to be addressed as well. Originally the idea was that a 20% saving was enough to justify compaction - but should this be set higher - especially as the non-active journal could be split onto separate mount points and run on cold/cheap/big-volume disks.
Question has been asked about using leveled with multiple processes. This issue is just to clarify the answer on this.
Fundamentally, leveled has been designed to have a single actor (the bookie gen_server) fronting all requests. There are multiple processes behind that single actor (the Penciller, the Inker, the Clerks, and a separate process for each file), so that background work does not lock the Bookie - but each request to the database must sit in a single queue until the bookie is free to serve it.
If there is intended to be a long-running request (e.g. to support a fold), a read-only point-in-time snapshot can be taken to support that request offline of the main bookie (so other requests only have to wait on the time to make the snapshot, not run the fold). The snapshots are point-in-time though, so won't be updated as new PUT requests are received by the Bookie.
There exists the potential for read requests to be managed in parallel to write requests - which would give more throughput in an environment with many CPU cores. However, primary goal of the leveled project is for it to be used as a Riak backend, where there will always be multiple leveled instances on the machine - i.e. concurrency to make full use of the capacity of the machine is expected to be handled at a Riak level.
It would be possible, without too many code changes, to have a read-only PID to provide parallel access to leveled. A process could be started with access to the leveled_bookie's ETS table, and direct access to the Pencller and Inker PIDs. However, working on this is not a priority at present. As there still needs to be a single Penciller, without significant rework, this would only provide limited parallelisation for reads.
Leveled, depending on object and database size, can still potentially support thousands of ops per second. If the limitation of the single fronting PID (the bookie) is a constraint, then either splitting requests over multiple leveled instances (and of course Riak can do that for you), or an alternative database would probably be better options than trying to modify leveled for greater parallelisation.
All testing so far has been focused on 18.3.
Running the rebar3 ct test for riak_SUITE handoff in 18.3 it reports:
40000 objects loaded in 6.330256 seconds
But in 19.3 it reports
40000 objects loaded in 429.411223 seconds
!!!!
It would be better in many cases for an all object fold to be done by scanning the journal (checking to see if the Journal entry is up to date in the penciller), rather than by scanning the ledger and fetching only the required objects from the Journal.
This is assumed as read-ahead as the Inker folds over the Journal should mean that most of the activity can be achieved through a continuous read without disk seeks (also assuming the SQN checks in the Ledger are largely served out of cache).
1) leveled_sst:indexed_list_mixedkeys_bitflip_test/0
Failure/Error: {error,
{badmatch,
<<0,0,1,79,0,0,1,76,0,0,2,213,0,0,1,150,0,0,0,0,
59,140,84,131,162,160,253,130,174,188,53,201,
234,135,231,234,110,129,106,204,111,222,232,207,
109,139,62,139,31,175,241,239,181,235,45,232,67,
248,217,143,157,224,110,178,233,165,229,158,176,
198,240,212,94,175,109,138,123,254,18,225,178,
136,235,195,81,221,193,131,80,0,0,8,15,120,1,
173,213,185,78,195,64,16,6,224,177,179,118,14,
168,184,4,66,226,172,160,64,89,95,68,17,5,162,
69,188,2,145,137,141,188,216,178,41,236,72,169,
169,169,40,168,104,225,9,104,233,121,34,90,102,
52,91,35,138,233,182,250,52,251,207,111,111,5,0,
219,133,91,168,12,28,83,130,127,213,205,203,188,
45,92,60,182,122,102,234,54,237,149,160,174,243,
101,92,168,212,43,220,12,252,116,222,154,69,158,
193,192,212,247,166,54,237,50,131,97,221,204,
170,166,41,187,199,12,84,221,85,213,95,226,29,
128,163,25,77,16,245,197,208,105,9,30,78,26,5,
168,158,138,169,55,172,134,33,170,135,98,234,11,
171,113,132,106,34,165,186,1,171,122,130,234,
154,148,218,59,99,53,160,10,236,136,169,207,54,
87,74,224,72,74,85,159,118,86,234,192,150,148,
234,221,114,93,207,17,237,75,161,254,37,163,26,
81,71,10,237,43,123,127,82,55,197,212,87,86,147,
49,170,23,82,234,96,221,54,128,118,117,32,166,
30,219,4,104,89,187,98,234,19,171,17,125,89,99,
49,245,141,213,152,254,131,19,41,117,248,206,
170,166,109,141,196,212,47,155,0,53,235,68,76,
253,182,9,208,172,129,148,58,250,176,42,109,107,
42,166,254,176,26,208,172,27,82,234,10,190,186,
244,106,133,148,235,190,148,186,234,91,149,102,
221,251,183,250,240,11,141,195,85,149,246,213,0,
203,131,80,0,0,8,20,120,1,173,212,189,74,195,
112,20,5,240,107,136,197,210,86,171,181,26,4,
157,68,29,252,202,151,181,186,185,56,56,185,11,
165,54,145,252,77,72,58,36,66,223,200,183,240,
17,196,89,124,11,193,213,123,249,223,93,135,179,
101,250,113,114,114,110,10,34,242,50,39,115,19,
90,50,57,181,110,154,89,158,214,153,195,143,181,
63,49,101,253,72,212,189,202,105,249,46,93,68,
81,230,78,143,51,39,161,214,116,86,155,151,52,
161,21,83,62,153,210,212,139,132,218,101,53,41,
170,42,111,230,9,185,101,83,20,127,169,111,86,
141,99,86,71,48,245,203,170,190,100,93,133,169,
63,218,192,152,85,31,165,246,92,171,6,146,149,
63,2,166,215,222,158,54,32,189,174,161,212,254,
48,39,151,55,32,5,180,97,232,173,69,3,70,29,24,
250,96,223,63,30,177,58,134,169,186,128,88,10,
184,70,169,235,175,186,0,81,119,81,234,198,183,
46,224,130,213,62,74,29,180,173,26,94,178,122,0,
83,67,205,42,27,232,194,212,198,170,145,52,112,
138,82,55,59,170,202,101,157,160,212,97,165,170,
236,245,12,166,126,88,53,144,6,118,80,234,150,
167,89,207,89,61,66,169,219,247,86,245,125,86,
59,48,117,174,170,220,214,0,166,190,91,53,148,
94,247,81,170,167,255,129,88,174,32,130,169,159,
154,85,26,56,252,183,250,252,11,196,66,89,42,
180,184,183,33,131,80,0,0,10,231,120,1,173,214,
79,72,20,81,28,7,240,183,187,51,251,199,63,133,
36,41,105,185,38,253,221,106,231,207,142,187,27,
98,150,221,58,26,65,65,200,62,103,99,118,103,
155,173,92,163,197,136,8,161,67,135,34,232,15,
148,29,212,34,2,205,14,29,34,12,242,80,8,138,
151,254,73,20,36,152,18,132,37,122,171,236,247,
154,173,153,195,250,30,193,220,222,92,62,239,
203,119,222,251,205,100,16,66,65,205,173,113,42,
114,165,116,228,221,215,213,161,39,115,154,27,
150,57,177,61,101,228,48,66,235,60,58,226,15,36,
243,138,172,113,9,69,115,171,200,155,232,200,
165,78,39,85,228,79,25,199,82,70,42,151,87,81,
192,200,182,103,178,89,189,235,132,138,56,163,
43,147,97,169,171,10,170,8,170,236,152,218,166,
35,14,178,146,168,30,199,208,37,51,170,24,5,181,
194,41,181,38,104,70,141,0,202,57,133,214,158,
52,163,202,36,234,38,199,212,151,166,26,33,181,
134,156,82,215,215,22,178,42,160,54,56,166,222,
48,107,141,1,234,119,12,253,90,56,1,164,128,114,
167,212,13,247,76,85,34,89,107,157,82,235,2,5,
149,100,173,250,95,53,171,163,18,115,8,8,130,32,
233,40,0,151,9,86,66,84,248,123,173,185,68,77,
113,84,115,195,176,40,59,104,112,8,229,222,188,
134,39,254,233,35,53,209,82,152,3,43,194,141,22,
92,73,129,61,125,248,168,161,193,69,193,94,127,
233,29,38,27,147,44,182,148,194,6,127,226,61,
139,183,70,128,245,213,47,36,88,172,40,137,22,
203,83,216,11,67,80,195,205,138,233,126,128,185,
135,3,231,152,112,220,214,239,22,10,220,164,224,
102,161,255,25,176,168,91,217,207,98,101,193,
198,174,161,176,99,95,32,239,213,62,245,7,129,
151,71,63,177,96,69,136,90,69,108,163,192,190,
89,188,247,243,228,4,176,252,204,135,12,155,141,
89,236,10,67,145,36,68,105,3,242,62,104,202,188,
135,39,15,191,48,197,130,27,237,253,238,160,228,
117,165,241,230,234,35,58,217,228,201,226,21,22,
27,19,108,199,172,158,194,142,103,113,72,127,78,
210,186,67,195,18,147,21,227,86,13,219,41,236,
199,57,92,25,57,60,10,172,171,83,253,198,100,
163,54,118,45,133,157,60,142,187,95,149,181,3,
203,55,236,30,101,178,49,27,11,63,19,197,254,15,
72,159,168,180,25,94,218,80,213,198,121,82,195,
224,244,91,22,28,151,109,167,33,76,129,47,213,
225,214,229,95,51,100,147,83,47,198,153,172,18,
177,218,21,41,108,248,59,228,93,221,195,95,36,
240,192,216,59,38,28,83,44,120,133,143,239,159,
34,230,31,227,67,61,247,7,97,205,95,47,235,44,
206,202,255,102,175,20,183,213,176,149,146,87,
223,9,121,243,213,215,200,236,117,45,69,56,22,
44,75,182,233,224,165,192,115,173,248,172,232,
153,5,214,219,56,89,193,98,35,130,237,60,148,80,
216,51,83,144,183,183,252,246,50,192,190,225,
243,109,76,88,177,13,157,0,5,190,220,139,195,
158,137,93,164,223,206,217,187,137,150,244,111,
1,31,41,157,220,230,21,249,131,80,0,0,4,129,120,
1,203,97,96,96,224,205,96,202,96,73,97,96,204,
207,102,224,114,42,77,206,78,45,49,48,48,48,206,
102,224,244,78,173,4,178,76,44,44,83,24,88,242,
74,115,114,50,88,18,245,50,152,82,24,216,18,147,
75,50,203,82,83,24,56,50,243,210,50,243,50,75,
42,51,152,146,24,24,30,228,37,121,105,127,216,3,
100,179,245,200,11,36,58,224,55,214,212,212,20,
97,172,1,30,99,141,94,230,177,48,76,142,155,255,
30,104,48,203,118,107,83,130,6,155,153,35,12,22,
199,99,240,210,167,64,131,63,168,86,60,5,185,
120,127,236,20,66,6,155,25,26,32,12,150,197,99,
240,188,13,73,182,151,173,86,2,141,101,254,255,
126,6,33,99,205,141,77,16,198,50,226,49,214,237,
3,208,189,190,127,237,206,129,220,155,81,228,66,
208,96,83,99,132,193,186,120,12,14,137,3,26,172,
37,185,3,20,194,140,110,207,159,17,54,24,41,132,
5,241,24,172,159,11,52,184,242,105,122,3,40,40,
214,41,135,17,52,216,204,8,225,98,53,60,6,115,
85,3,13,78,148,241,249,2,52,152,137,223,189,151,
144,193,150,150,72,65,161,141,199,224,58,161,36,
181,197,47,217,129,198,178,30,75,8,193,110,172,
9,60,107,24,152,32,25,11,212,132,51,107,60,220,
148,148,54,137,109,38,208,88,134,115,1,231,8,25,
107,104,108,134,8,6,33,60,198,118,188,6,6,195,
158,244,105,189,64,131,89,250,239,239,33,104,48,
114,48,40,227,49,248,112,32,208,224,229,19,210,
95,2,13,102,156,216,45,67,200,96,35,115,164,20,
193,129,199,224,216,131,73,76,47,23,108,6,5,196,
213,175,206,137,14,89,0,108,252,78,119,21,135,
32,158>>},
[{leveled_sst,crc_check_slot,1,
[{file,
"/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
{line,1525}]},
{leveled_sst,binaryslot_get,4,
[{file,
"/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
{line,1374}]},
{leveled_sst,test_binary_slot,4,
[{file,
"/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
{line,2351}]},
{leveled_sst,indexed_list_mixedkeys_bitflip_test,0,
[{file,
"/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
{line,2326}]}]}
Output:
142 tests, 1 failures```
Some OTP naivety on my part, treating terminate like init in reverse, has led to situations where close messages will receive an ACK before the close work has finished - and this may cause issues in fast close/delete/restart scenarios.
An SST file of 32K keys is made up of 256 slots of 128 Keys/MD each. Each slot is then sub-divided into 4 blocks of up to 32 Keys/MD.
The blocks are flat lists, stored using term_to_binary with compression enabled. The whole slot is then stored as a collection of blocks, pointers and lengths and has a CRC checksum added.
Currently for each fetch from a slot, the whole slot is read, and CRC checked, to fetch the blocks required (which is normally only one of the blocks). Only the required blocks are then split out of the in-memory slot and are then examined using term_to_binary followed by lists :nth (as the index should have been specific about which block the key may be in).
If the block lengths and pointers were cached (at a cost of 1 bit per key), then reading the whole slot, and performing the CRC across the whole slot, can be avoided. The blocks are compressed using zlib, and the zlib format mandates its own checksum, so the fetch from disk (although almost certainly page cache) will still be check-summed.
A new SST file has the blockindex_cache pre-loaded. An opened one (following a restart) doesn't - it loads due to keys being requested.
When leveled is used as an AAE store, there are no key fetches, and so the blockindex_cache is not loaded.
Perhaps load in the background after startup? Perhaps populate on first fetch_range?
When implementing metrix to measure timings in leveled_cdb, the mechanism seemed less clunky than the existing sampling technique. It also made it easier to detect and resolve an inefficient part of the code.
Refactor all metric logging and sampling to follow the same pattern as leveled_cdb.
There are now 5 lines of code not covered by tests. Get them covered.
So far the process of completing test coverage has proven useful in at least forcing documentation of hard to reach places. So I think in this case 100% test coverage is a worthwhile target
Need to have a general tidy-up that will assist the dialyzer (e.g. use specs), and also help the reader (improve inline commenting)
Make it clearer what startup options there are.
Maybe try and get it working with cuttlefish in riak_kv
Can you describe what happen when a process holding a snapshot crash? will it release it ?
A quick glance in the code shows it is starting a pencillef and at the end of the runner op close it. But what happen if the fold function or the process that execute the runner crash? How is managed the timeout?
Currently two import inputs to the journal compaction calculation are not configurable:
%% Sliding scale to allow preference of longer runs up to maximum
-define(SINGLEFILE_COMPACTION_TARGET, 40.0).
-define(MAXRUN_COMPACTION_TARGET, 70.0).
These should be configurable. I suspect that these are also set too low - perhaps should be 50/75 by default. 25% is a significant recovery percentage for a long journal compaction run.
ct tests never try create a third level to the merge tree - and the code blows up when trying to select a file in Level 2 to merge to Level 3.
2017-03-08 09:42:48.905 [error] <0.910.0> CRASH REPORT Process <0.910.0> with 0 neighbours exited with reason: bad argument in call to erlang:length({idxt,65,{{[{{i,<<"test">>,{<<"dateofbirth_bin">>,<<"1938-11-16|K7TxEg==">>},<<1,129,57,239>>},...},...],...},...}}) in leveled_pmanifest:mergefile_selector/2 line 261 in gen_server:terminate/6 line 744
The CDB file has an index of positions and counts with 256 member. This uses a list, but the code never needs to scan over this list, only to find specific indexes. This would be more efficient as an alternate datatype.
Not all snapshots are being released in volume tests, some need to be timed out .. but these don't seem to correlate with any query failures. Why might a snapshot not be released?
There is no is_empty check in Leveled, so for the riak_kv_leveled_backend as a workaround it does a bucket list (because bucket lists are fast) - and looks for a non-zero number of buckets. However, bucket lists are not fast if there are lots of buckets!! For example, where there is a bucket for each segment of an AAE tree.
Make an is-empty check that is always fast.
The Ledger Cache is kept small (up to about 2000 entries) as when a clone is required, this cache goes through ets:to_list to allow a snapshot to exist in the clone.
For short-lived 2i queries, this is probably a poor trade off. It would probably be better to run the query against the ETS table and push the results downstream to the clone, rather than push the whole cache.
Why?
F............
Failures:
1) leveled_iclerk:schedule_test/0
Failure/Error: ?assertMatch(true, SecondsToCompaction1 =< 84780)
expected: = true
got: false
%% /Users/martinsumner/dbroot/leveled/_build/test/lib/leveled/src/leveled_iclerk.erl:708:in `leveled_iclerk:schedule_test/0`
Output: Seconds to compaction 3543
Seconds to compaction 84815
Output: Seconds to compaction 3543
Seconds to compaction 84815
Currently the Penciller expects to receive the Bookie's Ledger Cache as some sort of leveled_tree implementation (accompanied by an index of its entries). To achieve this the Bookie first converts the ets table into a list, and then into the tree and then sends the tree in the push_mem message .., and then forgets about it.
How does Erlang handle this? when transferring objects between gen_server/fsm processes we have seen delays proportional to the size of the object - it isn't just a transfer of the reference. Could instead the Penciller expect an ets:table, and do this conversion itself, so that there is no need for both Bookie and Penciller ro hold a copy of the converted tree (and no need for the potential overhead of passing the tree)
Experiment required to prove TicTac caches.
New style of PUT required which provides vclock change and a partition ID. That vclock change should then be reflected in a cached TicTac tree for that partition ID.
There should be an ability to request the cached TicTac tree for any partition - by root, buckets or segments.
There should be an ability to return keys and clocks by segment ID.
The cached TicTac tree should recover back to a consistent point on startup.
There should be a capability to compare the cached TicTac tree with a snapshot of the store - and mark any differences as dirty segments.
There should be a capability to repair a group of dirty segments (where the hash represented in the cache may not reflect the accumulated hash on disk).
There should be a capability to trigger a build of a TicTac cached tree if on startup the tree is empty, but the store is not empty.
During early volume tests there was an issue whereby the Penciller Clerk would take the same work twice. It would do the work the second time as expected, but when the Penciller called back to prompt deletions - the dictionary for than Manifest SQN had already been emptied.
This caused the Penciller's Clerk to crash. As this is not a supervised process, the vnode carried on working, but without merging new entries. This worked fine until the Penciller process hit high memory watermarks and crashed. The absence of the Penciller process caused the Bookie to crash on the next call, and then this crashed the vnode process.
Riak Core then worked as expected, the vnode restarted, this restarted the bookie, which reloaded the lost penciller state from the ledger - and everything went back to normal.
In this test two vnodes were impacted at the start of the test, and the impact of the restart can be seen at the end.
This was initially resolved by placing a soft lock in the Penciller through the State#state.work_ongoing boolean - so that if work is ongoing it should never tee up more work.
However, this has still happened once since then.
There are a number of things to fix:
The problem could be made to go away by swapping the dict:fetch for a dict:find, but I think it is right for this to crash, as it is an unexpected event
Manifest files aren't deleted after we're finished with them. After a 24 hour test have over 2000 ledger manifest files per partition - these need to be tidied up.
Mistakenly believed that macros were "compiled" in a sense they're not. So within leveled_log every time a log is called it does the dict:from_list on the list.
At least it would be better to keep it as a list and do lists:keyfind.
Leveled not compiling on OTP 16 due to use of array:array() in type.
There needs to be some logging of how fast different parts of the CDB get process is. Perhaps large cycle counts are common, or reading the index is inefficient - no clause at the moment as to where time may be being spent, and whether this is therefore working efficiently.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.