martinsumner / leveled Goto Github PK

View Code? Open in Web Editor NEW

352.0 352.0 32.0 25.18 MB

A pure Erlang Key/Value store - based on a LSM-tree, optimised for HEAD requests

License: Apache License 2.0

Erlang 100.00%

leveled's People

Contributors

Stargazers

Watchers

leveled's Issues

End endkey_passed

Initially leveled has been developed to support any erlang term as Key or Bucket. In Riak, these will also be binaries.

The openness for any Erlang term though meant that the definition of an end key for things such as an all bucket query hard - as it was not possible to be sure what would be bigger than any term.

So matching on an EndKey doesn't use a simple > guard, it must use the ugly concept of leveled_codec:endkey_passed.

Really - keys should just be force to be binaries, and so nil can be used in end keys as something greater than all binaries.

Failure of recent_aae_expiry test

recent_aae_expiry test failed:

%%% tictac_SUITE ==> {failed,{{badmatch,false},
         [{tictac_SUITE,recent_aae_expiry,1,
                        [{file,"/Users/martinsumner/dbroot/leveled/_build/test/lib/leveled/test/end_to_end/tictac_SUITE.erl"},
                         {line,885}]},
          {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1533}]},
          {test_server,run_test_case_eval1,6,
                       [{file,"test_server.erl"},{line,1053}]},
          {test_server,run_test_case_eval,9,
                       [{file,"test_server.erl"},{line,985}]}]}}
Failed 1 tests. Passed 22 tests.

I'm assuming this is intermittent at present. Re-testing to confirm. Speculative assumption is that it may be a timer sleep timing thing - @Licenser has noted problems elsewhere with this.

I don't think this test would have been run more than a couple of times before I added it.

Binary bucket list doesn't check active

The binary bucket list type query (which underpins list_buckets) doesn't check if the key is active before returning a bucket

Penciller LSM Tree Width

Penciller follows the original advertised behaviour of leveldb in increasing the size of each level by an order of magnitude (actually a factor of 8).

In Basho's modified leveldb this isn't followed. In part this is because the use of overlapping files at Level 1, but it remains narrow beyond Level 2, increasing by only a factor of 2 from level 2 to Level 3 - but then by a factor of 12 from Level 3 to Level 4 and beyond.

Level	Level Size	Cumulative Size	Cumulative with AAE
0	360	360	720
1	2,160	2,520	5,040
2	2,940	5,460	10,920
3	6,144	11,604	23,208
4	122,880	134,484	268,968
5	2,362,232	2,496,716	4,993,432
6	not limited	not limited	not limited

I haven't been able to construct a model to understand this - to try and reason about what the right shape of merged tree is. But some experimentation seems worthwhile.

Sizes in leveled penciller are currently defined as a list of {level, FileCount} tuples:

-define(LEVEL_SCALEFACTOR, [{0, 0}, {1, 8}, {2, 64}, {3, 512}, {4, 4096}, {5, 32768}, {6, 262144}, {7, infinity}]).

Level 5 allows for 1 trillion keys. Level 6 and Level 7 seem a little superfluous in this case

An alternative might be:

-define(LEVEL_SCALEFACTOR, [{0, 0}, {1, 4}, {2, 16}, {3, 64}, {4, 512}, {5, 4096}, {6, 32768}, {7, infinity}]).

Verify Indexes, offline rebuild of keystore

See martinsumner/kv_index_tictactree#24

Need to make sure simple leveled functionality exists to verify indexes, and prompt a keystore to be rebuilt whilst a store is offline.

For the first part need to return a folder that will first fold objects (folding over the journal not the ledger) building an AAE tree for index entries of active objects. Then it should fold over all index entries in the Ledger providing an AAE tree to compare, and then compare the AAE trees - and log a warning if they don't compare. Use SnapPreFold = true for both folds, so they can be taken at a consistent point in time.

For the second part need to be able to call book_offlinerebuild(RootPath, ArchivePath), and have that archive the old Ledger, and then open the store with an empty Ledger, and then close again once it is rebuilt.

OTP20 and OTP21

Currently leveled will compile for OTP20 only by disabling warnings as errors. Once compiled, it appears to show some signs of measurable speed ups.

OTP21 comes with potentially greater enhancements.

Need to find some way of supporting R16 through to OTP21 in the short term. Either convert gen_fsm to gen_server or to https://gitlab.com/Project-FiFo/gen_fsm_compat

Compaction - memory leak

It may be that compaction has a memory leak

 Load:  cpu        17               Memory:  total    23476453    binary   17417838
        procs   18034                        processes 5858881    code        15786
        runq        0                        atom          719    ets         56323

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6938.30.0>         user_drv                      '-'     120060  715148488    2465279 erlang:bif_return_trap/1
<6938.4742.1815>    leveled_penciller:init/1      '-'          0   59419528          0 gen_server:loop/6
<6938.11643.4457>   leveled_penciller:init/1      '-'          0   49516424          0 gen_server:loop/6
<6938.763.0>        leveled_penciller:init/1      '-'    1934877   41263840          0 gen_server:loop/6
<6938.1503.0>       leveled_penciller:init/1      '-'    2262527   41263840          0 gen_server:loop/6
<6938.1885.0>       leveled_penciller:init/1      '-'    1594216   41263840          0 gen_server:loop/6
<6938.1157.0>       leveled_penciller:init/1      '-'    2293928   34386688          0 gen_server:loop/6
<6938.1250.2186>    leveled_penciller:init/1      '-'          0   34386688          0 gen_server:loop/6
<6938.6873.1992>    leveled_penciller:init/1      '-'          0   34386688          0 gen_server:loop/6
<6938.15673.2775>   leveled_penciller:init/1      '-'          0   34386688          0 gen_server:loop/6

Since increasing the pace of compaction, top shows the beam taking > 60% of memory on some nodes (normally for equivalent tests would be about 10%.

CRC in CDB Files

The constant database files are genuine constant database files, as the CRC check is wrapped in to the CDB file. This means that the journal file could not be interrogated by another CDB library.

This is unnecessary. The CRC is at the wrong layer of abstraction and should be added prior to pushing by the Inker, and when scanning the Inker should provide a validate function to the CDB process that does the CRC check.

is there any optimisations for the index

I see that leveled can store terms attached to an index, is there any optimisations to use it instead of having special keys for it?

Memory leak

It is expected that the memory consumed by processes grows with the database - in essence in line with the number of SST processes. However, I do not expect the binary heap or the ETS memory to perpetually grow. However (from about 5 hours into a load test)...

sudo rel/riak/bin/riak-admin top -interval 300 -sort memory -lines 10

===============================================================================================================================
 '[email protected]'                        22:22:27
 Load:  cpu         0               Memory:  total    11535910    binary    3587950
        procs   21334                        processes 6067068    code        16414
        runq       13                        atom          727    ets       1809045

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1867.0>       leveled_penciller:init/1      '-' 7517654176   49516496          0 gen_server:loop/6
<6816.760.0>        leveled_penciller:init/1      '-' 7425504054   34386688          0 gen_server:loop/6
<6816.761.0>        leveled_penciller:init/1      '-' 7547983471   28655728          0 gen_server:loop/6
<6816.1147.0>       leveled_penciller:init/1      '-' 7555039012   23880072          0 gen:do_call/4
<6816.1882.0>       leveled_penciller:init/1      '-' 7529623803   23880032          1 gen_server:loop/6
<6816.1127.0>       leveled_penciller:init/1      '-' 7577043611   23879928          0 gen_server:loop/6
<6816.1523.0>       leveled_penciller:init/1      '-' 7558402565   19900272          1 gen:do_call/4
<6816.1494.0>       leveled_penciller:init/1      '-' 7567906914   19900096          0 gen_server:loop/6
<6816.1493.0>       leveled_penciller:init/1      '-' 7596748048   16583744          1 gen:do_call/4
<6816.1501.0>       leveled_penciller:init/1      '-' 7324700860    7998096          0 gen:do_call/4

===============================================================================================================================
 '[email protected]'                        22:27:27
 Load:  cpu        92               Memory:  total    11690010    binary    3611813
        procs   21684                        processes 6171523    code        16414
        runq       11                        atom          727    ets       1832505

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1494.0>       leveled_penciller:init/1      '-'  104484942   49516600          1 gen:do_call/4
<6816.1151.0>       leveled_penciller:init/1      '-'   97723216   34386864          1 gen:do_call/4
<6816.1501.0>       leveled_penciller:init/1      '-'  100634488   34386760          0 gen_server:loop/6
<6816.759.0>        leveled_penciller:init/1      '-'  102692730   28655728          0 gen_server:loop/6
<6816.1867.0>       leveled_penciller:init/1      '-'  100708933   23880104          1 io:wait_io_mon_reply/2
<6816.760.0>        leveled_penciller:init/1      '-'  101042794   19900200          1 gen_server:loop/6
<6816.1127.0>       leveled_penciller:init/1      '-'  101072696   16583712          0 gen:do_call/4
<6816.1523.0>       leveled_penciller:init/1      '-'  101710445   16583712          0 gen:do_call/4
<6816.761.0>        leveled_penciller:init/1      '-'  101727371   16583584          0 dict:maybe_expand_aux/2
<6816.1147.0>       leveled_penciller:init/1      '-'  101437091   16583568          0 gen_server:loop/6

===============================================================================================================================
 '[email protected]'                        22:32:27
 Load:  cpu        92               Memory:  total    11933060    binary    3745714
        procs   22050                        processes 6256184    code        16414
        runq       13                        atom          727    ets       1857837

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1867.0>       leveled_penciller:init/1      '-'  105129452   41263944          1 gen_server:loop/6
<6816.1873.0>       leveled_penciller:init/1      '-'  100625652   41263840          0 gen_server:loop/6
<6816.1493.0>       leveled_penciller:init/1      '-'  105661484   34386688          0 gen_server:loop/6
<6816.1882.0>       leveled_penciller:init/1      '-'  103609840   34386688          0 gen_server:loop/6
<6816.1523.0>       leveled_penciller:init/1      '-'  103802488   23880072          0 gen:do_call/4
<6816.1151.0>       leveled_penciller:init/1      '-'  100007855   23879928          0 gen_server:loop/6
<6816.760.0>        leveled_penciller:init/1      '-'   99651317   16583744          1 gen:do_call/4
<6816.759.0>        leveled_penciller:init/1      '-'  101804936   16583568          0 gen_server:loop/6
<6816.1494.0>       leveled_penciller:init/1      '-'  100101865   13820424          1 dict:on_bucket/3
<6816.1501.0>       leveled_penciller:init/1      '-'  100864688   13819792          0 gen_server:loop/6

===============================================================================================================================
 '[email protected]'                        22:37:26
 Load:  cpu        93               Memory:  total    12082859    binary    3719957
        procs   22387                        processes 6412275    code        16414
        runq       20                        atom          727    ets       1878931

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
-------------------------------------------------------------------------------------------------------------------------------
<6816.1501.0>       leveled_penciller:init/1      '-'  104484691   49516568          0 gen:do_call/4
<6816.759.0>        leveled_penciller:init/1      '-'  106003999   49516424          0 gen_server:loop/6
<6816.1147.0>       leveled_penciller:init/1      '-'  102794123   41263840          0 gen_server:loop/6
<6816.1867.0>       leveled_penciller:init/1      '-'   99114913   34386864          1 gen:do_call/4
<6816.1493.0>       leveled_penciller:init/1      '-'  102038118   34386688          0 gen_server:loop/6
<6816.761.0>        leveled_penciller:init/1      '-'  102368120   28655728          0 gen_server:loop/6
<6816.1127.0>       leveled_penciller:init/1      '-'  102123399   28655728          0 gen_server:loop/6
<6816.1523.0>       leveled_penciller:init/1      '-'  104283115   16583568          0 gen_server:loop/6
<6816.1882.0>       leveled_penciller:init/1      '-'  103704673   13819864          0 gen_server:loop/6
<6816.1151.0>       leveled_penciller:init/1      '-'  104185132   13819792          0 gen_server:loop/6

High Sierra - binary_to_term

Tests intermittently failing on binary_to_term:

Failure/Error: {error,badarg,
                        [{erlang,binary_to_term,
                             [<<131,80,0,0,8,62,120,1,173,212,79,104,211,80,
                                28,192,241,95,99,218,174,130,214,162,204,185,
                                191,8,211,58,80,73,154,216,54,206,226,16,166,
                                7,61,136,162,7,17,70,179,116,38,166,166,14,
                                210,193,192,127,7,209,57,84,80,152,32,3,81,17,
                                167,162,162,94,220,64,173,48,16,188,168,245,
                                224,65,180,238,160,34,195,50,231,208,131,66,
                                241,23,58,205,59,172,239,33,244,246,114,249,
                                240,229,247,126,121,105,0,168,211,57,157,215,
                                192,99,152,224,219,148,237,54,83,182,206,225,
                                209,22,187,12,203,86,1,234,79,155,192,111,77,
                                245,203,58,159,228,117,78,3,95,178,219,54,250,
                                82,26,212,24,86,143,97,25,118,191,6,1,43,211,
                                149,206,100,204,236,65,13,120,43,155,78,179,
                                208,82,25,141,32,202,85,11,109,88,101,130,23,
                                75,69,9,213,5,213,82,27,47,205,170,10,170,139,
                                171,165,182,212,150,85,73,64,181,185,106,234,
                                134,178,26,137,161,218,240,191,106,198,132,
                                249,229,13,16,4,33,98,66,0,103,137,39,81,144,
                                254,94,42,159,92,52,55,170,115,184,41,207,191,
                                88,60,100,239,230,126,226,151,87,123,26,72,
                                118,204,110,65,37,88,22,93,184,158,2,191,24,
                                87,155,247,230,167,29,54,52,58,193,98,37,81,
                                113,217,16,133,205,61,192,222,251,197,252,71,
                                132,61,133,239,63,152,176,20,113,225,54,10,
                                188,113,23,194,147,123,244,247,206,88,194,32,
                                177,96,89,142,186,112,11,5,158,167,33,124,224,
                                66,103,209,41,126,242,246,4,11,94,167,16,87,
                                23,164,192,19,117,106,251,215,99,191,157,9,
                                183,143,55,177,216,168,66,244,46,161,176,249,
                                1,236,221,169,47,63,137,176,111,234,78,144,5,
                                199,68,162,119,25,5,110,13,34,124,111,243,212,
                                85,103,194,135,126,141,176,97,98,39,42,252,
                                110,14,5,183,108,53,252,238,138,31,207,156,49,
                                144,97,177,113,57,230,94,220,90,74,111,232,38,
                                246,94,127,124,113,6,97,254,92,31,19,86,4,217,
                                133,69,10,188,254,184,122,184,52,88,66,214,
                                215,250,114,136,213,171,68,136,139,91,74,97,
                                11,71,176,247,97,226,245,160,211,59,106,60,
                                155,27,150,254,189,17,130,66,204,183,194,195,
                                131,20,192,237,48,194,197,222,221,5,252,242,
                                76,150,98,44,88,36,55,98,5,165,120,108,88,245,
                                111,147,135,145,229,222,244,238,99,178,50,177,
                                104,171,41,172,71,197,222,161,158,15,57,7,126,
                                180,240,21,11,150,162,196,27,177,134,2,207,20,
                                212,163,223,46,159,69,214,123,102,199,13,38,
                                27,39,246,97,37,133,109,108,194,222,182,120,
                                34,143,48,63,114,237,19,11,142,202,68,111,45,
                                5,174,73,32,188,165,243,212,121,132,253,219,
                                63,79,39,59,246,255,1,160,51,103,175>>],
                             []},
                         {leveled_sst,deserialise_block,2,
                             [{file,
                                  "/Users/martinsumner/dbroot/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
                              {line,875}]},....

but:

Eshell V7.3  (abort with ^G)
1> B = <<131,80,0,0,8,62,120,1,173,212,79,104,211,80,
1>                                 28,192,241,95,99,218,174,130,214,162,204,185,
1>                                 191,8,211,58,80,73,154,216,54,206,226,16,166,
1>                                 7,61,136,162,7,17,70,179,116,38,166,166,14,
1>                                 210,193,192,127,7,209,57,84,80,152,32,3,81,17,
1>                                 167,162,162,94,220,64,173,48,16,188,168,245,
1>                                 224,65,180,238,160,34,195,50,231,208,131,66,
1>                                 241,23,58,205,59,172,239,33,244,246,114,249,
1>                                 240,229,247,126,121,105,0,168,211,57,157,215,
1>                                 192,99,152,224,219,148,237,54,83,182,206,225,
1>                                 209,22,187,12,203,86,1,234,79,155,192,111,77,
1>                                 245,203,58,159,228,117,78,3,95,178,219,54,250,
1>                                 82,26,212,24,86,143,97,25,118,191,6,1,43,211,
1>                                 149,206,100,204,236,65,13,120,43,155,78,179,
1>                                 208,82,25,141,32,202,85,11,109,88,101,130,23,
1>                                 75,69,9,213,5,213,82,27,47,205,170,10,170,139,
1>                                 171,165,182,212,150,85,73,64,181,185,106,234,
1>                                 134,178,26,137,161,218,240,191,106,198,132,
1>                                 249,229,13,16,4,33,98,66,0,103,137,39,81,144,
1>                                 254,94,42,159,92,52,55,170,115,184,41,207,191,
1>                                 88,60,100,239,230,126,226,151,87,123,26,72,
1>                                 118,204,110,65,37,88,22,93,184,158,2,191,24,
1>                                 87,155,247,230,167,29,54,52,58,193,98,37,81,
1>                                 113,217,16,133,205,61,192,222,251,197,252,71,
1>                                 132,61,133,239,63,152,176,20,113,225,54,10,
1>                                 188,113,23,194,147,123,244,247,206,88,194,32,
1>                                 177,96,89,142,186,112,11,5,158,167,33,124,224,
1>                                 66,103,209,41,126,242,246,4,11,94,167,16,87,
1>                                 23,164,192,19,117,106,251,215,99,191,157,9,
1>                                 183,143,55,177,216,168,66,244,46,161,176,249,
1>                                 1,236,221,169,47,63,137,176,111,234,78,144,5,
1>                                 199,68,162,119,25,5,110,13,34,124,111,243,212,
1>                                 85,103,194,135,126,141,176,97,98,39,42,252,
1>                                 110,14,5,183,108,53,252,238,138,31,207,156,49,
1>                                 144,97,177,113,57,230,94,220,90,74,111,232,38,
1>                                 246,94,127,124,113,6,97,254,92,31,19,86,4,217,
1>                                 133,69,10,188,254,184,122,184,52,88,66,214,
1>                                 215,250,114,136,213,171,68,136,139,91,74,97,
1>                                 11,71,176,247,97,226,245,160,211,59,106,60,
1>                                 155,27,150,254,189,17,130,66,204,183,194,195,
1>                                 131,20,192,237,48,194,197,222,221,5,252,242,
1>                                 76,150,98,44,88,36,55,98,5,165,120,108,88,245,
1>                                 111,147,135,145,229,222,244,238,99,178,50,177,
1>                                 104,171,41,172,71,197,222,161,158,15,57,7,126,
1>                                 180,240,21,11,150,162,196,27,177,134,2,207,20,
1>                                 212,163,223,46,159,69,214,123,102,199,13,38,
1>                                 27,39,246,97,37,133,109,108,194,222,182,120,
1>                                 34,143,48,63,114,237,19,11,142,202,68,111,45,
1>                                 5,174,73,32,188,165,243,212,121,132,253,219,
1>                                 63,79,39,59,246,255,1,160,51,103,175>>.
<<131,80,0,0,8,62,120,1,173,212,79,104,211,80,28,192,241,
  95,99,218,174,130,214,162,204,185,191,8,211,...>>
2> binary_to_term(B).
[{{i,"Bucket",{"t1_int",6796},"Key4"},
  {4,{active,infinity},no_lookup,null}},
 {{i,"Bucket",{"t1_int",6910},"Key2"},
  {2,{active,infinity},no_lookup,null}},
 {{i,"Bucket",{"t1_int",6952},"Key13"},
  {13,{active,infinity},no_lookup,null}},
 {{i,"Bucket",{"t1_int",7326},"Key19"},
  {19,{active,infinity},no_lookup,null}},
 {{i,"Bucket",{"t1_int",7958},"Key30"},
  {30,{active,infinity},no_lookup,null}},
 {{i,"Bucket",{"t1_int",7996},"Key27"},
  {27,{active,infinity},no_lookup,null}},
 {{o,"Bucket0002","Key000103",null},
  {16,{active,infinity},{51688,4139757173},{90488841,64}}},
 {{o,"Bucket0002","Key000141",null},
  {26,{active,infinity},{52931,509399537},{85047520,64}}},
 {{o,"Bucket0002","Key000319",null},
  {17,{active,infinity},{49074,3838963121},{31388405,64}}},
 {{o,"Bucket0002","Key000332",null},
  {41,{active,infinity},{16213,3714603754},{2555955,64}}},
 {{o,"Bucket0002","Key000446",null},
  {31,{active,infinity},{868,3980760685},{29284998,64}}},
 {{o,"Bucket0002","Key000593",null},
  {15,{active,infinity},{57368,1005355259},{87802653,64}}},
 {{o,"Bucket0002","Key000696",null},
  {20,{active,infinity},{53640,2267113555},{116370703,64}}},
 {{o,"Bucket0002","Key000713",null},
  {25,{active,infinity},{9231,2733590192},{8190631,64}}},
 {{o,"Bucket0002","Key000719",null},
  {30,{active,infinity},{43636,668770567},{40470639,64}}},
 {{o,"Bucket0002","Key000847",null},
  {46,{active,infinity},{4521,4086939046},{76641903,64}}},
 {{o,"Bucket0002","Key000904",null},
  {49,{active,infinity},{14980,2113833726},{103075733,64}}},
 {{o,"Bucket0002","Key000926",null},
  {23,{active,infinity},{56958,2329034167},{79194566,64}}},
 {{o,"Bucket0003","Key000099",null},
  {27,{active,infinity},{44071,3730207213},{32177719,64}}},
 {{o,"Bucket0003","Key000113",null},
  {37,{active,infinity},{47515,122434715},{47608167,64}}},
 {{o,"Bucket0003","Key000143",null},
  {44,{active,infinity},{354,3219089045},{45878992,64}}},
 {{o,"Bucket0003","Key000362",null},
  {45,{active,infinity},{62430,2146476174},{93147816,...}}},
 {{o,"Bucket0003","Key000384",null},
  {38,{active,infinity},{7197,...},{...}}},
 {{o,"Bucket0003","Key000642",null},
  {22,{active,...},{...},...}}]
3>

Sweeper support

The original intention for supporting hashtree rebuilds in was to modify the fold command called when prompting a rebuild so that it could use a fold that was optimised within leveled - i.e. rather than use fold_objects, use a fold that would return just Keys and Clocks, the minimum necessary information to produce the rebuild.

However, Riak has now added the kv_sweeper. The sweeper now makes it harder to fold using a fold other than fold_objects as:

it allows multiple participant functions to use the same fold (and the other participant functions may expect the results to be that of fold_objects);
it mixes other object expectations into the process (such as the calling of byte_size(ObjBin) to track object sizes for throttling, but perhaps also for monitoring) i.e. there is no abstraction in the code from the expectations of the output of fold_objects

There are two possible changes which can be considered for resolving this

change riak_kv_sweeper to support a new type of fold "observe_head_fun" (as well as "observe_fun", "modify_fun" and "delete_fun"), and change the behaviour if all participants of the fold are of type "clockobserve_fun" and the backend has the capability "fold_clocks"; and in this case use an alternate fold function that returns only Keys and clocks.
change riak_kv_sweeper to check for the capability "fold_heads" and "direct_fetch"; and in this case use an alternate fold function that returns a tuple of the form {proxy_object, HeadObjBin, ObjSize, {DirectFetchFun, Snapshot, ValueKey}} and extend the riak_object:from_binary method so that on pattern match to a head object it will create an extended riak_object that can return the original object size without having the full object, and can dynamically use the {Pid, Snapshot, ValueKey} tuple to direct_fetch the actual value in response to a riak_object:get_value/1 or get_values/1 call.
revert the hashtree/entropy manager to using a specific fold rather than folding via riak_kv_sweeper.

In both these cases the riak_kv_sweeper_fold:fold_req_fun/4 will need to changed to handle the potential variance in response.

The second solution makes riak_kv_sweeper more generically efficient in that all sweeps will be optimised where they do not require access to object values, but less efficient specifically in the case of hashtree rebuilds.

Add alternative compression option

Allow compression in both SST and CDB to use snappy or lz4 (but default to some pure Erlang library compression)

CDB backed database size ?

Hello,

leveled is very interesting. Pure Erlang riak-core backend ? Whaoo !
I didn't know about CDB datastores. Their page explains about 32bits adresses hence a 4Gb limit.
Does this limit apply to a whole leveled node ? or is a leveled node a growing set of several CDB files ?
Can one leveled node achieve some Tb of standalone storage ?
Can a 100ish nodes leveled powered riak-core cluster achive some Pb of cluster storage ?
Happy design and coding

Tags and abstraction

The use of Tags to indicate types, and the changing of behaviour by type (such as on recovery , inker compaction or penciller compaction) is a potentially powerful feature in leveled.

However, the implementation is a muddle within leveled_codec at the moment. It is not clear what is required for a new type, and what can be done with types.

status?

What is the current status of leveled? Can it start to be used in small prods? Is there any release planned soon?

basic_SUITE,fetchput_snapshot,1 fails on otp 19

When running rebar3 ct on otp-18.3 built from erlang.org/downloads otp-18.3 src tar, all tests pass. The same command run with otp-19.2 fails one test:

%%% basic_SUITE ==> {failed,{{badmatch,false},
     [{basic_SUITE,fetchput_snapshot,1,
                   [{file,"/Users/russell/dev/e/NHS/leveled/_build/test/lib/leveled/test/end_to_end/basic_SUITE.erl"},
                    {line,304}]},
      {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1529}]},
      {test_server,run_test_case_eval1,6,
                   [{file,"test_server.erl"},{line,1045}]},
      {test_server,run_test_case_eval,9,
                   [{file,"test_server.erl"},{line,977}]}]}}

This issue is just a reminder to investigate.

SubKey not supported in key listing - plus issues with book_isempty

When running in head_only mode key listing will not work, as when accumulating keys only {B, K} is passed to the FoldFun (so SubKey is ignored).

book_isempty currently only supports checking for empty when the bucket and keys are binary. This also doesn't work in head_only mode if for every {B, K} the first {B, K, SK} is not an active object.

Tidy up non-normal terminate

The bookie actor will use a non-normal terminate for destroy, but this is a potentially normal activity e.g. fallback partition closing.

eunit tests fail in OTP16

eunit tests don't cope with the lack of O_SYNC support in OTP 16.

Wasn't sure how to handle this, however may be able to copy from some of the stuff done by @Licenser using platform_define in rebar.config and ifdef

Really should tackle this, is it absolutely must run in OTP16 at present to work in Riak_kv. I've been a bit lazy not checking this each time before merging.

Last breath L0 file write on close

Also see #82

Since Is started running tests regularly on a macbook, I started seeing another test intermittently fail - space_clear_on_delete in basic_SUITE. I initially put this down to timeouts - but it isn't the case. There is a genuine reason why the test fails, which is maybe a bug, but seemingly a safe one.

The penciller keeps up to the last 40K or so keys and values it has received in its L0 cache until it is prompted (by a combination of hitting a threshold and some jitter magic) by a flush from above to write a new L0 file.

When a penciller is shut down, and if that shut down is normal - it tries to write that cache to a new L0 file (assuming there isn't one already present), with the filename for that L0 file being set using the current Manifest SQN + 1. On startup - it loads all the files from the manifest, and then looks for any L0 file at manifest SQN + 1. If there isn't one (perhaps because the shutdown wan't smooth) - no worries. The penciller will startup from the highest SQN from the sst files found in the manifest, and then load the remainder from the journal (which is the WAL and the source of truth). So the penciller will restart correctly, but with a slower startup.

The space_clear_on_delete test loads a load of data, the deletes it all. It then checks that the ledger files have been removed. However this relies on the last breadth functionality. to make sure compaction is triggered it does a Penciller shutdown - which should empty the cache of any remaining deletes to a L0. It then restarts, causing the Penciller to find the L0 .. and within a few seconds of the restart the penciller will prompt a compaction from L0 to L1 - which will merge all the deletes and ultimately remove all the files.

The issue emerges because the clerk which performs a merge updates the persisted manifest, before prompting the penciller to update its local manifest SQN. (note - I think this is probably a good thing).

this means thta a manifest could be persisted at SQN X during the shutdown, whilst the penciller shutting down thinks the SQN is X - 1 and stores its last breadth a L0 at X not X + 1. the L0 file is ignored at startup, the missing cache is rebuilt from the journal - and the penciller starts in the correct state.

However:

The L0 file stored at SQN X is now hanging around as unreaped garbage
Any files deleted by the merge which created manifest SQN X are hanging around as unreaped garbage
Nothing will prompt the cache to flush L0 and remove the final files

So the test does a listdir - and unexpectedly finds files still left after all the shutdown/open/shutdown routine.

EQC tests

A follow-up to the discussion in #82 about using EQC to test.

Schedule, Jitter and Thresholds in Journal Compaction

Placeholder for now, need to do some volume tests first.

Compaction of the Journal works functionally (according to ct tests). Leaving the database running overnight leads to the database being compacted. However what happens if compaction coincides with load? Do the vnodes jitter between the sufficiently, or might they all try and compact at once?

Perhaps there may be other strategies, maybe every vnode should choose a compaction hour at random, and compaction should be a continuous process - but with only 1/24th of vnodes doing compaction at once?

Scoring for compaction may need to be addressed as well. Originally the idea was that a 20% saving was enough to justify compaction - but should this be set higher - especially as the non-active journal could be split onto separate mount points and run on cold/cheap/big-volume disks.

Using leveled with multiple processes

Question has been asked about using leveled with multiple processes. This issue is just to clarify the answer on this.

Fundamentally, leveled has been designed to have a single actor (the bookie gen_server) fronting all requests. There are multiple processes behind that single actor (the Penciller, the Inker, the Clerks, and a separate process for each file), so that background work does not lock the Bookie - but each request to the database must sit in a single queue until the bookie is free to serve it.

If there is intended to be a long-running request (e.g. to support a fold), a read-only point-in-time snapshot can be taken to support that request offline of the main bookie (so other requests only have to wait on the time to make the snapshot, not run the fold). The snapshots are point-in-time though, so won't be updated as new PUT requests are received by the Bookie.

There exists the potential for read requests to be managed in parallel to write requests - which would give more throughput in an environment with many CPU cores. However, primary goal of the leveled project is for it to be used as a Riak backend, where there will always be multiple leveled instances on the machine - i.e. concurrency to make full use of the capacity of the machine is expected to be handled at a Riak level.

It would be possible, without too many code changes, to have a read-only PID to provide parallel access to leveled. A process could be started with access to the leveled_bookie's ETS table, and direct access to the Pencller and Inker PIDs. However, working on this is not a priority at present. As there still needs to be a single Penciller, without significant rework, this would only provide limited parallelisation for reads.

Leveled, depending on object and database size, can still potentially support thousands of ops per second. If the limitation of the single fronting PID (the bookie) is a constraint, then either splitting requests over multiple leveled instances (and of course Riak can do that for you), or an alternative database would probably be better options than trying to modify leveled for greater parallelisation.

OTP 19 - Incredibly slow PUTs

All testing so far has been focused on 18.3.

Running the rebar3 ct test for riak_SUITE handoff in 18.3 it reports:

40000 objects loaded in 6.330256 seconds

But in 19.3 it reports

40000 objects loaded in 429.411223 seconds

!!!!

Fold Objects - by Sequence or by Key

It would be better in many cases for an all object fold to be done by scanning the journal (checking to see if the Journal entry is up to date in the penciller), rather than by scanning the ledger and fetching only the required objects from the Journal.

This is assumed as read-ahead as the Inker folds over the Journal should mean that most of the activity can be achieved through a continuous read without disk seeks (also assuming the SQN checks in the Ledger are largely served out of cache).

Periodic failing test leveled_sst:indexed_list_mixedkeys_bitflip_test/0


  1) leveled_sst:indexed_list_mixedkeys_bitflip_test/0
     Failure/Error: {error,
                        {badmatch,
                            <<0,0,1,79,0,0,1,76,0,0,2,213,0,0,1,150,0,0,0,0,
                              59,140,84,131,162,160,253,130,174,188,53,201,
                              234,135,231,234,110,129,106,204,111,222,232,207,
                              109,139,62,139,31,175,241,239,181,235,45,232,67,
                              248,217,143,157,224,110,178,233,165,229,158,176,
                              198,240,212,94,175,109,138,123,254,18,225,178,
                              136,235,195,81,221,193,131,80,0,0,8,15,120,1,
                              173,213,185,78,195,64,16,6,224,177,179,118,14,
                              168,184,4,66,226,172,160,64,89,95,68,17,5,162,
                              69,188,2,145,137,141,188,216,178,41,236,72,169,
                              169,169,40,168,104,225,9,104,233,121,34,90,102,
                              52,91,35,138,233,182,250,52,251,207,111,111,5,0,
                              219,133,91,168,12,28,83,130,127,213,205,203,188,
                              45,92,60,182,122,102,234,54,237,149,160,174,243,
                              101,92,168,212,43,220,12,252,116,222,154,69,158,
                              193,192,212,247,166,54,237,50,131,97,221,204,
                              170,166,41,187,199,12,84,221,85,213,95,226,29,
                              128,163,25,77,16,245,197,208,105,9,30,78,26,5,
                              168,158,138,169,55,172,134,33,170,135,98,234,11,
                              171,113,132,106,34,165,186,1,171,122,130,234,
                              154,148,218,59,99,53,160,10,236,136,169,207,54,
                              87,74,224,72,74,85,159,118,86,234,192,150,148,
                              234,221,114,93,207,17,237,75,161,254,37,163,26,
                              81,71,10,237,43,123,127,82,55,197,212,87,86,147,
                              49,170,23,82,234,96,221,54,128,118,117,32,166,
                              30,219,4,104,89,187,98,234,19,171,17,125,89,99,
                              49,245,141,213,152,254,131,19,41,117,248,206,
                              170,166,109,141,196,212,47,155,0,53,235,68,76,
                              253,182,9,208,172,129,148,58,250,176,42,109,107,
                              42,166,254,176,26,208,172,27,82,234,10,190,186,
                              244,106,133,148,235,190,148,186,234,91,149,102,
                              221,251,183,250,240,11,141,195,85,149,246,213,0,
                              203,131,80,0,0,8,20,120,1,173,212,189,74,195,
                              112,20,5,240,107,136,197,210,86,171,181,26,4,
                              157,68,29,252,202,151,181,186,185,56,56,185,11,
                              165,54,145,252,77,72,58,36,66,223,200,183,240,
                              17,196,89,124,11,193,213,123,249,223,93,135,179,
                              101,250,113,114,114,110,10,34,242,50,39,115,19,
                              90,50,57,181,110,154,89,158,214,153,195,143,181,
                              63,49,101,253,72,212,189,202,105,249,46,93,68,
                              81,230,78,143,51,39,161,214,116,86,155,151,52,
                              161,21,83,62,153,210,212,139,132,218,101,53,41,
                              170,42,111,230,9,185,101,83,20,127,169,111,86,
                              141,99,86,71,48,245,203,170,190,100,93,133,169,
                              63,218,192,152,85,31,165,246,92,171,6,146,149,
                              63,2,166,215,222,158,54,32,189,174,161,212,254,
                              48,39,151,55,32,5,180,97,232,173,69,3,70,29,24,
                              250,96,223,63,30,177,58,134,169,186,128,88,10,
                              184,70,169,235,175,186,0,81,119,81,234,198,183,
                              46,224,130,213,62,74,29,180,173,26,94,178,122,0,
                              83,67,205,42,27,232,194,212,198,170,145,52,112,
                              138,82,55,59,170,202,101,157,160,212,97,165,170,
                              236,245,12,166,126,88,53,144,6,118,80,234,150,
                              167,89,207,89,61,66,169,219,247,86,245,125,86,
                              59,48,117,174,170,220,214,0,166,190,91,53,148,
                              94,247,81,170,167,255,129,88,174,32,130,169,159,
                              154,85,26,56,252,183,250,252,11,196,66,89,42,
                              180,184,183,33,131,80,0,0,10,231,120,1,173,214,
                              79,72,20,81,28,7,240,183,187,51,251,199,63,133,
                              36,41,105,185,38,253,221,106,231,207,142,187,27,
                              98,150,221,58,26,65,65,200,62,103,99,118,103,
                              155,173,92,163,197,136,8,161,67,135,34,232,15,
                              148,29,212,34,2,205,14,29,34,12,242,80,8,138,
                              151,254,73,20,36,152,18,132,37,122,171,236,247,
                              154,173,153,195,250,30,193,220,222,92,62,239,
                              203,119,222,251,205,100,16,66,65,205,173,113,42,
                              114,165,116,228,221,215,213,161,39,115,154,27,
                              150,57,177,61,101,228,48,66,235,60,58,226,15,36,
                              243,138,172,113,9,69,115,171,200,155,232,200,
                              165,78,39,85,228,79,25,199,82,70,42,151,87,81,
                              192,200,182,103,178,89,189,235,132,138,56,163,
                              43,147,97,169,171,10,170,8,170,236,152,218,166,
                              35,14,178,146,168,30,199,208,37,51,170,24,5,181,
                              194,41,181,38,104,70,141,0,202,57,133,214,158,
                              52,163,202,36,234,38,199,212,151,166,26,33,181,
                              134,156,82,215,215,22,178,42,160,54,56,166,222,
                              48,107,141,1,234,119,12,253,90,56,1,164,128,114,
                              167,212,13,247,76,85,34,89,107,157,82,235,2,5,
                              149,100,173,250,95,53,171,163,18,115,8,8,130,32,
                              233,40,0,151,9,86,66,84,248,123,173,185,68,77,
                              113,84,115,195,176,40,59,104,112,8,229,222,188,
                              134,39,254,233,35,53,209,82,152,3,43,194,141,22,
                              92,73,129,61,125,248,168,161,193,69,193,94,127,
                              233,29,38,27,147,44,182,148,194,6,127,226,61,
                              139,183,70,128,245,213,47,36,88,172,40,137,22,
                              203,83,216,11,67,80,195,205,138,233,126,128,185,
                              135,3,231,152,112,220,214,239,22,10,220,164,224,
                              102,161,255,25,176,168,91,217,207,98,101,193,
                              198,174,161,176,99,95,32,239,213,62,245,7,129,
                              151,71,63,177,96,69,136,90,69,108,163,192,190,
                              89,188,247,243,228,4,176,252,204,135,12,155,141,
                              89,236,10,67,145,36,68,105,3,242,62,104,202,188,
                              135,39,15,191,48,197,130,27,237,253,238,160,228,
                              117,165,241,230,234,35,58,217,228,201,226,21,22,
                              27,19,108,199,172,158,194,142,103,113,72,127,78,
                              210,186,67,195,18,147,21,227,86,13,219,41,236,
                              199,57,92,25,57,60,10,172,171,83,253,198,100,
                              163,54,118,45,133,157,60,142,187,95,149,181,3,
                              203,55,236,30,101,178,49,27,11,63,19,197,254,15,
                              72,159,168,180,25,94,218,80,213,198,121,82,195,
                              224,244,91,22,28,151,109,167,33,76,129,47,213,
                              225,214,229,95,51,100,147,83,47,198,153,172,18,
                              177,218,21,41,108,248,59,228,93,221,195,95,36,
                              240,192,216,59,38,28,83,44,120,133,143,239,159,
                              34,230,31,227,67,61,247,7,97,205,95,47,235,44,
                              206,202,255,102,175,20,183,213,176,149,146,87,
                              223,9,121,243,213,215,200,236,117,45,69,56,22,
                              44,75,182,233,224,165,192,115,173,248,172,232,
                              153,5,214,219,56,89,193,98,35,130,237,60,148,80,
                              216,51,83,144,183,183,252,246,50,192,190,225,
                              243,109,76,88,177,13,157,0,5,190,220,139,195,
                              158,137,93,164,223,206,217,187,137,150,244,111,
                              1,31,41,157,220,230,21,249,131,80,0,0,4,129,120,
                              1,203,97,96,96,224,205,96,202,96,73,97,96,204,
                              207,102,224,114,42,77,206,78,45,49,48,48,48,206,
                              102,224,244,78,173,4,178,76,44,44,83,24,88,242,
                              74,115,114,50,88,18,245,50,152,82,24,216,18,147,
                              75,50,203,82,83,24,56,50,243,210,50,243,50,75,
                              42,51,152,146,24,24,30,228,37,121,105,127,216,3,
                              100,179,245,200,11,36,58,224,55,214,212,212,20,
                              97,172,1,30,99,141,94,230,177,48,76,142,155,255,
                              30,104,48,203,118,107,83,130,6,155,153,35,12,22,
                              199,99,240,210,167,64,131,63,168,86,60,5,185,
                              120,127,236,20,66,6,155,25,26,32,12,150,197,99,
                              240,188,13,73,182,151,173,86,2,141,101,254,255,
                              126,6,33,99,205,141,77,16,198,50,226,49,214,237,
                              3,208,189,190,127,237,206,129,220,155,81,228,66,
                              208,96,83,99,132,193,186,120,12,14,137,3,26,172,
                              37,185,3,20,194,140,110,207,159,17,54,24,41,132,
                              5,241,24,172,159,11,52,184,242,105,122,3,40,40,
                              214,41,135,17,52,216,204,8,225,98,53,60,6,115,
                              85,3,13,78,148,241,249,2,52,152,137,223,189,151,
                              144,193,150,150,72,65,161,141,199,224,58,161,36,
                              181,197,47,217,129,198,178,30,75,8,193,110,172,
                              9,60,107,24,152,32,25,11,212,132,51,107,60,220,
                              148,148,54,137,109,38,208,88,134,115,1,231,8,25,
                              107,104,108,134,8,6,33,60,198,118,188,6,6,195,
                              158,244,105,189,64,131,89,250,239,239,33,104,48,
                              114,48,40,227,49,248,112,32,208,224,229,19,210,
                              95,2,13,102,156,216,45,67,200,96,35,115,164,20,
                              193,129,199,224,216,131,73,76,47,23,108,6,5,196,
                              213,175,206,137,14,89,0,108,252,78,119,21,135,
                              32,158>>},
                        [{leveled_sst,crc_check_slot,1,
                             [{file,
                                  "/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
                              {line,1525}]},
                         {leveled_sst,binaryslot_get,4,
                             [{file,
                                  "/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
                              {line,1374}]},
                         {leveled_sst,test_binary_slot,4,
                             [{file,
                                  "/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
                              {line,2351}]},
                         {leveled_sst,indexed_list_mixedkeys_bitflip_test,0,
                             [{file,
                                  "/Users/russell/dev/e/NHS/learn_level/leveled/_build/test/lib/leveled/src/leveled_sst.erl"},
                              {line,2326}]}]}
     Output:

142 tests, 1 failures```

de-terminate

Some OTP naivety on my part, treating terminate like init in reverse, has led to situations where close messages will receive an ACK before the close work has finished - and this may cause issues in fast close/delete/restart scenarios.

SST Blocks - Compression and Checksums

An SST file of 32K keys is made up of 256 slots of 128 Keys/MD each. Each slot is then sub-divided into 4 blocks of up to 32 Keys/MD.

The blocks are flat lists, stored using term_to_binary with compression enabled. The whole slot is then stored as a collection of blocks, pointers and lengths and has a CRC checksum added.

Currently for each fetch from a slot, the whole slot is read, and CRC checked, to fetch the blocks required (which is normally only one of the blocks). Only the required blocks are then split out of the in-memory slot and are then examined using term_to_binary followed by lists :nth (as the index should have been specific about which block the key may be in).

If the block lengths and pointers were cached (at a cost of 1 bit per key), then reading the whole slot, and performing the CRC across the whole slot, can be avoided. The blocks are compressed using zlib, and the zlib format mandates its own checksum, so the fetch from disk (although almost certainly page cache) will still be check-summed.

Background load SST blockindex_cache

A new SST file has the blockindex_cache pre-loaded. An opened one (following a restart) doesn't - it loads due to keys being requested.

When leveled is used as an AAE store, there are no key fetches, and so the blockindex_cache is not loaded.

Perhaps load in the background after startup? Perhaps populate on first fetch_range?

Improve metrics and sampling

When implementing metrix to measure timings in leveled_cdb, the mechanism seemed less clunky than the existing sampling technique. It also made it easier to detect and resolve an inefficient part of the code.

Refactor all metric logging and sampling to follow the same pattern as leveled_cdb.

Intermittently failing unit test

Test Coverage

There are now 5 lines of code not covered by tests. Get them covered.

So far the process of completing test coverage has proven useful in at least forcing documentation of hard to reach places. So I think in this case 100% test coverage is a worthwhile target

Help the dialyzer, help the reader

Need to have a general tidy-up that will assist the dialyzer (e.g. use specs), and also help the reader (improve inline commenting)

Tidy Up Startup Options

Make it clearer what startup options there are.

Maybe try and get it working with cuttlefish in riak_kv

what happen if a process holding a snapshot crash?

Can you describe what happen when a process holding a snapshot crash? will it release it ?

A quick glance in the code shows it is starting a pencillef and at the end of the runner op close it. But what happen if the fold function or the process that execute the runner crash? How is managed the timeout?

Journal Compaction

Currently two import inputs to the journal compaction calculation are not configurable:

%% Sliding scale to allow preference of longer runs up to maximum
-define(SINGLEFILE_COMPACTION_TARGET, 40.0).
-define(MAXRUN_COMPACTION_TARGET, 70.0).

These should be configurable. I suspect that these are also set too low - perhaps should be 50/75 by default. 25% is a significant recovery percentage for a long journal compaction run.

Selecting merge file from Level 2 to Level 3

ct tests never try create a third level to the merge tree - and the code blows up when trying to select a file in Level 2 to merge to Level 3.

2017-03-08 09:42:48.905 [error] <0.910.0> CRASH REPORT Process <0.910.0> with 0 neighbours exited with reason: bad argument in call to erlang:length({idxt,65,{{[{{i,<<"test">>,{<<"dateofbirth_bin">>,<<"1938-11-16|K7TxEg==">>},<<1,129,57,239>>},...},...],...},...}}) in leveled_pmanifest:mergefile_selector/2 line 261 in gen_server:terminate/6 line 744

CDB use of list for Index

The CDB file has an index of positions and counts with 256 member. This uses a list, but the code never needs to scan over this list, only to find specific indexes. This would be more efficient as an alternate datatype.

Snapshot releases

Not all snapshots are being released in volume tests, some need to be timed out .. but these don't seem to correlate with any query failures. Why might a snapshot not be released?

Is Empty using bucket list

There is no is_empty check in Leveled, so for the riak_kv_leveled_backend as a workaround it does a bucket list (because bucket lists are fast) - and looks for a non-zero number of buckets. However, bucket lists are not fast if there are lots of buckets!! For example, where there is a bucket for each segment of an AAE tree.

Make an is-empty check that is always fast.

ETS and Snapshots

The Ledger Cache is kept small (up to about 2000 entries) as when a clone is required, this cache goes through ets:to_list to allow a snapshot to exist in the clone.

For short-lived 2i queries, this is probably a poor trade off. It would probably be better to run the query against the ETS table and push the results downstream to the clone, rather than push the whole cache.

Rare and inconsistent unit test failure

Why?

F............
Failures:

  1) leveled_iclerk:schedule_test/0
     Failure/Error: ?assertMatch(true, SecondsToCompaction1 =< 84780)
       expected: = true
            got: false
     %% /Users/martinsumner/dbroot/leveled/_build/test/lib/leveled/src/leveled_iclerk.erl:708:in `leveled_iclerk:schedule_test/0`
     Output: Seconds to compaction 3543
Seconds to compaction 84815

     Output: Seconds to compaction 3543
Seconds to compaction 84815

PushMem - Prepare for Penciller

Currently the Penciller expects to receive the Bookie's Ledger Cache as some sort of leveled_tree implementation (accompanied by an index of its entries). To achieve this the Bookie first converts the ets table into a list, and then into the tree and then sends the tree in the push_mem message .., and then forgets about it.

How does Erlang handle this? when transferring objects between gen_server/fsm processes we have seen delays proportional to the size of the object - it isn't just a transfer of the reference. Could instead the Penciller expect an ets:table, and do this conversion itself, so that there is no need for both Bookie and Penciller ro hold a copy of the converted tree (and no need for the potential overhead of passing the tree)

TicTac Cache

Experiment required to prove TicTac caches.

New style of PUT required which provides vclock change and a partition ID. That vclock change should then be reflected in a cached TicTac tree for that partition ID.

There should be an ability to request the cached TicTac tree for any partition - by root, buckets or segments.

There should be an ability to return keys and clocks by segment ID.

The cached TicTac tree should recover back to a consistent point on startup.

There should be a capability to compare the cached TicTac tree with a snapshot of the store - and mark any differences as dirty segments.

There should be a capability to repair a group of dirty segments (where the hash represented in the cache may not reflect the accumulated hash on disk).

There should be a capability to trigger a build of a TicTac cached tree if on startup the tree is empty, but the store is not empty.

Supervision of Clerks, Supervision in General

During early volume tests there was an issue whereby the Penciller Clerk would take the same work twice. It would do the work the second time as expected, but when the Penciller called back to prompt deletions - the dictionary for than Manifest SQN had already been emptied.

This caused the Penciller's Clerk to crash. As this is not a supervised process, the vnode carried on working, but without merging new entries. This worked fine until the Penciller process hit high memory watermarks and crashed. The absence of the Penciller process caused the Bookie to crash on the next call, and then this crashed the vnode process.

Riak Core then worked as expected, the vnode restarted, this restarted the bookie, which reloaded the lost penciller state from the ledger - and everything went back to normal.

In this test two vnodes were impacted at the start of the test, and the impact of the restart can be seen at the end.

This was initially resolved by placing a soft lock in the Penciller through the State#state.work_ongoing boolean - so that if work is ongoing it should never tee up more work.

However, this has still happened once since then.

There are a number of things to fix:

Why can there still be a race with ongoing_work set?
Is there a better way of managing the failure of clerk processes?
More generically, should we be managing the file FSM processes and their failure better?

The problem could be made to go away by swapping the dict:fetch for a dict:find, but I think it is right for this to crash, as it is an unexpected event

GC Manifest Files

Manifest files aren't deleted after we're finished with them. After a 24 hour test have over 2000 ledger manifest files per partition - these need to be tidied up.

Log - dict:from_list

Mistakenly believed that macros were "compiled" in a sense they're not. So within leveled_log every time a log is called it does the dict:from_list on the list.

At least it would be better to keep it as a list and do lists:keyfind.

leveled not compiling OTP16

Leveled not compiling on OTP 16 due to use of array:array() in type.

CDB - Timings

There needs to be some logging of how fast different parts of the CDB get process is. Perhaps large cycle counts are common, or reading the index is inefficient - no clause at the moment as to where time may be being spent, and whether this is therefore working efficiently.

martinsumner / leveled Goto Github PK

leveled's People

Contributors

Stargazers

Watchers

Forkers

leveled's Issues

Recommend Projects

Recommend Topics

Recommend Org