basho / riak_core Goto Github PK

Distributed systems infrastructure used by Riak.

License: Apache License 2.0

Makefile 0.02% Erlang 99.98%

riak_core's Introduction

Riak Core

Riak Core is the distributed systems framework that forms the basis of how Riak distributes data and scales. More generally, it can be thought of as a toolkit for building distributed, scalable, fault-tolerant applications.

For some introductory reading on Riak Core (that’s not pure code), there’s an old but still valuable blog post on the Basho Blog that’s well worth your time.

Contributing

We love community code, bug fixes, and other forms of contribution. We use GitHub Issues and Pull Requests for contributions to this and all other code. To get started:

Fork this repository.
Clone your fork or add the remote if you already have a clone of the repository.
Create a topic branch for your change.
Make your change and commit. Use a clear and descriptive commit message, spanning multiple lines if detailed explanation is needed.
Push to your fork of the repository and then send a pull request.
A Riak committer will review your patch and merge it into the main repository or send you feedback.

Issues, Questions, and Bugs

There are numerous ways to file issues or start conversations around something Core related

The Riak Users List is the main place for all discussion around Riak.
There is a Riak Core-specific mailing list for issues and questions that pertain to Core but not Riak.
#riak on Freenode is a very active channel and can get you some real-time help if a lot of instances.
If you've found a bug in Riak Core, file a clear, concise, explanatory issue against this repo.

riak_core's People

Contributors

Stargazers

Watchers

Forkers

archaelus travelping argv0 sonian djui lemenkov websterclay rampage nicom edwardt andrewoswald oier eqrandom rzezeski footyfish dangerdawson evax goj b grourk bosky101 konradkaplita eriksoe nivertech jbrisbin iw truesef-dev jmshoffs0812 cthulhuology errord alinpopa jaejin xiufan fygrave mdediana russelldb pharkmillups manpages markhamstra fujin jerith tomlanyon mihawk jcrabtree johnptoohey deadzen snwight whitenode rickowens nyarly ektormak croland ricardobcl d63432 migue jametong janpieper shadidchowdhury tsloughter cmeiklejohn lemonhall leonlee mykook kvakvs tmcgilchrist campherkurt milindparikh hscic plumcube windeye h0ngcha0 neeraj9 kyorai algking yogishbaliga chinnurtb feiquanbifeng bones-rpc lumost sebastian uilxy runway20 quviq mbbx6spp denofiend redis-sd waptang javierarrieta platbox andreashasse ronnylt marsleezm licenser tothlac varnerac cybergrind joeljacobson sagungargs zhangjiayin marquisthunder

riak_core's Issues

legacy_vnode_routing should default to true for new installs

The legacy_vnode_routing app.config setting controls whether the new 1.1 vnode proxy code is used or the pre-1.1 vnode master code.

For new installs it should be set true - make sure it is set in the rel/files/app.config setting.

Once clusters are all upgraded to 1.1.0 and above the config item should be added manually.

insufficient_vnodes possible soon after node join

Today I have been able to reproduce the error below on every run of the yokozuna test. Currently, during yokozuna application startup, a KV get is performed. This get is performed after it waits for the riak_kv service but the preflist still comes up empty, i.e. [].

The issue is that at the moment the get is performed dev4 (which has just joined a 3-node cluster) owns none of the ring but is the only node in the riak_core_node_watcher. Thus UpNodes will consist of only dev4 and nothing from the ring will match--ultimately resulting in an empty preflist. The second paste below shows a print from riak_core_apl showing the ring with no dev4 owner and a node watcher with only dev4.

So the node watcher lags behind the ring update causing the node to temporarily have [] preflist for everything. I'm not sure this is easily solvable. We could add another stage to a node transition that lets it get ready before taking on requests but I'm not really sure what that all entails. I just wanted to dump my findings here before I lost the motivation to do so.

Error

** Reason for termination ==                                                                                                                                                                                                                
** {{case_clause,{<<"_yz_default">>,{error,{insufficient_vnodes,0,need,2}}}},[{yz_schema,get,1,    [{file,"src/yz_schema.erl"},{line,39}]},{yz_index,local_create,2,[{file,"src/yz_index.erl"},{line,75}]},{yz_events,add_index,2,[{file,"src/y\
z_events.erl"},{line,97}]},{yz_events,'-add_indexes/2-lc$^0/1-0-',2,[{file,"src/yz_events.erl"},{line,102}]},{yz_events,add_indexes,2,[{file,"src/yz_events.erl"},{line,102}]},{yz_events,sync_indexes,4,[{file,"src/yz_events.erl"},{line,\
234}]},{yz_events,handle_cast,2,[{file,"src/yz_events.erl"},{line,76}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]}]}

riak_core_apl print out

2012-09-11 17:59:29.774 [warning] <0.1792.0>@riak_kv_get_fsm:prepare:162 KV {<<"_yz_schema">>,<<"_yz_default">>} UpNodes ['[email protected]'] node_watcher tab: [{{by_service,yokozuna},['[email protected]']},{{by_service,riak_pipe},['[email protected]']},{{by_node,'[email protected]'},[yokozuna,riak_kv,riak_pipe]},{{by_service,riak_kv},['[email protected]']},{'[email protected]',63514619969}]
2012-09-11 17:59:29.775 [warning] <0.1792.0>@riak_core_apl:get_apl_ann:79 APL ANN: ['[email protected]'] [{1050454301831586472458898473514828420377701515264,'[email protected]'},{1073290264914881830555831049026020342559825461248,'[email protected]'},{1096126227998177188652763624537212264741949407232,'[email protected]'}] [{1118962191081472546749696200048404186924073353216,'[email protected]'},{1141798154164767904846628775559596109106197299200,'[email protected]'},{1164634117248063262943561351070788031288321245184,'[email protected]'},{1187470080331358621040493926581979953470445191168,'[email protected]'},{1210306043414653979137426502093171875652569137152,'[email protected]'},{1233142006497949337234359077604363797834693083136,'[email protected]'},{1255977969581244695331291653115555720016817029120,'[email protected]'},{1278813932664540053428224228626747642198940975104,'[email protected]'},{1301649895747835411525156804137939564381064921088,'[email protected]'},{1324485858831130769622089379649131486563188867072,'[email protected]'},{1347321821914426127719021955160323408745312813056,'[email protected]'},{1370157784997721485815954530671515330927436759040,'[email protected]'},{1392993748081016843912887106182707253109560705024,'[email protected]'},{1415829711164312202009819681693899175291684651008,'[email protected]'},{1438665674247607560106752257205091097473808596992,'[email protected]'},{0,'[email protected]'},{22835963083295358096932575511191922182123945984,'[email protected]'},{45671926166590716193865151022383844364247891968,'[email protected]'},{68507889249886074290797726533575766546371837952,'[email protected]'},{91343852333181432387730302044767688728495783936,'[email protected]'},{114179815416476790484662877555959610910619729920,'[email protected]'},{137015778499772148581595453067151533092743675904,'[email protected]'},{159851741583067506678528028578343455274867621888,'[email protected]'},{182687704666362864775460604089535377456991567872,'[email protected]'},{205523667749658222872393179600727299639115513856,'[email protected]'},{228359630832953580969325755111919221821239459840,'[email protected]'},{251195593916248939066258330623111144003363405824,'[email protected]'},{274031556999544297163190906134303066185487351808,'[email protected]'},{296867520082839655260123481645494988367611297792,'[email protected]'},{319703483166135013357056057156686910549735243776,'[email protected]'},{342539446249430371453988632667878832731859189760,'[email protected]'},{365375409332725729550921208179070754913983135744,'[email protected]'},{388211372416021087647853783690262677096107081728,'[email protected]'},{411047335499316445744786359201454599278231027712,'[email protected]'},{433883298582611803841718934712646521460354973696,'[email protected]'},{456719261665907161938651510223838443642478919680,'[email protected]'},{479555224749202520035584085735030365824602865664,'[email protected]'},{502391187832497878132516661246222288006726811648,'[email protected]'},{525227150915793236229449236757414210188850757632,'[email protected]'},{548063113999088594326381812268606132370974703616,'[email protected]'},{570899077082383952423314387779798054553098649600,'[email protected]'},{593735040165679310520246963290989976735222595584,'[email protected]'},{616571003248974668617179538802181898917346541568,'[email protected]'},{639406966332270026714112114313373821099470487552,'[email protected]'},{662242929415565384811044689824565743281594433536,'[email protected]'},{685078892498860742907977265335757665463718379520,'[email protected]'},{707914855582156101004909840846949587645842325504,'[email protected]'},{730750818665451459101842416358141509827966271488,'[email protected]'},{753586781748746817198774991869333432010090217472,'[email protected]'},{776422744832042175295707567380525354192214163456,'[email protected]'},{799258707915337533392640142891717276374338109440,'[email protected]'},{822094670998632891489572718402909198556462055424,'[email protected]'},{844930634081928249586505293914101120738586001408,'[email protected]'},{867766597165223607683437869425293042920709947392,'[email protected]'},{890602560248518965780370444936484965102833893376,'[email protected]'},{913438523331814323877303020447676887284957839360,'[email protected]'},{936274486415109681974235595958868809467081785344,'[email protected]'},{959110449498405040071168171470060731649205731328,...},...]
2012-09-11 17:59:29.776 [warning] <0.1792.0>@riak_kv_get_fsm:prepare:164 KV {<<"_yz_schema">>,<<"_yz_default">>} preflist [] node_watcher tab: [{{by_service,yokozuna},['[email protected]']},{{by_service,riak_pipe},['[email protected]']},{{by_node,'[email protected]'},[yokozuna,riak_kv,riak_pipe]},{{by_service,riak_kv},['[email protected]']},{'[email protected]',63514619969}]

Capability Negotiation

Provide an API for registering capabilities and supported protocol versions. Implement automatic capability negotiation across the cluster, picking the most preferred protocol version that all nodes speak for each registered capability.

Ability to delete custom bucket properties

Right now, while we have the ability in riak_core_bucket to set bucket properties:

%% @spec set_bucket(riak_object:bucket(), BucketProps::riak_core_bucketprops()) -> ok
%% @doc Set the given BucketProps in Bucket.
set_bucket(Name, BucketProps0) ->
    case validate_props(BucketProps0, riak_core:bucket_validators(), []) of
        {ok, BucketProps} ->
            F = fun(Ring, _Args) ->
                        OldBucket = get_bucket(Name),
                        NewBucket = merge_props(BucketProps, OldBucket),
                        {new_ring, riak_core_ring:update_meta({bucket,Name},
                                                              NewBucket,
                                                              Ring)}
                end,
            {ok, _NewRing} = riak_core_ring_manager:ring_trans(F, undefined),
            ok;
        {error, Details} ->
            lager:error("Bucket validation failed ~p~n", [Details]),
            {error, Details}
    end.

We do not have the ability to delete custom bucket properties. Since these are stored in the ring, this adds up when creating hundreds/thousands of buckets with custom properties. A use case that has been seen is where certain buckets require the search post-commit hook, and others do not.

Right now, even if you delete all the keys and attempt to set the bucket properties back to default, it still will be kept in the ring.

It should be feasible to delete the dict key for a certain bucket's properties, and have it revert back to default properties.

riak_core_handoff_sender not accepted {error, vnode_shutdown} message

2012-03-30 15:09:23 =ERROR REPORT====
Error in process <0.29050.246> on node '[email protected]' with exit value: {{badmatch,{error
,vnode_shutdown}},[{riak_core_handoff_sender,'-start_fold/5-fun-1-',9}]}

format _stat_ts in connection manager

#313

Considerations on the usability of various reports of ring size.

There are a few places users typically look to learn a cluster's ring size, for example: app.config file, status, /stats

The presence of the ring_creation_size setting in the app.conifg file as well as its presence in status (and elsewhere) can be misleading for those interested not in the ring creation size, but the size of the ring of the current running cluster. The information desired in that case is instead ring_num_partitions.

The purpose of this ticket is to raise the question of how the usability and reporting could be improved to further express that ring_creation_size is not the actual ring size, and should not be used as such. Certainly these 'variables' are what they say they are, and should be treated as such, but could improvements be made so that mistakes could be avoided in cases where, for whatever reason (frequently an error), they do not agree.

Should the reporting of ring_creation_size be deprecated and removed for clarity?
Are there any portions of code relying on ring_creation_size and ring_num_partitions to match that shouldn't be?
Should Riak check for a disagreement in these values on startup and log or report it?
Should the riak-admin ringready or another command be augmented to report a disagreement in these values?

Handoff reported as started even if rejected by

In 1.1 incoming handoff can be rejected if a node has too much going on. When outbound handoff occurs the 'Starting handoff' message is logged before the connection takes place. Move the logging after the connection is established.

Chicken & egg problem with vnode proxy process start & registration?

I have this theory that there's a chicken & egg problem with the vnode proxy process startup.

Riak is started.
Vnode operations arrive at the riak_core_vnode_master
For each operation, forwarding via a registered name is attempted. However, the vnode proxy process hasn't registered its name yet, so gen_fsm:send_event/2 fails with a badarg error.
The riak_kv app finishes starting.

However, because step 3 fails frequently enough, the riak_core app's supervisor's maximum limit is hit before the riak_kv app has started.

I dunno if that theory is correct or not, but here's the log to show the sequence of events.

2012-03-22 03:49:25.096 [info] <0.7.0> Application lager started on node '[email protected]'
2012-03-22 03:49:25.295 [info] <0.7.0> Application riak_core started on node '[email protected]'
2012-03-22 03:49:25.299 [info] <0.7.0> Application riak_control started on node '[email protected]'
2012-03-22 03:49:25.300 [info] <0.7.0> Application basho_metrics started on node '[email protected]'
2012-03-22 03:49:25.304 [info] <0.7.0> Application cluster_info started on node '[email protected]'
2012-03-22 03:49:25.458 [info] <0.1283.0>@riak_core:wait_for_application:396 Waiting for application riak_pipe to start (0 seconds).
2012-03-22 03:49:25.461 [info] <0.7.0> Application riak_pipe started on node '[email protected]'
2012-03-22 03:49:25.498 [info] <0.7.0> Application inets started on node '[email protected]'
2012-03-22 03:49:25.502 [info] <0.7.0> Application mochiweb started on node '[email protected]'
2012-03-22 03:49:25.520 [info] <0.7.0> Application erlang_js started on node '[email protected]'
2012-03-22 03:49:25.530 [info] <0.7.0> Application luke started on node '[email protected]'
2012-03-22 03:49:25.559 [info] <0.1283.0>@riak_core:wait_for_application:390 Wait complete for application riak_pipe (0 seconds)
2012-03-22 03:49:25.606 [error] <0.1347.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.608 [error] <0.1347.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.731 [info] <0.7.0> Application bitcask started on node '[email protected]'
2012-03-22 03:49:25.761 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1347.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.765 [error] <0.1415.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.766 [error] <0.1415.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.767 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1415.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.769 [error] <0.1416.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.772 [error] <0.1416.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.773 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1416.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.774 [error] <0.1447.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.775 [error] <0.1447.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.776 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1447.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.778 [error] <0.1455.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.780 [error] <0.1455.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.786 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1455.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.807 [error] <0.1464.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.808 [error] <0.1464.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.810 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1464.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.811 [error] <0.1467.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.812 [error] <0.1467.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.813 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1467.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.815 [error] <0.1470.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.817 [error] <0.1470.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.818 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1470.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.819 [error] <0.1485.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.821 [error] <0.1485.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.822 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1485.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.825 [error] <0.1494.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.826 [error] <0.1494.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.827 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1494.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.832 [error] <0.1520.0> gen_server riak_kv_vnode_master terminated with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.836 [error] <0.1520.0> CRASH REPORT Process riak_kv_vnode_master with 0 neighbours crashed with reason: bad argument in gen_fsm:send_event/2
2012-03-22 03:49:25.837 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1520.0> exit with reason bad argument in gen_fsm:send_event/2 in context child_terminated
2012-03-22 03:49:25.838 [error] <0.1346.0> Supervisor riak_kv_sup had child riak_kv_vnode_master started with riak_core_vnode_master:start_link(riak_kv_vnode, riak_kv_legacy_vnode) at <0.1520.0> exit with reason reached_max_restart_intensity in context shutdown
2012-03-22 03:49:25.996 [info] <0.2710.0>@riak_core:wait_for_application:396 Waiting for application riak_kv to start (0 seconds).
2012-03-22 03:49:26.003 [info] <0.7.0> Application riak_kv started on node '[email protected]'
2012-03-22 03:49:26.008 [info] <0.7.0> Application riak_kv exited with reason: shutdown

Allow node leaves to happen when still joining

It would be great if a user could leave a node even if it is still in the process of joining the cluster. As it is now, a node that joins a cluster must wait to be handed all of its data before it can leave. In other words: wait until a bunch of data is transfered to the node, just so the node can hand all the data off to other nodes and exit. Not particularly efficient.

The key to this change is to ensure that data already transfered to the node is handed off before leaving. Ideally, we should hand it back to the original owner if that's safe. The hard part is ensuring that the ring continues to meet guarantees (ie. all replicas on distinct nodes) without falling back to rediagonializing the ring. Which largely explains why this issue was punted on thus far. Some of the newer/planned claim work should help address this.

Dependencies cleanup?

Hi,

Some of us riak_core users have been discussing the list of dependencies riak_core has, and to our best understanding some of them don't exactly belong to riak_core itself (webmachine, mochiweb, protobuffs, etc.).

Do you guys think there is a chance we can do some cleanup in that area?

Thanks

replace vnode process name atom generation with gproc?

Hi,

Have you considered using gproc instead of generating atoms like proxy_my_vnode_319703483166135013357056057156686910549735243776? With gproc, you can register those processes as {vnode_proxy, 319703483166135013357056057156686910549735243776}. With gproc you also get some additional features like being able to await for a registration and such.

I have been using gproc in almost every project and it has proven to be incredibly useful.

Core console mentions riak-admin script

For example, when called through riak's 'riak-admin transfer_limit' command, for example.
However, the admin script for the actual core app will likely be different. At the very least we should make it possible for a core app to customize that and maybe other messages.

Starting riak_core as application, how-to

Hej,

I was wondering if the riak_core_util:start_app_deps(riak_core) in riak_core_app:start/2 isn't creating a chicken'n'egg problem? When I try to start riak_core (for testing) from a shell by application:start(riak_core), it throws exception because the other applications (crypto, webmachine) are not running yet. So the code of start_app_deps/1, which itself uses the same app file key as ERTS for getting a list of applications is never reached. And when starting the deps applications before, there is no use for it.

So I wonder if I missed something when starting the riak_core app up. I currently use:

erl -pa ebin deps/*/ebin \
    -boot start_sasl \
    -eval "application:start(crypto), application:start(webmachine), application:start(riak_core)."

If the handoff port is taken, handoffs fail with uninformative errors

Note that if two of our ports (SSL and handoff was the wild case) are the same on the receiving node, this is all you get on the sender:

2012-10-05 17:17:47.823 [error] <0.17731.38>@riak_core_handoff_sender:start_fold:207
ownership_handoff transfer of riak_kv_vnode from '[email protected]'
456719261665907161938651510223838443642478919680 to '[email protected]'
456719261665907161938651510223838443642478919680 failed because of TCP recv timeout

these repeat constantly. On the receiver, you get nothing at all.

We should catch this on the receiver and provide better error messages. If there's a catchable error on the sender, we should improve the message there as well.

Handoff for kv never happens, and nodes that join can never leave and must be force-removed.

Race in vnode worker pool

When a riak_core_vnode_worker finishes work, it sends checkin messages to both poolboy and riak_core_vnode_worker_pool. The latter maintains a queue of work to be handled when there's room in the pool. As soon as RCVWP gets the checkin message, it asks poolboy if there is a worker available (expecting that the worker just checked in will now be available).

The problem is that poolboy may receive RCVWP's message before receiving the worker's checkin message. If this happens, it will tell RCVWP that the pool is full. RCVWP then sticks in the 'queueing' state until it receives another checkin message from a worker. Since another checkin may never arrive, the pool may become frozen.

The test defined by worker_pool_pulse.erl on the bwf-pool-race branch of riak_core demonstrates this race. Under PULSE execution, the test will fail with deadlock.

In order to run the test, you will need the pulse_otp beams from https://github.com/Quviq/pulse_otp on your path. You will also need to compile poolboy with PULSE annotations - the bwf-pool-race branch of basho/poolboy provides this.

Since I think PULSE's graphical output is quite cool, I'll include it here:

Key:

"root" (red): test process
"spawn_opt" (blue): riak_core_vnode_worker_pool
"spawn_opt1" (cyan): poolboy fsm
"spawn_opt2" (green): poolboy_sup
"spawn_opt3" (magenta): poolboy_worker/riak_core_vnode_worker

The problem is illustrated by the last four steps of "spawn_opt1", the poolboy process (in cyan), and the last six steps of "spawn_opt", the riak_core_vnode_worker_pool_process (in blue). The $gen_all_state_event that is sent to spawn_opt from the last step in spawn_opt3 is one of the two checkin messages, and the $gen_sync_event sent to spawn_opt1 is the other. The $gen_event sent from spawn_opt to spawn_opt1 is the checkout message, and since spawn_opt1 receives that message before the checkin is delivered, you can see it reply {_,full}. After that, everything is stuck waiting with no way forward.

Now fixing begins…

riak_core_stat:produce_stats() is never rescheduled to be called after a crash

If there is an error during the calculation of riak_core stats, it dies as expected but stats are never re-calculated(until node restart).

Example from the field

Error from log on 6/10:

Error in process <0.14823.27> on node 'OMMITTED' with exit value: {badarg,[{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,168}]},{riak_core_stat,'-vnodeq_stats/0-lc$^0/1-0-',1,[{file,"src/riak_core_stat.erl"},{line,169}]},{riak_core_stat,vnodeq_stats...

Report of last time stats were calculated(Run on 6/12):

 {'OMMITTED',{{2013,6,10},{20,31,40}}}]
Reported running_vnodes: 631
Actual running_vnodes: 83

riak-admin transfers exits with status 1 when there are no transfers.

$./bin/riak start
$./bin/riak-admin transfers
No transfers active

Active Transfers:


$echo $?
1

generic UDT transport

prototype erlang transport based on UDT for replication and risk_core_sender/recvr. Die TCP!

inbound handoffs never cleanup

There are cases where the handoff sender, or its connection, may
disappear and the receiver doesn't know about it. In this case it
will sit forever in {active, once} waiting for the sender. This has
been seen in production. If you get the handoff status for all nodes
you'll see all inbound handoff.

rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).

[{'[email protected]',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
 {'[email protected]',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
 {[email protected]',[{{undefined,undefined},
                                                undefined,inbound,active,[]},
                                               {{undefined,undefined},undefined,inbound,active,[]}]},
...

This stalls handoff because each node has reached the concurrency
limit. Since these inbounds will never be reaped they will block
forever. Calling the force handoff API will not resume handoff
because it still respects the concurrency limit. To resume you must
kill the handoffs via the following code snippet.

f(Members).
Members = riak_core_ring:all_members(element(2, riak_core_ring_manager:get_raw_ring())).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [0])).
rp(rpc:multicall(Members, riak_core_handoff_manager, set_concurrency, [2])).

This is related to #153 where the same situation was
fixed for the sender (by setting timeouts on recv) but the handoff
receiver was never fixed.

The easiest way to fix this is to add a timeout to the receiver so
that inactive connections will be noticed and reaped. This timeout
could be set to some arbitrary limit like 5-10 minutes since it's a
msg per object. The assumption being that if an object takes 10
minutes then chances are something is wrong and it's better to
restart. Handoff will still stall but only for a limited period of
time and it won't require manual intervention to resume.

configurable handoff sender

Hi,

I was wondering whether you would consider a patch that would make handoff sender configurable, that is to allow different ways for handoffs to be collected and sent.

riak-admin wait-for-service riak_kv <bad_node_name> does not return error [JIRA: RIAK-2802]

Running riak-admin wait-for-service riak_kv with a bad node name (i.e. localhost) return:

riak_kv is not up: []

No indication that this is a bad node name. Suggest putting logic in to check node name for cluster membership or ability to ping.

Identified in 1045 during upgrade from 0.14 to 1.1.1 (https://help.basho.com/tickets/1045)

Make cluster admin a two-phase (plan/commit) process

The way Riak currently provides cluster administration through riak-admin and Riak Control is a fire-and-pray model. For example, you issue riak-admin join to join a node to a cluster, and the join is immediately scheduled. It is impossible to atomically add multiple nodes at once, or to perform both joins and leaves at the same time. Instead, Riak immediately calculates the new partition ownership and begins transferring data to the new node. Furthermore, it is impossible to see how the change will affect the cluster until after issuing the join/leave, but by that point there is no way to cancel/stop things if the changes are undesired (eg. join a node and suddenly 128+ partitions become schedule during peak traffic).

Let's move away from this approach, and move towards a two-phase approach to cluster changes. Rather than having join/leaves/etc happen immediately, issuing such commands should instead stage pending changes to the cluster. A user should then be able to issue a command such as riak-admin plan to print out the staged changes -- the list of changes, the cluster membership/ring ownership resulting from these changes, the number of transfers that would be required, etc. If the user is satisfied, they could then riak-admin commit the plan, and the entire plan would be issued, and transfers scheduled. Otherwise, the user could continue to add/leave additional nodes, or riak-admin clear to clear the entire set of staged changes.

No open source license specified

I'm interested in using riak-core for an open source project that would (hopefully) be MIT-licensed. I checked the Readme and associated documentation but could not find a stated OSL. Assuming riak-core is intended for open source use, can Basho issue a license for it accordingly?

New stats suggestion: VM distribution port statistics

Suggested indirectly via @kocolosk in his presentation at http://www.erlang-factory.com/upload/presentations/552/kocoloski-erlang-writ-large.pdf slides 13-15.

These stats are per TCP port. Mapping the port to the remote Erlang node name would be tremendously useful, then exposing each group of TCP stats in a JSON object keyed by the remote node name?

OTP docs: http://www.snookles.com/erlang-docs/R15B01/lib/kernel-2.15.1/doc/html/inet.html#getstat-2

Non-blocking send (2 of 2)

This is a longer-term issue for tracking changes related to distributed Erlang message sending to remote nodes: it is possible for the sending process, using an innocent-looking '!' operator, to block for seconds (or minutes!) waiting for the net_kernel to re-connect to a machine that has crashed (or otherwise not responding to network communication).

A separate ticket, "Non-blocking send (1 of 2)", will be created to address the most vulnerable riak_core process (operation forwarding during handoff) in a quick & ruthless way. We need a more general (and probably less ruthless) method of dealing with this messaging problem while still providing adequate flow control.

([email protected])1> io:format("~s\n", [element(2,process_info(pid(0,1872,0), backtrace))]).
Program counter: 0x0000000103696d08 (gen:do_call/4 + 576)
CP: 0x0000000000000000 (invalid)
arity = 0

0x0000000108491a78 Return addr 0x00000001045eb060 (gen_server:call/3 + 128)
y(0)     #Ref<0.0.0.50173>
y(1)     '[email protected]'
y(2)     []
y(3)     infinity
y(4)     {connect,normal,'[email protected]'}
y(5)     '$gen_call'
y(6)     <0.20.0>

0x0000000108491ab8 Return addr 0x0000000103650aa8 (erlang:dsend/2 + 632)
y(0)     infinity
y(1)     {connect,normal,'[email protected]'}
y(2)     net_kernel
y(3)     Catch 0x00000001045eb060 (gen_server:call/3 + 128)

0x0000000108491ae0 Return addr 0x00000001073ffb58 (gen_fsm:send_event/2 + 616)
y(0)     {'$gen_event',{riak_vnode_req_v1,68507889249886074290797726533575766546371837952,{fsm,undefined,<12357.20683.0>},{riak_kv_put_req_v1,{<<5 bytes>>,<<5 bytes>>},{r_object,<<5 bytes>>,<<5 bytes>>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<5 bytes>>]],[],[],[],[],[],[],[],[[<<12 bytes>>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<11 bytes>>,49,85,50,97,72,119,83,119,80,101,77,84,80,114,107,82,84,107,109,107,55,99]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1364,605613,10619}]],[],[[<<11 bytes>>]]}}},<<14 bytes>>}],[{<<8 bytes>>,{3,63531822777}},{<<8 bytes>>,{2,63531824381}},{<<8 bytes>>,{4,63531824813}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},30829294,63531824813,[]}}}
y(1)     '[email protected]'
y(2)     proxy_riak_kv_vnode_68507889249886074290797726533575766546371837952

0x0000000108491b00 Return addr 0x0000000106dac0b8 (riak_core_vnode_master:do_proxy_cast/2 + 352)

0x0000000108491b08 Return addr 0x00000001073f7330 (riak_core_vnode:vnode_handoff_command/3 + 728)

Fix riak_core_connection_mgr_stats timestamp formatting

riak-repl clusterstats
RPC to 'riak@brawl' failed: {'EXIT',
{function_clause,
[{riak_core_connection_mgr_stats,
format_stat,
[{riak_conn_mgr_stats_stat_ts,
1365529127}],
[{file,
"src/riak_core_connection_mgr_stats.erl"},
{line,61}]},
{riak_core_connection_mgr_stats,
'-get_consolidated_stats/0-lc$^0/1-0-',
1,
[{file,
"src/riak_core_connection_mgr_stats.erl"},
{line,54}]},
{riak_core_connection_mgr_stats,
get_consolidated_stats,
0,
[{file,
"src/riak_core_connection_mgr_stats.erl"},
{line,54}]},
{riak_repl_console,
clusterstats,1,
[{file,
"src/riak_repl_console.erl"},
{line,203}]},
{rpc,
'-handle_call_call/6-fun-0-',
5,
[{file,
"rpc.erl"},
{line,203}]}]}}

ring mgr crash creates confused cluster

NOTE: This should be an extremely rare incident and is easily fixed
via a riak-admin call or a node restart (see end of issue for
work-around).

If, for whatever reason, the ring mgr crashes it results in the node
thinking it is a cluster of it's own. However, it will still have all
it's distributed Erlang connections in place and the other nodes will
consider it as part of the cluster.

To be preemptive, I don't think allowing unexpected msgs is the fix
(like we do with so many other servers). In fact, I think an unknown
msg should still crash the mgr.

After talking with @jtuple it sounds like the fix has to do with
changing the mgr to be the process responsible for reading the ring.

member_status before confused cluster

$ dev1/bin/riak-admin member_status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

$ dev2/bin/riak-admin member_status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

now crash the ring mgr

dev1/bin/riak attach
> gen_server:call(riak_core_ring_manager, fooey).
... crash report ...

check member_status again

dev1/bin/riak-admin member_status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:1 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

dev2/bin/riak-admin member_status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

investigate ring

> {ok, RawRing} = riak_core_ring_manager:get_raw_ring().
> sets:to_list(sets:from_list([Owner || {_,Owner} <- riak_core_ring:all_owners(RawRing)])).
['[email protected]']

work-around

The work-around is to join the node with the crashed ring mgr to any
other node in the cluster. It's important that it the join is
initiated by the node with the crashed mgr because joins can only be
performed by nodes not already part of a cluster (and in this case,
the in-memory parts of Riak think the node is not a member of the
cluster).

Also, if the node restarts it will read the correct ring off disk and
you don't have to do anything special.

dev/dev1/bin/riak-admin join '[email protected]'
Sent join request to [email protected]

dev/dev1/bin/riak-admin member_status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------

dev2/bin/riak-admin member_status
================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
valid      25.0%      --      '[email protected]'
-------------------------------------------------------------------------------
Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

gen_server call review

While trying to start Riak under Valgrind, I encountered a gen_server:call() timeout exception which causes an application startup failure which prevents Riak from starting successfullly.

The culprit for my particular case is riak_core_stat_cache.erl:

 register_app(App, {M, F, A}, TTL) ->
    gen_server:call(?SERVER, {register, App, {M, F, A}, TTL}).

This is likely one of many uses of gen_server:call/2 when we really ought to be using the 3-arity version with an infinity timeout.

Handoff capability support

Adds capability for handoff encoding, and options for zlib and raw.

Map-reduce jobs fail during rolling upgrade to 1.1

Map reduce jobs fail against a mixed 1.1/1.0 cluster or 1.1/0.14 cluster. Mixed clusters are common during rolling upgrades, where nodes are upgraded one at a time until the entire cluster is at the new version (1.1 in this case).

In addition to the jobs failing, the Riak 1.1 nodes will print several instances of the following error and then eventually shutdown:

2012-02-22 00:56:16.429 [error] <0.1615.0>
gen_server riak_pipe_vnode_master terminated with reason: no function clause matching
riak_core_vnode_master:handle_call(
  {return_vnode,
    {riak_vnode_req_v1,479555224749202520035584085735030365824602865664,
      {raw,#Ref<6584.0.7263.108930>,...},...}},
  {<6604.7292.421>,#Ref<6604.0.6697.250987>},
  {state,undefined,undefined,riak_pipe_vnode,undefined})

2012-02-22 00:56:16.431 [error] <0.1615.0> CRASH REPORT
Process riak_pipe_vnode_master with 0 neighbours crashed with reason:
no function clause matching riak_core_vnode_master:handle_call(
  {return_vnode,
    {riak_vnode_req_v1,479555224749202520035584085735030365824602865664,
      {raw,#Ref<6584.0.7263.108930>,...},...}},
  {<6604.7292.421>,#Ref<6604.0.6697.250987>},
  {state,undefined,undefined,riak_pipe_vnode,undefined})

The error arises from Riak 1.1 no longer implementing riak_core_vnode_master:handle_call({return_vnode, ...). Yet, pre-1.1 nodes will still send messages to Riak 1.1 nodes in a mixed cluster that must be handled by this now-missing clause.

Extend node names in transfers output

Currently in the output of riak-admin transfers, node names are truncated at 16 characters. For example:

riak-admin transfers 

'[email protected]' waiting to handoff 3 partitions 
'[email protected]' waiting to handoff 1 partitions 
'[email protected]' waiting to handoff 1 partitions 
'[email protected]' waiting to handoff 1 partitions 
'[email protected]' waiting to handoff 1 partitions 
'[email protected]' waiting to handoff 1 partitions 
'[email protected]' waiting to handoff 1 partitions

Active Transfers:

transfer type: hinted_handoff 
vnode type: riak_kv_vnode 
partition: 93865151022383844456719261665907161364247891968 
started: 2012-12-03 05:56:52 [209.44 s ago] 
last update: 2012-12-03 06:00:20 [1.69 s ago] 
objects transferred: 228668

1430 Objs/s 
riak@long-node-c =======================> riak@long-node-c 
1.17 MB/s

transfer type: hinted_handoff 
vnode type: riak_kv_vnode 
partition: 90716193865151022383844364244567192616657891968 
started: 2012-12-03 05:56:52 [209.44 s ago] 
last update: 2012-12-03 06:00:21 [157.72 ms ago] 
objects transferred: 231697

1412 Objs/s 
riak@long-node-c =======================> riak@long-node-c 
1.17 MB/s

Can the node name length be extended to 32 characters?

Riak take a lot longer to start up on systems with ulimit set to a million. (centos/vagrant/virtualbox)

Riak take a lot longer to start up on under virtualbox with ulimit set to a million. (centos/vagrant/virtualbox)

Ping me for the vagrant gist (i cant post it here.)

[vagrant@localhost riak-ee]$ ulimit -n 1000000
[vagrant@localhost riak-ee]$ time dev1/bin/riak start
real 1m18.837s
user 0m1.167s
sys 0m26.828s

[vagrant@localhost riak-ee]$ ulimit -n 4096
[vagrant@localhost riak-ee]$ time dev1/bin/riak start
real 0m2.330s
user 0m0.833s
sys 0m0.066s
[vagrant@localhost riak-ee]$

riak_core_service_mgr:unregister should take version arg

At present, you can register a protocol id multiple times, with different arguments and modules. However, the unregister API will remove all registered modules of that protocol id. So, there is no way to unregister discrete versions. Please add that ability to the API.

Do not fail handoffs silently if the remote node doesn't have the appropriately named service

https://github.com/basho/riak_core/blob/master/src/riak_core_vnode_manager.erl#L628

This line makes handoffs fail silently and therefore hard to debug. If this line also emitted a debug/info log, it'd be so much easier to not waste a couple of hours :)

If you like this idea, I can send a pull req.

Log handoff sender on node receiving handoff

Right now, we currently do not log which node sent the handoff on the receiving node:

2012-06-12 19:32:01.295 [info] <0.28019.343>@riak_core_handoff_receiver:handle_info:71 Handoff receiver for partition 35681192317648997026457149236237378409568665600 exited after processing 299 objects

It would be useful to know which node was sending the partition on the receiving node.

add HTTP resource for transfers status

In reference to Mark Smith's comments on this blog post.

The transfer status information should be exposed via HTTP so that external tools may easily obtain the data in it's raw form. Currently a user would either have to write some Erlang or parse human-friendly output on the command line. Neither option is very accessible to most users.

Implementation Note: I imagine just pulling back the raw status proplist and exposing that as JSON would do the trick.

Expose "services" command over HTTP

For automated health-check tools (nagios, load-balancers, etc) it would be incredibly useful expose which services are up on a node. A webmachine resource that fronts riak_core_node_watcher:services() (or an addition to the stats resource) would greatly help this.

riak_core_vnode_manager blocks on vnode startup

riak_core_vnode_manager is unresponsive while vnodes are being started by the vnode_sup (as it blocks in start_child). This should be done asynchronously as it causes timeouts in other parts of the system - e.g. stats.

Intermittent crash on startup

Intermittent crash on startup - 6 node cluster, 32 vnodes.

 {"Kernel pid terminated",application_controller,"{application_start_failure,riak_core,{bad_return,{{riak_core_app,start,[normal,[]]},{'EXIT',{{function_clause,[{orddict,fetch,['[email protected]',[{'[email protected]',[{{riak_core,staged_joins},[true,false]},{{riak_core,vnode_routing},[proxy,legacy]}]}]],[{file,[111,114,100,100,105,99,116,46,101,114,108]},{line,72}]},{riak_core_capability,renegotiate_capabilities,1,[{file,[115,114,99,47,114,105,97,107,95,99,111,114,101,95,99,97,112,97,98,105,108,105,116,121,46,101,114,108]},{line,417}]},{riak_core_capability,handle_call,3,[{file,[115,114,99,47,114,105,97,107,95,99,111,114,101,95,99,97,112,97,98,105,108,105,116,121,46,101,114,108]},{line,216}]},{gen_server,handle_msg,5,[{file,[103,101,110,95,115,101,114,118,101,114,46,101,114,108]},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,[112,114,111,99,95,108,105,98,46,101,114,108]},{line,227}]}]},{gen_server,call,[riak_core_capability,{register,{riak_core,vnode_routing},{capability,[proxy,legacy],legacy,{riak_core,legacy_vnode_routing,[{true,legacy},{false,proxy}]}}},infinity]}}}}}}"}

chstate & chstate_2 records extraction

Hi,

is there any chance to get those ring records extracted to some (even internal) include file? We are constructing some bogus rings and therefore need direct access to those structures. Right now we just copy-pasted them, but that doesn't sound like a good solution long term.

Guard against reading corrupt/empty ring files

Right now, we currently attempt to read the latest ring file, regardless if it is corrupt, empty, et cetera. This has been seen several times, and typically we delete the most recent ring file, and use the previous one.

Riak itself could attempt to implement this functionality.

Don't read ring files from the future

If a ringfile is written with a sufficiently skewed timestamp, subsequent ring updates will be ignored on node restart until the ring file timestamp is no longer is in the future. This can cause very surprising behaviour,

A simple way to replicate is simply to change the 'year' part of the ringfile's timestamp to a year in the future, change some ring configs, and restart the node.

Better behaviour would at least be to warn about a ring file from the future, or refuse to load it at all.

Notify user of stale rings

When a node starts, if there is a stale or mismatching ring file on the machine, the node dies with an uninformative error. This can happen when attempting to change the nodenames of a cluster, or when an install has brought up an unconfigured node, as in: basho/node_package#26

Example error:

19:54:20.781 [info] Application lager started on node '[email protected]'^M
19:54:20.864 [info] Upgrading legacy ring^M
19:54:21.015 [error] gen_server riak_core_capability terminated with reason: no function clause matching orddict:fetch('[email protected]', [{'[email protected]',[{{riak_core,staged_joins},[true,false]},{{riak_core,vnode_routing},[proxy,...]},...]},...]) line 72^M
/usr/lib64/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
Erlang has closed

It'd be nice if we caught this and exited with a sensible suggestion. E.G:

"Stale ring file found, you may wish to remove it or move it aside."

Vnodes keeps dying when trying to grow cluster beyond five nodes

We are receiving the following vnodes errors causing transfers to not finish. We receive the following error:

2013-05-31 17:39:54.395 [error] <0.152.0>@riak_core_handoff_manager:handle_info:311 An outbound handoff of partition riak_kv_vnode 1164634117248063262943561351070788031288321245184 was terminated because the vnode died
2013-05-31 17:39:54.547 [error] <0.152.0>@riak_core_handoff_manager:handle_info:311 An outbound handoff of partition riak_kv_vnode 1347321821914426127719021955160323408745312813056 was terminated because the vnode died
2013-05-31 17:39:54.552 [error] <0.18737.27>@riak_core_handoff_sender:start_fold:210

Riak can't listen on ipv6 addresses

I'm trying to deploy a riak cluster on machines with no IPv4 addresses.

I seem to be able to configure the http setting to listen on {"::", 8098}, but handoff_ip doesn't seem to be able to be given an IPv6 address.

When I configure:

{handoff_port, 8099 },
{handoff_ip, "::" },

riak won't start and I get the following error in riak console:

Attempting to restart script through sudo -u riak
Exec: /usr/lib64/riak/erts-5.8.5/bin/erlexec -boot /usr/lib64/riak/releases/1.1.1/riak             -embedded -config /etc/riak/app.config             -pa /usr/lib64/riak/basho-patches             -args_file /etc/riak/vm.args -- console
Root: /usr/lib64/riak
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]

** /usr/lib64/riak/lib/observer-0.9.10/ebin/etop_txt.beam hides /usr/lib64/riak/lib/basho-patches/etop_txt.beam
** Found 1 name clashes in code paths
04:50:49.545 [info] Application lager started on node 'riak@test1'
04:50:49.584 [error] Supervisor riak_core_handoff_listener_sup had child riak_core_handoff_listener started with riak_core_handoff_listener:start_link() at undefined exit with reason bad argument in call to erlang:list_to_integer("::") in gen_nb_server:'-convert/1-lc$^0/1-0-'/1 in context start_error
/usr/lib64/riak/lib/os_mon-2.2.7/priv/bin/memsup: Erlang has closed.
                                                                     Erlang has closed
                                                                                      {"Kernel pid terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}

Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})

I assume this is convert() in the bottom of riak_core/src/gen_nb_server.erl, where it tries to naively split on a period ('.'). I believe if this were to simply be changed to split on either a period or a colon (for IPv6 addresses) it should work...

Non-blocking send (1 of 2)

handoff of fallback data can be delayed or never performed

Summary

During startup, a node with fallback data on it's secondary partitions
will not start those secondary partitions. This means that nodes with
fallback data may delay or even never handoff the fallback data.
Depending on the ring and preflist this means that replicas may exist
on the same node for a longer period of time than required.

Delayed Handoff

In the case where the home node of the fallback partitions is still
down then read/write load will cause the secondaries to spin back up
and handoff will commence when the home node comes alive.

Missed Handoff

In the case where the home node of the fallbacks partitions is up when
the node(s) with the fallback partitions come up then the secondaries
will not be started by read/write load since all the primaries are
up. This means that, depending on the ring and preflist, 2 replicas
could exist on the same node until read repair or a write occurs.
That is, since the secondary partitions with fallback data are never
started handoff is missed and read repair + new writes must account
for all of the fallback data that was written while the home node
was down.

Steps to Reproduce

create devrel - make devrel, for d in dev/dev*; do $d/bin/riak start; done
join nodes - for d in dev/dev*; do $d/bin/riak-admin join [email protected]; done
wait for everything to settle (members + transfers)
stop dev4 - ./dev/dev4/bin/riak stop
insert data - for i in $(seq 1 100); do curl -X PUT -H 'content-type: text/plain' http://localhost:8091/riak/test/$i -d "hello"; done
check data - for i in $(seq 1 100); do curl http://localhost:8091/riak/test/$i 2> /dev/null && echo; done | wc -l
stop dev1-dev3 for d in dev/dev{1,2,3}; do $d/bin/riak stop; done
start dev4 - ./dev/dev4/bin/riak start
verify keys don't exist - for i in $(seq 1 100); do curl http://localhost:8094/riak/test/$i 2> /dev/null && echo; done | less
start dev1-dev3 - for d in dev/dev*; do $d/bin/riak start; done
wait a little, check transfers - ./dev/dev1/bin/riak-admin transfers

After step 11 there will be some transfers listed but after a while
they will complete. Those transfers are for stuff other than the
fallback data. To be sure you can wait until no transfers are
reported, stop dev1 - dev3, and then try to read the keys from
dev4--they won't be there.

stop dev4 - ./dev/dev4/bin/riak stop
read the keys - for i in $(seq 1 100); do curl http://localhost:8091/riak/test/$i 2> /dev/null && echo; done | less
check for transfers - ./dev/dev1/bin/riak-admin transfers

There should be multiple partitions pending transfer to dev4 now.
This is because the read caused the secondary vnodes to be read from
preflist and spun up.

riak_core should not directly reference `riak-admin` commands in output

https://github.com/basho/riak_core/blob/master/src/riak_core_console.erl#L179 and 9 other console output messages reference riak-admin. This is pretty confusing if you make a riak_core app that has nothing to do with riak, and try and use one of the console commands.

Nondeterministic failure of core_vnode_eqc:prop_simple()

I can't tell if this is due to the QC model not taking wall-clock time into account or if this is a real bug.

To reproduce:

% make test
% erl -pz .eunit deps/*/ebin
> C1 = [[{init,{setup,{qcst,[],[],undefined,[0,182687704666362864775460604089535377456991567872,365375409332725729550921208179070754913983135744,548063113999088594326381812268606132370974703616,730750818665451459101842416358141509827966271488,913438523331814323877303020447676887284957839360,1096126227998177188652763624537212264741949407232,1278813932664540053428224228626747642198940975104],[],false,0,[]}}},{set,{var,1},{call,core_vnode_eqc,prepare,[0]}},{set,{var,2},{call,core_vnode_eqc,start_vnode,[0]}},{set,{var,3},{call,mock_vnode,crash,[{0,nonode@nohost}]}},{set,{var,4},{call,mock_vnode,neverreply,[[{0,nonode@nohost}]]}},{set,{var,5},{call,mock_vnode,neverreply,[[{0,nonode@nohost}]]}},{set,{var,6},{call,mock_vnode,stop,[{0,nonode@nohost}]}},{set,{var,7},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},{set,{var,8},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},{set,{var,9},{call,core_vnode_eqc,start_vnode,[182687704666362864775460604089535377456991567872]}},{set,{var,10},{call,mock_vnode,get_counter,[{182687704666362864775460604089535377456991567872,nonode@nohost}]}},{set,{var,11},{call,mock_vnode,get_index,[{182687704666362864775460604089535377456991567872,nonode@nohost}]}},{set,{var,12},{call,mock_vnode,crash,[{182687704666362864775460604089535377456991567872,nonode@nohost}]}},{set,{var,13},{call,core_vnode_eqc,start_vnode,[730750818665451459101842416358141509827966271488]}},{set,{var,14},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},{set,{var,15},{call,core_vnode_eqc,latereply,[[{730750818665451459101842416358141509827966271488,nonode@nohost}]]}},{set,{var,16},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},{set,{var,17},{call,mock_vnode,get_crash_reason,[{730750818665451459101842416358141509827966271488,nonode@nohost}]}},{set,{var,18},{call,mock_vnode,get_counter,[{182687704666362864775460604089535377456991567872,nonode@nohost}]}},{set,{var,19},{call,core_vnode_eqc,start_vnode,[1278813932664540053428224228626747642198940975104]}},{set,{var,20},{call,mock_vnode,get_counter,[{182687704666362864775460604089535377456991567872,nonode@nohost}]}},{set,{var,21},{call,mock_vnode,get_index,[{182687704666362864775460604089535377456991567872,nonode@nohost}]}},{set,{var,22},{call,mock_vnode,get_index,[{1278813932664540053428224228626747642198940975104,nonode@nohost}]}},{set,{var,23},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},{set,{var,24},{call,core_vnode_eqc,latereply,[[{1278813932664540053428224228626747642198940975104,nonode@nohost}]]}}],[{res,[]}]].
> core_vnode_eqc:setup_simple().
> [true = eqc:check(core_vnode_eqc:prop_simple(), C1) || _ <- lists:seq(1,50*1000)].

After some period of time, I see:
OK, passed the test.
OK, passed the test.
Failed!
[{init,{setup,{qcst,[],[],undefined,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[],false,0,[]}}},
{set,{var,1},{call,core_vnode_eqc,prepare,[0]}},
{set,{var,2},{call,core_vnode_eqc,start_vnode,[0]}},
{set,{var,3},{call,mock_vnode,crash,[{0,nonode@nohost}]}},
{set,{var,4},{call,mock_vnode,neverreply,[[{0,nonode@nohost}]]}},
{set,{var,5},{call,mock_vnode,neverreply,[[{0,nonode@nohost}]]}},
{set,{var,6},{call,mock_vnode,stop,[{0,nonode@nohost}]}},
{set,{var,7},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},
{set,{var,8},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},
{set,{var,9},
{call,core_vnode_eqc,start_vnode,
[182687704666362864775460604089535377456991567872]}},
{set,{var,10},
{call,mock_vnode,get_counter,
[{182687704666362864775460604089535377456991567872,
nonode@nohost}]}},
{set,{var,11},
{call,mock_vnode,get_index,
[{182687704666362864775460604089535377456991567872,
nonode@nohost}]}},
{set,{var,12},
{call,mock_vnode,crash,
[{182687704666362864775460604089535377456991567872,
nonode@nohost}]}},
{set,{var,13},
{call,core_vnode_eqc,start_vnode,
[730750818665451459101842416358141509827966271488]}},
{set,{var,14},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},
{set,{var,15},
{call,core_vnode_eqc,latereply,
[[{730750818665451459101842416358141509827966271488,
nonode@nohost}]]}},
{set,{var,16},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},
{set,{var,17},
{call,mock_vnode,get_crash_reason,
[{730750818665451459101842416358141509827966271488,
nonode@nohost}]}},
{set,{var,18},
{call,mock_vnode,get_counter,
[{182687704666362864775460604089535377456991567872,
nonode@nohost}]}},
{set,{var,19},
{call,core_vnode_eqc,start_vnode,
[1278813932664540053428224228626747642198940975104]}},
{set,{var,20},
{call,mock_vnode,get_counter,
[{182687704666362864775460604089535377456991567872,
nonode@nohost}]}},
{set,{var,21},
{call,mock_vnode,get_index,
[{182687704666362864775460604089535377456991567872,
nonode@nohost}]}},
{set,{var,22},
{call,mock_vnode,get_index,
[{1278813932664540053428224228626747642198940975104,
nonode@nohost}]}},
{set,{var,23},{call,riak_core_vnode_master,all_nodes,[mock_vnode]}},
{set,{var,24},
{call,core_vnode_eqc,latereply,
[[{1278813932664540053428224228626747642198940975104,
nonode@nohost}]]}}]
History: [{{setup,{qcst,[],[],undefined,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[],false,0,[]}},
<0.3891.0>},
{{stopped,{qcst,[],[],<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[],false,0,[]}},
ok},
{{running,{qcst,[0],
[{0,0}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,undefined}],
false,0,[]}},
ok},
{{running,{qcst,[0],
[{0,0}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,0}],
false,0,[]}},
ok},
{{running,{qcst,[0],
[{0,1}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,0}],
false,0,[]}},
ok},
{{running,{qcst,[0],
[{0,2}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,0}],
false,0,[]}},
stopped},
{{running,{qcst,[],
[{0,0}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,undefined}],
false,0,[]}},
[]},
{{running,{qcst,[],
[{0,0}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,undefined}],
false,0,[]}},
[<0.3896.0>,<0.3895.0>]}]
State: {qcst,[],
[{0,0}],
<0.3891.0>,
[0,182687704666362864775460604089535377456991567872,
365375409332725729550921208179070754913983135744,
548063113999088594326381812268606132370974703616,
730750818665451459101842416358141509827966271488,
913438523331814323877303020447676887284957839360,
1096126227998177188652763624537212264741949407232,
1278813932664540053428224228626747642198940975104],
[{0,undefined}],
false,0,[]}
Result: {postcondition,false}
res: failed
{postcondition,false} /= ok
** exception error: no match of right hand side value false