uwiger / locks Goto Github PK
View Code? Open in Web Editor NEWA scalable, deadlock-resolving resource locker
License: Mozilla Public License 2.0
A scalable, deadlock-resolving resource locker
License: Mozilla Public License 2.0
I was testing locks using my testcase.
I believe that there is a bug in the lock_info handling of locks_server
and locks_agent
, which may cause deadlock.
My testcase has 3 concurrent clients/agents, namely C1, C2, and C3, and 3 locks, [1], [2], and [3].
Here is how the bug happened (in sketch):
C1, C2, and C3 competed on locks.
Due to the deadlock resolving algorithm, C1, C2 eventually acquired all locks and finished.
In the resolution process, C3 got lock_info of [2] (due to locks_agent:send_indirects/1
)
even C3 hadn't reach the point of requesting it, which means C3 was not in [2]'s queue.
The locks_server remove the local lock_info entry of [2] since the queue is empty now.
This effectively resets the vsn of the lock_info.
C3 started requesting [2], but the locks_server would respond with lock_info that
had lower vsn than what C3 was told with. Thus C3 got stuck.
I've tried to fix by not removing lock_info entries in locks_server
, but my fix seems to fail the test in other ways. Maybe this breaks the algorithm?
https://github.com/uwiger/locks/blob/master/src/locks.erl#L49 returns {<0.101.0>,{ok,[]}}
, but according to spec second element should be lock_result() :: {lock_status(), deadlocks()}
where lock_status() :: have_all_locks | have_none.
I can change https://github.com/uwiger/locks/blob/master/src/locks_agent.erl#L312 so it will match one of these and send PR, but not sure what will be semantically correct, have_all_locks or have_none?
-spec lock_objects(pid(), objs()) -> ok.
%%
lock_objects(Agent, Objects) ->
lists:foreach(fun({Obj, Mode}) when Mode == read; Mode == write ->
lock_nowait(Agent, Obj, Mode);
({Obj, Mode, Where}) when Mode == read; Mode == write ->
lock_nowait(Agent, Obj, Mode, Where);
({Obj, Mode, Where, Req})
when (Mode == read orelse Mode == write)
andalso (Req == all
orelse Req == any
orelse Req == majority
orelse Req == majority_alive
orelse Req == all_alive) ->
lock_nowait(Agent, Obj, Mode, Where);
(L) ->
error({illegal_lock_pattern, L})
end, Objects).
I have a quick question corresponding to the way locks are shared or exclusive.
Suppose I have write locks on [Resource, OID]
and requesting a read lock on [Resource]
. Will the clients be able to handle reads while the client that requested a write is modifying the resource? Or the write lock will exclude all reads on the resource?
I have two actors which works approximately in the same time. Each of them begins transaction. Each of them acquires read
lock on the same oid()
. Then first tries to upgrade read lock to write lock. Second does the same and application crashes immediately:
Logs of the first actor:
Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.4 (abort with ^G)
(n1@dch-mbp)1> application:ensure_all_started(locks).
{ok,[locks]}
(n1@dch-mbp)2> {Agent, TrRes} = locks:begin_transaction().
{<0.46.0>,{ok,[]}}
(n1@dch-mbp)3> locks:lock(Agent, [table], read).
{ok,[]}
(n1@dch-mbp)4> locks:lock(Agent, [table], write).
=ERROR REPORT==== 21-Oct-2015::14:45:19 === [20/376]
** Generic server locks_server terminating
** Last message in was {'$gen_cast',{surrender,[table],<0.55.0>}}
** When Server state == {st,{locks_server_locks,locks_server_agents},
{dict,2,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]},
{{[],[],[],[],[],[],[],
[[<0.55.0>|#Ref<0.0.0.76>]],
[],[],[],[],[],[],
[[<0.46.0>|#Ref<0.0.0.69>]],
[]}}},
<0.44.0>}
** Reason for termination ==
** {function_clause,[{locks_server,queue_entries_,
[[{entry,<0.55.0>,<0.53.0>,4,direct}]],
[{file,"src/locks_server.erl"},{line,211}]},
{locks_server,queue_entries_,1,
[{file,"src/locks_server.erl"},{line,214}]},
{locks_server,queue_entries_,1,
[{file,"src/locks_server.erl"},{line,214}]},
{locks_server,queue_entries_,1,
[{file,"src/locks_server.erl"},{line,212}]},
{locks_server,queue_entries,1,
[{file,"src/locks_server.erl"},{line,207}]},
{locks_server,notify,3,
[{file,"src/locks_server.erl"},{line,193}]},
{locks_server,handle_cast,2,
[{file,"src/locks_server.erl"},{line,142}]},
{gen_server,handle_msg,5,
[{file,"gen_server.erl"},{line,604}]}]}
=INFO REPORT==== 21-Oct-2015::14:45:19 ===
application: locks
exited: shutdown
type: temporary
** exception error: {cannot_lock_objects,[{req,[table],
read,
['n1@dch-mbp'],
0,all},
{req,[table],write,['n1@dch-mbp'],1,all}]}
in function locks_agent:await_reply/1 (src/locks_agent.erl, line 397)
in call from locks_agent:lock_/6 (src/locks_agent.erl, line 380)
(n1@dch-mbp)5> application:which_applications().
[{stdlib,"ERTS CXC 138 10","1.19.4"},
{kernel,"ERTS CXC 138 10","2.16.4"}]
Logs of the second actor:
Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.4 (abort with ^G)
(n2@dch-mbp)1>
User switch command
--> r 'n1@dch-mbp'
--> c
Eshell V5.10.4 (abort with ^G)
(n1@dch-mbp)1> {Agent, TrRes} = locks:begin_transaction().
{<0.55.0>,{ok,[]}}
(n1@dch-mbp)2> locks:lock(Agent, [table], read).
{ok,[]}
(n1@dch-mbp)3> locks:lock(Agent, [table], write).
** exception error: {cannot_lock_objects,[{req,[table],
read,
['n1@dch-mbp'],
0,all},
{req,[table],write,['n1@dch-mbp'],1,all}]}
in function locks_agent:await_reply/1 (src/locks_agent.erl, line 397)
in call from locks_agent:lock_/6 (src/locks_agent.erl, line 380)
I am new to locks so I am trying to learn how it works. In some sense I need lock upgrade functionality, that is why I was curious how it works. Maybe I miss something and what I did goes against very basics of what locks should do.
Am I right that try-catch block https://github.com/uwiger/locks/blob/master/src/locks_agent.erl#L228 prevents loop
from tail recursion optimisation?
I am trying to understand how locks works, thus I am not fully understand it's architecture. As far as I can see agents supposed to live the same amount of time as transactions does. Can it blow it's stack for this time if I am right?
locks_leader makes the assumption that all connected nodes are running the 'locks' application. If a node not running 'locks' connects to a node running a locks_leader process, the locks_leader process deadlocks.
Start node 'a'.
Start the 'locks' application.
Start a locks_leader process.
Observe that the locks_leader process on node 'a' is the leader and responsive.
Start named node 'b'. Connect to 'a'.
Observe that the locks_leader process on node 'a' is now stuck in safe_loop and no longer responds to normal messages.
locks_leader receives nodeup message, processed on line 558 of locks_leader.erl
The new node is not in nodes, so include_node (line 693) is called.
include_node calls locks_agent:lock_nowait
locks_agent sends a {locks_agent, _, 'waiting'} message, handled on line 571
Process gives up leadership, causing it to enter safe_loop, but a response from the newly connected node will never come since it is not running 'locks'.
I'm not sure what to do next. I don't know the locks application well enough to attempt a fix. Any guidance would be helpful.
How could we cancel/release a lock before the end of the transaction? Any idea how it could be implemented?
I was playing with locks
to see if I could use it to help me synchronize shutdown of a process that may have in-flight new requests. On a single node I started three lock agents. I first took out a write lock with one, and then requested read locks with the other two, and they blocked (as expected).
Then, I ended the transaction for the agent that had the write lock with end_transaction/1
, and both of the blocking read locks crashed.
To replicate, I added the following test to locks_tests.erl
, added it to run_test_/0
, and ran it:
one_lock_wrr_clients() ->
L = [?MODULE, ?LINE],
script([1,2,3],
[{1, ?LINE, locks, lock, ['$agent', L, write], match({ok,[]})},
{2, ?LINE, locks, lock_nowait, ['$agent', L, read], match(ok)},
{3, ?LINE, locks, lock_nowait, ['$agent', L, read], match(ok)},
{1, ?LINE, locks, end_transaction, ['$agent'], match(ok)},
{2, ?LINE, locks, await_all_locks, ['$agent'],
match({have_all_locks, []})},
{3, ?LINE, locks, await_all_locks, ['$agent'],
match({have_all_locks, []})}
]).
I am not confident that I should see {have_all_locks, []}
for both calls to await_all_locks/1
, but that doesn't really matter, because I get the following crash:
=ERROR REPORT==== 16-Oct-2015::01:14:18 ===
locks_agent: aborted
reason: function_clause
trace: [{locks_agent,lock_holder,
[[{w,[{entry,<0.81.0>,<0.80.0>,3,direct},
{entry,<0.79.0>,<0.78.0>,2,direct}]}]],
[{file,"src/locks_agent.erl"},{line,1080}]},
{locks_agent,'-pp_locks/1-lc$^0/1-0-',1,
[{file,"src/locks_agent.erl"},{line,1077}]},
{locks_agent,all_locks_status,1,
[{file,"src/locks_agent.erl"},{line,1068}]},
{locks_agent,check_if_done,2,
[{file,"src/locks_agent.erl"},{line,933}]},
{locks_agent,handle_call,3,
[{file,"src/locks_agent.erl"},{line,509}]},
{locks_agent,handle_msg,2,
[{file,"src/locks_agent.erl"},{line,266}]},
{locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,261}]},
{locks_agent,agent_init,3,
[{file,"src/locks_agent.erl"},{line,228}]}]
ERROR: {mismatch,
[exit,
{{function_clause,
[{locks_agent,agent_init,3,
[{file,"src/locks_agent.erl"},{line,235}]}]},
{gen_server,call,[<0.79.0>,await_all_locks,infinity]}},
normal,[]]}
Even though this crash was created with a lock_nowait/await_all_locks pair, it is exactly the same crash I received in my original testing.
The analysis I have done so far makes me think this is related to lock upgrades from read to write. Are lock upgrades supposed to happen automatically? Am I fundamentally misunderstanding the API? I don't know yet if this crash occurs if each agent is on a different node.
locks
pulls in plain_fsm
which in turn pulls in esl/edown
, which doesn't compile under new Erlang versions as per esl/edown#32 . The script in priv/check_edown.script
doesn't remove this dependency under current rebar and rebar3.
is the source code of the bench application used in your presentation [1] available? is there any usage in prod you're aware of btw?
[1] http://www.erlang-factory.com/static/upload/media/1402393233214750euc14wigerlocks.pdf
Every time ask_candidates/2 is called, it monitors all candidates: https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L238
However, it never demonitors and duplicated or unexpected 'DOWN' messages may happen.
Spawning a middle-man process like how gen_server:multi_call does it or demonitor in collect_replies/1 can solve this problem.
After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.
I modify check_if_done function(line 787) in locks_agent.erl to resolve it
check_if_done(#state{pending = Pending} = State, Msgs) ->
case ets:info(Pending, size) of
0 ->
Msg = {have_all_locks, []},
notify_msgs([Msg|Msgs], have_all(State));
_ ->
check_if_done_(State, Msgs)
end.
After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.
Modified function get_locks (line 1194) to handle case when ets table is empty
get_locks([H|T], Ls) ->
case ets_lookup(Ls, H) of
[L] -> [L | get_locks(T, Ls)];
[] -> get_locks(T, Ls)
end;
get_locks([], _) ->
[].
After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you
rebar compile
==> examples (compile)
==> locks (compile)
src/locks_agent.erl: error in parse transform 'locks_watcher': {{badmatch,
{ok,
{locks_watcher,
[{abstract_code,
no_abstract_code}]}}},
[{locks_watcher,get_exprs,2,
[{file,"src/locks_watcher.erl"},
{line,111}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,71}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,96}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,100}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,95}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,98}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,98}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,100}]}]}
make: *** [compile] Error 1
In README.md there are links to edown generated docs, e.g. doc/locks.md
, but the md files are not there.
I am curious how the implementation compares with raft? Did you have a look on it?
Also can we add dynamically a member and remove it from the cluster? How many participants can it handle?
I found an unexpected behaviour of locks_leaders; I confirmed it using test_cb.erl. The repro is very simple:
test_cb
on nodes A and B; each process is a leadertest_cb
processes will become the leaderWhat prevent to tag a release right now? Would be cool to have locks available soon hex.pm :)
Trying to use locks app (master branch, c9b585a) I've got interesting failure in a scenario described below.
There were 4 nodes alive connected to each other - A, B, C and D. Node D was a leader.
One time new node E was started, it discovered other running nodes and connected to them.
Before new node E even connected to other nodes, it decided it was a leader.
Once node E connected to other nodes, it sent its leadership info to them. For all 3 non-leaders A, B and C node E locks_leader’s callback elected(State, Election, Pid) was called with "Pid" of the “joined” node (A, B and C) process. In its turn, node’s A, B and C locks_leader’s callback surrendered(State, Synch, Election) was called.
When new leader E connected to old leader D, netsplit happened. Node D won, it’s locks_leader’s callback elected(State, Election, undefined) was called and all other nodes (A, B, C and E) received notification in a callback surrendered(State, Synch, Election), so node E was informed that it was not a leader anymore.
Since then all calls locks_leader:call/2 made in nodes A, B and C ended up with timeout. Same call made in D and E worked as usual with no errors. So it seems that internal state of locks leader of the "passive" nodes A, B and C was compromised by fighting leaders D and E...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.