uwiger / locks Goto Github PK

View Code? Open in Web Editor NEW

202.0 202.0 26.0 336 KB

A scalable, deadlock-resolving resource locker

License: Mozilla Public License 2.0

Erlang 100.00%

locks's People

Contributors

Stargazers

Watchers

locks's Issues

Deadlock due to outdated lock_info

I was testing locks using my testcase.
I believe that there is a bug in the lock_info handling of locks_server and locks_agent, which may cause deadlock.

My testcase has 3 concurrent clients/agents, namely C1, C2, and C3, and 3 locks, [1], [2], and [3].

C1 requests locks in the order of [[1], [2], [3]]
C2 requests locks in the order of [[2], [3], [1]]
C3 requests locks in the order of [[3], [1], [2]]

Here is how the bug happened (in sketch):

C1, C2, and C3 competed on locks.
Due to the deadlock resolving algorithm, C1, C2 eventually acquired all locks and finished.
In the resolution process, C3 got lock_info of [2] (due to locks_agent:send_indirects/1)
even C3 hadn't reach the point of requesting it, which means C3 was not in [2]'s queue.
The locks_server remove the local lock_info entry of [2] since the queue is empty now.
This effectively resets the vsn of the lock_info.
C3 started requesting [2], but the locks_server would respond with lock_info that
had lower vsn than what C3 was told with. Thus C3 got stuck.

I've tried to fix by not removing lock_info entries in locks_server, but my fix seems to fail the test in other ways. Maybe this breaks the algorithm?

Message lost of `locks_running` -- may cause deadlock

locks_watcher is notified by

locks/src/locks_server.erl

Line 64 in 8e9b2e3

catch locks_watcher ! locks_running,

However, the message could be lost if a locks_server is started after

locks/src/locks_watcher.erl

Line 23 in 8e9b2e3

case whereis(locks_server) of

but the message is sent before the name locks_watcher is registered.

locks/src/locks_watcher.erl

Line 26 in 8e9b2e3

try register(locks_watcher, self()),

Incorrect spec

https://github.com/uwiger/locks/blob/master/src/locks.erl#L49 returns {<0.101.0>,{ok,[]}}, but according to spec second element should be lock_result() :: {lock_status(), deadlocks()} where lock_status() :: have_all_locks | have_none.
I can change https://github.com/uwiger/locks/blob/master/src/locks_agent.erl#L312 so it will match one of these and send PR, but not sure what will be semantically correct, have_all_locks or have_none?

In lock_objects the last lock_nowait is missing the last parameter, Req?

-spec lock_objects(pid(), objs()) -> ok.
%%
lock_objects(Agent, Objects) ->
lists:foreach(fun({Obj, Mode}) when Mode == read; Mode == write ->
lock_nowait(Agent, Obj, Mode);
({Obj, Mode, Where}) when Mode == read; Mode == write ->
lock_nowait(Agent, Obj, Mode, Where);
({Obj, Mode, Where, Req})
when (Mode == read orelse Mode == write)
andalso (Req == all
orelse Req == any
orelse Req == majority
orelse Req == majority_alive
orelse Req == all_alive) ->
lock_nowait(Agent, Obj, Mode, Where);
(L) ->
error({illegal_lock_pattern, L})
end, Objects).

read vs write locks

I have a quick question corresponding to the way locks are shared or exclusive.

Suppose I have write locks on [Resource, OID] and requesting a read lock on [Resource]. Will the clients be able to handle reads while the client that requested a write is modifying the resource? Or the write lock will exclude all reads on the resource?

Application locks has stopped on double write-lock

I have two actors which works approximately in the same time. Each of them begins transaction. Each of them acquires read lock on the same oid(). Then first tries to upgrade read lock to write lock. Second does the same and application crashes immediately:

Logs of the first actor:

Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.4  (abort with ^G)
(n1@dch-mbp)1> application:ensure_all_started(locks).
{ok,[locks]}
(n1@dch-mbp)2> {Agent, TrRes} = locks:begin_transaction().
{<0.46.0>,{ok,[]}}
(n1@dch-mbp)3> locks:lock(Agent, [table], read).
{ok,[]}
(n1@dch-mbp)4> locks:lock(Agent, [table], write).
=ERROR REPORT==== 21-Oct-2015::14:45:19 ===                                                                                                                                                                [20/376]
** Generic server locks_server terminating 
** Last message in was {'$gen_cast',{surrender,[table],<0.55.0>}}
** When Server state == {st,{locks_server_locks,locks_server_agents},
                            {dict,2,16,16,8,80,48,
                                  {[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                   [],[]},
                                  {{[],[],[],[],[],[],[],
                                    [[<0.55.0>|#Ref<0.0.0.76>]],
                                    [],[],[],[],[],[],
                                    [[<0.46.0>|#Ref<0.0.0.69>]],
                                    []}}},
                            <0.44.0>}
** Reason for termination == 
** {function_clause,[{locks_server,queue_entries_,
                                   [[{entry,<0.55.0>,<0.53.0>,4,direct}]],
                                   [{file,"src/locks_server.erl"},{line,211}]},
                     {locks_server,queue_entries_,1,
                                   [{file,"src/locks_server.erl"},{line,214}]},
                     {locks_server,queue_entries_,1,
                                   [{file,"src/locks_server.erl"},{line,214}]},
                     {locks_server,queue_entries_,1,
                                   [{file,"src/locks_server.erl"},{line,212}]},
                     {locks_server,queue_entries,1,
                                   [{file,"src/locks_server.erl"},{line,207}]},
                     {locks_server,notify,3,
                                   [{file,"src/locks_server.erl"},{line,193}]},
                     {locks_server,handle_cast,2,
                                   [{file,"src/locks_server.erl"},{line,142}]},
                     {gen_server,handle_msg,5,
                                 [{file,"gen_server.erl"},{line,604}]}]}

=INFO REPORT==== 21-Oct-2015::14:45:19 ===
    application: locks
    exited: shutdown
    type: temporary
** exception error: {cannot_lock_objects,[{req,[table],
                                               read,
                                               ['n1@dch-mbp'],
                                               0,all},
                                          {req,[table],write,['n1@dch-mbp'],1,all}]}
     in function  locks_agent:await_reply/1 (src/locks_agent.erl, line 397)
     in call from locks_agent:lock_/6 (src/locks_agent.erl, line 380)
(n1@dch-mbp)5> application:which_applications().
[{stdlib,"ERTS  CXC 138 10","1.19.4"},
 {kernel,"ERTS  CXC 138 10","2.16.4"}]

Logs of the second actor:

Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.4  (abort with ^G)
(n2@dch-mbp)1> 
User switch command
 --> r 'n1@dch-mbp'
 --> c
Eshell V5.10.4  (abort with ^G)
(n1@dch-mbp)1> {Agent, TrRes} = locks:begin_transaction().
{<0.55.0>,{ok,[]}}
(n1@dch-mbp)2> locks:lock(Agent, [table], read).
{ok,[]}
(n1@dch-mbp)3> locks:lock(Agent, [table], write).
** exception error: {cannot_lock_objects,[{req,[table],
                                               read,
                                               ['n1@dch-mbp'],
                                               0,all},
                                          {req,[table],write,['n1@dch-mbp'],1,all}]}
     in function  locks_agent:await_reply/1 (src/locks_agent.erl, line 397)
     in call from locks_agent:lock_/6 (src/locks_agent.erl, line 380)

I am new to locks so I am trying to learn how it works. In some sense I need lock upgrade functionality, that is why I was curious how it works. Maybe I miss something and what I did goes against very basics of what locks should do.

Possible lack of tail recursion optiomisation

Am I right that try-catch block https://github.com/uwiger/locks/blob/master/src/locks_agent.erl#L228 prevents loop from tail recursion optimisation?
I am trying to understand how locks works, thus I am not fully understand it's architecture. As far as I can see agents supposed to live the same amount of time as transactions does. Can it blow it's stack for this time if I am right?

locks_leader cannot coexist with nodes not running locks application

locks_leader makes the assumption that all connected nodes are running the 'locks' application. If a node not running 'locks' connects to a node running a locks_leader process, the locks_leader process deadlocks.

Steps to reproduce

Start node 'a'.
Start the 'locks' application.
Start a locks_leader process.

Observe that the locks_leader process on node 'a' is the leader and responsive.

Start named node 'b'. Connect to 'a'.

Observe that the locks_leader process on node 'a' is now stuck in safe_loop and no longer responds to normal messages.

Investigation

locks_leader receives nodeup message, processed on line 558 of locks_leader.erl
The new node is not in nodes, so include_node (line 693) is called.
include_node calls locks_agent:lock_nowait
locks_agent sends a {locks_agent, _, 'waiting'} message, handled on line 571
Process gives up leadership, causing it to enter safe_loop, but a response from the newly connected node will never come since it is not running 'locks'.

I'm not sure what to do next. I don't know the locks application well enough to attempt a fix. Any guidance would be helpful.

cancel lock? what's needed

How could we cancel/release a lock before the end of the transaction? Any idea how it could be implemented?

Crash releasing write lock with outstanding read lockers

I was playing with locks to see if I could use it to help me synchronize shutdown of a process that may have in-flight new requests. On a single node I started three lock agents. I first took out a write lock with one, and then requested read locks with the other two, and they blocked (as expected).

Then, I ended the transaction for the agent that had the write lock with end_transaction/1, and both of the blocking read locks crashed.

To replicate, I added the following test to locks_tests.erl, added it to run_test_/0, and ran it:

one_lock_wrr_clients() ->
    L = [?MODULE, ?LINE],
    script([1,2,3],
           [{1, ?LINE, locks, lock, ['$agent', L, write], match({ok,[]})},
            {2, ?LINE, locks, lock_nowait, ['$agent', L, read], match(ok)},
            {3, ?LINE, locks, lock_nowait, ['$agent', L, read], match(ok)},
            {1, ?LINE, locks, end_transaction, ['$agent'], match(ok)},
            {2, ?LINE, locks, await_all_locks, ['$agent'],
             match({have_all_locks, []})},
            {3, ?LINE, locks, await_all_locks, ['$agent'],
             match({have_all_locks, []})}
           ]).

I am not confident that I should see {have_all_locks, []} for both calls to await_all_locks/1, but that doesn't really matter, because I get the following crash:

=ERROR REPORT==== 16-Oct-2015::01:14:18 ===
    locks_agent: aborted
    reason: function_clause
    trace: [{locks_agent,lock_holder,
                         [[{w,[{entry,<0.81.0>,<0.80.0>,3,direct},
                               {entry,<0.79.0>,<0.78.0>,2,direct}]}]],
                         [{file,"src/locks_agent.erl"},{line,1080}]},
            {locks_agent,'-pp_locks/1-lc$^0/1-0-',1,
                         [{file,"src/locks_agent.erl"},{line,1077}]},
            {locks_agent,all_locks_status,1,
                         [{file,"src/locks_agent.erl"},{line,1068}]},
            {locks_agent,check_if_done,2,
                         [{file,"src/locks_agent.erl"},{line,933}]},
            {locks_agent,handle_call,3,
                         [{file,"src/locks_agent.erl"},{line,509}]},
            {locks_agent,handle_msg,2,
                         [{file,"src/locks_agent.erl"},{line,266}]},
            {locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,261}]},
            {locks_agent,agent_init,3,
                         [{file,"src/locks_agent.erl"},{line,228}]}]
ERROR: {mismatch,
           [exit,
            {{function_clause,
                 [{locks_agent,agent_init,3,
                      [{file,"src/locks_agent.erl"},{line,235}]}]},
             {gen_server,call,[<0.79.0>,await_all_locks,infinity]}},
            normal,[]]}

Even though this crash was created with a lock_nowait/await_all_locks pair, it is exactly the same crash I received in my original testing.

The analysis I have done so far makes me think this is related to lock upgrades from read to write. Are lock upgrades supposed to happen automatically? Am I fundamentally misunderstanding the API? I don't know yet if this crash occurs if each agent is on a different node.

Application bound to a deprecated edown version

locks pulls in plain_fsm which in turn pulls in esl/edown, which doesn't compile under new Erlang versions as per esl/edown#32 . The script in priv/check_edown.script doesn't remove this dependency under current rebar and rebar3.

benchmark sourcecode ?

is the source code of the bench application used in your presentation [1] available? is there any usage in prod you're aware of btw?

[1] http://www.erlang-factory.com/static/upload/media/1402393233214750euc14wigerlocks.pdf

ask_candidates/2 never demonitors

Every time ask_candidates/2 is called, it monitors all candidates: https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L238

However, it never demonitors and duplicated or unexpected 'DOWN' messages may happen.

Spawning a middle-man process like how gen_server:multi_call does it or demonitor in collect_replies/1 can solve this problem.

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls

After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.

I modify check_if_done function(line 787) in locks_agent.erl to resolve it

check_if_done(#state{pending = Pending} = State, Msgs) ->
case ets:info(Pending, size) of
0 ->
Msg = {have_all_locks, []},
notify_msgs([Msg|Msgs], have_all(State));
_ ->
check_if_done_(State, Msgs)
end.

After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.

Modified function get_locks (line 1194) to handle case when ets table is empty
get_locks([H|T], Ls) ->
case ets_lookup(Ls, H) of
[L] -> [L | get_locks(T, Ls)];
[] -> get_locks(T, Ls)
end;
get_locks([], _) ->
[].

After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you

compile error in r16

rebar compile
==> examples (compile)
==> locks (compile)
src/locks_agent.erl: error in parse transform 'locks_watcher': {{badmatch,
{ok,
{locks_watcher,
[{abstract_code,
no_abstract_code}]}}},
[{locks_watcher,get_exprs,2,
[{file,"src/locks_watcher.erl"},
{line,111}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,71}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,96}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,100}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,95}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,98}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,98}]},
{locks_watcher,transform,1,
[{file,"src/locks_watcher.erl"},
{line,100}]}]}
make: *** [compile] Error 1

Generated edown files missing

In README.md there are links to edown generated docs, e.g. doc/locks.md, but the md files are not there.

how the leader election compares to raft?

I am curious how the implementation compares with raft? Did you have a look on it?

Also can we add dynamically a member and remove it from the cluster? How many participants can it handle?

No-one becomes a leader after netsplit

I found an unexpected behaviour of locks_leaders; I confirmed it using test_cb.erl. The repro is very simple:

start test_cb on nodes A and B; each process is a leader
connect A with B; one of test_cb processes will become the leader
disconnect A and B; each process is a leader again
connect A with B; both processes will be "elected" at the same time, after which both will surrender.

test_cb node A logs
test_cb node B logs

release tag ?

What prevent to tag a release right now? Would be cool to have locks available soon hex.pm :)

locks_leader:call/2 timeouts after joining new node

Trying to use locks app (master branch, c9b585a) I've got interesting failure in a scenario described below.

There were 4 nodes alive connected to each other - A, B, C and D. Node D was a leader.
One time new node E was started, it discovered other running nodes and connected to them.
Before new node E even connected to other nodes, it decided it was a leader.

Once node E connected to other nodes, it sent its leadership info to them. For all 3 non-leaders A, B and C node E locks_leader’s callback elected(State, Election, Pid) was called with "Pid" of the “joined” node (A, B and C) process. In its turn, node’s A, B and C locks_leader’s callback surrendered(State, Synch, Election) was called.

When new leader E connected to old leader D, netsplit happened. Node D won, it’s locks_leader’s callback elected(State, Election, undefined) was called and all other nodes (A, B, C and E) received notification in a callback surrendered(State, Synch, Election), so node E was informed that it was not a leader anymore.

Since then all calls locks_leader:call/2 made in nodes A, B and C ended up with timeout. Same call made in D and E worked as usual with no errors. So it seems that internal state of locks leader of the "passive" nodes A, B and C was compromised by fighting leaders D and E...

uwiger / locks Goto Github PK

locks's People

Contributors

Stargazers

Watchers

Forkers

locks's Issues

Steps to reproduce

Investigation

Recommend Projects

Recommend Topics

Recommend Org