derekkraan / horde Goto Github PK
View Code? Open in Web Editor NEWHorde is a distributed Supervisor and Registry backed by DeltaCrdt
License: MIT License
Horde is a distributed Supervisor and Registry backed by DeltaCrdt
License: MIT License
When starting a new child with Horde.Supervisor.start_child/2
and the child is already started, the error {:error, {:already_started, pid}}
is returned, but pid is always nil.
Having the pid in the returned error is commonly used to perform operation on the process, for example try to start the process -> already started -> GenServer.call(pid, whatever).
Please fill out this short survey to help me understand how Horde is being used in the wild.
Is therea mechanism implemented in this library that a process can hand off its state when the process is being restarted on another node? A bit like what swarm has implemented?
I think this should "just work", in which case I can remove the checks that prevent this from happening, but I will need to figure out a good way to test this case first.
Why it should work: each process is linked to the various nodes of Horde.Registry
. In the event of a network partition, these processes will each receive an exit signal. These processes should interpret that as the Registry having died, and re-register themselves. Then, when the different parts of the Registry are rejoined, all keys should have been added again and the Registry should be complete.
Currently Horde.Supervisor
only supports permanent processes. For feature-parity with DynamicSupervisor
, supporting transient
and temporary
processes is needed.
Hi,
This is not that much an issue but rather a doc improvement or completion request.
I wanted to give a try to Horde, after having spent some time with Swarm. I'm not very much experimented with distributed development in Elixir but I'm not an Elixir newbie. I got quite quickly stuck when trying to define both a Registry and a Supervisor: how to register a worker that is going to be supervised at the same time ? Do I need to use a :via
tuple or is there some other way ? What about join()
: should it be executed on each node or is it to be used in case we have several supervisors/registries? How to start multiple nodes and create a globally distributed supervisor/registry? Etc..
As I believe I'm not the only person in this situation, I think some code snippets with basic patterns would be really helpful. For example:
Thanks in advance
The UniformQuorumDistribution
strategy, and any similar ones where has_quorum?/1
can return false
and choose_node
therefore nil
breaks the :start_child
call in SupervisorImpl
.
defmodule Horde.UniformQuorumDistribution do
def choose_node(identifier, members) do
if has_quorum?(members) do
Horde.UniformDistribution.choose_node(identifier, members)
else
nil
end
end
and
defmodule Horde.SupervisorImpl do
def handle_call({:start_child, child_spec} = msg, from, %{node_id: this_node_id} = state) do
case state.distribution_strategy.choose_node(child_spec.id, state.members) do
{^this_node_id, _} ->
{reply, new_state} = add_child(child_spec, state)
{:reply, reply, new_state}
{other_node_id, _} ->
proxy_to_node(other_node_id, msg, from, state)
end
end
If a cluster is still starting up there will always be a point at which state.members
< whatever constitutes quorum.
Is there perhaps a way to delay the starting of child processes until the cluster has_quorum?
?
What would be the best way to deal with the startup and quorum?
Hi Derek.
I have a cluster setup with libcluster, on my local machine, as such:
fixed_example: [
strategy: Cluster.Strategy.Epmd,
config: [
hosts: [:"[email protected]", :"[email protected]", :"[email protected]"],
secret: "myapp_cluster"
]
]
Where 192.168.43.226
is my WIFI ip address.
I have a horde cluster that uses the results of Node.list() to connect to the other nodes.
This works perfectly fine. I have a process that should only run on one instance of the cluster and I can confirm this works wonderfully.
If I call {:ok, members} = Horde.Cluster.members(MyApp.HordeSupervisor)
I get an :ok tuple and three members that each have the :alive
distinction.
However, when I turn my wifi off, and the network splits into 3 partitions, then the response from running {:ok, members} = Horde.Cluster.members(MyApp.HordeSupervisor)
on each node returns an :ok tuple with only one member and it is still alive.
The problem then is that Horde.UniformQuorumDistribution
uses the members list to determine the quorum and since the members updated to only include the one node it can see, it thinks that each of the partitions has quorum, instead of rather killing each of them.
Do you have any idea if this is the expected behaviour. The problem has to be related to the members list updating but I am not sure where to find this and what could be causing this.
Hi,
I have tried running your example on ubuntu 18.04.
The two nodes start fine but both display the hello message with their own node details rather than just one of them (I cut and pasted the start code from the example readme into each terminal)
Killing one of the nodes does not cause the other node to pick up the lost process.
Running Node.list shows that both are connected to each other.
Am i doing something wrong?
regards
Tom
Currently the option for number of partitions is ignored. As a stop-gap measure we should check it and make sure it is not greater than 1.
I am loving Horde and learning a lot along the way.
I successfully get the HelloWorld
application to run with three different IEX instances and can see that only one instance of SayHello
is running. However, when I ask each one the :how_many
the count is different on each node, being the number of times that node has executed the :say_hello
function.
I thought it would be simple to extend the example application to work with CRDT and share the count so that how_many?
would return the total number of times the function has been called in the cluster. It doesn't seem to be as simple and the documentation for CRDT doesn't make it clear how adding neighbours from other nodes will work.
It would be great if the example application could be extended to show how the various nodes will use DeltaCrdt to share data between them.
I think this is a bug.
TLDR: The Distributed registry doesn't retrieve state when joining a cluster after running set_members
.
This is in my application.ex
[
...,
{Cluster.Supervisor, [Application.get_env(:libcluster, :topologies)]},
{Horde.Registry, name: FamilyFive.HordeRegistry, keys: :unique},
{Horde.Supervisor,
name: FamilyFive.HordeSupervisor, strategy: :one_for_one, children: []},
...
]
And this is a Tracker module that runs after the participating nodes are retrieved or changed:
Horde.Cluster.set_members(
FamilyFive.HordeRegistry,
Enum.map(nodes, fn n -> {FamilyFive.HordeRegistry, n} end
)
Horde.Cluster.set_members(
FamilyFive.HordeSupervisor,
Enum.map(nodes, fn n -> {FamilyFive.HordeSupervisor, n} end)
)
When I start a new process:
Horde.Supervisor.start_child(
FamilyFive.HordeSupervisor,
FamilyFive.PushNotifications.PushNotificationsScheduler
)
The FamilyFive.PushNotifications.PushNotificationsScheduler
includes the following start_link to be registered:
GenServer.start_link(__MODULE__, [],
name: {:via, Horde.Registry, {FamilyFive.HordeRegistry, __MODULE__}}
)
Then when I start up a new node and run this on the new node:
Horde.Registry.lookup(
FamilyFive.HordeRegistry,
FamilyFive.PushNotifications.PushNotificationsScheduler
)
It returns :undefined
even though it returns the pid
correctly on the other node. This inconsistency is not resolved automatically.
If I start the two nodes first and then run start the process on one of the nodes the lookups works. But if I first start one node run the process, and only then spin up the new node and then run the lookup it doesn't work.
I think it might have to do that I don't specify the members
on startup, but I don't know the members at startup (they are only known when libcluster queries the DNS).
See: https://asciinema.org/a/r19vUhFueoPRbf7IaPB92ntpu
I'm starting a two node cluster and then force a net split. During the net split, I am unable to start a second process with the same name (the supervisor won't let me). This means that I cannot have availability on both sides of the split.
However, if I tweak the memberships during the split (which I probably shouldn't be doing), I am able to start the same process on both sides. However, if I then heal the split and update the member list to contain both nodes this conflict is never resolved and I end up with two processes running.
Kind of the same behaviour can be observed when a node shuts down: https://asciinema.org/a/RDS52ntJYw6gCDXLN2rmfYkKc but in this case I lose the process entirely.
In both the cases of a net split, as well as a shutdown (since the scenario is pretty much the same for the other nodes in the cluster) I would have expected the process to be migrated to the other node. Or at least that I would have been able to start it.
I'm not fully sure what the expected behaviour is supposed to be, but I think this may be worth looking into.
First I’d like to say that this is great work.
I spent several days with Swarm before realizing it didn’t fit my needs because it doesn’t supervise processes.
Being Horde allows proper distributed supervision of sub trees, it seems to meet my needs perfectly, so thank you!
On to the real question...
Maybe I’m missing an option somewhere, but I can’t seem to find a way to tell Horde to leave processes running where ever they currently are when a node is added to the cluster.
I’m currently seeing the following behavior:
:ok
It would be nice if I could leave those processes running on node A.
I’m using Horde so that I can have a hot standby node ready to take over should node A leave the cluster unexpectedly.
That works, but the current behavior means there is a tiny bit of instability when my failover node comes online, or anytime one of my containers gets restarted really.
Currently we use .start_link
in a couple of places in Horde.Registry
and I'd like to move these to a proper supervision tree.
See the PR: elixir-lang/elixir#8963
Describe the bug
Hi,
The documentation has a lot of reference with the DynamicSupervisor but seems to act as a Supervisor which is a little confusing.
And the return seem to be inconsistent by returning nil pid on already_started.
To Reproduce
defmodule MyApp.Supervisor do
# Automatically defines child_spec/1
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg)
end
@impl true
def init(_init_arg) do
children = [
{MyServer, []}
]
Supervisor.init(children, strategy: :one_for_one)
end
end
defmodule MyServer do
use GenServer
def start_link(state) do
GenServer.start_link(__MODULE__, state)
end
## Callbacks
@impl true
def init(stack) do
{:ok, stack}
end
end
{:ok, _} = DynamicSupervisor.start_link(name: MyApp.CusterDynamicSupervisor, strategy: :one_for_one)
DynamicSupervisor.start_child(MyApp.CusterDynamicSupervisor, {MyApp.Supervisor, []})
# {:ok, #PID<0.419.0>}
DynamicSupervisor.start_child(MyApp.CusterDynamicSupervisor, {MyApp.Supervisor, []})
# {:ok, #PID<0.422.0>}
DynamicSupervisor.which_children(MyApp.CusterDynamicSupervisor)
# [
# {:undefined, #PID<0.419.0>, :supervisor, [MyApp.Supervisor]},
# {:undefined, #PID<0.422.0>, :supervisor, [MyApp.Supervisor]}
# ]
{:ok, _} = Supervisor.start_link([], name: MyApp.CusterSupervisor, strategy: :one_for_one)
# Supervisor.start_child(MyApp.Horde, {MyApp.CusterSupervisor, []})
# {:ok, #PID<0.419.0>}
# Supervisor.start_child(MyApp.Horde, {MyApp.CusterSupervisor, []})
# {:error, {:already_started, #PID<0.419.0>}}
# Supervisor.which_children(MyApp.Horde)
# [{MyApp.Supervisor, #PID<0.419.0>, :supervisor, [MyApp.Supervisor]}]
{:ok, _} = Horde.Supervisor.start_link(name: MyApp.CusterDistributedSupervisor, strategy: :one_for_one)
Horde.Supervisor.start_child(MyApp.CusterDistributedSupervisor, {MyApp.Supervisor, []})
# {:ok, #PID<0.435.0>}
Horde.Supervisor.start_child(MyApp.CusterDistributedSupervisor, {MyApp.Supervisor, []})
# {:error, {:already_started, nil}}
Horde.Supervisor.which_children(MyApp.CusterDistributedSupervisor)
# [{MyApp.Supervisor, #PID<0.435.0>, :supervisor, [MyApp.Supervisor]}]
Environment
21
~> 1.8.1
~> 0.5.0-rc.7
10.14.3 (18D109)
Additional context
I am developing a MMORPG server for educational purposes.
What I want to do is create a distributed dynamic supervisor who will supervise several Worlds and that will supervise several channels that can be considered as a parallel dimension of the world in which players cannot see each other but can send group messages; private messages; share guild...
Hi,
Unless I misunderstood something, I believe there are some issues in
joining/leaving hordes. Below are the traces and comments on
experiment I have done so far. As all is about the same topic, I open just
one issue
For what follows I have a XYZ.Utils
module with two helper functions that will be used later:
def start_worker(name) do
cs = %{id: name, start: {XYZ.Worker, :start_link, [[client_id: name]]}}
Horde.Supervisor.start_child(XYZ.DSup, cs)
end
def lookup(name) do
Horde.Registry.lookup(XYZ.Worker.via_tuple(name))
end
and a XYZ.Worker
module which is basically a GenServer
with few
API that are not relevant here.
All examples start with a fresh situation with two nodes started with
ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix
and
ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix
in two
terminals. Then nd1
connect to nd2
with
Node.connect(:"[email protected]")
.
Joining the hordes from nd1
to nd2
works fine
iex([email protected])1> Horde.Cluster.join_hordes(XYZ.DSup, {XYZ.DSup, :"[email protected]"})
:ok
iex([email protected])2> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok
However, the system returns :ok
if you specify a node name that does not exist:
iex([email protected])9> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok
Same holds for supervisors. Returning false
would provide the same
semantics as Node.connect()
.
leave_hordes()
registry on removed node is not cleaned properlyStart with a fresh situation, join the hordes and start a process
that will run on nd1
. Check on both nodes that the process is available:
iex([email protected])11> XYZ.Utils.lookup("a1")
#PID<0.215.0>
and on nd2
iex([email protected])5> XYZ.Utils.lookup("a1")
#PID<18754.215.0>
Good. Now leave the horde on nd2
and check for cluster members:
iex([email protected])7> Horde.Cluster.leave_hordes(XYZ.DReg)
:ok
iex([email protected])8> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
<<137, 122, 137, 240, 81, 186, 67, 134, 134, 190, 218, 93, 121, 158, 66, 210,
155>> => {#PID<18754.173.0>, #PID<18754.176.0>}
}}
and on nd1
iex([email protected])13> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
<<137, 122, 137, 240, 81, 186, 67, 134, 134, 190, 218, 93, 121, 158, 66, 210,
155>> => {#PID<0.173.0>, #PID<0.176.0>}
}}
This is ok. Yet, if we lookup the name on nd2
we still get the process pid:
iex([email protected])11> XYZ.Utils.lookup("a1")
#PID<18754.215.0>
Is it a wanted behaviour ? If so I would understand that the registry
on nd2
keeps the processes that are running on nd2
, but here the
process runs on the other node and this behaviour seems
inappropriate. Otherwise, it may imply the cleanup is not done on the
removed node.
NB: I understand CRDT are eventually consistent. I re-looked up
several times (after 10s, 30s and maybe 1mn) and the process is still
registered.
This is not really an issue but more a remark: if one
registry is removed, say the one running on nd2
as in the previous
example, sometimes (subjective measure...: between 10s to 30s of what
I could experiment) running Horde.Cluster.members(XYZ.DReg)
on nd2
shows both registries for a while. I ran into this once then I had to
retry several times before getting it again. My system was CPU flat
and there was no process registered. I believe this is due to some
CRDT synchronizations, yet it seems long. What would happen if a
process is to be registered after the node had left a horde but before
everything is cleaned up?
nd1
and nd2
are freshly started and hordes are joined. Leaving the
supervision horde on nd2
seems to work fine:
iex([email protected])4> Horde.Cluster.leave_hordes(XYZ.DSup)
:ok
After a while, however, we get the following error:
iex([email protected])5>
2018-06-27 06:54:50.103 - pid=<0.164.0> [error] - GenServer XYZ.DSup terminating
** (stop) exited in: GenServer.stop(#PID<0.165.0>, :force_shutdown, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently
associated with the given name, possibly because its application isn't started
(elixir) lib/gen_server.ex:789: GenServer.stop/3
(horde) lib/horde/supervisor.ex:146: Horde.Supervisor.terminate/2
(stdlib) gen_server.erl:648: :gen_server.try_terminate/3
(stdlib) gen_server.erl:833: :gen_server.terminate/10
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :force_shutdown
State: %Horde.Supervisor.State{distribution_strategy: Horde.UniformDistribution,
members: %{<<27, 95, 41, 179, 113, 231, 211, 136, 166, 166, 226, 32, 200, 173, 190, 5, 62>> =>
{:shutting_down, {#PID<0.164.0>, #PID<0.165.0>, #PID<0.166.0>, #PID<0.169.0>}},
<<216, 132, 252, 114, 237, 23, 140, 220, 238, 8, 30, 25, 202, 105, 106, 195, 70>> =>
{:alive, {#PID<15878.164.0>, #PID<15878.165.0>, #PID<15878.166.0>, #PID<15878.169.0>}}},
members_pid: #PID<0.166.0>,
node_id: <<27, 95, 41, 179, 113, 231, 211, 136, 166, 166, 226, 32, 200, 173, 190, 5, 62>>,
processes: %{}, processes_pid: #PID<0.169.0>, processes_updated_at: 5,
processes_updated_counter: 5, shutting_down: true, supervisor_pid: #PID<0.165.0>}
Again the systems load is flat and there is no process being supervised.
This may mean that leaving the horde take too long.
nd1
and nd2
are freshly started and registry hordes are
joined. No process is registered:
iex([email protected])2> Node.connect(:"[email protected]")
true
iex([email protected])3> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok
When listing the members of the cluster on nd1
, I get:
iex([email protected])7> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
<<83, 195, 54, 96, 102, 20, 156, 116, 131, 106, 17, 103, 11, 204, 251, 33,
138>> => {#PID<15833.173.0>, #PID<15833.176.0>},
<<189, 133, 217, 206, 174, 187, 181, 173, 175, 99, 162, 160, 87, 102, 179,
153, 55>> => {#PID<0.173.0>, #PID<0.176.0>}
}}
<<83, 195, ...>>
is the part running on nd2
.
OK. Now, let's kill nd2
(with Control-C
) and let's restart it and
re-join the registry horde:
iex([email protected])2> Node.connect(:"[email protected]")
true
iex([email protected])3> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok
I get the following messages every 5 seconds on nd2
, for ever:
2018-06-29 06:27:35.825 - pid=<0.32.0> [error] - Discarding message
{delta, {<0.176.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>
#{'__struct__'=>'Elixir.MapSet',map=>#{},version=>2},maxima=>#{}},keys=>#{'__struct__'=>
'Elixir.MapSet',map=>#{},version=>2},state=>#{}}},1}
from <0.176.0> to <0.176.0> in an old incarnation (1) of this node (2)
2018-06-29 06:27:35.826 - pid=<0.32.0> [error] - Discarding message
{delta,{<0.173.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{317600142,0}=>
[],{376039348,0}=>[],{629303426,0}=>[]},version=>2},maxima=>#{317600142=>
0,376039348=>0,629303426=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>
#{<<83,195,54,96,102,20,156,116,131,106,17,103,11,204,251,33,138>>=>
[],<<124,86,99,184,16,232,64,101,19,160,234,151,99,96,179,111,197>>=>
[],<<189,133,217,206,174,187,181,173,175,99,162,160,87,102,179,153,55>>=>[]},version=>2},
state=>#{<<83,195,54,96,102,20,156,116,131,106,17,103,11,204,251,33,138>>=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>
#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.173.0>,<0.176.0>},1530242807895109565}=>[]},version=>2},
state=>#{{{<0.173.0>,<0.176.0>},1530242807895109565}=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>
'Elixir.MapSet',map=>#{{317600142,0}=>[]},version=>2}}}},
<<124,86,99,184,16,232,64,101,19,160,234,151,99,96,179,111,197>>=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>
#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.173.0>,<0.176.0>},1530242849488856679}=>[]},version=>2},
state=>#{{{<0.173.0>,<0.176.0>},1530242849488856679}=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalDotSet',causal_context=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{},version=>2},maxima=>#{}},
state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{376039348,0}=>[]},version=>2}}}},
<<189,133,217,206,174,187,181,173,175,99,162,160,87,102,179,153,55>>=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>
#{'__struct__'=>'Elixir.MapSet',map=>#{{{<15833.173.0>,<15833.176.0>},
1530242805313776549}=>[]},version=>2},
state=>#{{{<15833.173.0>,<15833.176.0>},1530242805313776549}=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,
state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{629303426,0}=>[]},version=>2}}}}}}},2}
from <0.173.0> to <0.173.0> in an old incarnation (1) of this node (2)
If I check on nd1
the list of members, I get:
iex([email protected])8> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
<<83, 195, 54, 96, 102, 20, 156, 116, 131, 106, 17, 103, 11, 204, 251, 33,
138>> => {#PID<15833.173.0>, #PID<15833.176.0>},
<<124, 86, 99, 184, 16, 232, 64, 101, 19, 160, 234, 151, 99, 96, 179, 111,
197>> => {#PID<15833.173.0>, #PID<15833.176.0>},
<<189, 133, 217, 206, 174, 187, 181, 173, 175, 99, 162, 160, 87, 102, 179,
153, 55>> => {#PID<0.173.0>, #PID<0.176.0>}
}}
<<83, 195, ...>>
is still here and points to the processes on the
killed node. This shows that the old registry was not removed when
nd2
was killed.
NB: I tried the same with supervisors which seem to work fine. The processes are restarted
on remaining nodes and while the reference of the old supervisor is
still present as well, it is labelled as :dead
. Re-joining generates
no issue. By the way, I'm wondering why do you keep this information ?
If you need clarifications on this report, let me know.
Cheers
Currently every node is connected to every other node (via delta_crdt). It should be possible to define other topologies to make synchronization more efficient.
Useful for #21
Apologies for raising an issue as this is more about clarification!
When you gracefully shut a horde node down, the process it has are moved to a new node according to the documentation but does the state of the process also move with it in the same way as swarm does?
thanks
Tom
Hi Derek,
We are planning to switch from Swarm to Horde but after doing some tests I saw that when I restart a node, dead node is not removed from horde registry members list and I receive this error: Discarding message {delta ...
which I believe is because of trying to send deltas to a dead node.
Restarting nodes causes the registry members list to grow and amount of errors is also increased.
I don't know if it is a bug or I'm doing something wrong.
Starting a single node with
iex --name [email protected] --cookie asdf -S mix
This is using current master and Elixir 1.7.4 and Erlang 20.2.3
It results in the following:
18:21:04.398 [info] Starting Horde.RegistryImpl with name HelloWorld.HelloRegistry
18:21:04.401 [error] Process HelloWorld.HelloRegistry.RegistryCrdt (#PID<0.515.0>) terminating
** (ErlangError) Erlang error: :timeout_value
(stdlib) gen_server.erl:389: :gen_server.loop/7
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Initial Call: DeltaCrdt.CausalCrdt.init/1
Ancestors: [HelloWorld.HelloRegistry.Supervisor, HelloWorld.Supervisor, #PID<0.508.0>]
Message Queue Length: 0
Messages: []
Links: [#PID<0.516.0>, #PID<0.517.0>, #PID<0.518.0>, #PID<0.510.0>]
Dictionary: [rand_seed: {%{bits: 58, jump: #Function<8.15449617/1 in :rand.mk_alg/1>, next: #Function<5.15449617/1 in :rand.mk_alg/1>, type: :exrop, uniform: #Function<6.15449617/1 in :rand.mk_alg/1>, uniform_n: #Function<7.15449617/2 in :rand.mk_alg/1>, weak_low_bits: 1}, [37638961097853609 | 49241742922068666]}]
Trapping Exits: true
Status: :running
Heap Size: 610
Stack Size: 27
Reductions: 394
Only by reverting to v 0.3.0 and forcing delta_crdt
to 0.3.1 can I get the example to start.
Please see: https://asciinema.org/a/AKOZTh1jvI1NTa1KK2XMEIBT1
The test app I'm running can be found here: https://github.com/frekw/horde-test
For Horde.Registry
the member list seem to be replicated (as expected) but for the supervisor it's not.
I could very well be doing something wrong as well but at least I don't think I am.
According to Horde docs and function spec, terminate_child/2 should use the child pid as child_id to terminate, like DynamicSupervisor does.
But does not work, and looking into tests seems that we need to use the id of the process (which works), which is (maybe) fine, but should at least be documented.
I get the following error if I do an :init.stop()
:
[error] GenServer FamilyFive.HordeSupervisor.Crdt terminating
** (ArgumentError) argument error
:erlang.send(FamilyFive.HordeSupervisor, {:crdt_update, [{:add, {:process, FamilyFive.PushNotifications.PushNotificationsScheduler}, {nil, %{id: FamilyFive.PushNotifications.PushNotificationsScheduler, modules: [FamilyFive.PushNotifications.PushNotificationsScheduler], restart: :permanent, shutdown: 5000, start: {FamilyFive.PushNotifications.PushNotificationsScheduler, :start_link, []}, type: :worker}}}]})
(horde) lib/horde/supervisor_supervisor.ex:12: anonymous fn/2 in Horde.SupervisorSupervisor.init/1
(delta_crdt) lib/delta_crdt/causal_crdt.ex:201: DeltaCrdt.CausalCrdt.update_state_with_delta/3
(delta_crdt) lib/delta_crdt/causal_crdt.ex:166: DeltaCrdt.CausalCrdt.handle_cast/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:operation, {:add, [process: FamilyFive.PushNotifications.PushNotificationsScheduler, nil: %{id: FamilyFive.PushNotifications.PushNotificationsScheduler, modules: [FamilyFive.PushNotifications.PushNotificationsScheduler], restart: :permanent, shutdown: 5000, start: {FamilyFive.PushNotifications.PushNotificationsScheduler, :start_link, []}, type: :worker}]}}}
State: %DeltaCrdt.CausalCrdt{crdt_module: DeltaCrdt.AWLWWMap, crdt_state: %DeltaCrdt.AWLWWMap{dots: %{434327776 => 3, 704243284 => 3, 883112317 => 8}, value: %{{:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{1, 1555133645008603000} => #MapSet<[{883112317, 7}]>}, {:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{1, 1555133616375684000} => #MapSet<[{883112317, 3}]>}, {:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{%Horde.Supervisor.Member{name: {FamilyFive.HordeSupervisor, :"[email protected]"}, status: :alive}, 1555133645053704000} => #MapSet<[{704243284, 1}]>}, {:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{%Horde.Supervisor.Member{name: {FamilyFive.HordeSupervisor, :"[email protected]"}, status: :shutting_down}, 1555133652587732000} => #MapSet<[{883112317, 8}]>}, {:process, FamilyFive.PushNotifications.PushNotificationsScheduler} => %{{{{FamilyFive.HordeSupervisor, :"[email protected]"}, %{id: FamilyFive.PushNotifications.PushNotificationsScheduler, restart: :permanent, start: {FamilyFive.PushNotifications.PushNotificationsScheduler, :start_link, []}}}, 1555133621006383000} => #MapSet<[{883112317, 4}]>}}}, merkle_tree: %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: %{{:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<58, 230>>}, hash: "bq"}}, hash: <<133, 107>>}}, hash: "Ҝ"}}, hash: <<61, 214>>}, nil}, hash: <<113, 155>>}}, hash: <<123, 28>>}}, hash: <<155, 93>>}}, hash: <<185, 109>>}}, hash: "xP"}}, hash: <<128, 89>>}, nil}, hash: <<30, 123>>}}, hash: <<236, 29>>}, nil}, hash: <<47, 165>>}, nil}, hash: <<19, 32>>}, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: %{{:process, FamilyFive.PushNotifications.PushNotificationsScheduler} => <<103, 206>>}, hash: <<243, 50>>}, nil}, hash: <<22, 60>>}}, hash: <<28, 43>>}}, hash: <<198, 240>>}, nil}, hash: <<238, 154>>}}, hash: <<192, 201>>}}, hash: <<42, 203>>}, nil}, hash: <<26, 208>>}}, hash: "Ż"}}, hash: <<46, 131>>}, nil}, hash: <<29, 99>>}}, hash: <<169, 248>>}}, hash: <<34, 181>>}}, hash: <<173, 218>>}}, hash: <<254, 33>>}, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: %{{:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<58, 230>>}, hash: <<247, 98>>}, nil}, hash: <<14, 194>>}, nil}, hash: "Fe"}, nil}, hash: <<9, 233>>}, nil}, hash: <<174, 79>>}, nil}, hash: "mF"}, nil}, hash: <<150, 233>>}, nil}, hash: <<199, 102>>}}, hash: <<108, 147>>}, nil}, hash: <<186, 84>>}, nil}, hash: <<57, 171>>}}, hash: <<240, 164>>}, nil}, hash: "{w"}, nil}, hash: "6m"}, nil}, hash: <<193, 79>>}}, hash: <<199, 211>>}, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: %{{:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<227, 223>>}, hash: <<171, 117>>}, nil}, hash: <<159, 14>>}, nil}, hash: <<73, 252>>}}, hash: <<24, 213>>}, nil}, hash: "#K"}}, hash: <<202, 243>>}, nil}, hash: <<136, 44>>}, nil}, hash: <<45, 21>>}}, hash: <<28, 9>>}, nil}, hash: <<206, 111>>}, nil}, hash: "4!"}, nil}, hash: <<224, 43>>}, nil}, hash: <<223, 59>>}}, hash: <<227, 8>>}}, hash: <<244, 112>>}, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: %{{:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<252, 229>>}, hash: <<6, 197>>}, nil}, hash: "*?"}, nil}, hash: <<193, 91>>}}, hash: <<153, 32>>}, nil}, hash: <<251, 11>>}}, hash: "hb"}}, hash: <<197, 44>>}}, hash: <<118, 222>>}}, hash: <<242, 202>>}}, hash: <<78, 192>>}, nil}, hash: <<12, 139>>}, nil}, hash: <<143, 159>>}, nil}, hash: <<138, 226>>}}, hash: <<139, 112>>}, nil}, hash: "q\v"}}, hash: <<160, 148>>}}, hash: "}V"}, name: FamilyFive.HordeSupervisor.Crdt, neighbours: #MapSet<[{FamilyFive.HordeSupervisor.Crdt, :"[email protected]"}]>, node_id: 883112317, on_diffs: #Function<0.101987752/1 in Horde.SupervisorSupervisor.init/1>, sequence_number: 0, storage_module: nil, sync_interval: 100}
Originally posted by @jfrolich in #89 (comment)
Which would be implemented by UniformQuorumDistribution
and UniformDistribution
and any new distribution strategies.
A transient process will be (correctly or incorrectly) restarted on another node when the node on which it was started is removed from the cluster, regardless of whether it has finished or not.
This function is inconsistent with Horde.Registry.lookup/2
. Making it consistent will add a large cost to the call.
The standard library Registry
doesn't offer this function so I think we can safely deprecate and then remove.
Related to #16
So here's as much information as I have gathered so far.
{:via, Horde.Registry, {Engine.HordeRegistry, {__MODULE__, struct.action_id}}}
child_spec =
{WatchFolderRunner, {workflow_id, watch_folder}}
|> Supervisor.child_spec(id: "#{WatchFolderRunner}_#{watch_folder.action_id}")
Horde.Supervisor.start_child(
Engine.HordeSupervisor,
child_spec
)
I think this change may have been introduced around rc7 or rc8 with the change to one CRDT. I have not validated it though.
Hi Derek,
This is an issue on registries.
My xyz
application actually starts a distributed registry XYZ.DReg
and a distributed supervisor XYZ.DSup
in its supervision tree. At boot, it also start a random number of dummy Task
s so there is a good chance that the pids of same processes running on each node are different (for i<-1..:rand.uniform(100), do: Task.async(fn -> nil end)
).
The scenario is as follows.
Let's start two nodes (each starting the application) in two terminals:
ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix
end
ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix
Let's check distributed registry members:
iex([email protected])1> Horde.Cluster.members(XYZ.DReg)
{:ok, %{"Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<0.198.0>, #PID<0.202.0>}}}
and
iex([email protected])1> Horde.Cluster.members(XYZ.DReg)
{:ok, %{"kAu9t9EZlbteNeJnWMLNZA==" => {#PID<0.197.0>, #PID<0.201.0>}}}
Now, lets join the registry horde from node nd1
and recheck for members:
iex([email protected])3> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
true
iex([email protected])4> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
"Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<0.198.0>, #PID<0.202.0>},
"kAu9t9EZlbteNeJnWMLNZA==" => {#PID<16274.197.0>, #PID<16274.201.0>}
}}
and on nd2
:
iex([email protected])2> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
"Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<16320.198.0>, #PID<16320.202.0>},
"kAu9t9EZlbteNeJnWMLNZA==" => {#PID<0.197.0>, #PID<0.201.0>}
}}
Things are looking good for a normal situation. So let's kill node nd2
(or shutdown it with :init.stop
) and recheck for members on nd1
:
iex([email protected])5> Horde.Cluster.members(XYZ.DReg)
{:ok,
%{
"Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<0.198.0>, #PID<0.202.0>},
"kAu9t9EZlbteNeJnWMLNZA==" => {#PID<16274.197.0>, #PID<16274.201.0>}
}}
The reference to the registry that was running on the dead node is still present. If you join a third node nd3
to the cluster and to the XYZ.DReg
horde, it will be propagated as well. I believe it will have an impact if one alive node tries to sync with this dead node.
Cheers
Maurycy
I'm running a script that is creating a bunch of GenServers registered with Horde.Registry. The script isn't attempting to do anything in parallel, just working through a list one at a time. Sometimes, after running for a few minutes, I will observe a hang and see the following crash:
** (exit) exited in: GenServer.call(MyApp.EntityRegistry, :get_keys_ets_table, 5000)
** (EXIT) time out
(elixir) lib/gen_server.ex:989: GenServer.call/3
(horde) lib/horde/registry.ex:95: Horde.Registry.lookup/2
(gambit) lib/my_app/persistence/entity.ex:56: MyApp.Survey.Major.ensure_started/1
(gambit) lib/my_app/survey/major.ex:4: MyApp.Survey.Major.update/2
(gambit) lib/my_app/core_sync/import.ex:294: MyApp.CoreSync.Import.save_major_translation/2
(elixir) lib/enum.ex:1327: Enum."-map/2-lists^map/1-0-"/2
(elixir) lib/enum.ex:1327: Enum."-map/2-lists^map/1-0-"/2
(gambit) lib/my_app/core_sync/import.ex:70: anonymous fn/2 in MyApp.CoreSync.Import.process_majors/0
(elixir) lib/enum.ex:1940: Enum."-reduce/3-lists^foldl/2-0-"/3
(gambit) lib/my_app/core_sync/import.ex:66: MyApp.CoreSync.Import.process_majors/0
(gambit) lib/my_app/core_sync/import.ex:15: MyApp.CoreSync.Import.run/1
(mix) lib/mix/task.ex:331: Mix.Task.run_task/3
(mix) lib/mix/cli.ex:79: Mix.CLI.run_task/2
(elixir) lib/code.ex:767: Code.require_file/2
Right now I'm testing locally with a single node, so I don't think this is an issue with the CRDT.
Looks like there's been a lot of work done since 0.3.0...what would be your recommendation here? Are the RC releases "safe" to use?
I'm running on master right now.
I start with a horde of two nodes (A and B). I terminate B (by sending a SIGTERM to the Erlang node that hosts it). Then I start B up again (with the same Erlang node name). The restarted node begins displaying errors that look like this:
[error] Discarding message {delta,{<0.260.0>,<0.260.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[],{623221198,0}=>[],{653145801,0}=>[]},version=>2},maxima=>#{347099519=>0,623221198=>0,653145801=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>[],<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>[],<<"MQr4dFCEp2++AZuDqgLPrw==">>=>[]},version=>2},state=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{623221198,0}=>[]},version=>2}}}},<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>[]},version=>2},state=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[]},version=>2}}}},<<"MQr4dFCEp2++AZuDqgLPrw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{653145801,0}=>[]},version=>2}}}}}}},2} from <0.260.0> to <0.260.0> in an old incarnation (3) of this node (1)
My guess is that the CRDT running on A never got the memo that B disappeared, and is still trying to send it messages. The restarted B node is (rightfully) not accepting those messages.
I was able to verify that, in this case, Registry.terminate never got called, and so wasn't able to initiate graceful cleanup of its CRDTs. So if I make sure Registry's terminate callback gets called (see dazuma@811a351) that seems to fix it. But that's probably not foolproof either; I'm sure there are ways to kill a node brutally and not give terminate a chance to execute.
Please see: https://asciinema.org/a/lOdfe5xgtt7PO7Qwynu8THKru
If two nodes have been part of the same Horde, PIDs that run on nodes that were part of the previous overlay network configuration are still considered valid. However, if an actual netsplit occurs, the process will be restarted on a node using the new configuration.
I would have expected cluster membership changes to be treated the same as a netsplit, i.e the process to have been started on the other node as well as soon as the Horde memberships were reconfigured.
I'm starting processes under a Horde.Supervisor, and using :via
tuples to register them with a Horde.Registry. I'm encountering a frequent race condition that appears to look like this:
Horde.Supervisor.start_child
to start a process that also registers itself using name: {:via, Horde.Registry, ...}
:via
tuple to GenServer.send
. Unfortunately, the new registry entry probably has not yet propagated back to node A. So the call crashes. Indeed, without polling the registry, I can't find a good way to determine when the call is safe to make.I've been poking through the horde source and can't see an obvious way to solve this one, without forcing horde to start the process on the local node. (Is there a way to do that?) Currently, I'm using the returned supervisor pid, iterating its children and looking for the recipient pid directly for those "immediate" calls. But it seems a clumsy way to proceed. Wondering if you have any solutions/suggestions. At the very least, it might be worth documenting this since it's likely to be a common problem.
Sometimes the following exception can be raised:
[error] GenServer :"horde_82228185.Crdt" terminating
** (ArgumentError) argument error
:erlang.send(:horde_82228185, {:crdt_update, [{:add, {:process, :proc_12}, {nil, %{id: :proc_12, modules: [Task], restart: :permanent, shutdown: 10, start: {Task, :start_link, [#Function<0.66858309/0 in SupervisorTest.__ex_unit_setup_0/1>]}, type: :worker}}}]})
(horde) lib/horde/supervisor_supervisor.ex:14: anonymous fn/2 in Horde.SupervisorSupervisor.init/1
(delta_crdt) lib/delta_crdt/causal_crdt.ex:239: DeltaCrdt.CausalCrdt.update_state_with_delta/3
(delta_crdt) lib/delta_crdt/causal_crdt.ex:204: DeltaCrdt.CausalCrdt.handle_cast/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:operation, {:add, [process: :proc_12, nil: %{id: :proc_12, modules: [Task], restart: :permanent, shutdown: 10, start: {Task, :start_link, [#Function<0.66858309/0 in SupervisorTest.__ex_unit_setup_0/1>]}, type: :worker}]}}}
This often happen in tests (e.g. the supervisor test of this project) and it seems to be caused by this line. When it happens the process registered with the name root_name
is already dead, and an argument error is raise by send/2
.
A simple
on_diffs: fn diffs ->
try do
send(root_name, {:crdt_update, diffs})
rescue
e ->
nil
end
is enough, but i dont know if this is the right approach here or there is a better, more structured solution.
What do you think?
Hi Derek.
We're continuing our process of getting to production with a live horde cluster. I am excited as always.
We were testing our cluster (I work remotely, but the guys in the office tried it out) and while the cluster seems to do what is expected, they did notice some errors appearing in the logs.
(stop) exited in: GenServer.call(MyApp.HordeSupervisor, :members, 5000)
(EXIT) time out
(elixir) lib/gen_server.ex:989: GenServer.call/3
(hive) lib/my_app/horde/cluster.ex:99: MyApp.HordeCluster.members/0
(hive) lib/my_app/horde/cluster.ex:71: anonymous fn/0 in MyApp.HordeCluster.start_horde_children/0
(elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Function: #Function<1.74019401/0 in MyApp.HordeCluster.start_horde_children/0>
Args: []```
This error comes from a line that is called with:
{:ok, members} = Horde.Cluster.members(MyApp.HordeSupervisor)
We start our cluster with the following children:
{Horde.Supervisor, name: MyApp.HordeSupervisor,
strategy: :one_for_one,
distribution_strategy: strategy(),
children: []},
{Horde.Registry, name: MyApp.HordeRegistry, keys: :unique},
{MyApp.Horde.NodeMonitor, name: MyApp.Horde.NodeMonitor}
Do you have any idea what might be causing this? It doesn't seem to have a detrimental effect on the functioning of the cluster.
The other issues they discovered were:
GenServer Hive.HordeRegistry terminating
(FunctionClauseError) no function clause matching in Horde.RegistryImpl.handle_cast/2
(horde) lib/horde/registry_impl.ex:91: Horde.RegistryImpl.handle_cast({:request_to_join_hordes, {:registry, "Jw9eSu4x4E4N5B8f87h2YA==", #PID<30324.493.0>, false, {#PID<30324.447.0>, #Reference<30324.165341209.964952065.63890>}
}}, %Horde.RegistryImpl.State
{keys_ets_table: :"keys_Elixir.MyApp.HordeRegistry", keys_pid: #PID<0.1540.0>, members_pid: #PID<0.1532.0>, node_id: "Coxg9YdgjfgDRewOBedkrw==", nodes: [:"[email protected]", :"[email protected]"], pids_ets_table: :"pids_Elixir.MyApp.HordeRegistry", processes_updated_at: 0, processes_updated_counter: 0, registry_ets_table: MyApp.HordeRegistry, registry_pid: #PID<0.1536.0>}
and
GenServer Hive.HordeSupervisor.Registry terminating
(FunctionClauseError) no function clause matching in anonymous fn/1 in Horde.SupervisorImpl.monitor_supervisors/2
I am hoping that either these are known to you and not something to worry about, or that it might help you. If you recognise any of these as legitimate issues with our implementation, please let me know.
Thanks for the great project!
For feature parity with Elixir's Registry, Horde.Registry
needs to support keys: :duplicate
.
Creating this issue as a placeholder for the conversation we were having today. The tldr; of that conversation was:
dkraan [12:31 PM]
So Horde doesn't do any redistributing when a node joins the cluster, but when a node leaves the cluster its processes will be evenly distributed across the remaining nodes.
It would be good if processes in horde could be rebalanced on node up/node join, and not only on node down. The thought so far was that there may be a pluggable way to expose this type of functionality to the library user.
Current module based supervisor implementation does not allow setting supervisor options like max restarts, timers and so on, that's because of:
My suggestion is to make the module-based supervisor as the standard one. If is too complicated maybe make child_spec overridable in the macro?
Which compares the number of nodes in the cluster with a static number to determine if the cluster has quorum.
When using the new 0.5.0 set_members
functionality to join or leave a cluster it's possible to update all members in the supervisor, however in the registry case once an added member always an added member. See test case below:
test "can join and unjoin supervisor with set_members" do
{:ok, _} = Horde.Supervisor.start_link(name: :sup6, strategy: :one_for_one)
{:ok, _} = Horde.Supervisor.start_link(name: :sup7, strategy: :one_for_one)
assert :ok = Horde.Cluster.set_members(:sup6, [:sup6, :sup7])
{:ok, members} = Horde.Cluster.members(:sup6)
assert 2 = Enum.count(members)
assert :ok = Horde.Cluster.set_members(:sup6, [:sup6])
{:ok, members} = Horde.Cluster.members(:sup6)
assert 1 = Enum.count(members)
end
test "can join and unjoin registry with set_members" do
{:ok, _} = Horde.Registry.start_link(name: :reg4, keys: :unique)
{:ok, _} = Horde.Registry.start_link(name: :reg5, keys: :unique)
assert :ok = Horde.Cluster.set_members(:reg4, [:reg4, :reg5])
{:ok, members} = Horde.Cluster.members(:reg4)
assert 2 = Enum.count(members)
assert :ok = Horde.Cluster.set_members(:reg4, [:reg4])
{:ok, members} = Horde.Cluster.members(:reg4)
assert 1 = Enum.count(members)
end
The supervisor test case works as expected however the registry test case fails since the two members are both still there
The current example app uses start_child
before the cluster has been formed, resulting in "hijacking" the process whenever a new node is added. I'd like to change this to show the more correct flow of first connecting to the cluster and only then starting the child process(es).
In (for example) Horde.Supervisor, we start a number of processes that are named. This can create a possible leak in the atom table. Horde should be modified to avoid dynamically creating named processes.
Where is the best place to call Horde.Cluster.join_hordes?
I'm getting this:
[email protected]: [2019-05-05 20:46:08.883] [info] Starting Horde.SupervisorImpl with name PulsarSolrIndexer.HordeSupervisor
[email protected]: [2019-05-05 20:49:52.806] [info] Starting PulsarSolrIndexer v0.6.0
[email protected]: [2019-05-05 20:49:52.828] [info] Starting Horde.SupervisorImpl with name PulsarSolrIndexer.HordeSupervisor
[email protected]: [2019-05-05 20:49:52.836] [info] Starting Horde.RegistryImpl with name PulsarSolrIndexer.GRegistry
[email protected]: [2019-05-05 20:49:52.840] [info] Starting Elixir.Server.Retry
[email protected]: [2019-05-05 20:49:52.863] [info] cleaning task running table
[email protected]: [2019-05-05 20:49:52.864] [info] Elixir.SQSConsumer started
[email protected]: [2019-05-05 20:49:52.869] [info] ConsumerSupervisor: min_demand = 3, max_demand: 10
[email protected]: [2019-05-05 20:49:55.323] [error] GenServer PulsarSolrIndexer.HordeSupervisor.MembersCrdt terminating
** (FunctionClauseError) no function clause matching in DeltaCrdt.CausalCrdt.handle_info/2
(delta_crdt) lib/delta_crdt/causal_crdt.ex:70: DeltaCrdt.CausalCrdt.handle_info({:add_neighbours, [#PID<19377.302.0>]}, %DeltaCrdt.CausalCrdt{crdt_module: DeltaCrdt.AWLWWMap, crdt_state: %DeltaCrdt.AWLWWMap{dots: %{396184569 => 1}, value: %{"BckHx0Z3wtMTLPsCsjmj/A==" => %{{{:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}, 1557082192832923218} => #MapSet<[{396184569, 1}]>}}}, max_sync_size: 200, merkle_map: %MerkleMap{map: %{"BckHx0Z3wtMTLPsCsjmj/A==" => {:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}}, merkle_tree: %MerkleMap.MerkleTree{object: {<<212, 241, 216, 124>>, {<<99, 0, 49, 45>>, {<<223, 107, 254, 235>>, {<<147, 120, 219, 150>>, {<<102, 149, 191, 171>>, {<<253, 8, 127, 129>>, [], {<<17, 181, 215, 160>>, [], {"A`\\E", {<<20, 210, 168, 102>>, [], {<<24, 112, 40, 47>>, {<<21, 196, 32, 43>>, {<<65, 152, 231, 228>>, {<<167, 1, 238, 191>>, [], {<<234, 243, 169, 248>>, {<<68, 214, 139, 175>>, {<<172, 29, 229, 84>>, {<<80, 131, 188, 213>>, {<<224, 210, 216, 84>>, {<<76, 132, ...>>, [], {...}}, []}, []}, []}, []}, []}}, []}, []}, []}}, []}}}, []}, []}, []}, []}, []}}}, name: PulsarSolrIndexer.HordeSupervisor.MembersCrdt, neighbour_monitors: %{}, neighbours: #MapSet<[]>, node_id: 396184569, on_diffs: #Function<4.83987349/1 in DeltaCrdt.CausalCrdt.init/1>, outstanding_syncs: %{}, sequence_number: 0, storage_module: nil, sync_interval: 5})
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:add_neighbours, [#PID<19377.302.0>]}
State: %DeltaCrdt.CausalCrdt{crdt_module: DeltaCrdt.AWLWWMap, crdt_state: %DeltaCrdt.AWLWWMap{dots: %{396184569 => 1}, value: %{"BckHx0Z3wtMTLPsCsjmj/A==" => %{{{:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}, 1557082192832923218} => #MapSet<[{396184569, 1}]>}}}, max_sync_size: 200, merkle_map: %MerkleMap{map: %{"BckHx0Z3wtMTLPsCsjmj/A==" => {:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}}, merkle_tree: %MerkleMap.MerkleTree{object: {<<212, 241, 216, 124>>, {<<99, 0, 49, 45>>, {<<223, 107, 254, 235>>, {<<147, 120, 219, 150>>, {<<102, 149, 191, 171>>, {<<253, 8, 127, 129>>, [], {<<17, 181, 215, 160>>, [], {"A`\\E", {<<20, 210, 168, 102>>, [], {<<24, 112, 40, 47>>, {<<21, 196, 32, 43>>, {<<65, 152, 231, 228>>, {<<167, 1, 238, 191>>, [], {<<234, 243, 169, 248>>, {<<68, 214, 139, 175>>, {<<172, 29, 229, 84>>, {<<80, 131, 188, 213>>, {<<224, 210, 216, 84>>, {<<76, 132, ...>>, [], {...}}, []}, []}, []}, []}, []}}, []}, []}, []}}, []}}}, []}, []}, []}, []}, []}}}, name: PulsarSolrIndexer.HordeSupervisor.MembersCrdt, neighbour_monitors: %{}, neighbours: #MapSet<[]>, node_id: 396184569, on_diffs: #Function<4.83987349/1 in DeltaCrdt.CausalCrdt.init/1>, outstanding_syncs: %{}, sequence_number: 0, storage_module: nil, sync_interval: 5}
[email protected]: [2019-05-05 20:49:55.326] [error] GenServer PulsarSolrIndexer.HordeSupervisor terminating
** (stop) exited in: GenServer.call(PulsarSolrIndexer.HordeSupervisor.MembersCrdt, {:operation, {:add, ["BckHx0Z3wtMTLPsCsjmj/A==", {:shutting_down, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}]}}, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(elixir) lib/gen_server.ex:979: GenServer.call/3
(horde) lib/horde/supervisor_impl.ex:69: Horde.SupervisorImpl.handle_call/3
(stdlib) gen_server.erl:661: :gen_server.try_handle_call/4
(stdlib) gen_server.erl:690: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.311.0>): :horde_shutting_down
State: %Horde.SupervisorImpl.State{distribution_strategy: Horde.UniformDistribution, members: %{"BckHx0Z3wtMTLPsCsjmj/A==" => {:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}}, members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, node_id: "BckHx0Z3wtMTLPsCsjmj/A==", pid: #PID<0.307.0>, processes: %{}, processes_pid: #PID<0.306.0>, processes_updated_at: 0, processes_updated_counter: 0, shutting_down: false}
Client #PID<0.311.0> is alive
(stdlib) gen.erl:169: :gen.do_call/4
(elixir) lib/gen_server.ex:986: GenServer.call/3
(horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
(elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:769: Enum.each/2
(stdlib) gen_server.erl:673: :gen_server.try_terminate/3
(stdlib) gen_server.erl:858: :gen_server.terminate/10
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[email protected]: [2019-05-05 20:49:55.326] [error] GenServer #PID<0.311.0> terminating
** (stop) exited in: GenServer.call(PulsarSolrIndexer.HordeSupervisor, :horde_shutting_down, 5000)
** (EXIT) exited in: GenServer.call(PulsarSolrIndexer.HordeSupervisor.MembersCrdt, {:operation, {:add, ["BckHx0Z3wtMTLPsCsjmj/A==", {:shutting_down, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}]}}, :infinity)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
(elixir) lib/gen_server.ex:989: GenServer.call/3
(horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
(elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
(elixir) lib/enum.ex:769: Enum.each/2
(stdlib) gen_server.erl:673: :gen_server.try_terminate/3
(stdlib) gen_server.erl:858: :gen_server.terminate/10
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.304.0>, :shutdown}
Great work on Horde! I just got finished porting a project over from Swarm, and it seems to be working much more stably under Horde—mostly, I think, because of the graceful shutdown decoupled from node shutdown. I've also found Horde's source much easier to follow than Swarm's, so kudos.
One of the requirements of my app is handoff of state when a process gets restarted on another node using Horde.Supervisor. (Consider, for example, an app running on Kubernetes where containers can be killed due to scaling down or as part of a rolling update, and you want long-running processes to migrate to other live containers along with their state.) This is something that Swarm does, and I imagine it may be a fairly common requirement.
I got this working for my app by creating a third horde type called "Handoff" that basically just wraps another CRDT with a mapping from process name to saved state. (My implementation borrows heavily from the current Horde.Registry module.) Processes that are about to be terminated can save their state to the Handoff CRDT, and when they start up they can check for saved state on the CRDT.
I was wondering if you have particular plans for a feature like this in Horde, and if you have any ideas or feedback on implementation. I could even try cleaning up my implementation as a pull request if you're interested.
Example:
iex(1)> child_spec = %{id: {A, "a"}, start: {Agent, :start_link, [fn -> 0 end]}}
iex(2)> DynamicSupervisor.start_child(MyApp.DynamicSupervisor, child_spec)
{:ok, #PID<0.836.0>}
iex(3)> Horde.Supervisor.start_child(MyApp.DistributedSupervisor, child_spec)
[error] GenServer MyApp.DistributedSupervisor terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {A, "a"}. This protocol is implemented for...
This is problematic when worker definition is provided by external lib
Hi Derek,
Still struggling with this. I can run the hello_world example fine but when i try and incorporate Horde.Registry or Horde.Supervisor in my own application supervision tree i get the following error:
** (Mix) Could not start application core: Core.Application.start(:normal, []) returned an error: shutdown: failed to start child: Horde.LocalRegistry
** (EXIT) an exception was raised:
** (FunctionClauseError) no function clause matching in Keyword.get/3
(elixir) lib/keyword.ex:179: Keyword.get({#PID<0.186.0>, :members_updated}, :notify, nil)
(delta_crdt) lib/delta_crdt/causal_crdt.ex:64: DeltaCrdt.CausalCrdt.start_link/3
(horde) lib/horde/registry.ex:141: Horde.Registry.init/1
(stdlib) gen_server.erl:365: :gen_server.init_it/2
(stdlib) gen_server.erl:333: :gen_server.init_it/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
The Code being run to try and do this is:
def start(_type, _args) do
import Supervisor.Spec
# List all child processes to be supervised
children = [
%{
id: Phoenix.PubSub.PG2.Dalek.Core,
start: {Phoenix.PubSub.PG2, :start_link, [:core, []]}
},
{Horde.Registry, [name: Horde.LocalRegistry]},
{Horde.Supervisor, [name: Horde.LocalSupervisor, strategy: :one_for_one]},
worker(Core.Scheduler, []),
etc with other children
Removing the Horde child processes and all is fine, but adding those causes the error above. I am obviously using it wrong but 3 hours spent trying to work out why is making me pull my hair out!
Any help greatly received as Horde (rather than Swarm) is exactly the functionality i am looking for.
Tom
So I connect two nodes with set_members
which results in the following on both nodes:
{App.DistributedSupervisor, :"[email protected]"},
{App.DistributedSupervisor, :"[email protected]"}
]
[
{App.ProcessRegistry, :"[email protected]"},
{App.ProcessRegistry, :"[email protected]"}
]```
I start a process and list all processes in the registry on both nodes which gives me:
```iex([email protected])3> Horde.Registry.processes(App.ProcessRegistry)
%{
"process-id" => {#PID<0.674.0>,
nil}
}
iex([email protected])3> Horde.Registry.processes(App.ProcessRegistry)
%{
"process-id" => {#PID<23104.674.0>,
nil}
}```
When I then use `set_members` again for `DistributedSupervisor` it works on the first node, however the second node raises this exception:
```iex([email protected])4> Horde.Cluster.set_members(App.DistributedSupervisor, [App.DistributedSupervisor])
13:05:56.638 [debug] Found 1 processes on dead nodes [ pid=<0.420.0> line=500 function=processes_on_dead_nodes/1 module=Horde.SupervisorImpl file=lib/horde/supervisor_impl.ex application=horde ]
** (exit) exited in: GenServer.call(App.DistributedSupervisor, {:set_members, [App.DistributedSupervisor]}, 5000)
** (EXIT) an exception was raised:
** (ArithmeticError) bad argument in arithmetic expression
:erlang.rem(1766195506, 0)
(horde) lib/horde/uniform_distribution.ex:17: Horde.UniformDistribution.choose_node/2
(horde) lib/horde/supervisor_impl.ex:475: anonymous fn/3 in Horde.SupervisorImpl.handle_topology_changes/1
(elixir) lib/enum.ex:2934: Enum.filter_list/2
(horde) lib/horde/supervisor_impl.ex:474: Horde.SupervisorImpl.handle_topology_changes/1
(horde) lib/horde/supervisor_impl.ex:395: Horde.SupervisorImpl.set_members/2
(horde) lib/horde/supervisor_impl.ex:102: Horde.SupervisorImpl.handle_call/3
(stdlib) gen_server.erl:661: :gen_server.try_handle_call/4
(elixir) lib/gen_server.ex:989: GenServer.call/3```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.