derekkraan / horde Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 102.0 597 KB

Horde is a distributed Supervisor and Registry backed by DeltaCrdt

License: MIT License

Elixir 100.00%

cluster elixir registry supervisor

horde's People

Contributors

Stargazers

Watchers

Forkers

nitrino renesugar dazuma doug-co amatalai pharosproduction rodrigofdz quantamhd mgwidmann fredr jamescheuk91 svenw flaviogrossi davec82 pdgonzalez872 bpsong xadhoom arjan pkinney shayan-taheri uberbrodt emkeyem treble37 lazercorn broodfusion validere-ben smaximov ordepdev evadne ivor markmeeus f2011b leoht deva-hub ajwah idabmat frekw windfish-studio criticalbh jfrolich hippware neuroradiology weakwire sikanrong dkulchenko pedro-gutierrez brndnmtthws veedo hellcoderz amamla gbrlcustodio balena alex-sysoev chrismcg dvic egze sizigi vhf vereis dbuos christopherlai kianmeng kunna astutecat onixus74 sleipnir aetherus ioan-badila cy303 phanmn fmbraga pgeraghty tp-iasc okarifrankline dbernazal cdesch rudolfman autodidaddict sswrk dtykocki brendanball l3nz doytsujin whatnot-inc fabriziosestito instinctscience norrchr ramonpin rungthiwasrisaart discomco-ex pertsevds botsquad pbrubaker boyzwj yagehacker siftlogic spscream gmdorf caioaao noaccos

horde's Issues

Horde.Supervisor.start_child returns nil pid when :already_started

When starting a new child with Horde.Supervisor.start_child/2 and the child is already started, the error {:error, {:already_started, pid}} is returned, but pid is always nil.

Having the pid in the returned error is commonly used to perform operation on the process, for example try to start the process -> already started -> GenServer.call(pid, whatever).

Gather information about current Horde usage

Please fill out this short survey to help me understand how Horde is being used in the wild.

Handoff state to new process

Is therea mechanism implemented in this library that a process can hand off its state when the process is being restarted on another node? A bit like what swarm has implemented?

Rejoining nodes of `Horde.Registry` after partition should be possible

I think this should "just work", in which case I can remove the checks that prevent this from happening, but I will need to figure out a good way to test this case first.

Why it should work: each process is linked to the various nodes of Horde.Registry. In the event of a network partition, these processes will each receive an exit signal. These processes should interpret that as the Registry having died, and re-register themselves. Then, when the different parts of the Registry are rejoined, all keys should have been added again and the Registry should be complete.

Horde.Supervisors supports transient and temporary processes

Currently Horde.Supervisor only supports permanent processes. For feature-parity with DynamicSupervisor, supporting transient and temporary processes is needed.

Basic code snipets in the documentation

Hi,

This is not that much an issue but rather a doc improvement or completion request.

I wanted to give a try to Horde, after having spent some time with Swarm. I'm not very much experimented with distributed development in Elixir but I'm not an Elixir newbie. I got quite quickly stuck when trying to define both a Registry and a Supervisor: how to register a worker that is going to be supervised at the same time ? Do I need to use a :via tuple or is there some other way ? What about join(): should it be executed on each node or is it to be used in case we have several supervisors/registries? How to start multiple nodes and create a globally distributed supervisor/registry? Etc..

As I believe I'm not the only person in this situation, I think some code snippets with basic patterns would be really helpful. For example:

start a distributed supervisor
start a distributed registry
start a distributed supervisor/registry
do the same on 2 or 3 nodes to show required configurations and how it works

Thanks in advance

Quorum based distribution fails when calling `choose_node` on `start_child`

The UniformQuorumDistribution strategy, and any similar ones where has_quorum?/1 can return false and choose_node therefore nil breaks the :start_child call in SupervisorImpl.

defmodule Horde.UniformQuorumDistribution do
  def choose_node(identifier, members) do
    if has_quorum?(members) do
      Horde.UniformDistribution.choose_node(identifier, members)
    else
      nil
    end
  end

and

defmodule Horde.SupervisorImpl do
  def handle_call({:start_child, child_spec} = msg, from, %{node_id: this_node_id} = state) do
    case state.distribution_strategy.choose_node(child_spec.id, state.members) do
      {^this_node_id, _} ->
        {reply, new_state} = add_child(child_spec, state)
        {:reply, reply, new_state}

      {other_node_id, _} ->
        proxy_to_node(other_node_id, msg, from, state)
    end
  end

If a cluster is still starting up there will always be a point at which state.members < whatever constitutes quorum.

Is there perhaps a way to delay the starting of child processes until the cluster has_quorum??

What would be the best way to deal with the startup and quorum?

Testing Horde.UniformQuorumDistribution not behaving as expected.

Hi Derek.

I have a cluster setup with libcluster, on my local machine, as such:

    fixed_example: [
      strategy: Cluster.Strategy.Epmd,
      config: [
        hosts: [:"[email protected]", :"[email protected]", :"[email protected]"],
        secret: "myapp_cluster"
      ]
    ]

Where 192.168.43.226 is my WIFI ip address.

I have a horde cluster that uses the results of Node.list() to connect to the other nodes.

This works perfectly fine. I have a process that should only run on one instance of the cluster and I can confirm this works wonderfully.

If I call {:ok, members} = Horde.Cluster.members(MyApp.HordeSupervisor) I get an :ok tuple and three members that each have the :alive distinction.
However, when I turn my wifi off, and the network splits into 3 partitions, then the response from running {:ok, members} = Horde.Cluster.members(MyApp.HordeSupervisor) on each node returns an :ok tuple with only one member and it is still alive.

The problem then is that Horde.UniformQuorumDistribution uses the members list to determine the quorum and since the members updated to only include the one node it can see, it thinks that each of the partitions has quorum, instead of rather killing each of them.

Do you have any idea if this is the expected behaviour. The problem has to be related to the members list updating but I am not sure where to find this and what could be causing this.

Erlang 20 and Elixir 1.6.1 and example not running correctly

Hi,

I have tried running your example on ubuntu 18.04.

The two nodes start fine but both display the hello message with their own node details rather than just one of them (I cut and pasted the start code from the example readme into each terminal)

Killing one of the nodes does not cause the other node to pick up the lost process.

Running Node.list shows that both are connected to each other.

Am i doing something wrong?

regards

Tom

Horde.Registry supports partitions

Currently the option for number of partitions is ignored. As a stop-gap measure we should check it and make sure it is not greater than 1.

Share counter between nodes with DeltaCRDT in the example HelloWorld application

I am loving Horde and learning a lot along the way.
I successfully get the HelloWorld application to run with three different IEX instances and can see that only one instance of SayHello is running. However, when I ask each one the :how_many the count is different on each node, being the number of times that node has executed the :say_hello function.

I thought it would be simple to extend the example application to work with CRDT and share the count so that how_many? would return the total number of times the function has been called in the cluster. It doesn't seem to be as simple and the documentation for CRDT doesn't make it clear how adding neighbours from other nodes will work.

It would be great if the example application could be extended to show how the various nodes will use DeltaCrdt to share data between them.

Registry is not up to date when starting two nodes

I think this is a bug.

TLDR: The Distributed registry doesn't retrieve state when joining a cluster after running set_members.

This is in my application.ex

[
  ...,
  {Cluster.Supervisor, [Application.get_env(:libcluster, :topologies)]},
  {Horde.Registry, name: FamilyFive.HordeRegistry, keys: :unique},
  {Horde.Supervisor,
         name: FamilyFive.HordeSupervisor, strategy: :one_for_one, children: []},
  ...
]

And this is a Tracker module that runs after the participating nodes are retrieved or changed:

Horde.Cluster.set_members(
  FamilyFive.HordeRegistry,
  Enum.map(nodes, fn n -> {FamilyFive.HordeRegistry, n} end
)

Horde.Cluster.set_members(
  FamilyFive.HordeSupervisor,
  Enum.map(nodes, fn n -> {FamilyFive.HordeSupervisor, n} end)
)

When I start a new process:

Horde.Supervisor.start_child(
  FamilyFive.HordeSupervisor,
  FamilyFive.PushNotifications.PushNotificationsScheduler
)

The FamilyFive.PushNotifications.PushNotificationsScheduler includes the following start_link to be registered:

GenServer.start_link(__MODULE__, [],
  name: {:via, Horde.Registry, {FamilyFive.HordeRegistry, __MODULE__}}
)

Then when I start up a new node and run this on the new node:

Horde.Registry.lookup(
  FamilyFive.HordeRegistry,
  FamilyFive.PushNotifications.PushNotificationsScheduler
)

It returns :undefined even though it returns the pid correctly on the other node. This inconsistency is not resolved automatically.

If I start the two nodes first and then run start the process on one of the nodes the lookups works. But if I first start one node run the process, and only then spin up the new node and then run the lookup it doesn't work.

I think it might have to do that I don't specify the members on startup, but I don't know the members at startup (they are only known when libcluster queries the DNS).

Incorrect behaviour during netsplits?

See: https://asciinema.org/a/r19vUhFueoPRbf7IaPB92ntpu

I'm starting a two node cluster and then force a net split. During the net split, I am unable to start a second process with the same name (the supervisor won't let me). This means that I cannot have availability on both sides of the split.

However, if I tweak the memberships during the split (which I probably shouldn't be doing), I am able to start the same process on both sides. However, if I then heal the split and update the member list to contain both nodes this conflict is never resolved and I end up with two processes running.

Kind of the same behaviour can be observed when a node shuts down: https://asciinema.org/a/RDS52ntJYw6gCDXLN2rmfYkKc but in this case I lose the process entirely.

In both the cases of a net split, as well as a shutdown (since the scenario is pretty much the same for the other nodes in the cluster) I would have expected the process to be migrated to the other node. Or at least that I would have been able to start it.

I'm not fully sure what the expected behaviour is supposed to be, but I think this may be worth looking into.

Is there a way to leave a process where it is when topology changes?

First I’d like to say that this is great work.
I spent several days with Swarm before realizing it didn’t fit my needs because it doesn’t supervise processes.
Being Horde allows proper distributed supervision of sub trees, it seems to meet my needs perfectly, so thank you!

On to the real question...

Maybe I’m missing an option somewhere, but I can’t seem to find a way to tell Horde to leave processes running where ever they currently are when a node is added to the cluster.
I’m currently seeing the following behavior:

Bring node A online
App starts, supervision trees added to Horde
All :ok
Bring node B online
App starts, moves currently running processes to node B

It would be nice if I could leave those processes running on node A.

I’m using Horde so that I can have a hot standby node ready to take over should node A leave the cluster unexpectedly.
That works, but the current behavior means there is a tiny bit of instability when my failover node comes online, or anytime one of my containers gets restarted really.

Put child processes of `Horde.Registry` in a proper supervision tree

Currently we use .start_link in a couple of places in Horde.Registry and I'd like to move these to a proper supervision tree.

Publish metrics with Telemetry

Port Registry.select/2 to Horde.Registry

See the PR: elixir-lang/elixir#8963

Documentation confusion between DynamicSupervisor and Supervisor API

Describe the bug

Hi,

The documentation has a lot of reference with the DynamicSupervisor but seems to act as a Supervisor which is a little confusing.
And the return seem to be inconsistent by returning nil pid on already_started.

To Reproduce

defmodule MyApp.Supervisor do
  # Automatically defines child_spec/1
  use Supervisor

  def start_link(init_arg) do
    Supervisor.start_link(__MODULE__, init_arg)
  end

  @impl true
  def init(_init_arg) do
    children = [
      {MyServer, []}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

defmodule MyServer do
  use GenServer

  def start_link(state) do
    GenServer.start_link(__MODULE__, state)
  end

  ## Callbacks

  @impl true
  def init(stack) do
    {:ok, stack}
  end
end

{:ok, _} = DynamicSupervisor.start_link(name: MyApp.CusterDynamicSupervisor, strategy: :one_for_one)

DynamicSupervisor.start_child(MyApp.CusterDynamicSupervisor, {MyApp.Supervisor, []})
# {:ok, #PID<0.419.0>}
DynamicSupervisor.start_child(MyApp.CusterDynamicSupervisor, {MyApp.Supervisor, []})
# {:ok, #PID<0.422.0>}

DynamicSupervisor.which_children(MyApp.CusterDynamicSupervisor)
# [
#   {:undefined, #PID<0.419.0>, :supervisor, [MyApp.Supervisor]},
#   {:undefined, #PID<0.422.0>, :supervisor, [MyApp.Supervisor]}
# ]

{:ok, _} = Supervisor.start_link([], name: MyApp.CusterSupervisor, strategy: :one_for_one)

# Supervisor.start_child(MyApp.Horde, {MyApp.CusterSupervisor, []})
# {:ok, #PID<0.419.0>}
# Supervisor.start_child(MyApp.Horde, {MyApp.CusterSupervisor, []})
# {:error, {:already_started, #PID<0.419.0>}}

# Supervisor.which_children(MyApp.Horde)
# [{MyApp.Supervisor, #PID<0.419.0>, :supervisor, [MyApp.Supervisor]}]

{:ok, _} = Horde.Supervisor.start_link(name: MyApp.CusterDistributedSupervisor, strategy: :one_for_one)

Horde.Supervisor.start_child(MyApp.CusterDistributedSupervisor, {MyApp.Supervisor, []})
# {:ok, #PID<0.435.0>}
Horde.Supervisor.start_child(MyApp.CusterDistributedSupervisor, {MyApp.Supervisor, []})
# {:error, {:already_started, nil}}
Horde.Supervisor.which_children(MyApp.CusterDistributedSupervisor)
# [{MyApp.Supervisor, #PID<0.435.0>, :supervisor, [MyApp.Supervisor]}]

Environment

Erlang OTP: 21
Elixir version: ~> 1.8.1
Horde version: ~> 0.5.0-rc.7
MacOS: 10.14.3 (18D109)

Additional context

I am developing a MMORPG server for educational purposes.
What I want to do is create a distributed dynamic supervisor who will supervise several Worlds and that will supervise several channels that can be considered as a parallel dimension of the world in which players cannot see each other but can send group messages; private messages; share guild...

Issues on joining/leaving hordes

Hi,

Unless I misunderstood something, I believe there are some issues in
joining/leaving hordes. Below are the traces and comments on
experiment I have done so far. As all is about the same topic, I open just
one issue

For what follows I have a XYZ.Utils module with two helper functions that will be used later:

def start_worker(name) do
  cs = %{id: name, start: {XYZ.Worker, :start_link, [[client_id: name]]}}
  Horde.Supervisor.start_child(XYZ.DSup, cs)
end

def lookup(name) do
 Horde.Registry.lookup(XYZ.Worker.via_tuple(name))
end

and a XYZ.Worker module which is basically a GenServer with few
API that are not relevant here.

All examples start with a fresh situation with two nodes started with
ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix and
ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix in two
terminals. Then nd1 connect to nd2 with
Node.connect(:"[email protected]").

I1 : joining a horde on a node that does not exist

Joining the hordes from nd1 to nd2 works fine

iex([email protected])1> Horde.Cluster.join_hordes(XYZ.DSup, {XYZ.DSup, :"[email protected]"})
:ok
iex([email protected])2> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok

However, the system returns :ok if you specify a node name that does not exist:

iex([email protected])9> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok

Same holds for supervisors. Returning false would provide the same
semantics as Node.connect().

I2: Upon `leave_hordes()` registry on removed node is not cleaned properly

Start with a fresh situation, join the hordes and start a process
that will run on nd1. Check on both nodes that the process is available:

iex([email protected])11> XYZ.Utils.lookup("a1")
#PID<0.215.0>

and on nd2

iex([email protected])5> XYZ.Utils.lookup("a1")
#PID<18754.215.0>

Good. Now leave the horde on nd2 and check for cluster members:

iex([email protected])7> Horde.Cluster.leave_hordes(XYZ.DReg)
:ok
iex([email protected])8> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   <<137, 122, 137, 240, 81, 186, 67, 134, 134, 190, 218, 93, 121, 158, 66, 210,
     155>> => {#PID<18754.173.0>, #PID<18754.176.0>}
 }}

and on nd1

iex([email protected])13> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   <<137, 122, 137, 240, 81, 186, 67, 134, 134, 190, 218, 93, 121, 158, 66, 210,
     155>> => {#PID<0.173.0>, #PID<0.176.0>}
 }}

This is ok. Yet, if we lookup the name on nd2 we still get the process pid:

iex([email protected])11> XYZ.Utils.lookup("a1")
#PID<18754.215.0>

Is it a wanted behaviour ? If so I would understand that the registry
on nd2 keeps the processes that are running on nd2, but here the
process runs on the other node and this behaviour seems
inappropriate. Otherwise, it may imply the cleanup is not done on the
removed node.

NB: I understand CRDT are eventually consistent. I re-looked up
several times (after 10s, 30s and maybe 1mn) and the process is still
registered.

I3: Node leaving a registry takes time

This is not really an issue but more a remark: if one
registry is removed, say the one running on nd2 as in the previous
example, sometimes (subjective measure...: between 10s to 30s of what
I could experiment) running Horde.Cluster.members(XYZ.DReg) on nd2
shows both registries for a while. I ran into this once then I had to
retry several times before getting it again. My system was CPU flat
and there was no process registered. I believe this is due to some
CRDT synchronizations, yet it seems long. What would happen if a
process is to be registered after the node had left a horde but before
everything is cleaned up?

I4: Leaving a supervisor horde generates an error

nd1 and nd2 are freshly started and hordes are joined. Leaving the
supervision horde on nd2 seems to work fine:

iex([email protected])4> Horde.Cluster.leave_hordes(XYZ.DSup)
:ok

After a while, however, we get the following error:

iex([email protected])5>
2018-06-27 06:54:50.103 - pid=<0.164.0> [error] - GenServer XYZ.DSup terminating
** (stop) exited in: GenServer.stop(#PID<0.165.0>, :force_shutdown, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently
      associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:789: GenServer.stop/3
    (horde) lib/horde/supervisor.ex:146: Horde.Supervisor.terminate/2
    (stdlib) gen_server.erl:648: :gen_server.try_terminate/3
    (stdlib) gen_server.erl:833: :gen_server.terminate/10
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :force_shutdown
State: %Horde.Supervisor.State{distribution_strategy: Horde.UniformDistribution,
members: %{<<27, 95, 41, 179, 113, 231, 211, 136, 166, 166, 226, 32, 200, 173, 190, 5, 62>> =>
  {:shutting_down, {#PID<0.164.0>, #PID<0.165.0>, #PID<0.166.0>, #PID<0.169.0>}},
  <<216, 132, 252, 114, 237, 23, 140, 220, 238, 8, 30, 25, 202, 105, 106, 195, 70>> =>
  {:alive, {#PID<15878.164.0>, #PID<15878.165.0>, #PID<15878.166.0>, #PID<15878.169.0>}}},
  members_pid: #PID<0.166.0>,
  node_id: <<27, 95, 41, 179, 113, 231, 211, 136, 166, 166, 226, 32, 200, 173, 190, 5, 62>>,
  processes: %{}, processes_pid: #PID<0.169.0>, processes_updated_at: 5,
  processes_updated_counter: 5, shutting_down: true, supervisor_pid: #PID<0.165.0>}

Again the systems load is flat and there is no process being supervised.
This may mean that leaving the horde take too long.

I5: Leaving (the hard way) then re-joining of a registry horde fails

nd1 and nd2 are freshly started and registry hordes are
joined. No process is registered:

iex([email protected])2>     Node.connect(:"[email protected]")
true
iex([email protected])3>     Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok

When listing the members of the cluster on nd1, I get:

iex([email protected])7> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   <<83, 195, 54, 96, 102, 20, 156, 116, 131, 106, 17, 103, 11, 204, 251, 33,
     138>> => {#PID<15833.173.0>, #PID<15833.176.0>},
   <<189, 133, 217, 206, 174, 187, 181, 173, 175, 99, 162, 160, 87, 102, 179,
     153, 55>> => {#PID<0.173.0>, #PID<0.176.0>}
 }}

<<83, 195, ...>> is the part running on nd2.

OK. Now, let's kill nd2 (with Control-C) and let's restart it and
re-join the registry horde:

iex([email protected])2> Node.connect(:"[email protected]")
true
iex([email protected])3> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
:ok

I get the following messages every 5 seconds on nd2, for ever:

2018-06-29 06:27:35.825 - pid=<0.32.0> [error] - Discarding message
{delta, {<0.176.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>
 #{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>
 #{'__struct__'=>'Elixir.MapSet',map=>#{},version=>2},maxima=>#{}},keys=>#{'__struct__'=>
 'Elixir.MapSet',map=>#{},version=>2},state=>#{}}},1}
 from <0.176.0> to <0.176.0> in an old incarnation (1) of this node (2)

2018-06-29 06:27:35.826 - pid=<0.32.0> [error] - Discarding message
{delta,{<0.173.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{317600142,0}=>
[],{376039348,0}=>[],{629303426,0}=>[]},version=>2},maxima=>#{317600142=>
0,376039348=>0,629303426=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>
#{<<83,195,54,96,102,20,156,116,131,106,17,103,11,204,251,33,138>>=>
[],<<124,86,99,184,16,232,64,101,19,160,234,151,99,96,179,111,197>>=>
[],<<189,133,217,206,174,187,181,173,175,99,162,160,87,102,179,153,55>>=>[]},version=>2},
state=>#{<<83,195,54,96,102,20,156,116,131,106,17,103,11,204,251,33,138>>=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>
#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.173.0>,<0.176.0>},1530242807895109565}=>[]},version=>2},
state=>#{{{<0.173.0>,<0.176.0>},1530242807895109565}=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>
'Elixir.MapSet',map=>#{{317600142,0}=>[]},version=>2}}}},
<<124,86,99,184,16,232,64,101,19,160,234,151,99,96,179,111,197>>=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>
#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.173.0>,<0.176.0>},1530242849488856679}=>[]},version=>2},
state=>#{{{<0.173.0>,<0.176.0>},1530242849488856679}=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalDotSet',causal_context=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{},version=>2},maxima=>#{}},
state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{376039348,0}=>[]},version=>2}}}},
<<189,133,217,206,174,187,181,173,175,99,162,160,87,102,179,153,55>>=>
#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>
#{'__struct__'=>'Elixir.MapSet',map=>#{{{<15833.173.0>,<15833.176.0>},
1530242805313776549}=>[]},version=>2},
state=>#{{{<15833.173.0>,<15833.176.0>},1530242805313776549}=>#{'__struct__'=>
'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,
state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{629303426,0}=>[]},version=>2}}}}}}},2}
from <0.173.0> to <0.173.0> in an old incarnation (1) of this node (2)

If I check on nd1 the list of members, I get:

iex([email protected])8> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   <<83, 195, 54, 96, 102, 20, 156, 116, 131, 106, 17, 103, 11, 204, 251, 33,
     138>> => {#PID<15833.173.0>, #PID<15833.176.0>},
   <<124, 86, 99, 184, 16, 232, 64, 101, 19, 160, 234, 151, 99, 96, 179, 111,
     197>> => {#PID<15833.173.0>, #PID<15833.176.0>},
   <<189, 133, 217, 206, 174, 187, 181, 173, 175, 99, 162, 160, 87, 102, 179,
     153, 55>> => {#PID<0.173.0>, #PID<0.176.0>}
 }}

<<83, 195, ...>> is still here and points to the processes on the
killed node. This shows that the old registry was not removed when
nd2 was killed.

NB: I tried the same with supervisors which seem to work fine. The processes are restarted
on remaining nodes and while the reference of the old supervisor is
still present as well, it is labelled as :dead. Re-joining generates
no issue. By the way, I'm wondering why do you keep this information ?

If you need clarifications on this report, let me know.

Cheers

Support different cluster topologies

Currently every node is connected to every other node (via delta_crdt). It should be possible to define other topologies to make synchronization more efficient.

Registry should work with arbitrary terms (like Elixir `Registry`).

Useful for #21

Does Horde move process and state liek swarm

Apologies for raising an issue as this is more about clarification!

When you gracefully shut a horde node down, the process it has are moved to a new node according to the documentation but does the state of the process also move with it in the same way as swarm does?

thanks

Tom

Dead nodes are not removed from horde registry

Hi Derek,
We are planning to switch from Swarm to Horde but after doing some tests I saw that when I restart a node, dead node is not removed from horde registry members list and I receive this error: Discarding message {delta ... which I believe is because of trying to send deltas to a dead node.
Restarting nodes causes the registry members list to grow and amount of errors is also increased.
I don't know if it is a bug or I'm doing something wrong.

Example app fails to start

Starting a single node with
iex --name [email protected] --cookie asdf -S mix

This is using current master and Elixir 1.7.4 and Erlang 20.2.3

It results in the following:

18:21:04.398 [info]  Starting Horde.RegistryImpl with name HelloWorld.HelloRegistry
 
18:21:04.401 [error] Process HelloWorld.HelloRegistry.RegistryCrdt (#PID<0.515.0>) terminating
** (ErlangError) Erlang error: :timeout_value
    (stdlib) gen_server.erl:389: :gen_server.loop/7
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Initial Call: DeltaCrdt.CausalCrdt.init/1
Ancestors: [HelloWorld.HelloRegistry.Supervisor, HelloWorld.Supervisor, #PID<0.508.0>]
Message Queue Length: 0
Messages: []
Links: [#PID<0.516.0>, #PID<0.517.0>, #PID<0.518.0>, #PID<0.510.0>]
Dictionary: [rand_seed: {%{bits: 58, jump: #Function<8.15449617/1 in :rand.mk_alg/1>, next: #Function<5.15449617/1 in :rand.mk_alg/1>, type: :exrop, uniform: #Function<6.15449617/1 in :rand.mk_alg/1>, uniform_n: #Function<7.15449617/2 in :rand.mk_alg/1>, weak_low_bits: 1}, [37638961097853609 | 49241742922068666]}]
Trapping Exits: true
Status: :running
Heap Size: 610
Stack Size: 27
Reductions: 394

Only by reverting to v 0.3.0 and forcing delta_crdt to 0.3.1 can I get the example to start.

Horde.Cluster.set_members behaving differently for registries and supervisors

Please see: https://asciinema.org/a/AKOZTh1jvI1NTa1KK2XMEIBT1

The test app I'm running can be found here: https://github.com/frekw/horde-test

For Horde.Registry the member list seem to be replicated (as expected) but for the supervisor it's not.

I could very well be doing something wrong as well but at least I don't think I am.

Horde.Supervisor.terminate_child/2 non consistent with DynamicSupervisor

According to Horde docs and function spec, terminate_child/2 should use the child pid as child_id to terminate, like DynamicSupervisor does.

But does not work, and looking into tests seems that we need to use the id of the process (which works), which is (maybe) fine, but should at least be documented.

I get the following error if I do an `:init.stop()`:

I get the following error if I do an :init.stop():

[error] GenServer FamilyFive.HordeSupervisor.Crdt terminating
** (ArgumentError) argument error
    :erlang.send(FamilyFive.HordeSupervisor, {:crdt_update, [{:add, {:process, FamilyFive.PushNotifications.PushNotificationsScheduler}, {nil, %{id: FamilyFive.PushNotifications.PushNotificationsScheduler, modules: [FamilyFive.PushNotifications.PushNotificationsScheduler], restart: :permanent, shutdown: 5000, start: {FamilyFive.PushNotifications.PushNotificationsScheduler, :start_link, []}, type: :worker}}}]})
    (horde) lib/horde/supervisor_supervisor.ex:12: anonymous fn/2 in Horde.SupervisorSupervisor.init/1
    (delta_crdt) lib/delta_crdt/causal_crdt.ex:201: DeltaCrdt.CausalCrdt.update_state_with_delta/3
    (delta_crdt) lib/delta_crdt/causal_crdt.ex:166: DeltaCrdt.CausalCrdt.handle_cast/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:operation, {:add, [process: FamilyFive.PushNotifications.PushNotificationsScheduler, nil: %{id: FamilyFive.PushNotifications.PushNotificationsScheduler, modules: [FamilyFive.PushNotifications.PushNotificationsScheduler], restart: :permanent, shutdown: 5000, start: {FamilyFive.PushNotifications.PushNotificationsScheduler, :start_link, []}, type: :worker}]}}}
State: %DeltaCrdt.CausalCrdt{crdt_module: DeltaCrdt.AWLWWMap, crdt_state: %DeltaCrdt.AWLWWMap{dots: %{434327776 => 3, 704243284 => 3, 883112317 => 8}, value: %{{:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{1, 1555133645008603000} => #MapSet<[{883112317, 7}]>}, {:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{1, 1555133616375684000} => #MapSet<[{883112317, 3}]>}, {:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{%Horde.Supervisor.Member{name: {FamilyFive.HordeSupervisor, :"[email protected]"}, status: :alive}, 1555133645053704000} => #MapSet<[{704243284, 1}]>}, {:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => %{{%Horde.Supervisor.Member{name: {FamilyFive.HordeSupervisor, :"[email protected]"}, status: :shutting_down}, 1555133652587732000} => #MapSet<[{883112317, 8}]>}, {:process, FamilyFive.PushNotifications.PushNotificationsScheduler} => %{{{{FamilyFive.HordeSupervisor, :"[email protected]"}, %{id: FamilyFive.PushNotifications.PushNotificationsScheduler, restart: :permanent, start: {FamilyFive.PushNotifications.PushNotificationsScheduler, :start_link, []}}}, 1555133621006383000} => #MapSet<[{883112317, 4}]>}}}, merkle_tree: %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: %{{:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<58, 230>>}, hash: "bq"}}, hash: <<133, 107>>}}, hash: "Ҝ"}}, hash: <<61, 214>>}, nil}, hash: <<113, 155>>}}, hash: <<123, 28>>}}, hash: <<155, 93>>}}, hash: <<185, 109>>}}, hash: "xP"}}, hash: <<128, 89>>}, nil}, hash: <<30, 123>>}}, hash: <<236, 29>>}, nil}, hash: <<47, 165>>}, nil}, hash: <<19, 32>>}, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: %{{:process, FamilyFive.PushNotifications.PushNotificationsScheduler} => <<103, 206>>}, hash: <<243, 50>>}, nil}, hash: <<22, 60>>}}, hash: <<28, 43>>}}, hash: <<198, 240>>}, nil}, hash: <<238, 154>>}}, hash: <<192, 201>>}}, hash: <<42, 203>>}, nil}, hash: <<26, 208>>}}, hash: "Ż"}}, hash: <<46, 131>>}, nil}, hash: <<29, 99>>}}, hash: <<169, 248>>}}, hash: <<34, 181>>}}, hash: <<173, 218>>}}, hash: <<254, 33>>}, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: %{{:member, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<58, 230>>}, hash: <<247, 98>>}, nil}, hash: <<14, 194>>}, nil}, hash: "Fe"}, nil}, hash: <<9, 233>>}, nil}, hash: <<174, 79>>}, nil}, hash: "mF"}, nil}, hash: <<150, 233>>}, nil}, hash: <<199, 102>>}}, hash: <<108, 147>>}, nil}, hash: <<186, 84>>}, nil}, hash: <<57, 171>>}}, hash: <<240, 164>>}, nil}, hash: "{w"}, nil}, hash: "6m"}, nil}, hash: <<193, 79>>}}, hash: <<199, 211>>}, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: %{{:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<227, 223>>}, hash: <<171, 117>>}, nil}, hash: <<159, 14>>}, nil}, hash: <<73, 252>>}}, hash: <<24, 213>>}, nil}, hash: "#K"}}, hash: <<202, 243>>}, nil}, hash: <<136, 44>>}, nil}, hash: <<45, 21>>}}, hash: <<28, 9>>}, nil}, hash: <<206, 111>>}, nil}, hash: "4!"}, nil}, hash: <<224, 43>>}, nil}, hash: <<223, 59>>}}, hash: <<227, 8>>}}, hash: <<244, 112>>}, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {nil, %MerkleTree{children: {%MerkleTree{children: {%MerkleTree{children: %{{:member_node_info, {FamilyFive.HordeSupervisor, :"[email protected]"}} => <<252, 229>>}, hash: <<6, 197>>}, nil}, hash: "*?"}, nil}, hash: <<193, 91>>}}, hash: <<153, 32>>}, nil}, hash: <<251, 11>>}}, hash: "hb"}}, hash: <<197, 44>>}}, hash: <<118, 222>>}}, hash: <<242, 202>>}}, hash: <<78, 192>>}, nil}, hash: <<12, 139>>}, nil}, hash: <<143, 159>>}, nil}, hash: <<138, 226>>}}, hash: <<139, 112>>}, nil}, hash: "q\v"}}, hash: <<160, 148>>}}, hash: "}V"}, name: FamilyFive.HordeSupervisor.Crdt, neighbours: #MapSet<[{FamilyFive.HordeSupervisor.Crdt, :"[email protected]"}]>, node_id: 883112317, on_diffs: #Function<0.101987752/1 in Horde.SupervisorSupervisor.init/1>, sequence_number: 0, storage_module: nil, sync_interval: 100}

Originally posted by @jfrolich in #89 (comment)

Add a behaviour for `DistributionStrategy`

Which would be implemented by UniformQuorumDistribution and UniformDistribution and any new distribution strategies.

Horde cannot handle transient processes

A transient process will be (correctly or incorrectly) restarted on another node when the node on which it was started is removed from the cluster, regardless of whether it has finished or not.

Deprecate `Horde.Registry.processes/1`

This function is inconsistent with Horde.Registry.lookup/2. Making it consistent will add a large cost to the call.

The standard library Registry doesn't offer this function so I think we can safely deprecate and then remove.

Related to #16

We are running rc11 and we are seeing duplicate processes spawned after starting a child.

So here's as much information as I have gathered so far.

This issue did not appear at rc6 we have seen it at rc10, and rc11.
The process in question is a transient process
It is named using horde; with the via tuple {:via, Horde.Registry, {Engine.HordeRegistry, {__MODULE__, struct.action_id}}}
startup code

      child_spec =
        {WatchFolderRunner, {workflow_id, watch_folder}}
        |> Supervisor.child_spec(id: "#{WatchFolderRunner}_#{watch_folder.action_id}")

      Horde.Supervisor.start_child(
        Engine.HordeSupervisor,
        child_spec
      )

The problem only occurs in multinode setups specifically we run 6. The problem happens very frequently.
Usually what happens is that one process will be started on the node that calls the above code and another version will be started on a remote node.

Opinion Section

I think this change may have been introduced around rc7 or rc8 with the change to one CRDT. I have not validated it though.

Registries not cleaned properly after node crash or shutdown

Hi Derek,

This is an issue on registries.

My xyz application actually starts a distributed registry XYZ.DReg and a distributed supervisor XYZ.DSup in its supervision tree. At boot, it also start a random number of dummy Tasks so there is a good chance that the pids of same processes running on each node are different (for i<-1..:rand.uniform(100), do: Task.async(fn -> nil end)).

The scenario is as follows.

Let's start two nodes (each starting the application) in two terminals:

ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix

end

ERL_AFLAGS="-name [email protected] -setcookie abcd" iex -S mix

Let's check distributed registry members:

iex([email protected])1> Horde.Cluster.members(XYZ.DReg)
{:ok, %{"Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<0.198.0>, #PID<0.202.0>}}}

and

iex([email protected])1> Horde.Cluster.members(XYZ.DReg)
{:ok, %{"kAu9t9EZlbteNeJnWMLNZA==" => {#PID<0.197.0>, #PID<0.201.0>}}}

Now, lets join the registry horde from node nd1 and recheck for members:

iex([email protected])3> Horde.Cluster.join_hordes(XYZ.DReg, {XYZ.DReg, :"[email protected]"})
true
iex([email protected])4> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   "Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<0.198.0>, #PID<0.202.0>},
   "kAu9t9EZlbteNeJnWMLNZA==" => {#PID<16274.197.0>, #PID<16274.201.0>}
 }}

and on nd2:

iex([email protected])2> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   "Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<16320.198.0>, #PID<16320.202.0>},
   "kAu9t9EZlbteNeJnWMLNZA==" => {#PID<0.197.0>, #PID<0.201.0>}
 }}

Things are looking good for a normal situation. So let's kill node nd2 (or shutdown it with :init.stop) and recheck for members on nd1:

iex([email protected])5> Horde.Cluster.members(XYZ.DReg)
{:ok,
 %{
   "Ef/Vdd/0tI2LWdFuVpVFWg==" => {#PID<0.198.0>, #PID<0.202.0>},
   "kAu9t9EZlbteNeJnWMLNZA==" => {#PID<16274.197.0>, #PID<16274.201.0>}
 }}

The reference to the registry that was running on the dead node is still present. If you join a third node nd3 to the cluster and to the XYZ.DReg horde, it will be propagated as well. I believe it will have an impact if one alive node tries to sync with this dead node.

Cheers
Maurycy

time out in Horde.Registry.lookup v0.3.0

I'm running a script that is creating a bunch of GenServers registered with Horde.Registry. The script isn't attempting to do anything in parallel, just working through a list one at a time. Sometimes, after running for a few minutes, I will observe a hang and see the following crash:

** (exit) exited in: GenServer.call(MyApp.EntityRegistry, :get_keys_ets_table, 5000)
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:989: GenServer.call/3
    (horde) lib/horde/registry.ex:95: Horde.Registry.lookup/2
    (gambit) lib/my_app/persistence/entity.ex:56: MyApp.Survey.Major.ensure_started/1
    (gambit) lib/my_app/survey/major.ex:4: MyApp.Survey.Major.update/2
    (gambit) lib/my_app/core_sync/import.ex:294: MyApp.CoreSync.Import.save_major_translation/2
    (elixir) lib/enum.ex:1327: Enum."-map/2-lists^map/1-0-"/2
    (elixir) lib/enum.ex:1327: Enum."-map/2-lists^map/1-0-"/2
    (gambit) lib/my_app/core_sync/import.ex:70: anonymous fn/2 in MyApp.CoreSync.Import.process_majors/0
    (elixir) lib/enum.ex:1940: Enum."-reduce/3-lists^foldl/2-0-"/3
    (gambit) lib/my_app/core_sync/import.ex:66: MyApp.CoreSync.Import.process_majors/0
    (gambit) lib/my_app/core_sync/import.ex:15: MyApp.CoreSync.Import.run/1
    (mix) lib/mix/task.ex:331: Mix.Task.run_task/3
    (mix) lib/mix/cli.ex:79: Mix.CLI.run_task/2
    (elixir) lib/code.ex:767: Code.require_file/2

Right now I'm testing locally with a single node, so I don't think this is an issue with the CRDT.
Looks like there's been a lot of work done since 0.3.0...what would be your recommendation here? Are the RC releases "safe" to use?

Registry might not be cleaning up its CRDTs on termination

I'm running on master right now.

I start with a horde of two nodes (A and B). I terminate B (by sending a SIGTERM to the Erlang node that hosts it). Then I start B up again (with the same Erlang node name). The restarted node begins displaying errors that look like this:

[error] Discarding message {delta,{<0.260.0>,<0.260.0>,#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalContext',dots=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[],{623221198,0}=>[],{653145801,0}=>[]},version=>2},maxima=>#{347099519=>0,623221198=>0,653145801=>0}},keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>[],<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>[],<<"MQr4dFCEp2++AZuDqgLPrw==">>=>[]},version=>2},state=>#{<<"3AbGdJ3PVn8ZdlUdX0G50w==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533685998533388000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{623221198,0}=>[]},version=>2}}}},<<"FuXp0DUpIe/hVYUrxRkuiw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>[]},version=>2},state=>#{{{<17676.260.0>,<17676.264.0>},1533685948061668000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{347099519,0}=>[]},version=>2}}}},<<"MQr4dFCEp2++AZuDqgLPrw==">>=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotMap',causal_context=>nil,keys=>#{'__struct__'=>'Elixir.MapSet',map=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>[]},version=>2},state=>#{{{<0.260.0>,<0.264.0>},1533686028650670000}=>#{'__struct__'=>'Elixir.DeltaCrdt.CausalDotSet',causal_context=>nil,state=>#{'__struct__'=>'Elixir.MapSet',map=>#{{653145801,0}=>[]},version=>2}}}}}}},2} from <0.260.0> to <0.260.0> in an old incarnation (3) of this node (1)

My guess is that the CRDT running on A never got the memo that B disappeared, and is still trying to send it messages. The restarted B node is (rightfully) not accepting those messages.

I was able to verify that, in this case, Registry.terminate never got called, and so wasn't able to initiate graceful cleanup of its CRDTs. So if I make sure Registry's terminate callback gets called (see dazuma@811a351) that seems to fix it. But that's probably not foolproof either; I'm sure there are ways to kill a node brutally and not give terminate a chance to execute.

Subtractive changes to the Horde memberships aren't mirrored in the Registry and Supervisor

Please see: https://asciinema.org/a/lOdfe5xgtt7PO7Qwynu8THKru

If two nodes have been part of the same Horde, PIDs that run on nodes that were part of the previous overlay network configuration are still considered valid. However, if an actual netsplit occurs, the process will be restarted on a node using the new configuration.

I would have expected cluster membership changes to be treated the same as a netsplit, i.e the process to have been started on the other node as well as soon as the Horde memberships were reconfigured.

Common race condition combining Horde.Supervisor and Horde.Registry with :via tuples

I'm starting processes under a Horde.Supervisor, and using :via tuples to register them with a Horde.Registry. I'm encountering a frequent race condition that appears to look like this:

I have two nodes, A and B.
On node A, I call Horde.Supervisor.start_child to start a process that also registers itself using name: {:via, Horde.Registry, ...}
Horde decides the process should be started on node B, and proxies the request over there.
Node B starts the process and registers it under the Horde.Registry on node B.
The reply goes back to my code on node A.
Node A then wants to call the new process immediately, and passes the same :via tuple to GenServer.send. Unfortunately, the new registry entry probably has not yet propagated back to node A. So the call crashes. Indeed, without polling the registry, I can't find a good way to determine when the call is safe to make.
(I know you can just use the returned pid directly, but that becomes more complicated when, as is my case, the child being started is itself a supervisor and I want to call one of its children.)

I've been poking through the horde source and can't see an obvious way to solve this one, without forcing horde to start the process on the local node. (Is there a way to do that?) Currently, I'm using the returned supervisor pid, iterating its children and looking for the recipient pid directly for those "immediate" calls. But it seems a clumsy way to proceed. Wondering if you have any solutions/suggestions. At the very least, it might be worth documenting this since it's likely to be a common problem.

exception sending message to a dead process

Sometimes the following exception can be raised:

[error] GenServer :"horde_82228185.Crdt" terminating                
** (ArgumentError) argument error                                                
    :erlang.send(:horde_82228185, {:crdt_update, [{:add, {:process, :proc_12}, {nil, %{id: :proc_12, modules: [Task], restart: :permanent, shutdown: 10, start: {Task, :start_link, [#Function<0.66858309/0 in SupervisorTest.__ex_unit_setup_0/1>]}, type: :worker}}}]})                                                           
    (horde) lib/horde/supervisor_supervisor.ex:14: anonymous fn/2 in Horde.SupervisorSupervisor.init/1                                                            
    (delta_crdt) lib/delta_crdt/causal_crdt.ex:239: DeltaCrdt.CausalCrdt.update_state_with_delta/3                                                                
    (delta_crdt) lib/delta_crdt/causal_crdt.ex:204: DeltaCrdt.CausalCrdt.handle_cast/2                                                                            
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4                      
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6                        
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3                       
Last message: {:"$gen_cast", {:operation, {:add, [process: :proc_12, nil: %{id: :proc_12, modules: [Task], restart: :permanent, shutdown: 10, start: {Task, :start_link, [#Function<0.66858309/0 in SupervisorTest.__ex_unit_setup_0/1>]}, type: :worker}]}}}

This often happen in tests (e.g. the supervisor test of this project) and it seems to be caused by this line. When it happens the process registered with the name root_name is already dead, and an argument error is raise by send/2.

A simple

on_diffs: fn diffs ->
  try do
    send(root_name, {:crdt_update, diffs})
  rescue
    e ->
      nil
  end

is enough, but i dont know if this is the right approach here or there is a better, more structured solution.

What do you think?

Intermittent errors in horde that seems to run fine otherwise.

Hi Derek.

We're continuing our process of getting to production with a live horde cluster. I am excited as always.

We were testing our cluster (I work remotely, but the guys in the office tried it out) and while the cluster seems to do what is expected, they did notice some errors appearing in the logs.

(stop) exited in: GenServer.call(MyApp.HordeSupervisor, :members, 5000)
(EXIT) time out
(elixir) lib/gen_server.ex:989: GenServer.call/3
(hive) lib/my_app/horde/cluster.ex:99: MyApp.HordeCluster.members/0
(hive) lib/my_app/horde/cluster.ex:71: anonymous fn/0 in MyApp.HordeCluster.start_horde_children/0
(elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Function: #Function<1.74019401/0 in MyApp.HordeCluster.start_horde_children/0>
Args: []```

This error comes from a line that is called with:

{:ok, members} = Horde.Cluster.members(MyApp.HordeSupervisor)

We start our cluster with the following children:

      {Horde.Supervisor,  name: MyApp.HordeSupervisor,
                        strategy: :one_for_one,
                        distribution_strategy: strategy(),
                        children: []},
      {Horde.Registry, name: MyApp.HordeRegistry, keys: :unique},
      {MyApp.Horde.NodeMonitor, name: MyApp.Horde.NodeMonitor}

Do you have any idea what might be causing this? It doesn't seem to have a detrimental effect on the functioning of the cluster.

The other issues they discovered were:

GenServer Hive.HordeRegistry terminating

(FunctionClauseError) no function clause matching in Horde.RegistryImpl.handle_cast/2
(horde) lib/horde/registry_impl.ex:91: Horde.RegistryImpl.handle_cast({:request_to_join_hordes, {:registry, "Jw9eSu4x4E4N5B8f87h2YA==", #PID<30324.493.0>, false, {#PID<30324.447.0>, #Reference<30324.165341209.964952065.63890>}
}}, %Horde.RegistryImpl.State

{keys_ets_table: :"keys_Elixir.MyApp.HordeRegistry", keys_pid: #PID<0.1540.0>, members_pid: #PID<0.1532.0>, node_id: "Coxg9YdgjfgDRewOBedkrw==", nodes: [:"[email protected]", :"[email protected]"], pids_ets_table: :"pids_Elixir.MyApp.HordeRegistry", processes_updated_at: 0, processes_updated_counter: 0, registry_ets_table: MyApp.HordeRegistry, registry_pid: #PID<0.1536.0>}

and

GenServer Hive.HordeSupervisor.Registry terminating

(FunctionClauseError) no function clause matching in anonymous fn/1 in Horde.SupervisorImpl.monitor_supervisors/2

I am hoping that either these are known to you and not something to worry about, or that it might help you. If you recognise any of these as legitimate issues with our implementation, please let me know.

Thanks for the great project!

Horde.Registry supports `keys: :duplicate`

For feature parity with Elixir's Registry, Horde.Registry needs to support keys: :duplicate.

Process redistribution on node join

Creating this issue as a placeholder for the conversation we were having today. The tldr; of that conversation was:

dkraan [12:31 PM]
So Horde doesn't do any redistributing when a node joins the cluster, but when a node leaves the cluster its processes will be evenly distributed across the remaining nodes.

It would be good if processes in horde could be rebalanced on node up/node join, and not only on node down. The thought so far was that there may be a pluggable way to expose this type of functionality to the library user.

Improve module based supervisor

Current module based supervisor implementation does not allow setting supervisor options like max restarts, timers and so on, that's because of:

child_spec is not overridable
init/1 in standard module-based supervisor is expected to return a Supervisor.init call, in horde returns some options that's not clear what are used for

My suggestion is to make the module-based supervisor as the standard one. If is too complicated maybe make child_spec overridable in the macro?

Implement `StaticQuorumDistribution`

Which compares the number of nodes in the cluster with a static number to determine if the cluster has quorum.

set_members for registry doesn't remove previously added members

When using the new 0.5.0 set_members functionality to join or leave a cluster it's possible to update all members in the supervisor, however in the registry case once an added member always an added member. See test case below:

    test "can join and unjoin supervisor with set_members" do
      {:ok, _} = Horde.Supervisor.start_link(name: :sup6, strategy: :one_for_one)

      {:ok, _} = Horde.Supervisor.start_link(name: :sup7, strategy: :one_for_one)

      assert :ok = Horde.Cluster.set_members(:sup6, [:sup6, :sup7])

      {:ok, members} = Horde.Cluster.members(:sup6)
      assert 2 = Enum.count(members)

      assert :ok = Horde.Cluster.set_members(:sup6, [:sup6])
      {:ok, members} = Horde.Cluster.members(:sup6)
      assert 1 = Enum.count(members)
    end

    test "can join and unjoin registry with set_members" do
      {:ok, _} = Horde.Registry.start_link(name: :reg4, keys: :unique)

      {:ok, _} = Horde.Registry.start_link(name: :reg5, keys: :unique)

      assert :ok = Horde.Cluster.set_members(:reg4, [:reg4, :reg5])

      {:ok, members} = Horde.Cluster.members(:reg4)
      assert 2 = Enum.count(members)

      assert :ok = Horde.Cluster.set_members(:reg4, [:reg4])
      {:ok, members} = Horde.Cluster.members(:reg4)
      assert 1 = Enum.count(members)
    end

The supervisor test case works as expected however the registry test case fails since the two members are both still there

Update example application to better illustrate correct start-up procedure.

The current example app uses start_child before the cluster has been formed, resulting in "hijacking" the process whenever a new node is added. I'd like to change this to show the more correct flow of first connecting to the cluster and only then starting the child process(es).

Don't use named processes for dynamic processes

In (for example) Horde.Supervisor, we start a number of processes that are named. This can create a possible leak in the atom table. Horde should be modified to avoid dynamically creating named processes.

Horde.Cluster.join_hordes crash on the others node

Where is the best place to call Horde.Cluster.join_hordes?

I'm getting this:

[email protected]: [2019-05-05 20:46:08.883] [info]   Starting Horde.SupervisorImpl with name PulsarSolrIndexer.HordeSupervisor
[email protected]: [2019-05-05 20:49:52.806] [info]   Starting PulsarSolrIndexer v0.6.0
[email protected]: [2019-05-05 20:49:52.828] [info]   Starting Horde.SupervisorImpl with name PulsarSolrIndexer.HordeSupervisor
[email protected]: [2019-05-05 20:49:52.836] [info]   Starting Horde.RegistryImpl with name PulsarSolrIndexer.GRegistry
[email protected]: [2019-05-05 20:49:52.840] [info]   Starting Elixir.Server.Retry
[email protected]: [2019-05-05 20:49:52.863] [info]   cleaning task running table
[email protected]: [2019-05-05 20:49:52.864] [info]   Elixir.SQSConsumer started
[email protected]: [2019-05-05 20:49:52.869] [info]   ConsumerSupervisor: min_demand = 3, max_demand: 10
[email protected]: [2019-05-05 20:49:55.323] [error]  GenServer PulsarSolrIndexer.HordeSupervisor.MembersCrdt terminating
** (FunctionClauseError) no function clause matching in DeltaCrdt.CausalCrdt.handle_info/2
    (delta_crdt) lib/delta_crdt/causal_crdt.ex:70: DeltaCrdt.CausalCrdt.handle_info({:add_neighbours, [#PID<19377.302.0>]}, %DeltaCrdt.CausalCrdt{crdt_module: DeltaCrdt.AWLWWMap, crdt_state: %DeltaCrdt.AWLWWMap{dots: %{396184569 => 1}, value: %{"BckHx0Z3wtMTLPsCsjmj/A==" => %{{{:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}, 1557082192832923218} => #MapSet<[{396184569, 1}]>}}}, max_sync_size: 200, merkle_map: %MerkleMap{map: %{"BckHx0Z3wtMTLPsCsjmj/A==" => {:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}}, merkle_tree: %MerkleMap.MerkleTree{object: {<<212, 241, 216, 124>>, {<<99, 0, 49, 45>>, {<<223, 107, 254, 235>>, {<<147, 120, 219, 150>>, {<<102, 149, 191, 171>>, {<<253, 8, 127, 129>>, [], {<<17, 181, 215, 160>>, [], {"A`\\E", {<<20, 210, 168, 102>>, [], {<<24, 112, 40, 47>>, {<<21, 196, 32, 43>>, {<<65, 152, 231, 228>>, {<<167, 1, 238, 191>>, [], {<<234, 243, 169, 248>>, {<<68, 214, 139, 175>>, {<<172, 29, 229, 84>>, {<<80, 131, 188, 213>>, {<<224, 210, 216, 84>>, {<<76, 132, ...>>, [], {...}}, []}, []}, []}, []}, []}}, []}, []}, []}}, []}}}, []}, []}, []}, []}, []}}}, name: PulsarSolrIndexer.HordeSupervisor.MembersCrdt, neighbour_monitors: %{}, neighbours: #MapSet<[]>, node_id: 396184569, on_diffs: #Function<4.83987349/1 in DeltaCrdt.CausalCrdt.init/1>, outstanding_syncs: %{}, sequence_number: 0, storage_module: nil, sync_interval: 5})
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:add_neighbours, [#PID<19377.302.0>]}
State: %DeltaCrdt.CausalCrdt{crdt_module: DeltaCrdt.AWLWWMap, crdt_state: %DeltaCrdt.AWLWWMap{dots: %{396184569 => 1}, value: %{"BckHx0Z3wtMTLPsCsjmj/A==" => %{{{:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}, 1557082192832923218} => #MapSet<[{396184569, 1}]>}}}, max_sync_size: 200, merkle_map: %MerkleMap{map: %{"BckHx0Z3wtMTLPsCsjmj/A==" => {:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}}, merkle_tree: %MerkleMap.MerkleTree{object: {<<212, 241, 216, 124>>, {<<99, 0, 49, 45>>, {<<223, 107, 254, 235>>, {<<147, 120, 219, 150>>, {<<102, 149, 191, 171>>, {<<253, 8, 127, 129>>, [], {<<17, 181, 215, 160>>, [], {"A`\\E", {<<20, 210, 168, 102>>, [], {<<24, 112, 40, 47>>, {<<21, 196, 32, 43>>, {<<65, 152, 231, 228>>, {<<167, 1, 238, 191>>, [], {<<234, 243, 169, 248>>, {<<68, 214, 139, 175>>, {<<172, 29, 229, 84>>, {<<80, 131, 188, 213>>, {<<224, 210, 216, 84>>, {<<76, 132, ...>>, [], {...}}, []}, []}, []}, []}, []}}, []}, []}, []}}, []}}}, []}, []}, []}, []}, []}}}, name: PulsarSolrIndexer.HordeSupervisor.MembersCrdt, neighbour_monitors: %{}, neighbours: #MapSet<[]>, node_id: 396184569, on_diffs: #Function<4.83987349/1 in DeltaCrdt.CausalCrdt.init/1>, outstanding_syncs: %{}, sequence_number: 0, storage_module: nil, sync_interval: 5}
[email protected]: [2019-05-05 20:49:55.326] [error]  GenServer PulsarSolrIndexer.HordeSupervisor terminating
** (stop) exited in: GenServer.call(PulsarSolrIndexer.HordeSupervisor.MembersCrdt, {:operation, {:add, ["BckHx0Z3wtMTLPsCsjmj/A==", {:shutting_down, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}]}}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:979: GenServer.call/3
    (horde) lib/horde/supervisor_impl.ex:69: Horde.SupervisorImpl.handle_call/3
    (stdlib) gen_server.erl:661: :gen_server.try_handle_call/4
    (stdlib) gen_server.erl:690: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.311.0>): :horde_shutting_down
State: %Horde.SupervisorImpl.State{distribution_strategy: Horde.UniformDistribution, members: %{"BckHx0Z3wtMTLPsCsjmj/A==" => {:alive, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}}, members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, node_id: "BckHx0Z3wtMTLPsCsjmj/A==", pid: #PID<0.307.0>, processes: %{}, processes_pid: #PID<0.306.0>, processes_updated_at: 0, processes_updated_counter: 0, shutting_down: false}
Client #PID<0.311.0> is alive

    (stdlib) gen.erl:169: :gen.do_call/4
    (elixir) lib/gen_server.ex:986: GenServer.call/3
    (horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
    (elixir) lib/enum.ex:769: Enum.each/2
    (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib) gen_server.erl:858: :gen_server.terminate/10
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
[email protected]: [2019-05-05 20:49:55.326] [error]  GenServer #PID<0.311.0> terminating
** (stop) exited in: GenServer.call(PulsarSolrIndexer.HordeSupervisor, :horde_shutting_down, 5000)
    ** (EXIT) exited in: GenServer.call(PulsarSolrIndexer.HordeSupervisor.MembersCrdt, {:operation, {:add, ["BckHx0Z3wtMTLPsCsjmj/A==", {:shutting_down, %{members_pid: #PID<0.305.0>, name: PulsarSolrIndexer.HordeSupervisor, pid: #PID<0.307.0>, processes_pid: #PID<0.306.0>}}]}}, :infinity)
        ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:989: GenServer.call/3
    (horde) lib/horde/signal_shutdown.ex:21: anonymous fn/1 in Horde.SignalShutdown.terminate/2
    (elixir) lib/enum.ex:769: Enum."-each/2-lists^foreach/1-0-"/2
    (elixir) lib/enum.ex:769: Enum.each/2
    (stdlib) gen_server.erl:673: :gen_server.try_terminate/3
    (stdlib) gen_server.erl:858: :gen_server.terminate/10
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.304.0>, :shutdown}

State handoff

Great work on Horde! I just got finished porting a project over from Swarm, and it seems to be working much more stably under Horde—mostly, I think, because of the graceful shutdown decoupled from node shutdown. I've also found Horde's source much easier to follow than Swarm's, so kudos.

One of the requirements of my app is handoff of state when a process gets restarted on another node using Horde.Supervisor. (Consider, for example, an app running on Kubernetes where containers can be killed due to scaling down or as part of a rolling update, and you want long-running processes to migrate to other live containers along with their state.) This is something that Swarm does, and I imagine it may be a fairly common requirement.

I got this working for my app by creating a third horde type called "Handoff" that basically just wraps another CRDT with a mapping from process name to saved state. (My implementation borrows heavily from the current Horde.Registry module.) Processes that are about to be terminated can save their state to the Handoff CRDT, and when they start up they can check for saved state on the CRDT.

I was wondering if you have particular plans for a feature like this in Horde, and if you have any ideas or feedback on implementation. I could even try cleaning up my implementation as a pull request if you're interested.

Unable to start child if child_spec.id doesn't implement String.Chars

Example:

iex(1)> child_spec = %{id: {A, "a"}, start: {Agent, :start_link, [fn -> 0 end]}}

iex(2)> DynamicSupervisor.start_child(MyApp.DynamicSupervisor, child_spec)
{:ok, #PID<0.836.0>}

iex(3)> Horde.Supervisor.start_child(MyApp.DistributedSupervisor, child_spec)
[error] GenServer MyApp.DistributedSupervisor terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {A, "a"}. This protocol is implemented for...

This is problematic when worker definition is provided by external lib

Issues adding Horde.Registry or Horde.Supervisor to application supervision tree

Hi Derek,

Still struggling with this. I can run the hello_world example fine but when i try and incorporate Horde.Registry or Horde.Supervisor in my own application supervision tree i get the following error:

** (Mix) Could not start application core: Core.Application.start(:normal, []) returned an error: shutdown: failed to start child: Horde.LocalRegistry
** (EXIT) an exception was raised:
** (FunctionClauseError) no function clause matching in Keyword.get/3
(elixir) lib/keyword.ex:179: Keyword.get({#PID<0.186.0>, :members_updated}, :notify, nil)
(delta_crdt) lib/delta_crdt/causal_crdt.ex:64: DeltaCrdt.CausalCrdt.start_link/3
(horde) lib/horde/registry.ex:141: Horde.Registry.init/1
(stdlib) gen_server.erl:365: :gen_server.init_it/2
(stdlib) gen_server.erl:333: :gen_server.init_it/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

The Code being run to try and do this is:

def start(_type, _args) do
import Supervisor.Spec

# List all child processes to be supervised
children = [
  %{
      id: Phoenix.PubSub.PG2.Dalek.Core,
      start: {Phoenix.PubSub.PG2, :start_link, [:core, []]}
  },
  {Horde.Registry, [name: Horde.LocalRegistry]},
  {Horde.Supervisor, [name: Horde.LocalSupervisor, strategy: :one_for_one]},
  worker(Core.Scheduler, []),

etc with other children

Removing the Horde child processes and all is fine, but adding those causes the error above. I am obviously using it wrong but 3 hours spent trying to work out why is making me pull my hair out!

Any help greatly received as Horde (rather than Swarm) is exactly the functionality i am looking for.

Tom

set_members throws error on topology change

So I connect two nodes with set_members which results in the following on both nodes:

   {App.DistributedSupervisor, :"[email protected]"},
   {App.DistributedSupervisor, :"[email protected]"}
]
[
   {App.ProcessRegistry, :"[email protected]"},
   {App.ProcessRegistry, :"[email protected]"}
]```
I start a process and list all processes in the registry on both nodes which gives me:
```iex([email protected])3> Horde.Registry.processes(App.ProcessRegistry)
%{
  "process-id" => {#PID<0.674.0>,
   nil}
}

iex([email protected])3> Horde.Registry.processes(App.ProcessRegistry)
%{
  "process-id" => {#PID<23104.674.0>,
   nil}
}```
When I then use `set_members` again for `DistributedSupervisor` it works on the first node, however the second node raises this exception:
```iex([email protected])4> Horde.Cluster.set_members(App.DistributedSupervisor, [App.DistributedSupervisor])
13:05:56.638 [debug] Found 1 processes on dead nodes [ pid=<0.420.0> line=500 function=processes_on_dead_nodes/1 module=Horde.SupervisorImpl file=lib/horde/supervisor_impl.ex application=horde ]
** (exit) exited in: GenServer.call(App.DistributedSupervisor, {:set_members, [App.DistributedSupervisor]}, 5000)
    ** (EXIT) an exception was raised:
        ** (ArithmeticError) bad argument in arithmetic expression
            :erlang.rem(1766195506, 0)
            (horde) lib/horde/uniform_distribution.ex:17: Horde.UniformDistribution.choose_node/2
            (horde) lib/horde/supervisor_impl.ex:475: anonymous fn/3 in Horde.SupervisorImpl.handle_topology_changes/1
            (elixir) lib/enum.ex:2934: Enum.filter_list/2
            (horde) lib/horde/supervisor_impl.ex:474: Horde.SupervisorImpl.handle_topology_changes/1
            (horde) lib/horde/supervisor_impl.ex:395: Horde.SupervisorImpl.set_members/2
            (horde) lib/horde/supervisor_impl.ex:102: Horde.SupervisorImpl.handle_call/3
            (stdlib) gen_server.erl:661: :gen_server.try_handle_call/4
    (elixir) lib/gen_server.ex:989: GenServer.call/3```