Hey! I am developing a small experimental app for a project I am working on. I

GC is extremely slow about phoenix_pubsub HOT 4 CLOSED

phoenixframework commented on May 25, 2024

GC is extremely slow

from phoenix_pubsub.

Comments (4)

josevalim commented on May 25, 2024

Can you please give this benchmark a try with phoenixframework/firenest? It has a pubsub system based on the Phoenix one and I think this issue may be addressed there. Also keep in mind the GC stalling may not be an issue since those processes are dead anyway. Dispatch will be a bit slower but it is still generally fast. -- *José Valim* www.plataformatec.com.br Skype: jv.ptec Founder and Director of R&D

from phoenix_pubsub.

timbuchwaldt commented on May 25, 2024

Hi José,
thanks for your quick reply. After a quick fight with Firenest I was able to start the benchmark.
With two partitions it doesn't max out all schedulers but still takes ages (stopped the benchmark after 2 minutes for 500k subscriptions) , the reduction counts clearly show both PIDPartition processes as the ones taking the time.

Increasing the number of partitions increases the load, with 8 partitions leading to the same pattern of 8/8 schedulers being maxed out, the runtime is equivalent: https://dsh.re/aa50c

from phoenix_pubsub.

timbuchwaldt commented on May 25, 2024

Sooo..I figured it out, damn synthetic benchmarks.

I subscribed the test processes to all the same channels, so I basically had 10 channels with 50k subscribers each. Each deletion then is a lookup in the pid ETS table for finding subscriptions (fast as it's a lookup by key) plus 10 match_delete. The match_delete then finds the key in ETS (quick) but has to iterate through all 50k items to find the pid. This occurs 10 times for each item.

If my (rather bad) mathematics are correct the worst case number of lookups that goes key by key would be (500000^2 + 500000)/2 which is roughly 125 billion lookups.

I now rebuilt my sample with a random set of 100 channels I subscribe to, this changes the picture drastically, sadly the benchmarking becomes a lot more complex since picking the random channels makes subscribing slow.

In conclusion I suspect there is a huge bottleneck in the cleanup when lots of subscribers subscribe to the same topic, as this means ETS has to go through all items with the shared topic key to find which exact element to delete.

Edit: I see one viable workaround if one would still want to have lots of subscribers on a single key: Subscribe the processes to key + rand(0, n), sending to all by sending to key+0 up to key+n.
This will then distribute the number of duplicates in the table equally over 1/n keys and therefore drastically reduce the impact of this issue. As sending is a simple lookup the overhead on the sending site is negligible.

Edit^2:
I Implemented my workaround:

  def distributed_publish(id, message) do
    for i <- 1..128 do
      Firenest.PubSub.broadcast(FFL.PubSub, id <> "-part" <> Integer.to_string(i), message)
    end
  end

  def distributed_subscribe(id) do
    distributed_id = id <> "-part" <> Integer.to_string(:rand.uniform(128))
    Firenest.PubSub.subscribe(FFL.PubSub, distributed_id)
  end

This improves the behaviour drastically (first bump is starting, then waiting, the terminating all instances):

Compared to not distributing:

I would implement this a little nicer if there is interest, possibly hashing the key or something similar.

from phoenix_pubsub.

chrismccord commented on May 25, 2024

Now that we have sharding in place, this should be much improved. Thanks!

from phoenix_pubsub.

GC is extremely slow about phoenix_pubsub HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent