This is an RFC for a new (optional) optimisation that is crucial for us.
I'm going to work on implementing this today on a fork and will hopefully have a PR, but wanted to document the rationale and design first.
Problem
Right now, if history/recovery is enabled, Centrifugo will store up to history_size
messages for every channel, published to in the last history_lifetime
seconds.
This makes designs with high degree of fan-out impractically expensive.
In my case I have about 5k messages a second being published (after fan-out) where some messages might be published into tens of thousand of channels - one for each "watcher". Even with no subscribers active and modest history settings, that means I need tens or hundreds of GB of Redis memory which is all doing nothing - never read by anyone.
In vast majority of cases, a user who is watched by tens of thousands of people will likely only have a very small fraction of them currently online and subscribed to centrifugo; most users with few watchers will have zero watchers online most of the time. I expect this optimisation to reduce redis memory requirements (and operations per second) by several orders of magnitude, with almost no change to app behaviour (see Edge Case
section).
Solution Overview
I propose an optimisation that will ensure we simply don't store history when it is not needed by "active" subscribers. "Active" means: online now, or within history_lifetime
seconds (and so will possibly reconnect, expecting recovery).
The logic below is almost transparent to the application - it will behave identically to current logic, but will result in only using redis memory for data that is actually delivered.
At a high level it's simple:
- check if
engine.publish()
actually delivered to anyone
- check if history already exists (i.e. something was listening in the last
history_lifetime
)
- Add message to history key if and only if at least one of the above checks indicates active subscriber for that channel
Implementation
This should be relatively simple. If I were designing from scratch I'd be tempted to not have separate publish
and addHistory
methods in the engine
interface as they are fairly closely linked. Indeed if you made that change it would be possible to implement this entire request in the publish
method and potentially with a lua script inside redis entirely.
But in interest of minimising changes to core centrifugo code I propose we implement it with these minimal changes:
- change
Engine.publish
signature to publish(chID ChannelID, message []byte) (bool, error)
where the bool return value is true
when we delivered to at least 1 active subscriber and false
otherwise.
- change
addHistory
's options struct to include a new option: OnlySaveIfActive bool
. If this is set to true
then addHistory
should only add the new event if the history key already exists.
Engine Compatibility
- The changes to
publish
are efficient to implement in memory engine as you have state about subscribers
- Also efficient in redis engine since
PUBLISH
returns the number of active subscriptions that were delivered to (http://redis.io/commands/publish)
- Even if there are future engines that don't know efficiently - e.g. use a third-party service that doesn't return that info, or queue publishes asynchronously - they can still just return
true
for every call, and you end up with exactly the same behaviour as now - suboptimal but totally correct.
- The semantics of current client code is that we only recover on disconnection, so messages delivered before a client first connects are never read by the client anyway. Thus it's safe to simply not store them at all if there is no client connected now, and hasn't been for the last
history_lifetime
. Clients end up with identical deliverability guarantees - indeed they couldn't tell the difference with and without this.
Edge Case (resolved with #148 )
I've said this is //almost// completely transparent to the client which is true. There is one case in which it is not which I regard as an edge case, and acceptable given centrifugos "best effort" general design.
Given the proposed optimisation, consider the following:
- Client connects and subscribes to
foo
- No messages are published before client loses connection (phone goes into tunnel for example)
- While client is offline, message
A
is published into foo
- Since there are no active subscribers (dropped connection pubsub state cleaned up) AND no existing history, this message is dropped not saved
- Client reconnects within
history_lifetime
with recover
option. But they don't get A
because it was not saved.
Personally I don't consider this a big problem. However given that it is a change, we could consider making this optimisation configurable in case others find it unacceptable.
Self-subscriptions
The one other case that doesn't affect me but should be pointed out is that if you are using the Websocket API to allow publishing (i.e. publish: true
in config) then these messages will always be saved due to the client API design - a client can't publish to a socket unless they have a subscription to it which means publish()
will always see at least one active subscription.
This is no worse that current behaviour and is probably expected/desirable anyway so I don't think it's an argument against the optimisation.