This is on master, as of 6eb1dd2.
Buckle up.
We're seeing weird behavior when sending to a particular machine in our cluster of two.
Three clients subscribe to diego.staging.start
. One is just nats-sub '>'
for debugging. The others are actual consumers taking in messages and handling them. They normally subscribe with a queue, but we collected data with and without it.
We're doing a NATS request on diego.staging.start
and expect to see a response from one (with queue) or both (without) consumers.
There are two NATS machines in our cluster, and two consumers.
Without queueing
We're seeing that, without a queue, if we send it to a particular machine, all clients receive the message twice. Sending to the other NATS machine is fine, ands both consumers receive it once.
Publish command:
nats-request -n 2 -s 'nats://nats:[email protected]:4222' diego.staging.start '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
nats-request -n 2 -s 'nats://nats:[email protected]:4222' diego.staging.start '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
Sending to 10.244.0.6
:
The message is received by each connection twice.
Logs on 10.244.0.6:
"Client connection created", [192.168.50.1, 56959], 61]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\"}"], "c: 61"]
[["PING OP"], "c: 61"]
[["PING OP"], "c: 61"]
[["SUB OP", "SUB _INBOX.646f990eb6a4b7079b510ee383 2"], "c: 61"]
[["PUB OP", "PUB diego.staging.start _INBOX.646f990eb6a4b7079b510ee383 64"], "c: 61"]
[["Processing Client msg: 1", "diego.staging.start", "_INBOX.646f990eb6a4b7079b510ee383", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 61"]
["Ignoring route, already processed", 61]
["Ignoring route, already processed", 61]
["Ignoring route, already processed", 61]
["Ignoring route, already processed", 61]
["Ignoring route, already processed", 61]
[["MSG OP", "MSG _INBOX.646f990eb6a4b7079b510ee383 RSID:61:2 2"], "c: 1"]
[["Processing Router msg: 43", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 1"]
[["MSG OP", "MSG _INBOX.646f990eb6a4b7079b510ee383 RSID:61:2 2"], "c: 1"]
[["Processing Router msg: 44", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 1"]
[["MSG OP", "MSG _INBOX.646f990eb6a4b7079b510ee383 RSID:61:2 2"], "c: 1"]
[["Processing Router msg: 45", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 1"]
[["MSG OP", "MSG _INBOX.646f990eb6a4b7079b510ee383 RSID:61:2 2"], "c: 1"]
[["Processing Router msg: 46", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 1"]
["Client connection closed", [192.168.50.1, 56959], 61]
Logs on 10.244.2.6:
[["SUB OP", "SUB _INBOX.646f990eb6a4b7079b510ee383 RSID:61:2"], "c: 1"]
[["SUB OP", "SUB _INBOX.646f990eb6a4b7079b510ee383 RSID:61:2"], "c: 6"]
[["MSG OP", "MSG diego.staging.start RSID:4:2 _INBOX.646f990eb6a4b7079b510ee383 64"], "c: 1"]
[["Processing Router msg: 82", "diego.staging.start", "_INBOX.646f990eb6a4b7079b510ee383", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 1"]
[["MSG OP", "MSG diego.staging.start QRSID:2:1 _INBOX.646f990eb6a4b7079b510ee383 64"], "c: 6"]
[["Processing Router msg: 34", "diego.staging.start", "_INBOX.646f990eb6a4b7079b510ee383", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 6"]
[["PUB OP", "PUB _INBOX.646f990eb6a4b7079b510ee383 2"], "c: 30"]
[["Processing Client msg: 88", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 30"]
["Ignoring route, already processed", 30]
[["PUB OP", "PUB _INBOX.646f990eb6a4b7079b510ee383 2"], "c: 30"]
[["Processing Client msg: 89", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 30"]
["Ignoring route, already processed", 30]
[["PUB OP", "PUB _INBOX.646f990eb6a4b7079b510ee383 2"], "c: 31"]
[["Processing Client msg: 88", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 31"]
["Ignoring route, already processed", 31]
[["PUB OP", "PUB _INBOX.646f990eb6a4b7079b510ee383 2"], "c: 31"]
[["Processing Client msg: 89", "_INBOX.646f990eb6a4b7079b510ee383", "", "{}"], "c: 31"]
["Ignoring route, already processed", 31]
nats-sub '>'
[#418] Received on [diego.staging.start] : '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
[#419] Received on [diego.staging.start] : '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
[#420] Received on [_INBOX.646f990eb6a4b7079b510ee383] : '{}'
[#421] Received on [_INBOX.646f990eb6a4b7079b510ee383] : '{}'
[#422] Received on [_INBOX.646f990eb6a4b7079b510ee383] : '{}'
[#423] Received on [_INBOX.646f990eb6a4b7079b510ee383] : '{}'
Sending to 10.244.2.6
:
The message is received by each client once (as expected).
Logs on 10.244.0.6:
[["SUB OP", "SUB _INBOX.b27c82c8c1f1b7562a166b2b95 RSID:68:2"], "c: 3"]
[["SUB OP", "SUB _INBOX.b27c82c8c1f1b7562a166b2b95 RSID:68:2"], "c: 1"]
Logs on 10.244.2.6
(which we DID send to):
["Client connection created", [192.168.50.1, 56933], 68]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\"}"], "c: 68"]
[["PING OP"], "c: 68"]
[["PING OP"], "c: 68"]
[["SUB OP", "SUB _INBOX.b27c82c8c1f1b7562a166b2b95 2"], "c: 68"]
[["PUB OP", "PUB diego.staging.start _INBOX.b27c82c8c1f1b7562a166b2b95 64"], "c: 68"]
[["Processing Client msg: 1", "diego.staging.start", "_INBOX.b27c82c8c1f1b7562a166b2b95", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 68"]
[["PUB OP", "PUB _INBOX.b27c82c8c1f1b7562a166b2b95 2"], "c: 31"]
[["Processing Client msg: 87", "_INBOX.b27c82c8c1f1b7562a166b2b95", "", "{}"], "c: 31"]
[["PUB OP", "PUB _INBOX.b27c82c8c1f1b7562a166b2b95 2"], "c: 30"]
[["Processing Client msg: 87", "_INBOX.b27c82c8c1f1b7562a166b2b95", "", "{}"], "c: 30"]
["Client connection closed", [192.168.50.1, 56933], 68]
With Queueing
With a queue, our nats-sub
always sees the request message, but if we send it to the same particular machine, no consumers receive the message.
Publish command:
nats-request -n 1 -s 'nats://nats:[email protected]:4222' diego.staging.start '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
nats-request -n 1 -s 'nats://nats:[email protected]:4222' diego.staging.start '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
I've captured the full logs of both machines for the full sequence; comments are inline showing the timing of the command.
Sending to 10.244.0.6
nats-sub
[#424] Received on [diego.staging.start] : '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
# no response from consumers
Logs on 10.244.0.6
[&{Host:10.244.0.6 Port:4222 Trace:true Debug:true NoLog:false NoSigs:false Logtime:false MaxConn:65536 Username:nats Password:nats Authorization: PingInterval:2m0s MaxPingsOut:2 HTTPPort:0 SslTimeout:0.5 AuthTimeout:15 MaxControlLine:1024 MaxPayload:1048576 ClusterHost:10.244.0.6 ClusterPort:4223 ClusterUsername:nats ClusterPassword:nats ClusterAuthTimeout:15 Routes:[nats-route://nats:[email protected]:4223] ProfPort:0 PidFile:/var/vcap/sys/run/nats/nats.pid LogFile:/var/vcap/sys/log/nats/nats.log}]
["DEBUG is on"]
["TRACE is on"]
["Starting gnatsd version 0.4.6"]
["Listening for route connections on 10.244.0.6:4223"]
["Listening for client connections on 10.244.0.6:4222"]
["gnatsd is ready"]
["Trying to connect to route on 10.244.2.6:4223"]
["Error trying to connect to route: dial tcp 10.244.2.6:4223: connection refused"]
["Route connection created", [10.0.2.15, 43494], 1]
["Route sent local subscriptions", 1]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\",\"ssl_required\":false,\"name\":\"2319669868152a7ed5f753d50352ba41\"}"], "c: 1"]
["Registering remote route", "2319669868152a7ed5f753d50352ba41"]
[["SUB OP", "SUB diego.staging.start diego.stagers QRSID:2:1"], "c: 1"]
[["SUB OP", "SUB > RSID:3:2"], "c: 1"]
["Trying to connect to route on 10.244.2.6:4223"]
["Route connection created", [10.244.2.6, 4223], 2]
["Route connect msg sent", [10.244.2.6, 4223], 2]
["Route sent local subscriptions", 2]
["Detected duplicate remote route", "2319669868152a7ed5f753d50352ba41", [10.244.2.6, 4223], 2]
["Router connection closed", [10.244.2.6, 4223], 2]
["Attempting reconnect for solicited route", 2]
[["SUB OP", "SUB diego.staging.start diego.stagers QRSID:2:1"], "c: 2"]
[["SUB OP", "SUB > RSID:3:2"], "c: 2"]
["Trying to connect to route on 10.244.2.6:4223"]
["Route connection created", [10.244.2.6, 4223], 3]
["Route connect msg sent", [10.244.2.6, 4223], 3]
["Route sent local subscriptions", 3]
["Registering remote route", "2319669868152a7ed5f753d50352ba41"]
[["SUB OP", "SUB diego.staging.start diego.stagers QRSID:2:1"], "c: 3"]
[["SUB OP", "SUB > RSID:3:2"], "c: 3"]
# Sending to 10.244.0.6 (consumers did not receive anything):
[["SUB OP", "SUB diego.staging.start diego.stagers QRSID:6:1"], "c: 1"]
[["SUB OP", "SUB diego.staging.start diego.stagers QRSID:6:1"], "c: 3"]
["Client Ping Timer", [10.0.2.15, 43494], 1]
[["PING OP"], "c: 1"]
[["PONG OP"], "c: 1"]
["Client Ping Timer", [10.244.2.6, 4223], 3]
[["PING OP"], "c: 3"]
[["PONG OP"], "c: 3"]
["Client connection created", [192.168.50.1, 57202], 4]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\"}"], "c: 4"]
[["PING OP"], "c: 4"]
[["PING OP"], "c: 4"]
[["SUB OP", "SUB _INBOX.542fde2bc4851d06c7d46ac09e 2"], "c: 4"]
[["PUB OP", "PUB diego.staging.start _INBOX.542fde2bc4851d06c7d46ac09e 64"], "c: 4"]
[["Processing Client msg: 1", "diego.staging.start", "_INBOX.542fde2bc4851d06c7d46ac09e", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 4"]
["Ignoring route, already processed", 4]
["Ignoring route, already processed", 4]
# Sending to 10.244.2.6 (functioned properly):
["Client connection closed", [192.168.50.1, 57202], 4]
[["SUB OP", "SUB _INBOX.dd5458fa34dca53c342e7834df RSID:7:2"], "c: 3"]
[["SUB OP", "SUB _INBOX.dd5458fa34dca53c342e7834df RSID:7:2"], "c: 1"]
Sending to 10.244.2.6
nats-sub
[#425] Received on [diego.staging.start] : '{"app_id":"a","task_id":"b","download_uri":"http://example.com"}'
# consumer responds:
[#426] Received on [_INBOX.dd5458fa34dca53c342e7834df] : '{}'
Logs on 10.244.2.6
[&{Host:10.244.2.6 Port:4222 Trace:true Debug:true NoLog:false NoSigs:false Logtime:false MaxConn:65536 Username:nats Password:nats Authorization: PingInterval:2m0s MaxPingsOut:2 HTTPPort:0 SslTimeout:0.5 AuthTimeout:15 MaxControlLine:1024 MaxPayload:1048576 ClusterHost:10.244.2.6 ClusterPort:4223 ClusterUsername:nats ClusterPassword:nats ClusterAuthTimeout:15 Routes:[nats-route://nats:[email protected]:4223] ProfPort:0 PidFile:/var/vcap/sys/run/nats/nats.pid LogFile:/var/vcap/sys/log/nats/nats.log}]
["DEBUG is on"]
["TRACE is on"]
["Starting gnatsd version 0.4.6"]
["Listening for route connections on 10.244.2.6:4223"]
["Listening for client connections on 10.244.2.6:4222"]
["gnatsd is ready"]
["Trying to connect to route on 10.244.0.6:4223"]
["Route connection created", [10.244.0.6, 4223], 1]
["Route connect msg sent", [10.244.0.6, 4223], 1]
["Route sent local subscriptions", 1]
["Registering remote route", "6d10583879befb490c2b15e4bbdb1259"]
["Client connection created", [10.0.2.15, 45175], 2]
[["CONNECT OP", "CONNECT {\"user\":\"nats\",\"pass\":\"nats\",\"verbose\":true,\"pedantic\":true}"], "c: 2"]
[["SUB OP", "SUB diego.staging.start diego.stagers 1"], "c: 2"]
["Client connection created", [192.168.50.1, 57086], 3]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\"}"], "c: 3"]
[["SUB OP", "SUB > 2"], "c: 3"]
[["PING OP"], "c: 3"]
["Route connection created", [10.0.2.15, 43845], 4]
["Route sent local subscriptions", 4]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\",\"ssl_required\":false,\"name\":\"6d10583879befb490c2b15e4bbdb1259\"}"], "c: 4"]
["Detected duplicate remote route", "6d10583879befb490c2b15e4bbdb1259", [10.0.2.15, 43845], 4]
["Router connection closed", [10.0.2.15, 43845], 4]
["Route connection created", [10.0.2.15, 43852], 5]
["Route sent local subscriptions", 5]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\",\"ssl_required\":false,\"name\":\"6d10583879befb490c2b15e4bbdb1259\"}"], "c: 5"]
["Registering remote route", "6d10583879befb490c2b15e4bbdb1259"]
# Sending to 10.244.0.6 (consumers did not receive anything):
["Client connection created", [10.0.2.15, 55657], 6]
[["CONNECT OP", "CONNECT {\"user\":\"nats\",\"pass\":\"nats\",\"verbose\":true,\"pedantic\":true}"], "c: 6"]
[["SUB OP", "SUB diego.staging.start diego.stagers 1"], "c: 6"]
["Client Ping Timer", [10.244.0.6, 4223], 1]
[["PING OP"], "c: 1"]
[["PONG OP"], "c: 1"]
["Client Ping Timer", [10.0.2.15, 45175], 2]
[["PONG OP"], "c: 2"]
[["PING OP"], "c: 3"]
["Client Ping Timer", [192.168.50.1, 57086], 3]
[["PONG OP"], "c: 3"]
["Client Ping Timer", [10.0.2.15, 43852], 5]
[["PING OP"], "c: 5"]
[["PONG OP"], "c: 5"]
[["SUB OP", "SUB _INBOX.542fde2bc4851d06c7d46ac09e RSID:4:2"], "c: 1"]
[["SUB OP", "SUB _INBOX.542fde2bc4851d06c7d46ac09e RSID:4:2"], "c: 5"]
[["MSG OP", "MSG diego.staging.start RSID:3:2 _INBOX.542fde2bc4851d06c7d46ac09e 64"], "c: 1"]
[["Processing Router msg: 1", "diego.staging.start", "_INBOX.542fde2bc4851d06c7d46ac09e", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 1"]
["Client Ping Timer", [10.0.2.15, 55657], 6]
[["PONG OP"], "c: 6"]
# Sending to 10.244.2.6 (worked properly):
["Client connection created", [192.168.50.1, 57203], 7]
[["CONNECT OP", "CONNECT {\"verbose\":false,\"pedantic\":false,\"user\":\"nats\",\"pass\":\"nats\"}"], "c: 7"]
[["PING OP"], "c: 7"]
[["PING OP"], "c: 7"]
[["SUB OP", "SUB _INBOX.dd5458fa34dca53c342e7834df 2"], "c: 7"]
[["PUB OP", "PUB diego.staging.start _INBOX.dd5458fa34dca53c342e7834df 64"], "c: 7"]
[["Processing Client msg: 1", "diego.staging.start", "_INBOX.dd5458fa34dca53c342e7834df", "{\"app_id\":\"a\",\"task_id\":\"b\",\"download_uri\":\"http://example.com\"}"], "c: 7"]
[["PUB OP", "PUB _INBOX.dd5458fa34dca53c342e7834df 2"], "c: 2"]
[["Processing Client msg: 1", "_INBOX.dd5458fa34dca53c342e7834df", "", "{}"], "c: 2"]
["Client connection closed", [192.168.50.1, 57203], 7]