Comments (5)
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.
The labels on this github issue will be updated when the story is started.
from cf-rabbitmq-release.
Hi Matthias,
Sorry to hear you had a cluster downtime. Your idea sounds great improvement indeed. Would you mind submitting the PR? We'll review it next week.
Thank you,
from cf-rabbitmq-release.
@mkuratczyk thanks for getting back to me! I created the PR in #89
from cf-rabbitmq-release.
The RabbitMQ app might not be able to start for genuine reasons, the most common one being a unavailable peers to complete Mnesia sync which is a boot step prior to starting rabbitmq app. In this common scenario, BOSH should continue with the next instance and not stop. A failing post-start
script would prevent BOSH from continuing with the next instance in the group.
If there was a genuine issue for the RabbitMQ app crashing and not starting, then this should bring the entire Erlang VM down, meaning a changing PID that Monit and ultimately BOSH would notice and correctly mark the canary node as failing. A crashing RabbitMQ app would not bring the Erlang VM down until v3.6.13, I'm wondering if you were experiencing this known & already fixed issue: rabbitmq/rabbitmq-common#237
To guard against RabbitMQ app crashing after Mnesia sync-ing completes, it would make sense to run a node & cluster check in a post-deploy
, which we already do. There's also rabbitmqctl await_cluster_formation 3 --timeout 120
available since RabbitMQ v3.7.6 which might be of interest here.
cc @nodo @albertoleal @MarcialRosales @michaelklishin
from cf-rabbitmq-release.
@nodo About your idea in #89 to investigate the logs:
- Node 1 looks fine as far as I can tell (see log output in #89 (comment))
- Node 3 looks fine as Node 1
- Node 2 outputs the following error messages in its rabbitmq@*.log:
=ERROR REPORT==== 1-Jun-2018::22:33:33 ===
** Generic server <0.6673.4975> terminating
** Last message in was {'$gen_cast',terminate}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,
<0.16961.4977>,<0.16429.4977>,<0.16961.4977>,
<<"100.106.163.21:56114 -> 100.106.195.20:5672">>,
{lstate,<0.17527.4977>,false},
none,2,
{[{1,none,{<7325.9869.1>,1548}}],[]},
{user,<<"ueWpLOMKblniObwR">>,
[management,policymaker],
[{rabbit_auth_backend_internal,none}]},
<<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,<<>>,
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],
[[<7325.9869.1>|
{resource,
<<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,
queue,<<"<REDACTED-QUEUE-NAME>">>}]],
[],[],[],[],[],[],[],[],[],[],[],[]}}},
{state,
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],
[[<7325.9869.1>|#Ref<0.1.3680763905.11413>]],
[],[],[],[],[],[],[],[],[],[],[],[]}}},
erlang},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
{set,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],
[<7325.9869.1>],
[],[],[],[],[],[],[],[],[],[],[],[]}}},
<0.15941.4977>,
{state,fine,20000,undefined},
false,1,
{{0,nil},{0,nil}},
[],
{{0,nil},{0,nil}},
[{<<"consumer_cancel_notify">>,bool,true},
{<<"publisher_confirms">>,bool,true},
{<<"basic.nack">>,bool,true},
{<<"authentication_failure_close">>,bool,true},
{<<"connection.blocked">>,bool,true},
{<<"exchange_exchange_bindings">>,bool,true}],
none,0,none,flow,[]}
** Reason for termination ==
** {badarg,
[{ets,insert,
[channel_metrics,
{<0.6673.4975>,
[{pid,<0.6673.4975>},
{transactional,false},
{confirm,false},
{consumer_count,0},
{messages_unacknowledged,1},
{messages_unconfirmed,0},
{messages_uncommitted,0},
{acks_uncommitted,0},
{prefetch_count,0},
{global_prefetch_count,0},
{state,running},
{garbage_collection,
[{max_heap_size,0},
{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,10}]}]}],
[]},
{rabbit_core_metrics,channel_stats,2,
[{file,"src/rabbit_core_metrics.erl"},{line,144}]},
{rabbit_channel,emit_stats,2,
[{file,"src/rabbit_channel.erl"},{line,2023}]},
{rabbit_event,if_enabled,3,[{file,"src/rabbit_event.erl"},{line,138}]},
{rabbit_channel,terminate,2,
[{file,"src/rabbit_channel.erl"},{line,634}]},
{gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1147}]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
** In 'terminate' callback with reason ==
** normal
This occurs multiple times for different queues.
I'm afraid I'm not allowed to share the whole logs (since it might contain customer information).
Does this suffice to make a judgement, for example looks like a known bug?
Otherwise, I try to see whether we can work with Pivotal support again (where we can safely pass logs).
post deploy logs:
node1:
Running node checks at Fri Jun 1 22:39:03 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun 1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@05938b8fd4a523bd529c136ca66c20c2
Error: {killed,{gen_server2,call,[<10723.16611.1>,{info,[pid]},infinity]}}
RabbitMQ application is not running
node2:
Running node checks at Fri Jun 1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun 1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04
Error: operation node_health_check on node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04 timed out
RabbitMQ application is not running
node3:
Running node checks at Fri Jun 1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun 1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c
Error: operation node_health_check on node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c timed out
RabbitMQ application is not running
from cf-rabbitmq-release.
Related Issues (20)
- Issue upgrading to version 3.7.7 HOT 8
- rabbitmq-server job pre-start script fail during deployment HOT 6
- Configure rabbitmq clusters with haproxy using TLS HOT 1
- 266.0.0 install 3.7.14 HOT 1
- balance leastconn HOT 1
- inconsistencies with tags HOT 2
- Can we set a cluster name using a deployment manifest? HOT 2
- rabbitmq-server.ssl.disable_non_ssl_listeners doesn't work HOT 2
- How to run custom `rabbitmqctl` commands post start RabbitMQ server? HOT 3
- Regression: deleting broker user on startup bricks running Java applications HOT 3
- Releases are no longer published on bosh.io HOT 9
- test HOT 2
- Wrong Cuttlefish-format - 50-overrideConfig.conf.erb HOT 5
- Since 2021-02-24, `disk_alarm_threshold` does not accept integer values anymore HOT 5
- Compile Erlang consistently with other RabbitMQ releases HOT 4
- New bosh links doesn't seem to work with previous manifest HOT 3
- New bosh links doesn't seem to work with previous manifest HOT 2
- Support for RabbitMQ 3.10.x HOT 2
- Not able to deploy rabbitmq-bosh-release on Azure CF Platform
- Support for 3.13.x
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cf-rabbitmq-release.