Code Monkey home page Code Monkey logo

Comments (5)

cf-gitbot avatar cf-gitbot commented on September 1, 2024

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

from cf-rabbitmq-release.

mkuratczyk avatar mkuratczyk commented on September 1, 2024

Hi Matthias,

Sorry to hear you had a cluster downtime. Your idea sounds great improvement indeed. Would you mind submitting the PR? We'll review it next week.

Thank you,

from cf-rabbitmq-release.

MatthiasWinzeler avatar MatthiasWinzeler commented on September 1, 2024

@mkuratczyk thanks for getting back to me! I created the PR in #89

from cf-rabbitmq-release.

gerhard avatar gerhard commented on September 1, 2024

The RabbitMQ app might not be able to start for genuine reasons, the most common one being a unavailable peers to complete Mnesia sync which is a boot step prior to starting rabbitmq app. In this common scenario, BOSH should continue with the next instance and not stop. A failing post-start script would prevent BOSH from continuing with the next instance in the group.

If there was a genuine issue for the RabbitMQ app crashing and not starting, then this should bring the entire Erlang VM down, meaning a changing PID that Monit and ultimately BOSH would notice and correctly mark the canary node as failing. A crashing RabbitMQ app would not bring the Erlang VM down until v3.6.13, I'm wondering if you were experiencing this known & already fixed issue: rabbitmq/rabbitmq-common#237

To guard against RabbitMQ app crashing after Mnesia sync-ing completes, it would make sense to run a node & cluster check in a post-deploy, which we already do. There's also rabbitmqctl await_cluster_formation 3 --timeout 120 available since RabbitMQ v3.7.6 which might be of interest here.

cc @nodo @albertoleal @MarcialRosales @michaelklishin

from cf-rabbitmq-release.

MatthiasWinzeler avatar MatthiasWinzeler commented on September 1, 2024

@nodo About your idea in #89 to investigate the logs:

  • Node 1 looks fine as far as I can tell (see log output in #89 (comment))
  • Node 3 looks fine as Node 1
  • Node 2 outputs the following error messages in its rabbitmq@*.log:
=ERROR REPORT==== 1-Jun-2018::22:33:33 ===
** Generic server <0.6673.4975> terminating
** Last message in was {'$gen_cast',terminate}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,
                         <0.16961.4977>,<0.16429.4977>,<0.16961.4977>,
                         <<"100.106.163.21:56114 -> 100.106.195.20:5672">>,
                         {lstate,<0.17527.4977>,false},
                         none,2,
                         {[{1,none,{<7325.9869.1>,1548}}],[]},
                         {user,<<"ueWpLOMKblniObwR">>,
                          [management,policymaker],
                          [{rabbit_auth_backend_internal,none}]},
                         <<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,<<>>,
                         {dict,1,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],
                            [[<7325.9869.1>|
                              {resource,
                               <<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,
                               queue,<<"<REDACTED-QUEUE-NAME>">>}]],
                            [],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {state,
                          {dict,1,16,16,8,80,48,
                           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                           {{[],[],[],
                             [[<7325.9869.1>|#Ref<0.1.3680763905.11413>]],
                             [],[],[],[],[],[],[],[],[],[],[],[]}}},
                          erlang},
                         {dict,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {dict,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {set,1,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],
                            [<7325.9869.1>],
                            [],[],[],[],[],[],[],[],[],[],[],[]}}},
                         <0.15941.4977>,
                         {state,fine,20000,undefined},
                         false,1,
                         {{0,nil},{0,nil}},
                         [],
                         {{0,nil},{0,nil}},
                         [{<<"consumer_cancel_notify">>,bool,true},
                          {<<"publisher_confirms">>,bool,true},
                          {<<"basic.nack">>,bool,true},
                          {<<"authentication_failure_close">>,bool,true},
                          {<<"connection.blocked">>,bool,true},
                          {<<"exchange_exchange_bindings">>,bool,true}],
                         none,0,none,flow,[]}
** Reason for termination ==
** {badarg,
       [{ets,insert,
            [channel_metrics,
             {<0.6673.4975>,
              [{pid,<0.6673.4975>},
               {transactional,false},
               {confirm,false},
               {consumer_count,0},
               {messages_unacknowledged,1},
               {messages_unconfirmed,0},
               {messages_uncommitted,0},
               {acks_uncommitted,0},
               {prefetch_count,0},
               {global_prefetch_count,0},
               {state,running},
               {garbage_collection,
                   [{max_heap_size,0},
                    {min_bin_vheap_size,46422},
                    {min_heap_size,233},
                    {fullsweep_after,65535},
                    {minor_gcs,10}]}]}],
            []},
        {rabbit_core_metrics,channel_stats,2,
            [{file,"src/rabbit_core_metrics.erl"},{line,144}]},
        {rabbit_channel,emit_stats,2,
            [{file,"src/rabbit_channel.erl"},{line,2023}]},
        {rabbit_event,if_enabled,3,[{file,"src/rabbit_event.erl"},{line,138}]},
        {rabbit_channel,terminate,2,
            [{file,"src/rabbit_channel.erl"},{line,634}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1147}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
** In 'terminate' callback with reason ==
** normal

This occurs multiple times for different queues.

I'm afraid I'm not allowed to share the whole logs (since it might contain customer information).
Does this suffice to make a judgement, for example looks like a known bug?
Otherwise, I try to see whether we can work with Pivotal support again (where we can safely pass logs).

post deploy logs:

node1:

Running node checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@05938b8fd4a523bd529c136ca66c20c2
Error: {killed,{gen_server2,call,[<10723.16611.1>,{info,[pid]},infinity]}}
RabbitMQ application is not running

node2:

Running node checks at Fri Jun  1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04
Error: operation node_health_check on node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04 timed out
RabbitMQ application is not running

node3:

Running node checks at Fri Jun  1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c
Error: operation node_health_check on node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c timed out
RabbitMQ application is not running

from cf-rabbitmq-release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.