Code Monkey home page Code Monkey logo

Comments (2)

sampathP avatar sampathP commented on August 13, 2024

Hi,
Please help me to understand the situation.
In your environment, among the multiple compute nodes, you have node-12.local, node-13.local( reserve host). Then, masakari-hostmonitors (one or many of them) in the cluster send the false notification saying that node-12.local is down, however node-12.local is not down and nova-compute is still running.
When masakari-controller receives node-failure notificaions, first it will disable the compute node (node-12.local) and try to evacuate VMs on failed node (node-12.local) to reserve-host (node-13.local). As shown in your log, masakari-controller worked as expected. However, nova refused to evacuate, because nova-compute state on failed-node (node-12.local) expected to be down, but it was up. As a result, recovery process terminated abnormally.

If above understanding is correct, next question is who and why send the false host down notification?
[1] Can you please find out the full db record of uuid=a0752992-61f9-447e-b9bf-5d4099d09be9?
It is in your [db host ip or hostname].vmha.notification_list table.
[2] Can you please check the masakari-hostmonitor.log in other nodes in the cluster for above notificaion?

Please let me know if you need more information for how to get those info from your environment.

In abnormal termination, current masakari does not return the reserve host back to its original state because it would be a problem if recovery failed after some VMs are successfully evacuated to reserve host. And, also it does not re-enable the failed node (node-12.local). In this case, operator has to check the situation and operator may re-enable the failed node (node-12.local) through nova API and operator may readd the reserve-host (node-13.local).

from masakari.

johnavp1989 avatar johnavp1989 commented on August 13, 2024

Hello,

Sorry for the late response. We haven't encountered this issue since the initial deployment so I'm not sure this is really relevant anymore but I wanted to provide you with the details you requested. Your assessment is mostly correct with the exception that node-12 is the reserved host and node-13 was the failed host.

Here's the DB entry:

*************************** 3. row ***************************
id: 7
create_at: 2016-11-04 07:45:54
update_at: 2016-11-04 07:49:54
delete_at: 2016-11-04 07:49:54
deleted: 0
notification_id: a0752992-61f9-447e-b9bf-5d4099d09be9
notification_type: rscGroup
notification_regionID: RegionOne
notification_hostname: node-13.local
notification_uuid:
notification_time: 2016-11-04 07:45:53
notification_eventID: 1
notification_eventType: 2
notification_detail: 2
notification_startTime: 2016-11-04 07:45:53
notification_endTime: NULL
notification_tzname: 'UTC', 'UTC'
notification_daylight: 0
notification_cluster_port: 226.94.1.1:5405
progress: 2
recover_by: 0
iscsi_ip: NULL
controle_ip: 172.17.1.20
recover_to: node-12.local
3 rows in set (0.00 sec)

Unfortunately I no longer have the logs from this time as they've been rotated.

If we experience this issue again I'll provide the logs.

from masakari.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.