Code Monkey home page Code Monkey logo

cookbook-openstack-network's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cookbook-openstack-network's Issues

OCF resource should not replicate DHCP

The neutron-ha-tool's resource script does DHCP replication on start, which should not be done, as neutron by default takes care of this through the dhcp_agents_per_network parameter.

rcopenstack-neutron-ovs-cleanup required before reboot

On our network nodes an rcopenstack-neutron-ovs-cleanupwas required prior to the reboot, to have network working, after the node starts, otherwise it hung up in crowbar_join.
@seife please provide more details from the maintenance, if you have

Neutron-ha-tool improvements

Discussion on neutron-ha-tool

Problems

Problems that exist within the system, that we need to live with because changing them would require major redesign.

  • neutron-ha-tool at the moment is not really monitoring the connection between routers but rather the agent's reported status. For short outages in the management network, or rabbit outages causing agents to fail to send the heartbeat - the tool might falsely decide to move routers.
  • not all routers are equal - a small number of routers are responsible for the majority of the load

Goals

  • make sure that the tool is not causing more trouble than help, it is not amplifying small outages
  • make sure busy routers are treated with care - they do not land on the same agent
  • neutron ha tool has to be made aware of the actual environment
    • some routers are responsible for the most of the traffic
      • special treatment for busy routers
        • check load on target agent before migrating busy router - on the first iteration it can be as simple as counting the busy routers per agent, and making sure busy routers are evenly distributed.
        • can we know in advance which are the busy routers - can we tag them?
        • maybe having placement preferences for routers (Router 1 should live in agent1 or agent2) explicitly hinting the scheduling of routers

Metrics

The tool has to know the layout of the routers and agents somehow.

  • How do we know how busy an agent is?
    • maybe we should treat an agent busy if it already has one busy router.
  • How do we know if a router is busy router?
    • is it a fixed list?
    • what if landscape is re-built?
    • can we do some real-time measurements, like if a router has more than X ports, we treat it as a busy one?
    • shall we assign some metric to the busyness? Each landscape has one busy router, but the prod landscape is the most important.

Scheduling

When we found that an agent needs to be evacuated, how do we use the above metrics to schedule router migrations?

  • If we move most busy routers first, we might amplify a small outage, and cause more trouble.

Actions

  • find out how frequently does the agent report heartbeat
    • See report_interval and agent_down_time. report_interval defaults to 30s and agent_down_time defaults to 75s.
  • find out a way to associate attributes to routers - we need to associate data to routers that help hinting the scheduling of routers when outage happened. Previously we wanted to use the description field for marking the AZ of the router, but we might want to give some structure to the description.
  • periodically check for agent liveness while migrating routers - this would prevent short rabbit outages to cause moving a big number of routers
    • A separate issue has been created: #12
  • make sure busy routers are distributed evenly across agents as possible

neutron-ha-tool depends on paramiko

for neutron-ha-tool's ha mode, we don't do any ssh, however this is still a requirement.

either:

  • fix packaging
  • make the dependency dynamic, so it's not required for it's regular run

AZ aware router placement

We know which network nodes live close to which AZs. We also know which routers belong to which AZs, We want to use this information to move routers to network nodes that are close to compute nodes.

neutron-ha-tool can:

  • evacuate an agent (--l3-agent-evacuate)
  • move a list of routers (--router-list-file)
  • move to a specific agent (--target-agent-id)

The idea is to re-use neutron-ha-tool for this job.

QnA

  • How do we know which routers belong to which AZs?
  • How do we know which agents (network nodes) belong to which AZs?

[placeholder] network issues on 22 Apr

On 22 Apr (Saturday) a network outage happened. We suspect neutron-ha-tool moved some of the high-traffic routers to network node 1, which became unstable due to the high load. On 24 Apr some routers have been moved off network node 1, and that seemed to make the landscape stable.

Investigation

  • Find out what triggered the move of the router on 22 Apr - was it neutron-ha-tool?
  • The control cluster was performed a restart of the rabbitmq at the 22 Apr 10:55 UTC. At this point all services were lose there connection to the messaging service. The neutron l3-agents were one of them. After the successful restart of the rabbitmq the l3-agent reconnect to the service. In that moment the neutron-ha-tool was checked the status of the l3-agents, but not all of them were fully reconnected and online again. So the neutron-ha-tool trigger the migration of 121 router.
  • Find out what caused rabbit downtime
  • no root cause found. See comment below on actions taken.

Takeaways

  • Only some routers are responsible for the traffic (This is the topic of #9)
  • Using the number of routers per agent is not a good balancing strategy as that would not prevent all the chatty routers to be hosted by a single agent (topic of #9)
  • Make it clear within the help text of neutron-ha-tool that the HOST parameter of --l3-agent-evacuate refers to a host name, not a host UUID. (separate issue created for this, #13 )
  • If we are using a router list by saying --router-list-file and the routers are not found, make sure that the user is notified accordingly. (SUSE-Cloud#18 Covers this)
  • When migrating routers, ports might not become active within the time available. In this case the following stacktrace will be printed: (SUSE-Cloud#19 Covers this item)
2017-04-24 09:28:05,093 neutron-ha-tool ERROR    Failed to migrate router=61aea97d-4711-4fb3-8fa8-43b9fa4503d9 from agent=6443bcf9-18ae-456a-b340-fcc565c0cd67 to agent=b3b9971c-24c2-4b80-91b5-4d103c261a81
Traceback (most recent call last):
  File "/usr/bin/neutron-ha-tool", line 627, in migrate_router_safely
    wait_for_router, delete_namespace)
  File "/usr/bin/neutron-ha-tool", line 671, in migrate_router
    wait_router_migrated(qclient, router_id, target['host'])
  File "/usr/bin/neutron-ha-tool", line 732, in wait_router_migrated
    (router_id, ", ".join(remaining_ports)))
RuntimeError: Some ports are not ACTIVE on router_id=61aea97d-4711-4fb3-8fa8-43b9fa4503d9: [8820e14e-b362-4883-8e36-5b09b2e6f112]
2017-04-24 09:28:05,094 neutron-ha-tool INFO     0 routers were evacuated from L3 agent d00-25-b5-a0-03-63
2017-04-24 09:28:05,094 neutron-ha-tool ERROR    1 errors encountered during evacuation

this information is clearly not representing that this is a timeout, and that this might not be an error after all

neutron-ha-tool rebalance over all l3-agents

During our maintenance in production we had routers distributed like that:

node1: 66
node2: 67
node3: 67
node4: 64
node5: 0

We triggered a neutron-ha-tool --l3-agent-rebalance and that results in:

node1: 33
node2: 67
node3: 67
node4: 64
node5: 34

We triggered it again, but the distribution did not change. The assumption was, that neutron-ha-tool would distribute routers over all l3-agent equally and not only from one.

Fix is #6

Timeout reported as a failure during migration

When migrating routers, ports might not become active within the time available. In this case the following stacktrace will be printed:

2017-04-24 09:28:05,093 neutron-ha-tool ERROR    Failed to migrate router=61aea97d-4711-4fb3-8fa8-43b9fa4503d9 from agent=6443bcf9-18ae-456a-b340-fcc565c0cd67 to agent=b3b9971c-24c2-4b80-91b5-4d103c261a81
Traceback (most recent call last):
  File "/usr/bin/neutron-ha-tool", line 627, in migrate_router_safely
    wait_for_router, delete_namespace)
  File "/usr/bin/neutron-ha-tool", line 671, in migrate_router
    wait_router_migrated(qclient, router_id, target['host'])
  File "/usr/bin/neutron-ha-tool", line 732, in wait_router_migrated
    (router_id, ", ".join(remaining_ports)))
RuntimeError: Some ports are not ACTIVE on router_id=61aea97d-4711-4fb3-8fa8-43b9fa4503d9: [8820e14e-b362-4883-8e36-5b09b2e6f112]
2017-04-24 09:28:05,094 neutron-ha-tool INFO     0 routers were evacuated from L3 agent d00-25-b5-a0-03-63
2017-04-24 09:28:05,094 neutron-ha-tool ERROR    1 errors encountered during evacuation

Improve help message

Make it clear within the help text of neutron-ha-tool that the HOST parameter of --l3-agent-evacuate refers to a host name, not a host UUID.

Fix is #6

Check agent liveness before migration

When we do the migration, check if the agent is still dead before migrating away any routers from it. If we find that the router is alive again, stop migrating routers away.

Fix is #6

neutron-ha-tool is packaged in openstack-neutron

A the moment neutron-ha-tool is part of the openstack-neutron package but this has several disadvantages:

  • we can't update the tool without updating neutron (which is not always wanted)
  • if we release a maintenance update for openstack-neutron then SAP has to rebase there own neutron package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.