Code Monkey home page Code Monkey logo

ansible-lxc-rpc's Introduction

##NOTE: This repo and all development has moved to Stackforge

###As per all Stackforge projects, bug tracking and release management will be managed in launchpad, and code reviews will be managed in gerrit

ansible-lxc-rpc's People

Contributors

andymcc avatar apsu avatar bjoernt avatar cfarquhar avatar claco avatar cloudnull avatar davidwittman avatar git-harry avatar hughsaunders avatar ionosphere80 avatar jacobwagner avatar jcannava avatar mancdaz avatar matthewoliver avatar mattt416 avatar miguelgrinberg avatar nrb avatar odyssey4me avatar samyaple avatar stevelle avatar woodardchristop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansible-lxc-rpc's Issues

Address addition of new user_variables

If we add new user_variables we need a sensible way of notifying the end-user as these won't be in the "/etc/rpc_deploy/user_variables.yml" file and will cause runs to fail.

We currently do a version check on the inventory vs environment vs rpc_user_config files, we should probably consider doing this with the user_variables also.

safe upgrade is upgrading the kernel

When the setup-common role runs a safe-upgrade is run to ensure all of the packages that are determined to be safely upgradable are up-to-date. Sadly the ansible apt module uses aptitude to perform this upgrade which attempts to bring in new kernels as they become available. Presently we are dep'ing on kernel 3.13.0-34-generic and the new stable kernel for Ubuntu 14.04.01 kernel 3.13.0-35-generic. It would seem that this is going to be a never ending battle of incremental kernel releases. If we are not going to be deploying a repository of frozen packages at this time we may need change the kernel check to set on the version and allow for kernel releases of >= 34.

Some containers are using unconfined apparmor profiles

Issue by Apsu
Monday Jul 28, 2014 at 16:18 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/239


Some containers, such as neutron-agents, nova-compute and cinder-volumes are disabling apparmor by setting their profile to "unconfined". This is primarily due to us failing to figure out the right way to provide access to all the host resources and capabilities ( https://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.2/capfaq-0.2.txt ) required to do the needful.

We should figure out how to use apparmor correctly so we can at least make a passable attempt at locking these containers down as much as possible, in line with the rest of the cluster containers.

All git repos should be set to a TAG or SHA.

Issue by cloudnull
Monday Aug 25, 2014 at 15:06 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/459


All of the git repos that we have we need to set the tag or sha so that we never install anything from source that is unexpected. Additionally, we should create a tarball of all sources we are installing and add them to our frozen repo. Installing from tarball and then falling back to git repos if needed will not only speed up installations, clones can take a long time and vary due to lots of uncontrollable network conditions, but should also provide us a mechanism to move into in a CDC environment. IMO, by having the repo contain a source directory where software is downloaded and installed from that contains our tar'd up git repos will be the first step to fix the CDC situation where no internet access is available.

Here are all of the git repos we have and the respected branches that need to be stabilized:

Cinder:

Cinder Git Branch:

  • inventory/group_vars/cinder_all.yml:65:git_install_branch: stable/icehouse

Glance:

Glance Git Branch:

  • inventory/group_vars/glance_all.yml:75:git_install_branch: stable/icehouse

Heat:

Heat Git Branch:
inventory/group_vars/heat_all.yml:70:git_install_branch: stable/icehouse

Horizon:

Horizon Git Branch:

  • inventory/group_vars/horizon.yml:46:git_install_branch: stable/icehouse

Keystone:

Keystone Git Branch:

  • inventory/group_vars/keystone_all.yml:60:git_install_branch: stable/icehouse

Neutron:

Neutron Git Branch:

  • inventory/group_vars/neutron_all.yml:83:git_install_branch: stable/icehouse

Nova:

Nova Git Branch:

  • inventory/group_vars/nova_all.yml:79:git_install_branch: stable/icehouse

RaxMon:

RaxMon Git Branch:

  • etc/rpc_deploy/user_variables.yml:112:maas_repo_version: master

Holland:

Holland Git Branch:

  • playbooks/rpc_support.yml:34: holland_release: "{{ rpc_support_holland_branch|default('v1.0.10') }}"

SSH Cert generation for nova fails

@cloudnull @jacobwagner

This fails:

TASK: [nova_compute_sshkey_setup | Sync authorized_keys file] *****************
failed: [node17.uk.com] => {"cmd": "rsync --delay-updates -FF --compress --archive --rsh 'ssh -o StrictHostKeyChecking=no' --out-format='<>%i %n%L' /tmp/authorized_keys [email protected]:/var/lib/nova/.ssh/authorized_keys", "failed": true, "rc": 255}
msg: Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,password).
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(226) [sender=3.1.0]

This won't work unless we force sync keys on all nodes from the node you're running it on. (e.g. usually I use ssh-agent forwarding to get this to work, this doesn't follow since it will ssh to localhost first)

Additionally, it looks like it would fail on a second run (even if it succeeded on the first) since the shell out prompts you if the key already exists (which it will), and ansible won't respond to the prompt.

I think we should really reconsider this approach - with the original patch we discussed it over here, end result the same key was used on all nodes, this simplifies it out, for no real downside (since all the keys have to be synced anyway, access to a nova user on any node gives you access to all other nova users on other compute nodes - regardless of whether they have the same or different keys).

With separate keys for each compute node, the scale out isn't amazing since the pub key has to be copied to all other compute nodes, and it has to sync the authorized_keys for all compute nodes, for no real upside. Additionally the authorized_keys file is being recreated every run - it seems like generating individual keys creates added complexity for no real benefit.

Move the elasticsearch storage backend

We should move the elasticsearch storage backend to something outside of the containers LV. Current proposal is to bind mount /openstack/elasticsearch into the container and use that as the backend storage for elasticsearch.

nova-consoleauth needs to use memcached

This is required since we run multiple memcached servers. Whilst we specify memcached_servers in nova.conf, this is under the keystone_authtoken and is not found by nova-consoleauth.

On a side note, should we specify memcached_servers to point to the memcached VIP, or to each individual memcached server?

Hosts file can generate duplicate entry weirdness

If you remove containers/re-run ansible duplicate hosts will be creating leading to very inconsistent and weird results:

root@533812-node16:~/ansible-lxc-rpc/rpc_deployment# grep 222 /etc/hosts
10.241.0.222 infra3_horizon_container-b0288493
10.241.0.222 infra3_heat_apis_container-a7ee79ee

Glance image replication, or other such mechanisms

Issue by byronmccollum
Thursday Aug 28, 2014 at 03:20 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/484


Glance image replicator redux...

Registering an image lands the bits on only one infra node. Spawning an instance from that image could cause nova scheduler to retry multiple times until the image fetch call goes to the correct infra node containing the image. If scheduler retries >= number of infra nodes, this should eventually success (after unnecessary delay, and artificially induced lumpy compute distribution), but with any amount of load, it's quite possible all scheduler retries go to infra nodes without the image.

So, are there plans to reintroduce glance image replicator, or other such mechanisms. Or is the preferred / recommended configuration to use Swift if you have more that one infra node?

[Cinder] iSCSI doesn't require running nova-compute outside of a container after all!

Issue by Apsu
Wednesday Aug 06, 2014 at 16:13 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/314


After lots of digging around and reading kernel code, trying to figure out how to fix iscsitarget's crackheadedness, I discovered that there's another iscsi targeting system built into recent kernels such as 3.13. If we load the scsi_transport_iscsi module, tgtd can talk to it from inside of a container with no problem! The initiator still only needs the iscsi_tcp module.

I made a crappy asciinema recording to demonstrate here: https://asciinema.org/a/11317

We should be able to just add the scsi_transport_iscsi module to cinder's module list and turn is_metal back off by default.

Flat network type seems broken on un-containerized compute nodes

When using a flat network type the linux-bridge-agent bombs due to the linux-bridge-agent attempting to bridge the already bridged network.

Error:

2014-09-08 02:51:13.332 19696 INFO neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent [-] Port tapc76d1465-17 updated. Details: {u'admin_state_up': True, u'network_id': u'bb4573bd-220e-491e-8b0a-01b84c5a6551', u'segmentation_id':
 None, u'physical_network': u'vlan', u'device': u'tapc76d1465-17', u'port_id': u'c76d1465-172a-43ec-98cd-ca4f6daa444f', u'network_type': u'flat'}
2014-09-08 02:51:14.604 19696 ERROR neutron.plugins.linuxbridge.agent.linuxbridge_neutron_agent [-] Unable to add br-vlan to brqbb4573bd-22! Exception:
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'brctl', 'addif', 'brqbb4573bd-22', 'br-vlan']
Exit code: 1
Stdout: ''
Stderr: "device br-vlan is a bridge device itself; can't enslave a bridge device to a bridge device.\n"

While this issue needs verification, to fix this we'd need to rework the ml2.ini plugin to move the binding for flat network types to an unbridged interface.

metadata_host left unset and defaults to '$my_ip'

I don't think this is an issue since we're using neutron only and metadata_agent.ini is configured with a correct nova_metadata_ip ({{internal_vip_address}}), however it may be worth setting metadata_host to also resolve to {{internal_vip_address}} in nova.conf just for consistency's sake.

Frozen Repo missing package for haproxy install

failed: [node1] => (item=haproxy,hatop,vim-haproxy) => {"failed": true, "item": "haproxy,hatop,vim-haproxy"}
stderr: E: Failed to fetch http://dc0e2a2ef0676c3453b1-31bb9324d3aeab0d08fa434012c1e64d.r5.cf1.rackcdn.com/ubuntu/pool/main/liby/libyaml/libyaml-0-2_0.1.4-3ubuntu3_amd64.deb  404  Not Found [IP: 80.239.178.186 80]

Time out on initial Network check is probably too long

Issue by andymcc
Friday Aug 29, 2014 at 13:20 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/485


The initial time out check is 60 seconds - this seems too long, we could do this at 10seconds, since it just exists to determine whether we should reboot the container or not - the post reboot check can be longer (as it is 60 seconds).

This adds a lot of time to the initial run, and if the network is up (before we even restart it - e.g. it was already setup from a previous run) then it won't take 60 seconds to confirm.

README out of date

Issue by jcourtois
Monday Aug 18, 2014 at 15:35 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/398


The README hasn't been updated for about a month. Since then, at least one container (nova-spice-console) has been added, another has been removed (nova-compute) and a number of other changes have been made at least some of which are probably material.

Can someone give the README a close reading and edit it for accuracy and usability?

Incorrect hosts groups listed in maas_local.yml

For some checks, we reference the _container hosts group (cinder_api_container as an example). The problem here is that if that hosts group is moved onto bare metal then the checks will not get deployed.

All Package repos need to be replaced with the frozen repo

Issue by cloudnull
Monday Aug 25, 2014 at 14:58 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/458


The repo URL needs to be updated to use the cname, we presently have:

These repo need to use the rpc_repo_url variable:

Related Issues:

dnsmasq package not available in pinned apt repo

Issue by johnmarkschofield
Wednesday Aug 20, 2014 at 16:36 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/426


When using the pinned apt repo:

root@schof-aio:~# cat /etc/apt/sources.list
deb [arch=amd64] http://dc0e2a2ef0676c3453b1-31bb9324d3aeab0d08fa434012c1e64d.r5.cf1.rackcdn.com LA main
root@schof-aio:~#

The dnsmasq package is not available:

root@schof-aio:~# apt-get update
Hit http://mirror.jmu.edu trusty InRelease
Hit http://mirror.jmu.edu trusty/main amd64 Packages
Hit http://mirror.jmu.edu trusty/main i386 Packages
Hit http://dc0e2a2ef0676c3453b1-31bb9324d3aeab0d08fa434012c1e64d.r5.cf1.rackcdn.com LA InRelease
Ign http://mirror.jmu.edu trusty/main Translation-en_US
Ign http://mirror.jmu.edu trusty/main Translation-en
Hit http://dc0e2a2ef0676c3453b1-31bb9324d3aeab0d08fa434012c1e64d.r5.cf1.rackcdn.com LA/main amd64 Packages
Ign http://dc0e2a2ef0676c3453b1-31bb9324d3aeab0d08fa434012c1e64d.r5.cf1.rackcdn.com LA/main Translation-en_US
Ign http://dc0e2a2ef0676c3453b1-31bb9324d3aeab0d08fa434012c1e64d.r5.cf1.rackcdn.com LA/main Translation-en
Reading package lists... Done
root@schof-aio:~# apt-get install dnsmasq
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package dnsmasq is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  dnsmasq-base

E: Package 'dnsmasq' has no installation candidate
root@schof-aio:~#

This causes the openstack-common.yml playbook to fail:

TASK: [container_common | Ensure container packages are installed] ************
<10.51.50.1> ESTABLISH CONNECTION FOR USER: root
<10.51.50.1> REMOTE_MODULE apt pkg=libpq-dev,dnsmasq,dnsmasq-utils state=present
<10.51.50.1> EXEC ['ssh', '-C', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', u'10.51.50.1', u"/bin/sh -c 'LC_CTYPE=en_US.UTF-8 LANG=en_US.UTF-8 /usr/bin/python'"]
failed: [infra1] => (item=libpq-dev,dnsmasq,dnsmasq-utils) => {"failed": true, "item": "libpq-dev,dnsmasq,dnsmasq-utils"}
msg: No package matching 'dnsmasq' is available

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/root/openstack-common.retry

infra1                     : ok=21   changed=3    unreachable=0    failed=1

OpenStackInstaller: 2014-08-19 20:00:33,021 - CRITICAL: Failed running playbook 'playbooks/openstack/openstack-common.yml' 3 times. Aborting...

This is a release-blocker.

Glance requires an entry for snet endpoint even when not using swift for the image backend.

Issue by cloudnull
Monday Aug 25, 2014 at 20:43 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/463


When running a new installation the system attempts to laydown a host name for a service_net endpoint even when using a file backend. The desired outcome would be that all swift related options are optional and only parsed when set to a backend of swift.

The task that is attempting to execute:
https://github.com/rcbops/ansible-lxc-rpc/blob/ead8d1d698aa2820f53a20c753f4f6db03e44b64/rpc_deployment/roles/glance_snet_override/tasks/main.yml

Valid Endpoints:
https://github.com/rcbops/ansible-lxc-rpc/blob/ead8d1d698aa2820f53a20c753f4f6db03e44b64/rpc_deployment/inventory/group_vars/glance_all.yml#L98-L116

Glance Options:
https://github.com/rcbops/ansible-lxc-rpc/blob/master/etc/rpc_deploy/user_variables.yml#L56-L65

  • Running the code results with an error when snet is true || false
    glance_swift_enable_snet: false
TASK: [glance_snet_override | Remove hosts entry if glance_swift_enable_snet is False] ***
fatal: [infra1_glance_container-dbd3a387] => One or more undefined variables: 'dict object' has no attribute 'SomeRegion'

FATAL: all hosts have already failed -- aborting

glance_swift_enable_snet: true

TASK: [glance_snet_override | Add hosts entry if glance_swift_enable_snet is True] ***
fatal: [infra1_glance_container-dbd3a387] => One or more undefined variables: 'dict object' has no attribute 'SomeRegion'

FATAL: all hosts have already failed -- aborting

Related Issue: #260

Cinder.conf Template Errors

There is an error when writing the backend sections. Also, the enabled_backends value doesn't seem right. It is using the key instead of the value from rpc_user_config.yml.

Error


< TASK: cinder_common | Setup Cinder Config >


    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

fatal: [cinder1_cinder_volumes_container-f30a3071] => {'msg': "One or more undefined variables: 'unicode object' has no attribute 'items'", 'failed': True}
fatal: [cinder1_cinder_volumes_container-f30a3071] => {'msg': 'One or more items failed.', 'failed': True, 'changed': False, 'results': [{'msg': "One or more undefined variables: 'unicode object' has no attribute 'items'", 'failed': True}]}

FATAL: all hosts have already failed -- aborting
cinder.conf Template (partial)

...
{% if cinder_backends is defined %}

enabled_backends={% for backend in cinder_backends|dictsort %}{{ backend.0 }}{% if not loop.last %},{% endif %}{% endfor %}

{% for backend_section in cinder_backends|dictsort %}
[{{ backend_section.0 }}]
{% for key, value in backend_section.1.items() %}
{{ key }}={{ value }}
{% endfor %}

{% endfor %}
{% endif %}
...
Cinder.conf Rendered (partial)

...
enabled_backends=limit_container_types

Then it just blow up around here. See above error...

...
rpc_user_config.yml (relevant section)

User defined Storage Hosts, this should be a required group

storage_hosts:
cinder1:
ip: 172.29.236.1
# "container_vars" can be set outside of all other options as
# host specific optional variables.
container_vars:
# In this example we are defining what cinder volumes are
# on a given host.
cinder_backends:
# if the "limit_container_types" argument is set, within
# the top level key of the provided option the inventory
# process will perform a string match on the container name with
# the value found within the "limit_container_types" argument.
# If any part of the string found within the container
# name the options are appended as host_vars inside of inventory.
limit_container_types: cinder_volume
lvm:
volume_group: cinder-volumes
driver: cinder.volume.drivers.lvm.LVMISCSIDriver
backend_name: LVM_iSCSI

Rabbit FQDN Issues returning

Issue by andymcc
Friday Aug 29, 2014 at 14:29 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/486


Rabbit sometimes fails to start because it can't connect to itself (when using anFQDN):

TASK: [rabbit_common | Install rabbit packages] *******************************
failed: [node12.domain.com_rabbit_mq_container-48e7bee8] => (item=rabbitmq-server)

Since the package install starts the package this fails, logs show the following:

rabbit@node12:

  • unable to connect to epmd (port 4369) on node12: address (cannot connect to host/port)

hosts shows the following:
root@node12:~# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 node12.domain.com_rabbit_mq_container-48e7bee8

Moving the "Fix /etc/hosts" entry in the rabbit_common to be above the install will fix this.

Pinned Repo broken for linux-image-extra-virtual

Issue by johnmarkschofield
Tuesday Aug 19, 2014 at 17:26 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/415


The vhost_net kernel module is installed by default in the Ubuntu Trusty image on mycloud.rackspace.com. In the VagrandCloud Trusty Image and the Official Ubuntu Image, that kernel module is not present.

The fix is to install the linux-image-extra package, which includes the vhost_net module. To avoid specifying kernel versions in my scripts, I install linux-image-extra-virtual.

With the pinned repo, I am unable to install any of the linux-image-extra-* packages. With a standard ubuntu repo, I can.

So I believe we have mismatches between different kernel packages present in the repo. All need to be updated to a consistent version.

Should be able to limit setup plays to a set of hosts or containers

Issue by hughsaunders
Friday Aug 22, 2014 at 11:38 GMT
Originally opened as https://github.com/rcbops/ansible-lxc-rpc-orig/issues/449


Currently the destroy-containers.yml has variables host_group and container_group that can be supplied with -e to limit the containers that will be removed.

It would be useful if all the playbooks directly included by host-setup.yml followed the same convention, so for example all the galera containers can be rebuilt, or all the galera containers on a node etc.

Playbooks included by host-setup.yml:

  • include: setup-common.yml #could use host_group
  • include: build-containers.yml #already uses host_group, could use container_group
  • include: restart-containers.yml #already uses host_group, could use container_group
  • include: host-common.yml #could use host_group

Glance notification mechanism

We configure the notification mechanism (notification_driver) in glance-api.conf to use a message queue. Without a consumer (typically ceilometer), this queue could become rather large.

Missing Values In `metadata_agent.ini` Config

Some values (nova_metadata_ip and metadata_proxy_shared_secret) are not getting set / substituted. I looked at vars/nova_all.yml and everything seems correct. Not sure why the literal template strings aren't being parsed.

[DEFAULT]
auth_url = http://172.29.236.4:5000/v2.0
auth_region = RegionOne
admin_tenant_name = service
admin_user = neutron
admin_password = herpderp
nova_metadata_ip = {{internal_vip_address}}
metadata_proxy_shared_secret = {{nova_metadata_proxy_secret}}
metadata_workers = 10

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.