beyondtheclouds / enos Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 22.0 7.73 MB

Experimental eNvironment for OpenStack :monkey:

Home Page: https://beyondtheclouds.github.io/enos/

License: GNU General Public License v3.0

Python 82.01% Ruby 0.68% Shell 15.11% Jinja 2.20%

chameleon docker grid5000 kolla-ansible openstack reproducible-research vagrant virtualbox

enos's People

Contributors

Stargazers

Watchers

enos's Issues

Streamline an alternative engine for enos based on vagrant

Make it incrementally, first step would be :

A static approach with 4 nodes (minimum : 1 controller, 1 network, 2 computes)

Bonus : make it compatible withe vagrant-g5k

Integrating ENOS in the OpenStack CI

It might be valuable to write an internship subject to investigate how ENOS can be used to perform automatic performance regressions test in the CI.
Note that Orange Labs is strongly interested by this aspect.

Memcached crashed with stable/ocata

Kolla-ansible failed to launch memcached with the following error:

Running command: '/usr/bin/memcached -vv -l 10.44.0.18 -p 11211 -c 5000'
failed to set rlimit for open files. Try starting as root or requesting smaller maxconns value.

The problem comes from the -c 5000 that is too big.

vbox : VMs get a name that doesn't match their final role

If we set such a section in reservation.yaml :

resources:
medium:
compute: 1
network: 1
large:
control: 1

then, the following section is generated in the env file :

rsc:
compute:

!!python/object:execo.host.Host {address: network-0, keyfile: /home/mat/dev/enos/.vagrant/machines/network-0/libvirt/private_key,
port: null, user: root}
control:
!!python/object:execo.host.Host {address: control-0, keyfile: /home/mat/dev/enos/.vagrant/machines/control-0/libvirt/private_key,
port: null, user: root}
medium:
!!python/object:execo.host.Host {address: control-0, keyfile: /home/mat/dev/enos/.vagrant/machines/control-0/libvirt/private_key,
port: null, user: root}
!!python/object:execo.host.Host {address: compute-0, keyfile: /home/mat/dev/enos/.vagrant/machines/compute-0/libvirt/private_key,
port: null, user: root}
!!python/object:execo.host.Host {address: network-0, keyfile: /home/mat/dev/enos/.vagrant/machines/network-0/libvirt/private_key,
port: null, user: root}
network:
!!python/object:execo.host.Host {address: compute-0, keyfile: /home/mat/dev/enos/.vagrant/machines/compute-0/libvirt/private_key,
port: null, user: root}

we can see the confusing situation where the VM named compute-0 has the role "network" and the VM named network-0 has the role "compute".

Chameleon provider

I propose to implement :

a generic openstack provider
a specific chameleon provider
- Ideally the corresponding python code could be integrated in execo under an execo_cc module. There will be specific code for the reservation, polling the api ...

Generic Workflow :

Make use of a debian8 image (add a new one if needed)
create a dedicated network, a router linked to the external network provided by chameleon
boot the server in this network (1 NIC), flavors are described in the reservation.yaml (similar to vbox provider)
return value of the init will return the ip range of the network in which we can take some virtual ips in the next phase, use only one nic.

How ansible accesses the servers

Solution 1 : one frontend (1)

Boot a master server add associate a floating ip, ssh into it
Run enos from there using the private ips of the server. Maybe using a script to launch at boot time on the master server.

Cons :

require one extra node (quota limit)
need to propagate the user env to the master server

Pro :

Easy (maybe provide a specific script to bootstrap the first VM boot)
No latency between local machine and the platform

We need to figure out if we make 2 leases for the bare metal case.
- if one lease is used, this means that the code running on the master server is static (in the sense it already knows the machines, the network ..)
- if two leases are used :
  - the first is used to reserve the master server
  - the second is created, as usual, by the code running on the master server. This fit more the model we have in enos

Solution 2 : one ssh gateway

Run enos from your local machine
Pick one node to act as gateway.

either configure a ssh_config accordingly (that will use this node as proxy to access the others)
either generate right ssh parameters in the inventory [1] that will proxy all ansible connections through this host

Hints: need to pass this information to the inventory generator. For now we are using the execo.Host structure. We should probably get rid of it now and use our own structure that will help to express everything.

Caveats :

limited by the local machine capacity
latency between local machine and the platform

Pro :

This is only ansible configuration
This fit the model of Enos

Solution 3 : all nodes are accessible

Run enos from your local machine
Enos boot VMs and associate floating ips
Run ansible as usual using the floating ips

Caveats :

make sure the public ips won't be used by kolla to configure service
floating ips are limited
limited by the local machine capacity
latency between local machine and the platform

Pros :

minor modifications to the inventory generator

Solution 4 : ?

Note on virtual ips

We need to tell neutron to accept trafffic from/to virtual ip. By default traffic to a virtual ip will be blocked. This can be done by updating the corresponding port in neutron by setting the allowed_address_pairs extension

neutron port-update 9b02dbf2-5353-42b4-9d90-80595c4909fa --allowed_address_pairs list=true type=dict ip_address=10.0.2.253

To be generic, we should probably allow this using a full range of IPs using the cidr of the subnet on every ports.

Note on registry

Maybe not for the first iteration, but we could think to have a dedicated volume to store the registry data (similar as we we have on G5K). Except that we don't need ceph dependencies to be installed.

Notable difference when working with bare-metal

Reservation

We'll have to reimplement another reservation logic similar to g5k.

Network isolation

Network isolation is available for bare metal on CC 2.
At a first sight, we could reuse most of the code above (kvm version). We just need to make sure on how the private network is created (follow the good rules of the documentation).

[1]: Something like that in ansible :

[control]
enos-2 ansible_ssh_user=debian ansible_host=10.0.2.61
[compute]
enos-0 ansible_ssh_user=debian ansible_host=10.0.2.60

[all:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o ProxyCommand="ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]"'

How to create the ressources

python calls to the openstack api (first poc)
dedicated ansible modules
terraform template
-> Blocking point : for chameleon, reservation_id isn't available as scheduler_hint
https://www.terraform.io/docs/providers/openstack/r/compute_instance_v2.html#scheduler_hints

[tc] add an option to not apply rules on hosts

Use case :

You don't want to slow down the monitoring traffic to/from the monitoring node.

multisite -> deployment model abstraction

The idea of this proposition is to allow different deployment models for multisite deployment to be natively supported by Enos. This goes from :

1 site / one region / one cloud
multiple site / one region / one cloud
multiple site / multiple region / one cloud
multiple site / multiple clouds

Currently we support 1. 2.
3. could be supported by a wrapper on top of Enos but requires some patches to Kolla :
( https://review.openstack.org/#/c/431588/ and https://review.openstack.org/#/c/431658/)

could be supported also by a wrapper on top of Enos.

The main differences I see between this deployment is how Enos understand groups.
One proposition to implement this has been given here :
https://github.com/BeyondTheClouds/Wiki/wiki/CR-030117-deployment-model

`tc` task is broken by #73

related docopt is missing on tc task

backups are made in `enos` subdirectory

This location is likely not the same as the current directory created in the current directory.

admin-openrc file should reuse kolla values

Values OS_AUTH_URL and OS_REGION_NAME in the template admin-openrc.j2 potentially are false values. The template should reuse variables openstack_region_name and keystone_admin_url provided by kolla-ansbile.

Galera patch is outdated

I tested it with Kolla master and branch 3.0.1, each time mariadb fails to start.
So I assume galera.cnf needs an update.

vlans / subnets are now accessible through the g5k api

It seems that vlans/subnet informations are now available through the API.
As a consequence we probably don't need the g5k_networks.yml file anymore.

>> pp root.sites[:rennes]
#<Resource:0x3fe95a0a2994 uri="/3.0/sites/rennes"
  RELATIONSHIPS
    clusters, deployments, jobs, metrics, network_equipments, parent, pdus, self, servers, status, version, versions, vlans
  PROPERTIES
    "compilation_server"=>false
    "description"=>"Grid5000 Rennes site"
    "email_contact"=>"[email protected]"
    "frontend_ip"=>"172.16.111.106"
    "g5ksubnet"=>{"gateway"=>"10.159.255.254", "network"=>"10.156.0.0/14"}
    "kavlan_ip_range"=>"10.24.0.0/14"
    "kavlans"=>{"1"=>
      {"gateway"=>"172.16.111.101", "network"=>"192.168.192.0/20"},
     "16"=>{"gateway"=>"10.27.255.254", "network"=>"10.27.192.0/18"},
     "2"=>{"gateway"=>"172.16.111.102", "network"=>"192.168.208.0/20"},
     "3"=>{"gateway"=>"172.16.111.103", "network"=>"192.168.224.0/20"},
     "4"=>{"gateway"=>"10.24.63.254", "network"=>"10.24.0.0/18"},
     "5"=>{"gateway"=>"10.24.127.254", "network"=>"10.24.64.0/18"},
     "6"=>{"gateway"=>"10.24.191.254", "network"=>"10.24.128.0/18"},
     "7"=>{"gateway"=>"10.24.255.254", "network"=>"10.24.192.0/18"},
     "8"=>{"gateway"=>"10.25.63.254", "network"=>"10.25.0.0/18"},
     "9"=>{"gateway"=>"10.25.127.254", "network"=>"10.25.64.0/18"},
     "default"=>{"gateway"=>"172.16.111.254", "network"=>"172.16.96.0/20"}}
    "latitude"=>48.1
    "location"=>"Rennes, France"
    "longitude"=>-1.6667
    "name"=>"Rennes"
    "production"=>true
    "renater_ip"=>"192.168.4.19"
    "security_contact"=>"[email protected]"
    "storage5k"=>true
    "sys_admin_contact"=>"[email protected]"
    "type"=>"site"
    "uid"=>"rennes"
    "user_support_contact"=>"[email protected]"
    "virt_ip_range"=>"10.156.0.0/14"
    "web"=>"http://www.irisa.fr"
    "version"=>"50f72bc5970f734edadb7337e7fd406ad1952c4c">

Allow `bench` to be used on an existing deployment

Currently, rally is installed and configured during the up phase. When working with an existing openstack deployment, this phase will likely not be called. As a result we should maybe differ this installation.

[g5K] Get an "all in one" deployment

write init-os using a dedicated ansible playbook

As we are moving many part of the code in ansible in our workflow (e.g #116 ). I've the feeling we should use Ansible as well to do the init-os phase. And, who knows, write the openstack provider using ansible sounds not that absurd (see #83 ).

Missing square bracket in enos backup documentation makes docopt crashes

There is a missing square bracket in enos backup documentation, around --backup-dir:

usage: enos backup [--backup_dir=BACKUP_DIR  [-e ENV|--env=ENV]
                                           ^

This makes docopt crashes and prevents backup.

Check DataPlane latencies after Traffic Shapping at the host level

Once TC network constraints have been defined at the NICs level (whatever the number of NICs), what is the latency we can expect at the level of the DataPlane (i.e. from the VMs that are executed on the hosts).

NIC are down and route are missing with the vagrant plugin

On NixOS, during a vagrant/vbox deployment, all vboxnet* NIC are down and routes through these NIC are missing.

Setting auto_config: true in the vagrant file seems to fix the bug.

--force-kolla-deploy and --force-deploy

Hi all,

It would be great to have two additional options in kolla.
--force-kolla-deploy:
This option should delete containers on the different nodes and invoke enos to redeploy the selected openstack like if it was the first time. This option is mandatory when you want to perform several trials of an experiment without redeploying everything (kadeploy + kolla)

--force-deploy:
This option should (re)deploy everything (kadeploy + kolla)

Backup influx failed when enable_monitoring=false

fatal: [graphene-6-kavlan-4.nancy.grid5000.fr]: FAILED! => {"changed": true, "cmd": ["docker", "stop", "influx"], "delta": "0:00:00.023152", "end": "2016-11-04 18:38:19.542934", "failed": true, "rc": 1, "
start": "2016-11-04 18:38:19.519782", "stderr": "Error response from daemon: No such container: influx", "stdout": "", "stdout_lines": [], "warnings": []}

add when: enable_monitoring to the tasks

Upgrade execo to latest version v2.6.1

There are maybe at least these breaking changes :

get_oar_job_vlan now returns an array

Problems on installation

Actually, I have a problem on installation. When I run this command from a frantend
pip install git+git://github.com/BeyondTheClouds/enos@master#egg=enos
I got this log at end

creating /usr/local/lib/python2.7/dist-packages/enos

error: could not create '/usr/local/lib/python2.7/dist-packages/enos': Permission denied

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-ojF8sT/enos/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-_lzaQi-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip-build-ojF8sT/enos
Storing debug log for failure in /home/jaddarrous/.pip/pip.log

The message says that I do not have permission. Knowing that sudo can't be used on the frontend, I reserved an instance and I ran the command again with sudo-g5k and I got this error log:

import pytz as _pytz

ImportError: No module named pytz

error in setup command: Error parsing /tmp/pip-build-eyIa3H/positional/setup.cfg: ImportError: No module named pytz

----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-eyIa3H/positional/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-Pz__bH-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip-build-eyIa3H/positional
Storing debug log for failure in /root/.pip/pip.log

Did I miss something?

Support of OSProfiler

Add support of OSProfiler into kolla-ansible (See, enos-scenarios/osprofiler)
Merge modification of TER-17 students
Add documentation

Patch files aren't up to date

Some findings :

Patching mariadb as we did for stable/mitaka don't work anymore
Patching site.yml shouldn't be necessary anymore since it workarounds a limitation of ansible 1.9.x

Before diving into this we should think twice if the use the node_custom_config is relevant.

benchs : some variables are deprecated

Some variables remain in the ansible/group_vars/all.yml but are now unused :

# will be copied on the rally host to launch scenarios
rally_scenarios_dir: "{{ playbook_dir }}/../../rally/"
rally_scenarios_list: "all-scenarios.txt.sample"
rally_times: 1
rally_concurrency: 1

A g5k deployment without compute nodes fails with " KeyError: 'compute' "

At line 169 of provider/g5k.py the provider expects some compute nodes in order to bound them on the /tmp directory.

The case there are no computes, the script fails.

[Doc]

This is missing : https://enos.readthedocs.io/en/latest/analysis/index.html#post-mortem

Change Host structure

In Enos we rely on execo.Host.
It's desirable to change it into an home made structure since :

it's gplv3 code
it's not very flexible in what we would like to express with Enos (e.g host accessibe through a proxycommand)

This will have some side effects on every provider and the extra.to_ansible_group function

For the record: upgrading to ansible 2.2.0 could be problematic

see :
ansible/ansible-modules-core#5558

we use the return value in the bench phase

[Doc][g5k] force user to install a newer version of pip

pip install -U pip --user should do the trick

bench broken : Wrong variable name in `enos/ansible/roles/bench/tasks/main.yml`

scenario_type is deprecated in favor of bench.type.

The reason why the error doesn't show up before is due that in some cases the env from previous deployment is reloaded and still included scenario_type.

running two benchs leads to have nested directory structure for the logs

https://github.com/BeyondTheClouds/kolla-g5k/blame/92e4ab657b860b32e7d0bf801e25c9626658ea51/ansible/roles/bench/tasks/logs.yml#L7

a second call, will create a _data subdirectory instead of copying the contents.
As a result indexing the logs in the results vm will fail.

I suggest to

explicitly create /tmp/kolla-logs
explicitly copy the content of /var/lib/docker/volumes/kolla_logs/_data inside

subsequent call should overwrite previous ones, which is a more desirable behaviour.

Can we emulate network degradations (packet lost)

See https://wiki.linuxfoundation.org/networking/netem

pip install git+git://github.com/BeyondTheClouds/enos@master#egg=enos

Do we currently support this way of installing the tool ?
I may be wrong but I think that, by default, there will be an issue with the inventory directory as well as the rally directory.

test

Reapply GPLv3 licence

never change licencing in a rush :)

Since we are importing ansible and execo we'll have to stick to GPLv3 licence.

Add ACM 3R flag on the ENOS documentation and mainpage

Following our discussion on Slack, I strongly support the action of putting on the mainpage of ENOS as well as the readthedocs pages, the fact that ENOS has been designed in order to favor the 3R as defined by ACM
https://www.acm.org/publications/policies/artifact-review-badging
Using ENOS will help researchers to get all the mandatory materials to get such agreements.

Add support for flat provider network

Neutron is currently deployed with the default parameters provided by Kolla. As a consequence tenant networks are enabled by default. In the perspective of running Openstack over Openstack I'm tempting to switch to a simpler model for neutron deployment and use a flat provider network. IPs would be taken on the kavlan ips pool. No tenant networks / floating ip will be supported anymore. This way we avoid the cost of the overlay network for the under cloud (by extension we also could avoid it for the over cloud if needed). This configuration should be eased by the fact that kolla/newton allows custom config to be placed along with the deployment files.

inventory files aren't up to date

Add annotation for relevant event into grafana

Grafana lets you add annotations that draw vertical line to notify specific events[1].

To do so, we have to log relevent event into the "event" table of influxdb[2].

We can do something generic by logging every step of ansible thanks to ansible hooks[3].

[1] http://docs.grafana.org/reference/annotations/
[2] http://maxchadwick.xyz/blog/grafana-influxdb-annotations
[3] https://docs.ansible.com/ansible/developing_plugins.html

Remove reservation.yaml.topology.sample

The documentation explains well enough how to use topology.

Factorize into ansible/common packet installations

apt-https-transport
pip (right version)

Inventory file seems a good place to put this information : https://docs.ansible.com/ansible/intro_inventory.html#host-variables
Another solution would be to rely on the variable loading mechanisms and place those information on ansible/hosts_vars//... but will lead to a creation of a bunch of files.

[vagrant provider] vbox provider should be named virtualbox -> review load_config

To reproduce :

provider: 
  type: vagrant
  option: vbox

Leads to

The provider 'vbox' could not be found, but was requested to
back the machine 'enos-127'. Please use a provider that exists.

Note that provider:vagrant works. In this case vagrant is called without any env variables and default to virtualbox.

Enos bench

Based on the previous version, I've prototyped a new way to launch benchs :

msimonin/kolla-g5k@c23d41a

Let's see if it fit our needs :)

On the index page of Enos documentation add a link to Enos github

There is no link to the source of Enos on Enos documentation.