bloomberg / chef-bcpc Goto Github PK

View Code? Open in Web Editor NEW

228.0 36.0 101.0 7.95 MB

Bloomberg Clustered Private Cloud distribution

License: Apache License 2.0

Shell 2.28% Ruby 16.24% HTML 5.95% Python 73.60% Makefile 0.44% Jinja 1.49% Vim Script 0.01%

chef openstack vagrant ceph ansible

chef-bcpc's Introduction

chef-bcpc

chef-bcpc is a set of Chef cookbooks and Ansible playbooks that build a highly-available OpenStack cloud.

The cloud consists of a variety of head nodes (OpenStack controller services, Ceph Mons/Mgrs, etcd, RabbitMQ), work nodes (hypervisors) and storage nodes (Ceph OSDs).

Each type of head node runs its core services in a highly-available manner and the roles of these nodes can be converged into a smaller set of hosts. In addition, the roles of work nodes and storage nodes can also be converged together.

Getting Started

The following instructions will get chef-bcpc up and running on your local machine for development and testing purposes.

Prerequisites

OS X or Linux
Quad-core CPU that supports VT-x or AMD-V virtualization extensions
32 GB of memory
128 GB of free disk space
Vagrant 2.1+
VirtualBox 5.2+ or KVM + libvirtd
Packer 1.4+
git, curl, rsync, ssh, jq, make, ansible

NOTE: It is likely possible to build an environment with 16GB of RAM or less if one is willing to make slight modifications to the virtual topology and/or change some of the build settings and overrides. However, we've opted to spec the minimum requirements slightly more aggressively and target hosts with 32GB RAM or more to provide the best out-of-the-box experience.

Local Build

Choose the topology and hardware configuration of your cluster. You can choose from existing configurations in virtual/topology, or build your own. hardware.yml and topology.yml are used by default. To view a list of tested topologies and hardware configurations please see virtual/README.
Set the variables in virtual/vagrantbox.json. The variable vagrant_box specifies the Vagrant box we use to build the virtual environment, and vagrant_box_version specifies the version of the Vagrant box. These variables are specified per Ansible inventory group of hosts, and must have a "default" group as is done in the default vagrantbox.json.
If one would like to build a pre-provisioned custom Packer box and use it as the base box to create the virtual environment, the steps below should be followed:
- Create virtual/packer/config/variables.json and set the variables. Depends on the virtual machine provider, an example can be found at variables.json.libvirt.example or variables.json.virtualbox.example. This step is essential for building a Packer box that's used as a base box image for building the virtual environment. The variables bcc_apt_key_url, bcc_apt_url and vagrant_cacert are optional, while others must be set. The variable kernel_version specifies the Linux kernel version we'd like to have for the Packer box. While base_box, base_box_version, and base_box_provider specify an official Vagrant box we'd like to use as a baseline for the Packer box, upon which we make further modifications. Last but not least, the variable output_packer_box_name specifies the name we'd like to use when adding the output Packer box to Vagrant.
- Alternatively, if one has S3 set up and would like to download/upload a packer box, virtual/packer/config/s3.json can be set up to leverage a pre-built packer box. An example can be found at s3.json.libvirt.example or s3.json.virtualbox.example. Run make target make download-packer-box and make upload-packer-box to download/upload a packer box.
- Run make target make create-packer-box. This will create a Packer box and add it to Vagrant with the name specified by output_packer_box_name.
- Set the variables in virtual/vagrantbox.json accordingly. When a local custom box built by Packer is used, the variable vagrant_box needs to be set to the name of the Packer box (aka, the same as output_packer_box_name in virtual/packer/config/variables.json), and vagrant_box_version should be set to 0.
- After these steps, make create all would always use the Packer box, unless virtual/vagrantbox.json is specified otherwise.
- If the Packer box needs to be updated, we recommend first clean up the old Packer box. To clean up a Packer box, one must first make sure there's no VM using the Packer box by running make destroy, and then run make destroy-packer-box to clean up the Packer box.
To make changes to the virtual topology without dirtying the tree, copy the hardware.yml and topology.yml to files named hardware.overrides.yml and topology.overrides.yml, respectively, and make changes to them instead.
If a proxy server is required for internet access, set the variables bcc_http_proxy_url and bcc_https_proxy_url respectively in virtual/packer/config/variables.json.
If additional CA certificates are required (e.g. for a proxy), set the variables TBD
From the root of the chef-bcpc git repository run the following command:

Download and install the latest version of Packer

wget https://releases.hashicorp.com/packer/1.6.6/packer_1.6.6_linux_amd64.zip -O ~/packer_1.6.6_linux.zip
sudo apt install unzip
sudo unzip ~/packer_1.6.6_linux.zip -d /usr/local/bin

Create a Python virtual environment (virtualenv) and activate it

python3 -mvenv venv
source venv/bin/activate
pip install 'pip>=19.1.1' wheel
pip install PyYaml ansible netaddr pyOpenSSL 'cryptography>=3.0,<38.0.0'

To create a libvirt build (the default), first install the following packages and plugins:

sudo apt-get install build-essential dnsmasq libguestfs-tools libvirt-dev pkg-config qemu-utils
vagrant plugin install vagrant-libvirt vagrant-mutate

Alternatively, to create a VirtualBox build, install the following plugin and set the following environment variables:

vagrant plugin install vagrant-vbguest
export VAGRANT_DEFAULT_PROVIDER=virtualbox
export VAGRANT_VAGRANTFILE=Vagrantfile.virtualbox

Use the following commands to create a virtual build:

make generate-chef-databags
make create-packer-box
make create all

To clean up the build:

make destroy
make destroy-packer-box

You may also want to change CPU model from qemu64 to kvm64 in ansible/playbooks/roles/common/defaults/main/chef.yml

chef_environment:
  name: virtual
  override_attributes:
    bcpc:
       nova:
         cpu_config:
           cpu_mode: custom
           cpu_model: kvm64

To switch from the default libvirt provider to the virtualbox provider, as far as the build is concerned, you can just remove the mutated libvirt box and then set VAGRANT_DEFAULT_PROVIDER and VAGRANT_VAGRANTFILE environment variables as described above. However since you must also make sure that the different hypervisors don't both try to control the CPU virtualization facilities, it is best to remove the mutated box and then simply reboot your development host.

This would look something like this:

$ rm -rf ~/.vagrant.d/boxes/bento-VAGRANTSLASH-ubuntu-20.04/202206.03.0/libvirt/
$ rm -rf ~/.vagrant.d/boxes/bento-VAGRANTSLASH-ubuntu-22.04/202206.13.0/libvirt/
$ sudo reboot

Hardware Deployment

TBD

Contributing

Currently, most development is done by a team at Bloomberg L.P. but we would like to build a community around this project. PRs and issues are welcomed. If you are interested in joining the team at Bloomberg L.P. please see available opportunities at the Bloomberg L.P. careers site.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Built With

chef-bcpc is built with the following open source software:

Thanks to all of these communities for producing this software!

chef-bcpc's People

Contributors

Stargazers

Watchers

Forkers

mariuscc schatt unixorx indexzero petermartini pchandra jcrugzz amscanne thariman min-huang plumgrid rrichardson caiush fanyeren ianschenck michaelmaring prodigeni ozzyjohnson cbaenziger adamrights rebala johnbellone timfallmk ben-haim efuquen bijugs amlee sudssm jrossi moutai kamidzi kelvk huyanhvn jmacarthur mihalis68 jeffbyrnes hanscj1 evandhoffman agilemobiledev semenovroman hanselke josephreynolds jp2007 ronakbanka muddman agomerz asears tricosmo easonmyang tcorbs raghav430 santhoshmukkadan xzflin narjh27 furlongm desrod wstevens11 salsa-dev ccollins99 dkaplan34 distributedone jgalvez8 justinjpacheco srinisakh hhxk18 tdrjnr mattp- kamal2222ahmed maduhu kiiranh jennygpy9691 roys-lb dprophet microhexhq peter-at lightbitslabs oritwas qasim-1develop robertdigital tj90241 bzhang257 ajaytikoo hbnworkstation npustchi khileshchauhan stjordanis chaowang987 tejaschauhan373 raulmldv huongbn psipika

chef-bcpc's Issues

VirtualBox Version Check Fails

On systems where $VBM isn't defined before virtualbox_env.sh:check_version() is run will fail with:

$ ./tests/automated_install.sh 
#### Setup configuration files
#### Setup VB's and Bootstrap
./virtualbox_env.sh: line 22: --version: command not found
ERROR: VirtualBox 4.3.8r92456 is less than 4.3.x!
  Only VirtualBox >= 4.3.x is officially supported.

automated_install.sh script error on Mac due to difference in "nc" syntax

"nc" command on Mac doesn't seem to recognize the -q option used in the scripts.

Need a way to carve out IP's from float net

Today, the Chef environment variable:
['bcpc']['floating']['available_subnet'] = "192.168.43.128/25"

Can be used to set a specific IP range as the float range, however, our hypervisor machines sit in our float range, so we would like to carve out a /24 from the range.

Perhaps we could add a new variable:
['bcpc']['floating']['exclude_subnet']
Or some other IP set operation.

The nova-manage command to delete a range from an already setup range is, for example:
nova-manage floating delete 191.168.0.0/24

SMART monitoring is disabled by default

I disabled SMART monitoring by default in the diamond recipe because we found it spammed syslog with voluminous condition check errors from the hpsa driver (HP Storage Array). I'd like to make this a preference from the user, or find a way to avoid the spammage. If the latter then I'll re-enable the SMART monitoring.

Workaround Havana upstream novncproxy issue and new websockify package

nova-novncproxy does not currently start with the latest proposed packages for Havana due to an older version of websockify.

In /var/log/upstart/nova-novncproxy.log:

TypeError: __init__() got an unexpected keyword argument 'no_parent'

See https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1228490 for more info.

There is a ppa which has a new upstream websockify package:

# cat /etc/apt/sources.list.d/gdahlman-havana-precise.list 
deb http://ppa.launchpad.net/gdahlman/havana/ubuntu devel main
deb-src http://ppa.launchpad.net/gdahlman/havana/ubuntu devel main
# apt-get update
# apt-get upgrade websockify
# service nova-novncproxy restart

Doing add-apt-repository ppa:gdahlman/havana doesn't directly work for me as the distro name needs to be devel. YMMV.

DNS Tenancy to Domain conversion doesn't handle dots in tenancy names

If I have a tenancy like "New Site.Com", it is not properly handled in the mysql function. The subdomain winds up looking like "new-site.com.bcpc.whatever.com" which is wrong. It should look more like "new-site-com.bcpc.whatever.com" so it's not creating new subdomains willy nilly.

Include ethtool in default install for troubleshooting

Ethtool is useful for troubleshooting network issues. We should include it by default.

Chef apt packages at apt.opscode.com incompatible with latest Ubuntu 12.04 LTS

If you try to install Chef using apt and pointing at an apt mirror, it fails due to incompatible dependencies. For example:

The following packages have unmet dependencies:
chef-server : Depends: chef-server-api (>= 10.18.2) but it is not going to be installed
Depends: chef-solr (>= 10.18.2) but it is not going to be installed
Depends: chef-expander (>= 10.18.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Using aptitude to try to get to the bottom of it, it appears the chef packages are incompatible with tzdata version 2013g-0ubuntu0.12.04 which was released on Oct 13th.

DNS slows down to be unusable as MySQL load climbs

With our view for PowerDNS, it see that as MySQL load climbs the view can take unacceptably long to run. Further, if getting a lot of DNS lookups for addresses not in DNS, PowerDNS's caching does little to help.

Unify VM and physical machine Cobblering

Today, VMs are booted using enroll_cobbler.sh, however, bare-metal is booted using cluster-enroll-cobbler.sh. This presents a logical disconnect for new adopters of the project as they mature through the stack being both scripts do largely the same thing.

Load default Ubuntu images with glance

So, afaikt Cirros is a total pos. Pretty much unusable beyond making sure that the basic settings of your OpenStack installation are running.

As such, it would be amazing if we could load the default Ubuntu 12.04 image into glance

I'll try to take a stab at adding this as I already have it loaded locally.

Upgrade-able chef deployments

The focus of Chef is on writing idempotent scripts which can be re-run with no effect. This is helpful unless you have a cluster of services upon which you and the open source community are iterating like madmen.

We are faced with the need to upgrade a working, production cluster in-place, and I think that this need will become even more common in the future as we provide bcpc with production SLAs in more places.

My recommendation in 2-fold :
I
Our recipes should completely build/rebuild configs, so we don't have to sprinkle not-ifs and only-ifs around the recipes. And we break up configurations which manage multiple services (e.g. HAP and DNS) into a master template and a partial template. Following DRY principles, we would take haproxy.conf.erb and break it up into a master template which then iterates through attributes that look like this:

default[:bcpc][:hap_services] = [ 
   { name => "ldap-389ds", 
     src_ip => -> () { node[:bcpc][:management][:vip] }, 
     src_port => 389,
     dst_ip => -> () {node[:bcpc][:management][:vip]},
     dst_port => 389,
     listen_options => ['timeout  client 1h',
                       'timeout  server 1h',
                       'mode tcp',
                       'balance leastconn',
                       'option  tcplog',
                       'option tcpka'],
      server_options => "check inter 1s rise 1 fall 1 observe layer4"
      servers => -> () { get_servers }
  } 
  ... 
]

it then iterates over the array and passes to the partial template that looks like this:

<%="listen #{name} #{src_ip}:#{src_port}"%>
  <% listen_options.each do |opt| -%>
     <%= opt %>
  <% end -> 
<% servers.each do |server| -%>
  <%= "server #{server.hostname} #{dst_ip}:#{dst_port} #{server_options}" %>
<% end -%>

This has three benefits, it is DRY, it is declarative, so we have better visibility about which ports belong to which service, and most importantly, we can inject new services either by adding to the main attributes file, or in an upgrade file to the attributes which pushes another service onto the array.

II
The second, and probably more contentious piece is that we need to move to a better git feature branch management model. We no longer have a single code base running in production, so we should treat our source management accordingly.
I heartily recommend the git-flow (http://nvie.com/posts/a-successful-git-branching-model/) workflow which will allow us to make branches for new, experimental features as well as make branches for installations into our production system. This way we could hot-patch a specific branch when we need to add new features, but we could also completely recreate a production environment if needed.

insecure_private_key: No such file or directory

Took me a few minutes to figure this out from the Vagrant documentation

Seems like we would want to attempt to fetch that from Github if it is not found in $HOME/.vagrant.d/insecure_private_key

I'll send over PR later, this issue is merely just to remind me.

Accomodate VIP movement when restarting haproxy

I've seen an intermittent bug when standing up a new cluster of headnodes where the keystone service catalog entries (keystone recipe), the glance upload of the cirros image (glance recipe), or the nova secgroup setup (nova-setup recipe) sometimes do get re-run even though the entries are already there. I've only see it occur when running chef-client again on an existing headnode after after new headnodes are added (aka, when the chef-client run is just regenerating configs and restarting processes that need references to all existing headnodes, like haproxy and mysql).

I think this is due to the fact that when haproxy restarts, the VIP may move (if the keepalived healthcheck occurs in the window between stop and start of haproxy). Since this bug can cause the chef recipes to duplicate entries (which isn't harmful per se, just sloppy), we should probably give the cluster a small amount of time to settle the VIP before hammering away at the openstack APIs for setup (since it's likely that the guard commands like not_if statements are failing and then the subsequent commands under the guard succeed).

I'm still not 100% sure this is what's happening, but I have a patch (hack) that I can't repro this bug under. I'll commit it but keep this open in case that's not the culprit.

Upgrade Ceph to Firefly (v0.80.x)

We are currently on Dumpling (v0.67.x). Firefly isn't final just yet, but we should think about our timing and updating to Firefly (v0.80.x).

https://ceph.com/docs/master/release-notes/#v0-80-firefly

beaver service restart hangs on initialization

When enrolling a node via Chef, restarting the beaver service hung until I killed off the beaver processes on that node. Opening an issue to remember to investigate whether this is repeatable or not.

[2013-08-17T21:12:41-04:00] INFO: Processing service[beaver] action restart (bcpc::beaver line 80)
[2013-08-17T21:15:52-04:00] INFO: service[beaver] restarted

Percona is Breaking Build

The current version of Percona (5.5.37-25.10-756.precise) breaks the build as all local requests fail as follows:

Access denied for user 'root'@'localhost' (using password: YES)

This is regardless of root having host % in mysql.user. The same cookbooks work with version 5.5.34-25.9-607.precise.

Document sourcedir for `vagrant up`

Another minor thing that took me a minute to figure out:

When running vagrant up manually, ensure you are in /path/to/chef-bcpc/vbox

DNS SOA records do not have valid nameserver

Our DNS records are currently using localhost for the nameserver field. From RFC 1035 section 3.3.13. "SOA RDATA format", the MNAME field of the SOA record should be: "The of the name server that was the original or primary source of data for this zone."

Thanks go to Erdal Gerda for noticing this!

SMP performance with VirtualBox

FWIW, I've enabled SMP on my local VirtualBox (4.3.10) on Mac OS X 10.9.mumble - the performance/corruption issues that we saw a year ago don't seem to recur. So, I'd like to re-enable SMP in vbox_create.sh - just up each bcpc-vm to have 2 CPUs - specifically, set CLUSTER_VM_CPUs to 2. The responsiveness of the head nodes is significantly better with 2 VCPUs. If others could try it out as well, that'd be great. =)

On a mac, sed -i cmd file should be sed -i -e cmd file

Issue seen in mac with sed.

For e.g., sed -i 's/vb.gui = true/vb.gui = false/' Vagrantfile will raise an error.

Change this to : sed -i -e 's/vb.gui = true/vb.gui = false/' Vagrantfile which should work fine.

vbox_create fails on OS X Mavericks

Chriss-MacBook-Pro:chef-bcpc cmorgan$ VBoxManage hostonlyif create
0%...
Progress state: NS_ERROR_FAILURE
VBoxManage: error: Failed to create the host-only adapter
VBoxManage: error: VBoxNetAdpCtl: Error while adding new interface: failed to open /dev/vboxnetctl: No such file or directory

VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component HostNetworkInterface, interface IHostNetworkInterface
VBoxManage: error: Context: "int handleCreate(HandlerArg_, int, int_)" at line 68 of file VBoxManageHostonly.cpp

DHCP lease for fixed IPs too long

DNSMasq hands out fixed IPs with lease times of a week (nova.conf.erb: dhcp_lease_time=604800). For low-churn situations where the number of VMs doesn't approach the size of the DHCP pool, you wouldn't notice, but when machines turn over often, the DHCP pool runs out of unleased IPs. This causes a denial of service where no new VM in the same DHCP pool gets an IP at startup until leases start expiring. This is most obvious in the startup messages when CloudInit complains that the eth0 interface is not configured and times out waiting for it.

I propose setting the DHCP lease time to something less than an hour, or an hour at most. Every tenant gets their own DNSMASQ instance, and lease renewal should be relatively cheap.

Need a script to create router instances for VirtualBox setup

Since vbox_create.sh just creates three basic VMs, we need to automate the creation of the utility router instance that provides DHCP and routing capabilities. It might actually be good to tie this into a scripted cobbler setup so that we can just PXE boot the VMs with the correct Ubuntu images via a preseed file. (Perhaps have yet-another-VM that does Cobbler? Or, get pfSense to do it? Or?)

I've got a start on a VirtualBox script to create a pfSense VM, but I need to confirm and automate this. Quick note is that the VirtualBox bridging doesn't work for me with pfsense-2.0.3 (kernel panic when FreeBSD 8.1 sees en1 bridged), so I need to upgrade to the latest pfsense 2.1 snapshot and enable VirtIO to bridge to en1. Random notes for those following along at home:

http://snapshots.pfsense.org/FreeBSD_RELENG_8_3/amd64/pfSense_HEAD/nanobsd/
http://doc.pfsense.org/index.php/VirtIO_Driver_Support

$VBM modifyvm $vm --nic4 bridged --bridgeadapter4 "en1: Wi-Fi (AirPort)" --nictype4 virtio

Graphite fails to Redirect

If one goes to https://<VIP>/graphite today, one gets a complaint about SSL redirection being broken:

Bad Request

Your browser sent a request that this server could not understand.
Reason: You're speaking plain HTTP to an SSL-enabled server port.
Instead use the HTTPS scheme to access this URL, please.

    Hint: https://bogus_host_without_reverse_dns:8888/

Apache/2.2.22 (Ubuntu) Server at bogus_host_without_reverse_dns Port 8888

This is not the right thing and should be fixed...

ceph-mon and ceph-mds are not starting on reboot

ceph-mon and ceph-mds are not starting on head node reboot.

Immediate remediation:

service ceph-mon start id=`hostname`
service ceph-mds start id=`hostname`

Investigating.

Recipe compile error in /var/chef/cache/cookbooks/apt/providers/repository.rb

The bootstrap_chef.sh phase of bringing up the bootstrap node of CHEF-BCPC fails with this :

================================================================================
Recipe Compile Error in /var/chef/cache/cookbooks/apt/providers/repository.rb
================================================================================

NameError
---------
undefined local variable or method `use_inline_resources' for #<Class:0x7f460e1a2d58>

Cookbook Trace:
---------------
  /var/chef/cache/cookbooks/apt/providers/repository.rb:20:in `class_from_file'

Relevant File Content:
----------------------
/var/chef/cache/cookbooks/apt/providers/repository.rb:

 13:  # Unless required by applicable law or agreed to in writing, software
 14:  # distributed under the License is distributed on an "AS IS" BASIS,
 15:  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 16:  # See the License for the specific language governing permissions and
 17:  # limitations under the License.
 18:  #
 19:  
 20>> use_inline_resources
 21:  
 22:  def whyrun_supported?
 23:    true
 24:  end
 25:  
 26:  # install apt key from keyserver
 27:  def install_key_from_keyserver(key, keyserver)
 28:    execute "install-key #{key}" do
 29:      if !node['apt']['key_proxy'].empty?

[Mon, 10 Jun 2013 11:58:07 -0400] ERROR: Running exception handlers
[Mon, 10 Jun 2013 11:58:07 -0400] FATAL: Saving node information to /var/chef/cache/failed-run-data.json
[Mon, 10 Jun 2013 11:58:07 -0400] ERROR: Exception handlers complete
[Mon, 10 Jun 2013 11:58:07 -0400] FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
[Mon, 10 Jun 2013 11:58:07 -0400] FATAL: NameError: undefined local variable or method `u/User/Users/c/Users/Users//Users/User/Use/Use/Use/User/Use/Users/User/Users/User/Us/U//U/Use/U/U//////

Rethink ceph-fs usage due to cold-start issues

If you have a cluster with a single head node, when you restart that machine, the ceph-mds won't restart by default...and if you run:

# service ceph-mds start id=`hostname`

It will come up, but the mds gets stuck in "replay" state. This is likely due to a bug in ceph-mds not handling a cold-start scenario properly. =(

Furthermore, since /mnt is now in /etc/fstab as a ceph-fs mount, we also fail on bootup in a cold-start scenario as ceph-fuse hangs. Ubuntu is smart enough to detect it is hanging, but requires you to hit a button to proceed skipping the mount.

ubuntu@bcpc-vm1:~$ ceph -v
ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
ubuntu@bcpc-vm1:~$ ceph mds stat
e28: 1/1/1 up {0=bcpc-vm1=up:replay}

Given the mds hang, I can't enroll any new machines in my cluster. It is probably worth trying out the new cuttlefish release, but...well...

CC @pchandra @cbaenziger

vagrant 1.2.7 doesn't permit VMs to use .1 address

Vagrant 1.2.7 no longer allows VMs to be statically assigned the .1 address. We can change it to .3 instead. (This did work with Vagrant 1.2.2.)

See hashicorp/vagrant#1750 for upstream "fix".

There are errors in the configuration of this machine. Please fix
the following errors and try again:

vm:
* Static IPs cannot end in ".1" since that address is always
reserved for the router. Please use another ending.
* Static IPs cannot end in ".1" since that address is always
reserved for the router. Please use another ending.
* Static IPs cannot end in ".1" since that address is always
reserved for the router. Please use another ending.

DNS: Stop generating NS records for tenancy subdomains

It appears that we are generating NS records for all of our tenancy subdomains. This appears to be the offending bit in powerdns.rb powerdns-table-records_forward-view:

SELECT domains.id+500 AS id, domains.id AS domain_id, domains.name AS name, 'NS' AS type, '#{node[:bcpc][:management][:vip]}' AS content, 300 AS ttl, NULL AS prio, NULL AS change_date FROM domains WHERE id > (SELECT MAX(id) FROM domains_static) UNION

I'd like to first verify with the DNS guys, but I'm pretty sure we don't need this.

VM's don't know their tenancy is part of their DNS name

After 7210d70, the VMs nolonger know their proper DNS name.

The metadata server returns the . via http://169.254.169.254/latest/meta-data/public-hostname. Also, stock Ubuntu VM image's end up setting the VM's FQDN to . (also missing the tenancy name in the actual FQDN).

automated_install.sh script error on Mac

Running the automated_install.sh script generates error in "sed -i" commands executed in the script.

For e.g. sed -i 's/vb.gui = true/vb.gui = false/' Vagrantfile

This is specific to behior of sed on Mac and can be fixed by

sed -i.bu 's/vb.gui = true/vb.gui = false/' Vagrantfile (or)

sed -i '' 's/vb.gui = true/vb.gui = false/' Vagrantfile

Pip Breaks Behind MITM Proxies

For those using the Hadoop branch, you will now hit the following bug when run behind a MITM proxy without a trusted cert, as Pip has gone SSL-only:

Successfully installed pip2pi pip
Cleaning up...
+ /usr/local/bin/pip install setuptools --no-use-wheel --upgrade
Cannot fetch index base URL https://pypi.python.org/simple/
Could not find any downloads that satisfy the requirement setuptools in /usr/lib/python2.7/dist-packages
Downloading/unpacking setuptools
Cleaning up...
No distributions at all found for setuptools in /usr/lib/python2.7/dist-packages
Storing debug log for failure in /home/vagrant/.pip/pip.log

Discussion I have found so far can further be found at:

Switch to omnibus Chef installers

I'm too busy to create a branch and PR right now...but here's a first-cut patch to switch to the Chef omnibus installers. This installs Chef 10 on the client and Chef 11 on the server. The chef-client run fails as the latest Chef 10 client Omnibus packages don't create a 'chef' user which chef-client cookbook expects. Oy.

The Chef 11 client packages require typing in a password for the knife configure --initial run...which is lame.

diff --git a/Vagrantfile b/Vagrantfile
index c62f13e..578a576 100644
--- a/Vagrantfile
+++ b/Vagrantfile
@@ -11,14 +11,12 @@ $local_mirror = nil

 if $local_mirror.nil?
   $repos_script = <<EOH
-    echo "deb http://apt.opscode.com precise-0.10 main" > /etc/apt/sources.list.d/opscode.list
 EOH
 else
   $repos_script = <<EOH
     sed -i s/archive.ubuntu.com/#{$local_mirror}/g /etc/apt/sources.list
     sed -i s/security.ubuntu.com/#{$local_mirror}/g /etc/apt/sources.list
     sed -i s/^deb-src/\#deb-src/g /etc/apt/sources.list
-    echo "deb http://#{$local_mirror}/chef precise-0.10 main" > /etc/apt/sources.list.d/opscode.list
 EOH
 end

diff --git a/cookbooks/bcpc/files/default/build_bins.sh b/cookbooks/bcpc/files/default/build_bins.sh
index f113f2b..2ae75e6 100755
--- a/cookbooks/bcpc/files/default/build_bins.sh
+++ b/cookbooks/bcpc/files/default/build_bins.sh
@@ -29,6 +29,19 @@ if [ -z `gem list --local fpm | grep fpm | cut -f1 -d" "` ]; then
   gem install fpm --no-ri --no-rdoc
 fi

+# Fetch chef client and server debs
+CHEF_CLIENT_URL=https://opscode-omnibus-packages.s3.amazonaws.com/ubuntu/12.04/x86_64/chef_10.30.4-1.ubuntu.12.04_amd64.deb
+#CHEF_CLIENT_URL=https://opscode-omnibus-packages.s3.amazonaws.com/ubuntu/12.04/x86_64/chef_11.10.4-1.ubuntu.12.04_amd64.deb
+CHEF_SERVER_URL=https://opscode-omnibus-packages.s3.amazonaws.com/ubuntu/12.04/x86_64/chef-server_11.0.11-1.ubuntu.12.04_amd64.deb
+if [ ! -f chef-client.deb ]; then
+   $CURL -o chef-client.deb ${CHEF_CLIENT_URL}
+fi
+
+if [ ! -f chef-server.deb ]; then
+   $CURL -o chef-server.deb ${CHEF_SERVER_URL}
+fi
+FILES="chef-client.deb chef-server.deb $FILES"
+
 # Build kibana3 installable bundle
 if [ ! -f kibana3.tgz ]; then
     git clone https://github.com/elasticsearch/kibana.git kibana3
diff --git a/setup_chef_cookbooks.sh b/setup_chef_cookbooks.sh
index 7ed81ae..7205d21 100755
--- a/setup_chef_cookbooks.sh
+++ b/setup_chef_cookbooks.sh
@@ -26,7 +26,7 @@ if [[ -f .chef/knife.rb ]]; then
   knife client delete $USER -y || true
   mv .chef/ ".chef_found_$(date +"%m-%d-%Y %H:%M:%S")"
 fi
-echo -e ".chef/knife.rb\nhttp://$BOOTSTRAP_IP:4000\n\n\n\n\n\n.\n" | knife configure --initial
+echo -e ".chef/knife.rb\nhttps://$BOOTSTRAP_IP\n\n\n/etc/chef-server/chef-webui.pem\n\n/etc/chef-server/chef-validator.pem\n.\n" | knife configure --initial

 cp -p .chef/knife.rb .chef/knife-proxy.rb

diff --git a/setup_chef_server.sh b/setup_chef_server.sh
index 033324f..1ac722c 100755
--- a/setup_chef_server.sh
+++ b/setup_chef_server.sh
@@ -16,39 +16,22 @@ if [[ -z "$CURL" ]]; then
    exit
 fi

-if [[ ! -f /etc/apt/sources.list.d/opscode.list ]]; then
-  cp opscode.list /etc/apt/sources.list.d/
-fi
-
-# When rerunning a bootstrap, the 'apt-get update' gets very slow if
-# the bootstrap node happens to be our apt mirror, so only do this if
-# the package we're after is not installed at all
-#
-# See http://askubuntu.com/questions/44122/upgrade-a-single-package-with-apt-get
-#
-if dpkg -s opscode-keyring 2>/dev/null | grep -q Status.*installed; then
-  echo opscode-keyring is installed
-else 
-  apt-get update
-  apt-get --allow-unauthenticated -y install opscode-keyring
-  apt-get update
-fi
-
 if dpkg -s chef 2>/dev/null | grep -q Status.*installed; then
   echo chef is installed
 else
-  DEBCONF_DB_FALLBACK=File{$(pwd)/debconf-chef.conf} DEBIAN_FRONTEND=noninteractive apt-get -y --force-yes install chef
+  dpkg -i cookbooks/bcpc/files/default/bins/chef-client.deb
 fi

 if dpkg -s chef-server 2>/dev/null | grep -q Status.*installed; then
   echo chef-server is installed
 else
-  DEBCONF_DB_FALLBACK=File{$(pwd)/debconf-chef.conf} DEBIAN_FRONTEND=noninteractive apt-get -y --force-yes install chef-server
+  dpkg -i cookbooks/bcpc/files/default/bins/chef-server.deb
+  sudo chef-server-ctl reconfigure
 fi

-
-chmod +r /etc/chef/validation.pem
-chmod +r /etc/chef/webui.pem
+chmod +r /etc/chef-server/admin.pem
+chmod +r /etc/chef-server/chef-validator.pem
+chmod +r /etc/chef-server/chef-webui.pem

 # copy our ssh-key to be authorized for root
 if [[ -f $HOME/.ssh/authorized_keys && ! -f /root/.ssh/authorized_keys ]]; then

Vagrant-built VMs don't PXE boot on default VirtualBox 4.2 due to built in DHCP server

Trying to build on a fresh install of VirtualBox 4.2, the bootstrap node installed fine but the VM picked up an IP address of 192.168.56.101. That IP is from the range VirtualBox gives out in its default DHCP servere, which isn't carrying the PXE information. Quick fix was to delete the dhcp server from VirtualBox.

New version of rabbit breaks 'guest' login

Running a fresh install of bcpc includes RabbitMQ 3.3. The platform installs fine, but doesn't work properly because processes can no longer connect as 'guest' to Rabbit. This appears to be intentional:

http://www.rabbitmq.com/blog/2014/04/02/breaking-things-with-rabbitmq-3-3/

The suggested workaround on that page (adding an empty loopback_users to re-enable guest) doesn't appear to work. After some poking about, I was able to get openstack back up by:

Modifying the rabbitmq recipe to put 'ostack' into the data bag instead of guest PRIOR to cheffing the head node.
creating an "ostack" user in rabbit and giving it '.' '.' '.*' permissions on '/'. It was not created automatically.
adding or updating the line rabbitmq_userid=guest to rabbitmq_userid=ostack in:

/etc/glance/glance-api.conf
/etc/cinder/cinder.conf
/etc/nova/nova.conf

I'm sure that's not exhaustive. I can see in the rabbit logs that some things still can't connect, and I'm not able to log into the rabbitmq management page, but this at least allows me to bring up a VM.

Install sshpass on bootstrap node by default

nodessh.sh won't work without sshpass installed.

vagrant@bcpc-bootstrap:~/chef-bcpc$ ./nodessh.sh Test-Laptop 10.0.100.11 -
Error: sshpass required for this tool. You should be able to 'sudo apt-get install sshpass' to get it

We should install it by default.

Migrate all the scripts in the repo root to a subdir

Just opening a placeholder for anyone feeling motivated to move the automation scripts to a subdir, since it's getting crowded IMHO. It'll take a little bit of review and testing since some may need a little re-writing to accommodate relative dirs, other assumptions, etc.

Ceph monitor logs fill up root volume quickly

I'm seeing the ceph monitor logs quickly spew out info every 100ms into the log file of the form:

2013-09-22 09:01:01.465453 7fcf55fb3700  1 mon.bcpc-vm1@0(leader).paxos(paxos active c 2260..2937) is_readable now=2013-09-22 09:01:01.465455 lease_expire=0.000000 has v0 lc 2937
2013-09-22 09:01:01.465490 7fcf55fb3700  1 mon.bcpc-vm1@0(leader).paxos(paxos active c 2260..2937) is_readable now=2013-09-22 09:01:01.465492 lease_expire=0.000000 has v0 lc 2937
2013-09-22 09:01:01.612839 7fcf557b2700  1 mon.bcpc-vm1@0(leader).paxos(paxos active c 2260..2937) is_readable now=2013-09-22 09:01:01.612840 lease_expire=0.000000 has v0 lc 2937
2013-09-22 09:01:02.795764 7fcf557b2700  1 mon.bcpc-vm1@0(leader).paxos(paxos active c 2260..2937) is_readable now=2013-09-22 09:01:02.795765 lease_expire=0.000000 has v0 lc 2937

At run-time, the following command reduces the paxos file logging:

$ ceph tell mon.* injectargs '--debug_paxos 0/5'

To make it permanent, the ceph.conf change would be:

[mon]
        debug paxos = 0/5

I'm not sure if we should file an upstream issue or just incorporate this into our scripts.

Netlink errors in keepalived

When bringing up a headnode, keepalived is happy (the VRRP_Script lines below) and then 2-3 minutes later, it logs some Netlink: filter function error messages. This happens reliably for me when I'm testing in VMs, so I think it's an issue. Googling around says that once you see them, you should restart keepalived to make sure it's still working properly. In testing, once I restart keepalived, those messages don't pop back up (I waited 30+ mins and didn't see anything).

Aug 18 11:41:02 bcpc-vm2 Keepalived_vrrp: VRRP_Script(chk_haproxy) succeeded
Aug 18 11:41:02 bcpc-vm2 Keepalived_vrrp: VRRP_Script(chk_ceph) succeeded
Aug 18 11:42:55 bcpc-vm2 Keepalived_vrrp: Netlink: filter function error
Aug 18 11:42:55 bcpc-vm2 Keepalived_healthcheckers: Netlink: filter function error
Aug 18 11:42:56 bcpc-vm2 Keepalived_healthcheckers: Netlink: filter function error
Aug 18 11:42:56 bcpc-vm2 Keepalived_vrrp: Netlink: filter function error
Aug 18 11:43:14 bcpc-vm2 Keepalived_vrrp: Netlink: filter function error
Aug 18 11:43:14 bcpc-vm2 Keepalived_healthcheckers: Netlink: filter function error
Aug 18 11:43:14 bcpc-vm2 Keepalived_healthcheckers: Netlink: filter function error
Aug 18 11:43:14 bcpc-vm2 Keepalived_vrrp: Netlink: filter function error

Mac bootstrap VM create warning

After machine is booted and ready, I see this warning:

[bootstrap] The guest additions on this VM do not match the installed version of
VirtualBox! In most cases this is fine, but in rare cases it can
prevent things such as shared folders from working properly. If you see
shared folder errors, please make sure the guest additions within the
virtual machine match the version of VirtualBox you have installed on
your host and reload your VM.

Guest Additions Version: 4.1.12
VirtualBox Version: 4.3

td-agent complains on restart

2014-05-06 17:03:12 -0400 [warn]: out_record_reformer: output_tag is deprecated. Use tag option instead.

Prob want to just change that.

Creating multiple head nodes fails with ceph-mon aborts

When creating multiple head nodes with Ceph Cuttlefish (confirmed to still exist with 0.61.7), ceph-mon on the additional headnodes will abort on startup.

On the new monitor node, do:

# ceph-mon --cluster=ceph --id=<hostname> --public_addr=<storage_ip> -f

Then once it gets quorum, ctrl-c it, then rerun chef-client.

There is an upstream issue filed with Ceph that should resolve the underlying issue. We will close this issue when we have confirmed that the upstream issue is resolved.

DNS for clusters may provide troublesome

DNS currently has some questions with:
add VIP as preferred DNS server - 67d805f
And
create DNS entries for hypervisors and enable recursion - 987e878

The issues are added as code review comments to the commits. However, this is not an issue causing catastrophic issue at this time.

Upgrade hypervisors to trusty (14.04)

We should think about when we want to upgrade from precise/12.04 to trusty/14.04.

One notable change that we know should be fixed in upstream packages captured in 14.04 is #54 where keepalived can drop the VIP under load. I expect there's more.

Thoughts?

Google's DNS Server Seems To Creep Into Networks Table

On a new cluster, the nova database, table networks, is getting the column dns1 set to 8.8.4.4. Strangely, this isn't seen anywhere in our setup for the nova-networks.

For example, the environment file had no mention of this DNS server:
ubuntu@foohost:/chef-bcpc$ knife environment show foo_env | grep 8.8.4.4
ubuntu@foohost:/chef-bcpc$

And we don't pass the DNS server in the recipe:
bash-3.2$ grep -i network ./cookbooks/bcpc/recipes/nova-setup.rb
nova-manage network create --label fixed --fixed_range_v4=#{node[:bcpc][:fixed][:cidr]} --num_networks=#{node[:bcpc][:fixed][:num_networks]} --multi_host=T --network_size=#{node[:bcpc][:fixed][:network_size]} --vlan=#{node[:bcpc][:fixed][:vlan_start]}
only_if ". /root/adminrc; nova-manage network list | grep "No networks found""

Perhaps we need to use --dns1 and --dns2, as the default in nova/network/manager.py is 8.8.4.4:
cfg.StrOpt('flat_network_dns',
default='8.8.4.4',
help='Dns for simple network'),

keystone dies on idle cluster

A large cluster on real hardware was left idle for nearly two weeks. Keystone died on all head nodes. 'sudo service keystone status' reports 'stop/waiting' and in Kibana3 I can see muttering about tokens being revoked.

Ruby locale issues w/ chef-client

I don't profess to be a ruby or chef expert (or novice, for that matter).

When sshing into the nodes to run chef-client (to test updated recipes), I hit the following error:

[2013-06-05T18:06:16-05:00] FATAL: ArgumentError: package[python-ujson_1.30-1_amd64.deb] (bcpc::beaver line 34) had an error: ArgumentError: invalid byte sequence in US-ASCII

I was able to fix it by explicitly setting the locale prior to running chef-client:

export LC_ALL=en_US.UTF-8

This isn't a very specific bug, but I figure I'd file something in case anyone else hits this problem or knows how I've misconfigured the system.

Warning message: ERROR: RuntimeError: Please set EDITOR environment variable

In cluster-assign-role.sh run, noticed this warning message:
ERROR: RuntimeError: Please set EDITOR environment variable

Issues with keepalived dropping VIP (running on VirtualBox).

I'm not sure what the right fix for this is, so I thought I would submit an issue and perhaps generate discussion.

I see occasional issues with keepalived dropping the VIP. It seems like it happens when the system is busy, particularly ceph. I am able to reproduce this easily by attempting an upload to glance:

    glance --insecure image-create --name=ubuntu-12.04 --is-public=True --container-format=bare --disk-format=raw --file ubuntu-12.04-server-cloudimg-amd64-disk1.raw

This will fail about half-way through with:

    Error communicating with http://10.0.100.5:9292 [Errno 32] Broken pipe

The root cause of that failure is that the IP was dropped. In fact, this was so consistent that I am only able to complete the upload by disabling keepalived, and add the IP myself :).

The laptop running the VMs is pretty beefy (16GB RAM, SSD, quad-core), but I'm assuming that all the ceph-osds in action is a bit too much, and get_monstatus starts taking longer than the timeout. I ran time get_monstatus continuously during a fresh image upload and saw wild variation (unedited):

...

real    0m1.076s
user    0m0.048s
sys 0m0.036s

real    0m4.813s
user    0m0.048s
sys 0m0.008s

real    0m10.869s
user    0m0.052s
sys 0m0.012s

real    0m0.922s
user    0m0.052s
sys 0m0.008s

I think it was at this point where the IP was dropped on this upload (makes sense!). This actually isn't inconsistent with what I've seen with ceph before, where querying the monitor can take a bit of time.

Note that I think this is highly unlikely to be an issue when deploying to physical machines.