Code Monkey home page Code Monkey logo

terraform-equinix-metal-anthos-on-vsphere's Introduction

Equinix Metal Website Slack Status PRs Welcome

Automated Anthos Installation via Terraform for Equinix Metal

These files will allow you to use Terraform to deploy Google Cloud's Anthos GKE on-prem on VMware vSphere on Equinix Metal's Bare Metal Cloud offering.

Terraform will create a Equinix Metal project complete with a linux machine for routing, a vSphere cluster installed on minimum 3 ESXi hosts with vSAN storage, and an Anthos GKE on-prem admin and user cluster registered to Google Cloud. You can use an existing Equinix Metal Project, check this section for instructions.

Environment Diagram

Users are responsible for providing their own VMware software, Equinix Metal account, and Anthos subscription as described in this readme.

The build (with default settings) typically takes 70-75 minutes.

This repo is by no means meant for production purposes, a production cluster is possible, but needs some modifications. If your desire is to create a production deployment, please consult with Equinix Metal Support via a support ticket.

Join us on Slack

We use Slack as our primary communication tool for collaboration. You can join the Equinix Metal Community Slack group by going to slack.equinixmetal.com and submitting your email address. You will receive a message with an invite link. Once you enter the Slack group, join the #google-anthos channel! Feel free to introduce yourself there, but know it's not mandatory.

Latest Updates

Starting with version v0.2.0, this module is published in the Terraform registy at https://registry.terraform.io/modules/equinix/anthos-on-vsphere/metal/latest.

For current releases, with Git tags, see https://github.com/packet-labs/google-anthos/releases. Historic changes are listed here by date.

9-25-2020

  • GKE on-prem 1.5.0-gke.27 has been released and has been successfully tested

7-29-2020

  • Several terraform formating and normailziaion with packet-labs github has been performed
  • GKE on-prem 1.4.1-gke.1 patch release has been successfully tested

6-25-2020

  • Support for GKE on-prem 1.4 added

6-8-2020

  • Added a check_capacity.py to manually perform a capacity check with Equinix Metal before building

6-03-2020

  • 1.3.2-gke.1 patch release has been successfully tested
  • Option to use Equinix Metal gen 3 (c3.medium.x86 for esxi and c3.small.x86 for router) along with ESXi 6.7

5-04-2020

  • 1.3.1-gke.0 patch release has been successfully tested

3-31-2020

  • The terraform is fully upgraded to work with Anthos GKE on-prem version 1.3.0-gke.16
  • There is now an option to use an existing Equinix Metal project rather than create a new one (default behavior is to create a new project)
  • We no longer require a private .ssh key for the environment be saved at ~/.ssh/id_rsa. The terraform will generate a ssh key pair and save it to ~/.ssh/<project_name>-. A .bak of the same file name is also created so that the key will be available after a terraform destroy command.

Prerequisites

To use these Terraform files, you need to have the following Prerequisites:

  • An Anthos subscription
  • A white listed GCP project and service account.
  • A Equinix Metal org-id and API key
  • If you are new to Equinix Metal
    • You will need to request an "Entitlement Increase". You will need to work with Equinix Metal Support via either:
    • Your message across one of these mediums should be:
      • I am working with the Google Anthos Terrafom deployment (github.com/equinix/terraform-metal-anthos-on-vsphere). I need an entitlement increase to allow the creation of five or more vLans. Can you please assist?
  • VMware vCenter Server 6.7U3 - VMware vCenter Server Appliance ISO obtained from VMware
  • VMware vSAN Management SDK 6.7U3 - Virtual SAN Management SDK for Python, also from VMware

Associated Equinix Metal Costs

The default variables make use of 4 c2.medium.x86 servers. These servers are $1 per hour list price (resulting in a total solution price of roughly $4 per hour). Additionally, if you would like to use Intel Processors for ESXi hosts, the m2.xlarge.x86 is a validated configuration. This would increase the useable RAM from 192GB to a whopping 1.15TB. These servers are $2 per hour list price (resulting in a total solution price of roughly $7 per hour.)

You can also deploy just 2 c2.medium.x86 servers for $2 per hour instead.

Tested GKE on-prem versions

The Terraform has been successfully tested with following versions of GKE on-prem:

  • 1.1.2-gke.0*
  • 1.2.0-gke.6*
  • 1.2.1-gke.4*
  • 1.2.2-gke.2*
  • 1.3.0-gke.16
  • 1.3.1-gke.0
  • 1.3.2-gke.1
  • 1.4.0-gke.13
  • 1.4.1-gke.1
  • 1.5.0-gke.27

To simplify setup, this is designed to used the bundled Seesaw load balancer. No other load balancer support is planned at this time.

Select the version of Anthos you wish to install by setting the anthos_version variable in your terraform.tfvars file.

*Due to a known bug in the BundledLb EAP version, the script will automatically detect when using the EAP version and automatically delete the secondary LB in each group (admin and user cluster) to prevent the bug from occurring.

Setup your GCS object store

You will need a GCS object store in order to download closed source packages such as vCenter and the vSan SDK. (See below for an S3 compatible object store option)

The setup will use a service account with Storage Admin permissions to download the needed files. You can create this service account on your own or use the helper script described below.

You will need to layout the GCS structure to look like this:

https://storage.googleapis.com:
    |
    |__ bucket_name/folder/
        |
        |__ VMware-VCSA-all-6.7.0-14367737.iso
        |
        |__ vsanapiutils.py
        |
        |__ vsanmgmtObjects.py

Your VMware ISO name may vary depending on which build you download. These files can be downloaded from My VMware. Once logged in to "My VMware" the download links are as follows:

You will need to find the two individual Python files in the vSAN SDK zip file and place them in the GCS bucket as shown above.

Download/Create your GCP Keys for your service accounts and activate APIs for your project

The GKE on-prem install requires several service accounts and keys to be created. See the Google documentation for more details. You can create these keys manually, or use a provided helper script to make the keys for you.

The Terraform files expect the keys to use the following naming convention, matching that of the Google documentation:

  • register-key.json
  • connect-key.json
  • stackdriver-key.json
  • whitelisted-key.json

If doing so manually, you must create each of these keys and place it in a folder named gcp_keys within the anthos folder. The service accounts also need to have IAM roles assigned to each of them. To do this manually, you'll need to follow the instructions from Google

GKE on-prem also requires several APIs to be activated on your target project.

Much easier (and recommended) is to use the helper script located in the anthos directory called create_service_accounts.sh to create these keys, assign the IAM roles, and activate the APIs. The script will allow you to log into GCP with your user account and select your Anthos white listed project. You'll also have an option to create a GCP service account to read from the GCS bucket. If you choose this option, you will create a storage-reader-key.json.

You can run this script as follows:

anthos/create_service_accounts.sh

Prompts will guide you through the setup.

Install Terraform

Terraform is just a single binary. Visit their download page, choose your operating system, make the binary executable, and move it into your path.

Here is an example for macOS:

curl -LO https://releases.hashicorp.com/terraform/0.12.18/terraform_0.12.18_darwin_amd64.zip 
unzip terraform_0.12.18_darwin_amd64.zip 
chmod +x terraform 
sudo mv terraform /usr/local/bin/ 

Download this project

To download this project, run the following command:

git clone https://github.com/equinix/terraform-metal-anthos-on-vsphere.git

Initialize Terraform

Terraform uses modules to deploy infrastructure. In order to initialize the modules your simply run: terraform init -upgrade. This should download five modules into a hidden directory .terraform.

Modify your variables

There are many variables which can be set to customize your install within 00-vars.tf and 30-anthos-vars.tf. The default variables to bring up a 3 node vSphere cluster and linux router using Equinix Metal's c2.medium.x86. Change each default variable at your own risk.

There are some variables you must set with a terraform.tfvars files. You need to set auth_token & organization_id to connect to Equinix Metal and the project_name which will be created in Equinix Metal. You will need to set anthos_gcp_project_id for your GCP Project ID. You will need a GCS bucket to download "Closed Source" packages such as vCenter. The GCS related variable is gcs_bucket_name. You need to provide the vCenter ISO file name as vcenter_iso_name.

The Anthos variables include anthos_version and anthos_user_cluster_name.

Here is a quick command plus sample values to start file for you (make sure you adjust the variables to match your environment, pay special attention that the vcenter_iso_name matches whats in your bucket):

cat <<EOF >terraform.tfvars 
auth_token = "cefa5c94-e8ee-4577-bff8-1d1edca93ed8" 
organization_id = "42259e34-d300-48b3-b3e1-d5165cd14169" 
project_name = "anthos-packet-project-1"
anthos_gcp_project_id = "my-anthos-project" 
gcs_bucket_name = "bucket_name/folder" 
vcenter_iso_name = "VMware-VCSA-all-6.7.0-XXXXXXX.iso" 
anthos_version = "1.3.0-gke.16"
anthos_user_cluster_name = "packet-cluster-1"
EOF

Using an S3 compatible object store (optional)

You have the option to use an S3 compatible object store in place of GCS in order to download closed source packages such as vCenter and the vSan SDK. Minio an open source object store, works great for this.

You will need to layout the S3 structure to look like this:

https://s3.example.com: 
    | 
    |__ vmware 
        | 
        |__ VMware-VCSA-all-6.7.0-14367737.iso
        | 
        |__ vsanapiutils.py
        | 
        |__ vsanmgmtObjects.py

These files can be downloaded from My VMware. Once logged in to "My VMware" the download links are as follows:

You will need to find the two individual Python files in the vSAN SDK zip file and place them in the S3 bucket as shown above.

For the cluster build to use the S3 option you'll need to change your variable file by adding the s3_boolean = "true" and including the s3_url, s3_bucket_name, s3_access_key, s3_secret_key in place of the gcs variables.

Here is the create variable file command again, modified for S3:

cat <<EOF >terraform.tfvars 
auth_token = "cefa5c94-e8ee-4577-bff8-1d1edca93ed8" 
organization_id = "42259e34-d300-48b3-b3e1-d5165cd14169" 
project_name = "anthos-packet-project-1"
anthos_gcp_project_id = "my-anthos-project" 
s3_boolean = "true"
s3_url = "https://s3.example.com" 
s3_bucket_name = "vmware" 
s3_access_key = "4fa85962-975f-4650-b603-17f1cb9dee10" 
s3_secret_key = "becf3868-3f07-4dbb-a6d5-eacfd7512b09" 
vcenter_iso_name = "VMware-VCSA-all-6.7.0-XXXXXXX.iso" 
anthos_version = "1.5.0-gke.27"
anthos_user_cluster_name = "packet-cluster-1"
EOF 

Deploy the Equinix Metal vSphere cluster and Anthos GKE on-prem cluster

All there is left to do now is to deploy the cluster:

terraform apply --auto-approve 

This should end with output similar to this:

Apply complete! Resources: 50 added, 0 changed, 0 destroyed. 
 
Outputs: 

KSA_Token_Location = The user cluster KSA Token (for logging in from GCP) is located at ./ksa_token.txt
SSH_Key_Location = An SSH Key was created for this environment, it is saved at ~/.ssh/project_2-20200331215342-key
VPN_Endpoint = 139.178.85.91
VPN_PSK = @1!64v7$PLuIIir9TPIJ
VPN_Password = n3$xi@S*ZFgUbB5k
VPN_User = vm_admin
vCenter_Appliance_Root_Password = *XjryDXx*P8Y3c1$
vCenter_FQDN = vcva.packet.local
vCenter_Password = 3@Uj7sor7v3I!4eo

The above Outputs will be used later for setting up the VPN. You can copy/paste them to a file now, or get the values later from the file terraform.tfstate which should have been automatically generated as a side-effect of the "terraform apply" command.

Checking Capacity in a Equinix Metal Facility (optional)

Before attempting to create the cluster, it is a good idea to do a quick capacity check to be sure there are enough devices at your chosen Equinix Metal facility.

We've included a check_capacity.py file to be run prior to a build. The file will read your terraform.tfvars file to use your selected host sizes and quantities or use the defaults if you've not set any.

Running the check_capacity.py file requires that you have python3 installed on your system.

Running the test is done with a simple command:

python3 check_capacity.py

The output will confirm which values it checked capacity for and display the results:

Using the default value for facility: dfw2
Using the default value for router_size: c2.medium.x86
Using the user variable for esxi_size: c3.medium.x86
Using the user variable for esxi_host_count: 3



Is there 1 c2.medium.x86 instance available for the router in dfw2?
Yes

Are there 3 c3.medium.x86 instances available for ESXi in dfw2?
Yes

Size of the vSphere Cluster

The code supports deploying a single ESXi server or a 3+ node vSAN cluster. Default settings are for 3 ESXi nodes with vSAN.

When a single ESXi server is deployed, the datastore is extended to use all available disks on the server. The linux router is still deployed as a separate system.

To do a single ESXi server deployment, set the following variables in your terraform.tfvars file:

esxi_host_count             = 1
anthos_datastore            = "datastore1"
anthos_user_master_replicas = 1

This has been tested with the c2.medium.x86. It may work with other systems as well, but it has not been fully tested. We have not tested the maximum vSAN cluster size. Cluster size of 2 is not supported.

Using Equinix Metal Gen 3 Hardware and ESXi 6.7

Equinix Metal is actively rolling out new hardware in mulitple locations which supports ESXi 6.7. Until the gen 3 hardware is more widely available, we'll not make gen 3 hardware the default but provide the option to use it.

Costs

The gen3 c3.medium.x86 is $0.10 more than the c2.medium.x86 but benefits from higher clock speed and the storage is better utlized to create a larger vSAN data store.

The c3.small.x86 is $0.50 less expensive than the c2.medium.x86. Therefore in a standard build, with 3 ESXi servers and 1 router, the net costs should be $0.20 lower than when using gen 2 devices.

Known Issues

ESXi 6.7 deployed on c3.medium.x86 may result in an alarm in vCenter which states Host TPM attestation alarm. The Equinix Metal team is looking into this but its thought to be a cosemetic issue.

Upon using terraform destroy --auto-approve to clean up an install, the VLANs may not get cleaned up properly.

Instructions to use gen 3

Using gen 3 requires modifying the terraform.tfvars file to include a few new variables:

esxi_size      = "c3.medium.x86"
vmware_os      = "vmware_esxi_6_7"
router_size    = "c3.small.x86"

These simple additions will cause the script to use the gen 3 hardware.

Connect to the Environment via VPN

By connecting via VPN, you will be able to access vCenter plus the admin workstation, cluster VMs, and any services exposed via the seesaw load balancers.

There is an L2TP IPsec VPN setup. There is an L2TP IPsec VPN client for every platform. You'll need to reference your operating system's documentation on how to connect to an L2TP IPsec VPN.

MAC how to configure L2TP IPsec VPN

NOTE- On a mac, for manual VPN setup use the values from Outputs (or from the generated file terraform.tfstate):

  • "Server Address" = VPN_Endpoint
  • "Account Name" = VPN_User
  • "User Authentication: Password" = VPN_Password
  • "Machine Authentication: Shared Secret" = VPN_PSK

Chromebook how to configure LT2P IPsec VPN

Make sure to enable all traffic to use the VPN (aka do not enable split tunneling) on your L2TP client. NOTE- On a mac, this option is under the "Advanced..." dialog when your VPN is selected (under System Preferences > Network Settings).

Some corporate networks block outbound L2TP traffic. If you are experiencing issues connecting, you may try a guest network or personal hotspot.

Windows 10 is known to be very finicky with L2TP Ipsec VPN. If you are on a Windows 10 client and experience issues getting VPN to work, consider using OpenVPN instead. These instructions may help setting up OpenVPN on the edge-gateway.

Connect to the clusters

You will need to ssh into the router/gateway and from there ssh into the admin workstation where the kubeconfig files of your clusters are located. NOTE- This can be done with or without establishing the VPN first.

ssh -i ~/.ssh/<private-ssh-key-created-by-project> root@VPN_Endpoint
ssh -i /root/anthos/ssh_key ubuntu@admin-workstation

The kubeconfig files for the admin and user clusters are located under ~/cluster, you can for example check the nodes of the admin cluster with the following command

kubectl --kubeconfig ~/cluster/kubeconfig get nodes

Connect to the vCenter

Connecting to the vCenter requires that the VPN be established. Once the VPN is connected, launch a browser to https://vcva.packet.local/ui. Youโ€™ll need to accept the self-signed certificate, and then enter the vCenter_Username and vCenter_Password provided in the Outputs of the run of "terraform apply" (or alternatively from the generated file terraform.tfstate). NOTE- use the vCenter_Password and not the vCenter_Appliance_Root_Password. NOTE- on a mac, you may find that the chrome browser will not allow the connection. If so, try using firefox.

Exposing k8s services

Currently services can be exposed on the bundled seesaw load balancer(s) on VIPs within the VM Private Network (172.16.3.0/24 by default). By default we exclude the last 98 usable IPs of the 172.16.3.0/24 subnet from the DHCP range-- 172.16.3.156-172.16.3.254. You can change this number by adjusting the reserved_ip_count field in the VM Private Network json in 00-vars.tf.

At this point services are not exposed to the public internet--you must connect via VPN to access the VIPs and services. One could adjust iptables on the edge-gateway to forward ports/IPs to VIP.

Cleaning the environment

To clean up a created environment (or a failed one), run terraform destroy --auto-approve.

If this does not work for some reason, you can manually delete each of the resources created in Equinix Metal (including the project) and then delete your terraform state file, rm -f terraform.tfstate.

Skipping the Anthos GKE on-prem cluster creation steps

If you wish to create the environment (including deploy the admin workstation and Anthos pre-res) but skip the cluster creation (so that you can practice creating a cluster on your own) add anthos_deploy_clusters = "False" to your terraform.tfvars file. This will still run the pre-requisites for the GKE on-prem install including setting up the admin workstation.

To create just the vSphere environment and skip all Anthos related steps, add anthos_deploy_workstation_prereqs = false.

Note that anthos_deploy_clusters uses a string of either "True" or "False" while anthos_deploy_workstation_prereqs uses a boolean of true or false. This is because the anthos_deploy_clusters variable is used within a bash script while anthos_deploy_workstation_prereqs is used by Terraform which supports booleans.

See anthos/cluster/bundled-lb-admin-uc1-config.yaml.sample to see what the Anthos parameters are when the default settings are used to create the environment.

Use an existing Equinix Metal project

If you have an existing Equinix Metal project you can use it assuming the project has at least 5 available vlans, Equinix Metal project has a limit of 12 Vlans and this setup uses 5 of them.

Get your Project ID, navigate to the Project from the console.equinixmetal.com console and click on PROJECT SETTINGS, copy the PROJECT ID.

add the following variables to your terraform.tfvars

create_project                    = false
project_id                        = "YOUR-PROJECT-ID"

Changing default Anthos GKE on-prem cluster defaults

Check the 30-anthos-vars.tf file for additional values (including number of user worker nodes and vCPU/RAM settings for the worker nodes) which can be set via the terraform.tfvars file.

Google Anthos Documentation

Once Anthos is deployed on Equinix Metal, all of the documentation for using Google Anthos is located on the Anthos Documentation Page.

Troubleshooting

Some common issues and fixes.

Error: The specified project contains insufficient public IPv4 space to complete the request. Please e-mail [email protected].

Should be resolved in https://github.com/equinix/terraform-metal-anthos-on-vsphere/commit/f6668b1359683eb5124d6ab66457f3680072651a

Due to recent changes to the Equinix Metal API, new organizations may be unable to use the Terraform to build ESXi servers. Equinix Metal is aware of the issue and is planning some fixes. In the meantime, if you hit this issue, email [email protected] and request that your organization be white listed to deploy ESXi servers with the API. You should reference this project (https://github.com/equinix/terraform-metal-anthos-on-vsphere) in your email.

At times the Equinix Metal API fails to recognize the ESXi host can be enabled for Layer 2 networking (more accurately Mixed/hybrid mode). The terraform will exit and you'll see

Error: POST https://api.packet.net/ports/e2385919-fd4c-410d-b71c-568d7a517896/disbond: 422 This device is not enabled for Layer 2. Please contact support for more details. 

  on 04-esx-hosts.tf line 1, in resource "packet_device" "esxi_hosts":
   1: resource "packet_device" "esxi_hosts" {

If this happens, you can issue terraform apply --auto-approve again and the problematic ESXi host(s) should be deleted and recreated again properly. Or you can perform terraform destroy --auto-approve and start over again.

null_resource.download_vcenter_iso (remote-exec): E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)

Occasionally the Ubuntu automatic unattended upgrades will run at an unfortunate time and lock apt while the script is attempting to run.

Should this happen, the best resolution is to clean up your deployment and try again.

SSH_AUTH_SOCK: dial unix /tmp/ssh-vPixj98asT/agent.11502: connect: no such file or directory

A failed deployment which results in the following output:

Error: Error connecting to SSH_AUTH_SOCK: dial unix /tmp/ssh-vPixj98asT/agent.11502: connect: no such file or directory



Error: Error connecting to SSH_AUTH_SOCK: dial unix /tmp/ssh-vPixj98asT/agent.11502: connect: no such file or directory



Error: Error connecting to SSH_AUTH_SOCK: dial unix /tmp/ssh-vPixj98asT/agent.11502: connect: no such file or directory



Error: Error connecting to SSH_AUTH_SOCK: dial unix /tmp/ssh-vPixj98asT/agent.11502: connect: no such file or directory

This could be because you are using a terminal emulation such as screenor tmux and the SSH agent is not running. May be corrected by running the command ssh-agents bash prior to running the terraform apply --auto-approve command.

terraform-equinix-metal-anthos-on-vsphere's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-equinix-metal-anthos-on-vsphere's Issues

Script failure is silenced

Noticed an other issue where the script fails to execute but terraform just says it's exited with non 0 without too much details, had to ssh into the node, re-run the script to figure out it had forgotten to provide the storage-reader json file.

This is easy to re-produce, just remove one of the json keys files (the storage-reader would be the easiest. run the script and watch it die with no details

Got to re-search if this is a terraform thing.

The default for 'facility` variable (ny5) does not seem to exist

I noticed the default packetlabs facility in variables.tf was updated to ny5: https://github.com/packet-labs/google-anthos/blob/2e4bc665fefe8b9f7b1a1cd9414a2d2cae42537b/variables.tf#L99-L101

I tried using the new facility, but got an error:

Error: Error reserving IP address block: POST https://api.packet.net/projects/<redacted>/ips: 422 ny5 is not a valid facility                                       
                                                                                                                                                                                              
  on 02-network-resources.tf line 1, in resource "packet_reserved_ip_block" "ip_blocks":                                                                                                      
   1: resource "packet_reserved_ip_block" "ip_blocks" {                                                                                                                                                                                                                                                                                                                                     
Error: Error reserving IP address block: POST https://api.packet.net/projects/<redacted>/ips: 422 ny5 is not a valid facility                                       
                                                                                                                                                                                              
  on 02-network-resources.tf line 8, in resource "packet_reserved_ip_block" "esx_ip_blocks":                                                                                                  
   8: resource "packet_reserved_ip_block" "esx_ip_blocks" {                     

Am I doing something wrong here? I checked the web console to see if the facility showed up in the UI and didn't see it there either. Is this facility available for normal users yet? Thanks!

Create option to use existing Packet project

@c0dyhi11
We should create boolean/if logic to allow the choice to create a Packet Project or use an existing one.

Known limitation of using an existing project-

  • Packet Project supports max of 12 VLANs, and our build uses 5. Therefore there would be a maximum of 2 of these builds per project.

no eligible ssds in vsan_claim.py

I captured the following log output while attempting to provision this module with the following non-default, Packet device, settings:

esxi_size       = "c3.medium.x86"
router_size     = "c2.medium.x86"
esxi_host_count = "1"
facility        = "dfw2"
Traceback (most recent call last):
  File "vsan_claim.py", line 97, in <module>
    smallerSize = min([disk.capacity.block * disk.capacity.blockSize for disk in ssds])
ValueError: min() arg is an empty sequence

By adding a print statement in vsan_claim.py, we can see that none of the ssd=True disks were status='eligible') (only one disk is shown here, for brevity. Each of the three included the error message that the disk contained existing partitions):

88 diskmap = {host: {'cache':[],'capacity':[]} for host in hosts}
89 cacheDisks = []
90 capacityDisks = []
91
92 for host in hosts:
93     print(repr(hostProps[host]['configManager.vsanSystem'].QueryDisksForVsan())) # added for debugging
94     ssds = [result.disk for result in hostProps[host]['configManager.vsanSystem'].QueryDisksForVsan() if
95         result.state == 'eligible' and result.disk.ssd]
97     smallerSize = min([disk.capacity.block * disk.capacity.blockSize for disk in ssds])
  (vim.vsan.host.DiskResult) {
      dynamicType = <unset>,
      dynamicProperty = (vmodl.DynamicProperty) [],
      disk = (vim.host.ScsiDisk) {
         dynamicType = <unset>,
         dynamicProperty = (vmodl.DynamicProperty) [],
         deviceName = '/vmfs/devices/disks/naa.500a075127d5116d',
         deviceType = 'disk',
         key = 'key-vim.host.ScsiDisk-0200000000500a075127d5116d4d5446444441',
         uuid = '0200000000500a075127d5116d4d5446444441',
         descriptor = (vim.host.ScsiLun.Descriptor) [
            (vim.host.ScsiLun.Descriptor) {
               dynamicType = <unset>,
               dynamicProperty = (vmodl.DynamicProperty) [],
               quality = 'highQuality',
               id = 'naa.500a075127d5116d'
            },
            (vim.host.ScsiLun.Descriptor) {
               dynamicType = <unset>,
               dynamicProperty = (vmodl.DynamicProperty) [],
               quality = 'mediumQuality',
               id = 'vml.0200000000500a075127d5116d4d5446444441'
            },
            (vim.host.ScsiLun.Descriptor) {
               dynamicType = <unset>,
               dynamicProperty = (vmodl.DynamicProperty) [],
               quality = 'mediumQuality',
               id = '0200000000500a075127d5116d4d5446444441'
            }
         ],
         canonicalName = 'naa.500a075127d5116d',
         displayName = 'Local ATA Disk (naa.500a075127d5116d)',
         lunType = 'disk',
         vendor = 'ATA     ',
         model = 'MTFDDAK480TDC   ',
         revision = 'F003',
         scsiLevel = 6,
         serialNumber = 'unavailable',
         durableName = <unset>,
         alternateName = (vim.host.ScsiLun.DurableName) [],
         standardInquiry = (byte) [],
         queueDepth = 32,
         operationalState = (str) [
            'ok'
         ],
         capabilities = (vim.host.ScsiLun.Capabilities) {
            dynamicType = <unset>,
            dynamicProperty = (vmodl.DynamicProperty) [],
            updateDisplayNameSupported = true
         },
         vStorageSupport = 'vStorageUnknown',
         protocolEndpoint = false,
         perenniallyReserved = <unset>,
         clusteredVmdkSupported = <unset>,
         capacity = (vim.host.DiskDimensions.Lba) {
            dynamicType = <unset>,
            dynamicProperty = (vmodl.DynamicProperty) [],
            blockSize = 512,
            block = 937703088
         },
         devicePath = '/vmfs/devices/disks/naa.500a075127d5116d',
         ssd = true,
         localDisk = true,
         physicalLocation = (str) [],
         emulatedDIXDIFEnabled = false,
         vsanDiskInfo = <unset>,
         scsiDiskType = <unset>
      },
      state = 'ineligible',
      vsanUuid = '',
      error = (vim.fault.DiskHasPartitions) {
         dynamicType = <unset>,
         dynamicProperty = (vmodl.DynamicProperty) [],
         msg = "Existing partitions found on disk 'naa.500a075127d5116d'.",
         faultCause = <unset>,
         faultMessage = (vmodl.LocalizableMessage) [],
         device = 'naa.500a075127d5116d'
      },
      degraded = <unset>
   }

Occasionally the ESX01 server fails to communicate with vCenter

On several occasions I've observed an ESX host (so far it has always been ESX01 but that my be coincidence) disconnects from the vCenter but continues to run VMs fine.

This will cause trouble when trying to deploy the Anthos cluster. Manually logging in and resetting the management agents on the ESXi server seems to remedy this.

If we see this continue to happen often then we should try to determine the underlying reason. We can attempt to write a simple script that will reset the agents proactively on each of the ESXi servers before attempting to do the cluster install, though I'm not sure that will solve the issue until we know what triggers it.

move admin workstation creation over to gkeadm

1.3 will bring a new method for bringing up the admin workstation: gkeadm. We should align with the product direction and use gkeadm instead of terraform for the admin workstation bring up moving forward.

latest supported anthos version

i would like to be able to specify the "latest supported version" of anthos. is there an automated way of obtaining this info?

if not, i propose that there should be a text file, perhaps named ANTHOS_SUPPORTED_VERSIONS.txt. its contents would be one line per supported version, sorted by version number.

eg,

1.1.2-gke.0
1.2.0-gke.6
1.2.1-gke.4
1.2.2-gke.2
1.3.0-gke.16
1.3.1-gke.0
1.3.2-gke.1
1.4.0-gke.13
1.4.1-gke.1
1.5.0-gke.27

this is basically the same as what's in the README, only in raw form rather than markdown.

Add sudo to openssl command to get around permissions issue

The following errors are happening during openssl CA certificate creation.

null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): Can't load /home/ubuntu/.rnd into RNG
null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): 140461474767296:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:88:Filename=/home/ubuntu/.rnd

There are some openssl issues and it looks like this is a permissions issues according to: openssl/openssl#7754

I think updating the following line From:
https://github.com/packet-labs/google-anthos/blob/40af0e953984c6a735b3e0bcf9c39e36f7af6176/anthos/cluster/finish_cluster.sh#L22

To:

sudo openssl \

Will solve the issue.

Working with n2.xlarges

I attempted to do an ESXi single server deployment on an n2.xlarge however hit an issue with the api call to move from layer3 to hybrid.

Error: Failed to convert device eae44273-c7e1-404e-aa6d-4284445545c3 from layer3 to hybrid. New type was layer3

anthos 1.5.0-gke.27 errors in terraform apply

i have been unable to to get anthos 1.5.0-gke.27 to pass "terraform apply" without errors.

here are some of the error messages from the log.

        null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): [K    - [FATAL] Hosts for AntiAffinityGroups: Anti-affinity groups enabled with available
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Some validation results were FATAL. Check report above.
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Failed to create root cluster: unable to create node Machine Deployments: creating or updating machine deployment "gke-admin-node" in namespace "default": timed out waiting for the condition
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): error: stat /home/ubuntu/cluster/kpresubmit-500-kubeconfig: no such file or directory
        null_resource.anthos_deploy_cluster[0] (remote-exec): [1m[31mError: [0m[0m[1merror executing "/tmp/terraform_522095460.sh": Process exited with status 1[0m

put private vars in a separate file

currently terraform.tfvars contains a mixture of private info (eg, orgid and API key), and non-private info (number of nodes, machine model). it would be better to segregate the two types of info into separate files so that examples of non-private info can be created and shared.

Enable a single ESXi host deployment

Create the option to have a single ESXi host (meaning no VSAN and disks need to be automatically aggregated into a single datastore).

The router will still exist as its own server. Thus we should do some testing on some of the smaller servers.

move this project to equinix/terraform-metal-anthos-on-vsphere

Move this project to its new home.

Requirements:

  • Before moving this project, we will want to get #113
    • convert the project to use terraform-provider-metal instead of terraform-provider-packet
  • Open a tracking issue for migration notes
  • Update in-repo web and git links
  • fix any packet-labs or google-anthos references

vlans may fail to be added to ports after network type conversion

I captured the following log output while attempting to provision this module with the following non-default, Packet device, settings:

esxi_size       = "c3.medium.x86"
router_size     = "c2.medium.x86"
esxi_host_count = "1"
facility        = "dfw2"
null_resource.apply_esx_network_config[0] (remote-exec): Trying to connect to ESX Host . . .
null_resource.apply_esx_network_config[0] (remote-exec): Connected to ESX Host !
null_resource.apply_esx_network_config[0] (remote-exec): Removing vNic: vmk0
null_resource.apply_esx_network_config[0] (remote-exec): Removing vNic: vmk1
null_resource.apply_esx_network_config[0] (remote-exec): Removing Port Group: VM Network
null_resource.apply_esx_network_config[0] (remote-exec): Removing Port Group: Private Network
null_resource.apply_esx_network_config[0] (remote-exec): Removing Port Group: Management Network
null_resource.apply_esx_network_config[0] (remote-exec): Removing vSwitch: vSwitch0
null_resource.apply_esx_network_config[0] (remote-exec): Updating vSwitch Uplinks...
null_resource.apply_esx_network_config[0] (remote-exec): Trying to connect to ESX Host . . .
null_resource.apply_esx_network_config[0] (remote-exec): Connected to ESX Host !
null_resource.apply_esx_network_config[0] (remote-exec): Found correct vSwitch.
null_resource.apply_esx_network_config[0] (remote-exec): Found bond0 port id
null_resource.apply_esx_network_config[0] (remote-exec): Found eth0 port id, but...
null_resource.apply_esx_network_config[0] (remote-exec): This is not the port you're looking for...
null_resource.apply_esx_network_config[0] (remote-exec): Found eth1 port id
null_resource.apply_esx_network_config[0] (remote-exec): Removing vLan 1220 from unbonded port
null_resource.apply_esx_network_config[0]: Still creating... [10s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Removing vLan 1230 from unbonded port
null_resource.apply_esx_network_config[0]: Still creating... [20s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Removing vLan 1221 from unbonded port
null_resource.apply_esx_network_config[0]: Still creating... [30s elapsed]
null_resource.apply_esx_network_config[0]: Still creating... [40s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Removing vLan 1222 from unbonded port
null_resource.apply_esx_network_config[0]: Still creating... [50s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Removing vLan 1228 from unbonded port
null_resource.apply_esx_network_config[0]: Still creating... [1m0s elapsed]
null_resource.apply_esx_network_config[0]: Still creating... [1m10s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Rebonding Ports...
null_resource.apply_esx_network_config[0]: Still creating... [1m20s elapsed]
null_resource.apply_esx_network_config[0]: Still creating... [1m30s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Adding vLan 1220 to bond
null_resource.apply_esx_network_config[0]: Still creating... [1m40s elapsed]
null_resource.apply_esx_network_config[0]: Still creating... [1m50s elapsed]
null_resource.apply_esx_network_config[0]: Still creating... [2m0s elapsed]
null_resource.apply_esx_network_config[0]: Still creating... [2m10s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Adding vLan 1230 to bond
null_resource.apply_esx_network_config[0]: Still creating... [2m20s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Adding vLan 1221 to bond
null_resource.apply_esx_network_config[0]: Still creating... [2m30s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Adding vLan 1222 to bond
null_resource.apply_esx_network_config[0]: Still creating... [2m40s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Traceback (most recent call last):
null_resource.apply_esx_network_config[0] (remote-exec):   File "/root/update_uplinks.py", line 64, in <module>
null_resource.apply_esx_network_config[0] (remote-exec):     main()
null_resource.apply_esx_network_config[0] (remote-exec):   File "/root/update_uplinks.py", line 60, in main
null_resource.apply_esx_network_config[0] (remote-exec):     host_network_system.UpdateVirtualSwitch(vswitchName=options.vswitch, spec=vss_spec)
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/VmomiSupport.py", line 706, in <lambda>
null_resource.apply_esx_network_config[0] (remote-exec):     self.f(*(self.args + (obj,) + args), **kwargs)
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/VmomiSupport.py", line 512, in _InvokeMethod
null_resource.apply_esx_network_config[0] (remote-exec):     return self._stub.InvokeMethod(self, info, args)
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/SoapAdapter.py", line 1351, in InvokeMethod
null_resource.apply_esx_network_config[0] (remote-exec):     resp = conn.getresponse()
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/http/client.py", line 1356, in getresponse
null_resource.apply_esx_network_config[0] (remote-exec):     response.begin()
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/http/client.py", line 307, in begin
null_resource.apply_esx_network_config[0] (remote-exec):     version, status, reason = self._read_status()
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/http/client.py", line 268, in _read_status
null_resource.apply_esx_network_config[0] (remote-exec):     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/socket.py", line 586, in readinto
null_resource.apply_esx_network_config[0] (remote-exec):     return self._sock.recv_into(b)
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
null_resource.apply_esx_network_config[0] (remote-exec):     return self.read(nbytes, buffer)
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/ssl.py", line 874, in read
null_resource.apply_esx_network_config[0] (remote-exec):     return self._sslobj.read(len, buffer)
null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/lib/python3.6/ssl.py", line 631, in read
null_resource.apply_esx_network_config[0] (remote-exec):     v = self._sslobj.read(len, buffer)
null_resource.apply_esx_network_config[0] (remote-exec): ConnectionResetError: [Errno 104] Connection reset by peer
null_resource.apply_esx_network_config[0]: Still creating... [2m50s elapsed]
null_resource.apply_esx_network_config[0] (remote-exec): Adding vLan 1228 to bond

provide bulletproof packet deletion script

terraform destroy often fails, leaving packet resources allocated, and potentially racking up charges. it would be helpful to provide users with a bulletproof script to delete packet resources for a cluster.

get integration tests passing

Integrations are currently failing with:


module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): Trying to connect to ESX Host . . .
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): Connected to ESX Host !
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): Successfully created vSwitch  vSwitch1
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): Successfully created PortGroup  Management
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): Traceback (most recent call last):
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):   File "/root/bootstrap/esx_host_networking.py", line 353, in <module>
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):     main()
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):   File "/root/bootstrap/esx_host_networking.py", line 211, in main
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):     create_port_group(host_network_system, "Management", vswitch_name, subnet['vlan'])
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):   File "/root/bootstrap/esx_host_networking.py", line 76, in create_port_group
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):     host_network_system.AddPortGroup(portgrp=port_group_spec)
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/VmomiSupport.py", line 706, in <lambda>
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):     self.f(*(self.args + (obj,) + args), **kwargs)
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/VmomiSupport.py", line 512, in _InvokeMethod
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):     return self._stub.InvokeMethod(self, info, args)
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/SoapAdapter.py", line 1397, in InvokeMethod
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):     raise obj # pylint: disable-msg=E0702
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): pyVmomi.VmomiSupport.AlreadyExists: (vim.fault.AlreadyExists) ***
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):    dynamicType = <unset>,
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):    dynamicProperty = (vmodl.DynamicProperty) [],
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):    msg = "The specified key, name, or identifier 'pgName' already exists.",
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):    faultCause = <unset>,
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):    faultMessage = (vmodl.LocalizableMessage) [],
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec):    name = 'pgName'
module.vsphere.null_resource.apply_esx_network_config[0] (remote-exec): ***

module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): Trying to connect to ESX Host . . .
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): There was a connection Error to host: 147.75.70.58. Sleeping 10 seconds and trying again.
module.vsphere.null_resource.apply_esx_network_config[1]: Still creating... [2m50s elapsed]
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): Trying to connect to ESX Host . . .
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): Connected to ESX Host !
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): Successfully created vSwitch  vSwitch1
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): Successfully created PortGroup  Management
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): Traceback (most recent call last):
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):   File "/root/bootstrap/esx_host_networking.py", line 353, in <module>
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):     main()
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):   File "/root/bootstrap/esx_host_networking.py", line 211, in main
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):     create_port_group(host_network_system, "Management", vswitch_name, subnet['vlan'])
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):   File "/root/bootstrap/esx_host_networking.py", line 76, in create_port_group
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):     host_network_system.AddPortGroup(portgrp=port_group_spec)
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/VmomiSupport.py", line 706, in <lambda>
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):     self.f(*(self.args + (obj,) + args), **kwargs)
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/VmomiSupport.py", line 512, in _InvokeMethod
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):     return self._stub.InvokeMethod(self, info, args)
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):   File "/usr/local/lib/python3.6/dist-packages/pyVmomi/SoapAdapter.py", line 1397, in InvokeMethod
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):     raise obj # pylint: disable-msg=E0702
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): pyVmomi.VmomiSupport.AlreadyExists: (vim.fault.AlreadyExists) ***
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):    dynamicType = <unset>,
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):    dynamicProperty = (vmodl.DynamicProperty) [],
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):    msg = "The specified key, name, or identifier 'pgName' already exists.",
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):    faultCause = <unset>,
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):    faultMessage = (vmodl.LocalizableMessage) [],
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec):    name = 'pgName'
module.vsphere.null_resource.apply_esx_network_config[1] (remote-exec): ***

Prevent unattended-upgrades from blocking apt-get install

As noted in the README.md Troubleshooting tips:

Occasionally the Ubuntu automatic unattended-upgrades will run at an unfortunate time and lock apt while the script is attempting to run.

null_resource.download_vcenter_iso (remote-exec): E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)

Currently, should this happen, the best resolution is to clean up your deployment and try again.

Prevent this collision by disabling unattended-upgrades until the installation is complete. This may require preconfiguring the package (cloud-init) so it doesn't activate on install.

Provide issue reporting tips

We should update the README.md or create an GH Issue template that offers advice on reporting issues.

This should advice users to include a snippet of the errors messages. If Terraform provided a stack trace, including that would be helpful. These should be sanitized of API tokens or IP addresses that may appear in the output.

There are ways to get more detailed output from Terraform by settings environment variables: https://www.terraform.io/docs/internals/debugging.html

The Packet API can also emit its API calls with environment variables (this includes the API token, cookies, and other sensitive details so be extra careful): PACKNGO_DEBUG=1

Some common occurrences have been catalogued here:
https://github.com/packet-labs/google-anthos#troubleshooting. If any problem becomes all too common and outside of quick-fix territory, we should chronicle it there.

Some issues may be files in alternate repositories, such as github.com/packethost/packngo, github.com/packethost/terraform-provider-packet, or in Anthos projects.

Error: error executing "/tmp/terraform_1532445389.sh": Process exited with status 32

Just tried this on a fresh project.

Here is a bit more detailed TF outputs although no errors prior to the line in the subject.

packet_port_vlan_attachment.esxi_priv_vlan_attach[3]: Still creating... [1m20s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[2]: Still creating... [50s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[8]: Still creating... [1m0s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[9]: Still creating... [1m20s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[0]: Still creating... [1m20s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[3]: Creation complete after 1m29s [id=48fe5df4-fe75-46a7-b79e-0ca7969e2db5:8e26abb6-6f03-40d1-a62a-9d75b97528e7]
packet_port_vlan_attachment.esxi_priv_vlan_attach[0]: Creation complete after 1m23s [id=16afcb24-9629-4333-a2d6-9d133849d0cc:9a44c9f0-ce98-4736-baf6-8f0200ac139c]
packet_port_vlan_attachment.esxi_priv_vlan_attach[2]: Still creating... [1m0s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[8]: Still creating... [1m10s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[9]: Still creating... [1m30s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[2]: Still creating... [1m10s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[8]: Creation complete after 1m19s [id=66baef95-94d5-4e8a-9fcc-95d9e624c200:9a44c9f0-ce98-4736-baf6-8f0200ac139c]
packet_port_vlan_attachment.esxi_priv_vlan_attach[9]: Still creating... [1m40s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[2]: Still creating... [1m20s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[9]: Still creating... [1m50s elapsed]
packet_port_vlan_attachment.esxi_priv_vlan_attach[9]: Creation complete after 1m51s [id=16afcb24-9629-4333-a2d6-9d133849d0cc:0e99fe28-aa4a-485a-85c4-66bcf2e0932a]
packet_port_vlan_attachment.esxi_priv_vlan_attach[2]: Creation complete after 1m25s [id=48fe5df4-fe75-46a7-b79e-0ca7969e2db5:e8e6d4b6-13b3-491f-8fc5-5a2a37c68db2]

Error: error executing "/tmp/terraform_1532445389.sh": Process exited with status 32

Timeouts result in Terraform provider crashes

As reported in the #google-anthos of the Packet Community Slack, the VMWare ESX provisioning phases take a longer amount of time and subsequently timeout more frequently.

Several members of the channel have identified a common error being reported during this phase, in providers 2.7.5 and 2.10.0 alike, perhaps others:

2020-07-08T11:04:33.716Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: 2020/07/08 11:04:33 [DEBUG] GET https://api.packet.net/devices/dbeddf64-1394-4427-b1cc-fcbedf831683?include=project,facility
2020-07-08T11:04:34.044Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: 2020/07/08 11:04:34 [TRACE] Waiting 10s before next try
2020-07-08T11:04:34.127Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: 2020/07/08 11:04:34 [WARN] WaitForState timeout after 1h0m0s
2020-07-08T11:04:34.127Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: 2020/07/08 11:04:34 [WARN] WaitForState starting 30s refresh grace period
2020-07-08T11:04:34.130Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: panic: interface conversion: interface {} is nil, not string
2020-07-08T11:04:34.130Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: 
2020-07-08T11:04:34.130Z [DEBUG] plugin.terraform-provider-packet_v2.7.5_x4: goroutine 262 [running]:

Reserve some IPs from VM Private Net for services VIPs

Users should assign services VIPs from the VM Private Subnet (172.16.3.0/24 by default). However currently we use the entire subnet as the DHCP range and reserve known IPs as needed.

We should constrict the range to say 150 IPs (up to .150) which would leave 154 IPs (in a /24) for VIPs. That's more than enough dynamic IPs and pool for VIP assignment.

Doing this to all subnets within user_data.py would be the easiest way to achieve this.

Uniform Standards Request: Experimental

Hello!

We believe this repository is Experimental and therefore needs the following files updated:

If you feel the repository should be maintained or end of life or that you'll need assistance to create these files, please let us know by filing an issue with https://github.com/packethost/standards.

Packet maintains a number of public repositories that help customers to run various workloads on Packet. These repositories are in various states of completeness and quality, and being public, developers often find them and start using them. This creates problems:

  • Developers using low-quality repositories may infer that Packet generally provides a low quality experience.
  • Many of our repositories are put online with no formal communication with, or training for, customer success. This leads to a below average support experience when things do go wrong.
  • We spend a huge amount of time supporting users through various channels when with better upfront planning, documentation and testing much of this support work could be eliminated.

To that end, we propose three tiers of repositories: Private, Experimental, and Maintained.

As a resource and example of a maintained repository, we've created https://github.com/packethost/standards. This is also where you can file any requests for assistance or modification of scope.

The Goal

Our repositories should be the example from which adjacent, competing, projects look for inspiration.

Each repository should not look entirely different from other repositories in the ecosystem, having a different layout, a different testing model, or a different logging model, for example, without reason or recommendation from the subject matter experts from the community.

We should share our improvements with each ecosystem while seeking and respecting the feedback of these communities.

Whether or not strict guidelines have been provided for the project type, our repositories should ensure that the same components are offered across the board. How these components are provided may vary, based on the conventions of the project type. GitHub provides general guidance on this which they have integrated into their user experience.

Support for different datastore types

With vSAN having issues when selecting different instance types and some instance types having very limited storage I think it's a good idea to add support for NFS. I am currently doing this by hand.. creating a new storage volume, attaching it to the router machine and hosting NFS server allows for a datastore of any size.

For reference, here's the doc for attaching storage volume to machine: https://www.packet.com/resources/guides/elastic-block-storage/

Create an option to use Open VPN

L2tp VPN can be problematic, especially for Windows clients.

We should explore an option to install an Open VPN server on the edge gateway as an alternative.

Move resource pool create from upload_ova.sh to deploy_vcva.py

We should move line 12 govc pool.create '*/${vmware_resource_pool}' to deploy_vcva.py for users that are not completing the Anthos prerequisites .

This is the more logical place for it any how.

We may also want to refactor upload_ova.sh to be a .py to keep consistency.

terraform apply should fail on error

currently, when "terraform apply" exits, there is no easy way to know whether the operation was truly successful (yielding a functioning cluster). it is common for terraform apply to exit with 0 status (success) even if some steps failed leading to a non-functional cluster.

ideally, the command shold exit 0 only if the cluster was correctly created.

if this cannot be achieved, then the sources should include some kind of test script that can be run to validate the cluster.

short of that, at least a script could be provided to scan the logfile for known fatal errors, while ignoring known "harmless" errors.

Implement code to clean up assigned IP blocks

Now that we have the IP block assignment fix in, we can implement code to give back the ip blocks after the esxi hosts are converted to layer to give back the IPs.

This would reduce costs ($0.04 per hour!!) and more importantly give back the IPs to be used on different projects.

I've tested manually deleting the IP blocks after the project was created and it had no impacted on the environment or on the destroy procedure.

Add enableha=true to load balancer config

1.3 will introduce a new field within loadbalancerconfig to enable ha aptly named enableha.

We should enable this for releases 1.3 and beyond. But if this field is in a 1.2 deployment then the config will be invalid.

Some sed-fu should be applied to make this work.

Deployment fails when respecting the README.MD

The introduced changes with the vsphere module break the entire repository.
Variables in TF are not documented or broken (object_store_tool is undocumented while the "s3_boolean" is not doing anything anymre). Also - I was not able to deploy and continously got NSX errors like:
Error: remote-exec provisioner error

with module.vsphere.null_resource.apply_esx_network_config[0],
on .terraform/modules/vsphere/main.tf line 408, in resource "null_resource" "apply_esx_network_config":
408: provisioner "remote-exec" {

error executing "/tmp/terraform_433087771.sh": Process exited with status 1
This seems to be the python3 code from the vSphere module (main.tf line 408...)

Would it be possible to bring the readme up to date and fully document the changes. Also - could you test this with Anthos on VMware version 1.7 (most recent - 1.5 is rather old).

VCVA deployment times out

When deploying the vCenter virtual appliance, it always fails with a message like this before timing out:

null_resource.deploy_vcva (remote-exec): Error:     Problem Id: None                                                                 
null_resource.deploy_vcva (remote-exec): Component key: setnet     Detail:
null_resource.deploy_vcva (remote-exec): Failed to set the time via NTP. Details:                                                    
null_resource.deploy_vcva (remote-exec): Failed to sync to NTP servers.. Code:                                                       
null_resource.deploy_vcva (remote-exec): com.vmware.applmgmt.err_ntp_sync_failed                                                     
null_resource.deploy_vcva (remote-exec): Could not set up time synchronization.                                                      
null_resource.deploy_vcva (remote-exec): Resolution: Verify that provided ntp                                                        
null_resource.deploy_vcva (remote-exec): servers are valid.                                                                          
null_resource.deploy_vcva (remote-exec):  [FAILED] Task: MonitorDeploymentTask:                                                      
null_resource.deploy_vcva (remote-exec): Monitoring Deployment execution failed                                                      
null_resource.deploy_vcva (remote-exec): at 13:45:13                                                                                 
null_resource.deploy_vcva (remote-exec): ========================================                                                    
null_resource.deploy_vcva (remote-exec): Error message: The appliance REST API                                                       
null_resource.deploy_vcva (remote-exec): was not yet available from the target                                                       
null_resource.deploy_vcva (remote-exec): VCSA 'vcva'because 'Failed to query                                                         
null_resource.deploy_vcva (remote-exec): deployment status for appliance vcva                                                        
null_resource.deploy_vcva (remote-exec): after trying all ip addresses'. The VCSA                                                    
null_resource.deploy_vcva (remote-exec): might still be starting up.                                                                 
null_resource.deploy_vcva (remote-exec): =============== 13:45:13 ===============                                                    
null_resource.deploy_vcva (remote-exec): Result and Log File Information...                                                          
null_resource.deploy_vcva (remote-exec): WorkFlow log directory:
null_resource.deploy_vcva (remote-exec): /tmp/vcsaCliInstaller-2021-02-01-13-12-fzgl6ein/workflow_1612185129050

Details:
vCenter ISO: VMware-VCSA-all-6.7.0-15132721.iso
esxi_size = c3.medium.x86

I have worked around it by changing the VCVA appliance ntp to time.google.com:

sed -i 's/time\.nist\.gov/time.google.com/' templates/vcva_template.json

This is not a permanent fix, as ideally the deployment would fail at this stage instead of continuing until a timeout. I will also add a PR doing this.

provide working config examples

there could be a directory of concrete configurations that are known to work. these would be helpful for users, and could be used in automated tests. it would also help to have separate files for private vars vs non-private vars, as suggested in issue 115. if this is done, then only the non-private side would be shared, of course.

Create a CI/CD workflow for contributions and periodic testing of the default branch

This project is very complex, spanning the creation and provisioning of physical hardware, virtual networking, operating systems, licensed software installations, with Anthos and Kubernetes clusters eventually coming out of the 1h+ provisioning process.

In order to allow this project to evolve quickly while encouraging users to safely depend on this module, we must introduce continuous integration and continuous delivery practices.

PRs should be verifiable using a CI/CD that does the following things:

  • builds the full environment
  • tears down the full environment (including, and especially, when the tests fail)
  • tear down (sweepers) should be run at the start of each build job to ensure that deletions are successful
    • sweepers look for a predefined resource name prefix and/or the presence of well-known tags
    • we could benefit from a generic all-purpose sweeper script, possibly leveraging the packet-cli here
    • an all-purpose sweeper could be one that removes all resources from a given project (by name), the packet api does not allow for the deletion of a project until all resources are deleted within it.
  • the build pipeline must be limited to one build at a time
    • so sweepers do not remove infrastructure needed for new tests
    • so provisioned hardware is kept to a minimum
  • Merges are blocked until the CI has completed successfully (this can be skipped in cases, like documentation changes)

Errors may be transparent to users in an otherwise successful Terraform build. The build phase must verify that Terraform and the resulting environment are in working order:

  • Verifies that terraform apply succeeded
  • Verifies that all scripts succeeded (this could be configured within each script and terraform provisioner error handling)
  • Verifies that the ESXi environment is healthy
  • Verifies that the Anthos host environment is healthy
  • Verifies that Anthos guest environments are created successfully

For now, CD means:

  • the default branch is also tested when updated.
  • the terraform module is tagged and published so that users may safely and easily pin to a previous version of this project

Additionally, this requires:

Make dnsmasq changes more resilient

Maybe it's too much of a one off, but I ran into an issue where terraform exited with an error when running 33-anthos-deploy-admin-workstation.tf. And when I restart things break pretty bad because replace_tf_vars.py ran before the failure and added an entry in dhcp.conf for admin workstation. When I re-run terraform it executes the script again, which adds the line again and then dnsmasq doesn't start anymore because it sees two reservations for the same IP. And then entire ESXi cluster starts to fall apart because hosts are added by FQDN to vCenter and once dnsmasq is down it can no longer resolve the names. A quick check to see if the last line in the file is the same as the line being added before adding it would make the script more resilient.

anthos 1.4.1-gke.1 errors during terraform apply

i can't get "terraform apply" to work with anthos 1.4.1-gke.1. typically the apply exits with status 0 but the cluster is not functional. i can see error messages in the log, like this:

null_resource.anthos_deploy_cluster[0] (remote-exec): ^[[0m^[[0mnull_resource.anthos_deploy_cluster (remote-exec): ^[[K - [FAILURE] ping
(availability): Following IPs could already be in use: [172.16.3.17]

and

apply.log.04:null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): error: stat /home/ubuntu/cluster/djfong-841-kubeconfig: no such file or directory

can't destroy after a failed deployment

After a failed provision, the index on reserved IP blocks can reference an invalid index, preventing terraform destroy from succeeding.

$ terraform apply
...
Error: Device in non-active state "failed"

  on 04-esx-hosts.tf line 1, in resource "packet_device" "esxi_hosts":
   1: resource "packet_device" "esxi_hosts" {
$ terraform destroy
...
Error: Error in function call

  on 04-esx-hosts.tf line 15, in resource "packet_device" "esxi_hosts":
  15:     reservation_ids = [element(packet_reserved_ip_block.esx_ip_blocks.*.id, count.index)]
    |----------------
    | count.index is 0
    | packet_reserved_ip_block.esx_ip_blocks is empty tuple

Call to function "element" failed: cannot use element function with an empty
list.

Using VMware vSphere 6.7 for ESXi hosts

Tried to modify variable "vmware_os" to use more recent 6.7 version. But that failed.
Is there a way to use the more recent 6.7 version for the vSphere hosts? Given that 6.5 is more than 5 years old by now switchting to 6.7 would be the far better option...
Also, Anthos fully supports vSphere 6.7 as of version 1.2.0

We are not reserving IPs for the seesaw nor VIPs

Currently we do not reserve any IPs from the private network DHCP pool when we assign IPs/VIPs for the seesaw.

We risk having an IP collision with the Anthos VMs that are deployed and the seesaw IPs/VIPs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.