Remove instance key as required parameter

Currently it is required to have an instance key configured and available in AWS.
This was done to ease debugging and issue investigation during development.

With the AWS SystemManager an other alternative exists which avoid the use of SSH keys.
Furthermore no bastion setup is needed in order to interact with the instance.

Nevertheless the AWS SystemManager has to be supported by the AMI.

Migrate Terratest to go modules

Summary

Terratest recently added support for the new dependency management system in go.

Hence we can migrate from go dep to go modules.

References

Missing network device dependency

Due to the missing network device dependency for the nomad application startup during system startup the application can not connect to the cluster.

Upgrade: Restart Policy for systemd services

Summary

To support rolling upgrades of the cluster orchestration system the restart policy for the nomad and consul services needs to be enhanced.

Expectation

In case of a normal shutdown of the application the service should not restart automatically.
In case of an abnormal shutdown the application the service should be restarted automatically by systemd.

Hint

Configuration systemd: Restart=

Packer build of Amazon Linux 2 fails

Calling packer build -var 'aws_region=us-east-1' -var 'ami_regions=us-east-1' nomad-consul-docker-ecr.json

Fails with:

==> amazon-linux-ami2: Error modify AMI attributes: InvalidAMIAttributeItemValue: Invalid attribute item value " " for userId item type.
==> amazon-linux-ami2:  status code: 400, request id: 672be7fd-f89b-49ba-a76c-426341955d8d

Nomad does not start with newest AMI version

Problem

Linux AMI2 build based on current state of master is buggy.
On instances based on this AMI nomad does not start at all.

● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mi 2018-09-26 09:25:15 UTC; 2h 9min ago
     Docs: https://nomadproject.io/docs/
  Process: 4462 ExecStart=/opt/nomad/bin/nomad agent -config /opt/nomad/config -data-dir /opt/nomad/data (code=exited, status=1/FAILURE)
 Main PID: 4462 (code=exited, status=1/FAILURE)

Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: Started Nomad.
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: Starting Nomad...
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration nomad[4462]: No configuration loaded from /opt/nomad/config
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration nomad[4462]: ==> Must specify either server, client or dev mode for the agent.
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: Unit nomad.service entered failed state.
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: nomad.service failed.

Upgrade to nomad 0.10

Overview

A new nomad version with a couple of interesting features was release: 0.10.:

Service Mesh via Consul Connect Integration including mutualTLS support
Host Volumes
Network Namespaces: Isolate network for task groups

Reference

https://www.hashicorp.com/blog/hashicorp-nomad-0-10-general-availability/

Every container is accessible over ingress ALB

All traffic on ingress ALB is routed to the services subnet on the public-services data-center. There on each node runs fabio.
Currently fabio is able to route the traffic to every location inside the vpc and will do as soon as a job registers itself at consul using the tagging-mechanism.

For example prometheus is running o the backoffice nodes. Fabio will route the traffic to them since prometheus currently registers at consul using the tagging-mechanism.

Add network-separation support

Put nomad data-centers into separate networks

Nomad UI LogMessage Visibility

Using the nomad ui and trying to examine the logs for one allocation just gives empty results for stdout and stderr:

Nomad UI Reviewing Logs

Don't use external repositories to get images/ binaries for Nomad

Why

We want to restrict access of the nomad-masters (leader) to the internet. That's why they are inside a subnet that has only access to AWS services. This restriction is made by allowing only routes to AWS services a specified at: https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Problem - binaries/ images from non ECR sources.

The fabio binary is loaded directly from github. But there is no route that allows egress access to GH.

Prometheus can't scrape metrics from nomad-clients

Reason the SG on the clients does not allow inbound on 4646

Refacor amazon-ecr-credential-helper usage

The configuration is cluttered around and it is unclear what is necessary and used.

The docker config.json is at two places:

~/.docker/config.json
/etc/docker/config.json

As well for Amazon Linux 2 LTS now a yum package is available:

Installation + "-y"

This will facilitate binary installation.

EBS attachment support

We need to be able to attach user-defined EBS volumes to the nomad nodes.
It should be possible to attach more than one EBS to specific data-center nodes.
Mounting of the attached volumes should be made automatically during instance creation.

Prometheus Port on Backoffice is Closed

Root-Cause

The backoffice data-center nodes are not allowed to communicate with each other over port 4646.
But this port is needed in order to be able to scrape metrics from them.

Task

Allow the backoffice-nodes to access each other over port 4646.

Upgrade nomad to 0.8.0

Released just a few days ago ... we should upgrade.
Changelog: https://github.com/hashicorp/nomad/blob/v0.8.0/CHANGELOG.md

Interesting improvements:

core: Servers can now service client HTTP endpoints [GH-3892] ... would solve #9
cli: Node status and filesystem related commands do not require direct network access to the Nomad client nodes [GH-3892]
ui: All views poll for changes using long-polling via blocking queries [GH-3936]

Cleanup ami module

Recommenced, mainly used and tested is on the module ami2.
There are no know use cases in order to preserve AMI creation using the module ami.

Hence this issue should be used to remove the unused module.

Access to logs does not work

Problem

calling nomad logs -stderr -f -job ping_service locally (even with active sshuttle)
Does not show any logs.
When the command is executed directly on the server, it works.

Update Nomad/ Consul ami to use consul 1.4

A new consul version was released.
This is 1.4.0 which includes bugfixes and interesting features to be tested.

Release Notes

Missing example for nomad module

Docker ports open to 0.0.0.0/0

No need to have the docker ports (20000...32000) open to the "world".
At least they can be restricted to the cidr of the used vpc or better just connect the SG's of the nodes accordingly.

Refactoring

cidr_blocks -> source_security_group_id

Remove EFS and S3 dependencies in CloudInit of Nomad Client Nodes

Issue

Currently there are two variables that are used to modify the cloud-init script from the outside.
This adds a unnecessary dependency to components that are not needed for the COS deployment. Thus the COS is less usable/ harder to setup, since all dependencies have to be satisfied.

Goal

Remove the not needed dependencies + the variables.
https://github.com/MatthiasScholz/cos/blob/master/modules/nomad-datacenter/user-data-nomad-client.sh#L53

Make instance count configurable on DC level

It would be nice if it is possible to adjust the amount of nodes per data-center.
For example on the backoffice dc there might be only a fraction of nodes necessary compared to the number of nodes needed for the private- or public services dc.

In nw-separation setup nomad fails to download the images

04/14/18 08:58:48 UTC Restarting Task restarting in 30.185113191s
04/14/18 08:58:48 UTC Driver Failure failed to initialize task "ping_service_task" for alloc "b9f34abf-f20c-27ff-6341-5c1040f9476f": Failed to pull 307557990628.dkr.ecr.us-east-1.amazonaws.com/service/ping-service:0.0.7: API error (500): {"message":"Get https://307557990628.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

Provide possibility to inject userdata

Some use-cases (i.e. mounting an efs mount-target) are best implemented via user-data when the instance is created.
With the current module api it is not possible to add steps to the user-data of the nomad nodes.

It would be nice to have this option.

dnsmasq not used for DNS based service-discovery

Since not all services support consul based service-discovery, we added dnsmasq on the instances.
As specified at https://www.consul.io/docs/guides/forwarding.html dnsmasq can be used to intercept all queries to the consul domain to 127.0.0.1:8600.
There at 127.0.0.1:8600 the consul-agent is running which is able to provide a service-discovery based on the service-catalog.

But dnsmaq is not configured correctly in order to intercept local calls.

Support AWS System Manager Session Manager

Summary

Making use of the AWS Session Manager will allow to deprecate the bastion setup in order to debug cluster issues where direct instance access is needed.

Using the AWS Session Manager provides better security and less infrastructure to maintain and pay for.

AWS System Manager Session Manager

Details

ensure AWS Session Manager is installed on the instances ( by default for AWS AmazonLinux 2 )
ensure instance is allowed to interact with AWS Session Manager ( instance profile )
cleanup documentation to advertise AWS Session Manager over Bastion setup ( +sshuttle )

Update to v0.4.4 of consul module

Update the infrastructure module for consul using the newest version v0.4.4.
This includes breaking changes in their API.

Add https listener for ui-ALB's

Problem

Currently the communication over the UI-ALB's to the COS components is using HTTP. Thus no encryption at transport is in place.

Unable to build AMI with packer

When installing consul, sudo yum update -y is called.
During this process the os is not able to load some packages.

amazon-linux-ami2: Updated:
amazon-linux-ami2: amazon-linux-extras.noarch 0:1.4-1.amzn2
amazon-linux-ami2: aws-cfn-bootstrap.noarch 0:1.4-30.amzn2
amazon-linux-ami2: dhclient.x86_64 12:4.2.5-58.amzn2.3.2
amazon-linux-ami2: dhcp-common.x86_64 12:4.2.5-58.amzn2.3.2
amazon-linux-ami2: dhcp-libs.x86_64 12:4.2.5-58.amzn2.3.2
amazon-linux-ami2: dotnet-host.x86_64 0:2.1.0_preview2_26411_07-1
amazon-linux-ami2: ec2-utils.noarch 0:0.5-1.amzn2.0.1
amazon-linux-ami2: kernel-tools.x86_64 0:4.14.33-59.34.amzn2
amazon-linux-ami2: mssql-server.x86_64 0:14.0.3025.34-3
amazon-linux-ami2:
amazon-linux-ami2: Failed:
amazon-linux-ami2: msodbcsql17.x86_64 0:17.0.1.1-1 msodbcsql17.x86_64 0:17.1.0.1-1
amazon-linux-ami2: mssql-tools.x86_64 0:17.0.1.1-1 mssql-tools.x86_64 0:17.1.0.1-1

Cached Nomad Node Id

Using AWS snapshots for instance startup runs into duplicated node ids due to the node id caching.

Check: Consul Issue 3415 for further details.

Enable jump from shared to bastion

The current SG on the bastion does not support jumping from the shared network.

Add time synchronisation support to the AMIs

AWS provides a Amazon Time Sync Service allowing to get accurate and synchronized clocks on each instance this will facilitate the monitoring and logging.

Amazon Time Sync Service

Move to terraform 0.12

Terraform 0.12 is now out for a while (current version is 0.12.9). Since it contains breaking changes in the HCL the code of the COS has to be adjusted to it.

As a precondition the referenced modules have to be terraform 0.12 compatible as well

terrafrom-aws-nomad at v0.5.0 is compatible
terraform-aws-consul at v0.7.0 is compatible

Restrict access to nomad-servers

see nomad/servers.tf

 # HACK: Still everything open for the nomad-servers. Has to be closed.
  allowed_inbound_cidr_blocks = ["0.0.0.0/0"]

Update Nomad/Consul ami to use consul 1.3.1

The newest released consul version is 1.3.1.
Released on November 13.

With this upgrade beside bugfixes we also get features like Connect Envoy Support (part of v 1.3.0).

Cached Consul Node Id

This issue is similar to this one: #25.

The data cached by consul is backed into a snapshot and hence reused when the snapshot get instantiated a second time.

Upgrade: Documentation of Cluster Upgrade

Description how to upgrade

consul server instances
nomad server instances
nomad client instances

Names of nomad and consul instances are unclear

Current state

nomad-client (dc public-services): COS-public-services-shiner
consul: COS-consul-shiner
nomad-server: nomad-example-server

Problem

The names can't be long, since they are used inside the official nomad-module as name-prefix, which introduces size limits for names (32 for target-groups, 64 for security-groups)
To avoid collisions the name should contain a random part.

Proposal

Keep random part at the end
Restrict the data-center name (only in this names) to max of 10 items.
Add abbrev. for node-type

nomad-client: NMC
nomad-server: NMS
consul: SDCFG (Service Discovery + Configuration)

Examples

COS-SDCFG-consul-shiner
COS-NMC-public-ser-shiner
COS-NMC-private-se-shiner
COS-NMC-backoffice-shiner
COS-NMC-content-se-shiner
COS-NMS-leader-shiner

Cleanup nomad module

The nomad module should be cleaned up because it uses gitlab references, e.g.:
* https://github.com/MatthiasScholz/cos/blob/master/modules/nomad/README.md

UDP blocked for docker ports

Some service (i.e. thanos) needs to be able to communicate with its cluster components using UDP.
Currently only TCP traffic is allowed across the docker ports.

--> open UDP.

Restricting the allowed CIDR blocks for the UI-ALB's does not work

The variable allowed_cidr_blocks_for_ui_alb is not used inside the module correct. Thus setting the variable has no effect currently.

Fix get_nomad_client_info.sh

Currently it returns only the public-services client node since it works on instance-tags. But these are no longer valid.

get_nomad_client_info.sh
2018-04-14 10:52:29 [INFO] [get_nomad_client_info.sh] aws ec2 describe-instances --region us-east-1 --profile playground --filter Name=tag:Name,Values=COS-NMC-public-ser-coyote Name=instance-state-name,Values=running
ISTANCE_ID INSTANCE_IP INSTANCE_IP (private)
52.91.138.241 i-0263e643f56bf7faa 10.128.50.55

Script to calculate cidr-blocks for egress_aws NatGW

Why

We want to restrict access of the nomad-masters (leader) to the internet. That's why they are inside a subnet that has only access to AWS services. This restriction is made by allowing only routes to AWS services a specified at: https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Problem - access to ECR needs a lot of the ip's specified at https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Which results in more than 50 route-entries for a route-table. And the limit for route-tables is 50.
Of course a limit increase can be requested, but due to potential performance impact it's not recommended to do so.

With #6 we solved the issue with widening the cidrs to /8. But as a long term solution we need to have more restricting cidr's (i.e. /16).
But to generate these correctly (+ merge them) and optimal (least number of rules possible) we need a sophisticated script.

Nomad is not able to pull from DockerHub

When trying to deploy a docker image from docker-hub nomad responds with the following error message:

failed to initialize task "ping_service_task" for alloc "8f46a473-90de-3e96-71bb-149ad2916453": Failed to find docker auth for repo "thobe/ping_service": docker-credential-ecr-login with input "thobe/ping_service" failed with stderr: credentials not found in native keychain

Example job file:

# job>group>task>service
# container for tasks or task-groups that nomad should run
job "ping_service" {
  datacenters = ["public-services"]
  #,"private-services","content-connector","backoffice"]
  type = "service"

  meta {
    my-key = "example"
  }

  # The group stanza defines a series of tasks that should be co-located on the same Nomad client.
  # Any task within a group will be placed on the same client.
  group "ping_service_group" {
    count = 1

    # restart-policy
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }

     ephemeral_disk {
      migrate = false
      size    = "50"
      sticky  = false
    }

    # The task stanza creates an individual unit of work, such as a Docker container, web application, or batch processing.
    task "ping_service_task" {
      driver = "docker"
      config {
        # Docker Hub:
        image = "thobe/ping_service:0.0.9"
      }

      logs {
        max_files     = 2
        max_file_size = 10
      }

      config {
        port_map = {
          http = 8080
        }
      }

      resources {
        cpu    = 100 # MHz
        memory = 20 # MB
        network {
          mbits = 10
          port "http" {
          }
        }
      }

      # The service stanza instructs Nomad to register the task as a service using the service discovery integration
      service {
        name = "ping-service"
        tags = ["urlprefix-/ping"] # fabio
        port = "http"
        check {
          name     = "Ping-Service Alive State"
          port     = "http"
          type     = "http"
          method   = "GET"
          path     = "/ping"
          interval = "10s"
          timeout  = "2s"
        }
       }

      env {
        SERVICE_NAME        = "${NOMAD_DC}",
        PROVIDER            = "ping-service",
        # uncomment to enable sd over consul
        CONSUL_SERVER_ADDR  = "172.17.0.1:8500"
        #PROVIDER_ADDR = "ping-service:25000"
      }
    }
  }
}

nomad
consul
fabio
terraform modules
testing

Establish CI/CD System

Overview

Running a CI/CD system will improve the confidence into change introduced into the repository due to the automated execution of verification steps like validity checks, linting, test execution etc.

Details

Two systems are currently in evaluation:

#78
#79

matthiasscholz / cos Goto Github PK

cos's People

Contributors

Stargazers

Watchers

Forkers

cos's Issues

Summary

References

Summary

Expectation

Hint

Problem

Overview

Reference

Why

Problem - binaries/ images from non ECR sources.

Root-Cause

Task

Problem

Refactoring

Issue

Goal

Summary

Details

Problem

Current state

Problem

Proposal

Examples

Why

Problem - access to ECR needs a lot of the ip's specified at https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Overview

Details

Recommend Projects

Recommend Topics

Recommend Org