Code Monkey home page Code Monkey logo

cos's People

Contributors

1nd2rd3st avatar ddragoti avatar dependabot[bot] avatar elumalainarasimman avatar fossabot avatar matthiasscholz avatar matthiasscholztw avatar thomasobenaus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cos's Issues

Remove instance key as required parameter

Currently it is required to have an instance key configured and available in AWS.
This was done to ease debugging and issue investigation during development.

With the AWS SystemManager an other alternative exists which avoid the use of SSH keys.
Furthermore no bastion setup is needed in order to interact with the instance.

Nevertheless the AWS SystemManager has to be supported by the AMI.

Missing network device dependency

Due to the missing network device dependency for the nomad application startup during system startup the application can not connect to the cluster.

Upgrade: Restart Policy for systemd services

Summary

To support rolling upgrades of the cluster orchestration system the restart policy for the nomad and consul services needs to be enhanced.

Expectation

In case of a normal shutdown of the application the service should not restart automatically.
In case of an abnormal shutdown the application the service should be restarted automatically by systemd.

Hint

Packer build of Amazon Linux 2 fails

Calling packer build -var 'aws_region=us-east-1' -var 'ami_regions=us-east-1' nomad-consul-docker-ecr.json

Fails with:

==> amazon-linux-ami2: Error modify AMI attributes: InvalidAMIAttributeItemValue: Invalid attribute item value " " for userId item type.
==> amazon-linux-ami2:  status code: 400, request id: 672be7fd-f89b-49ba-a76c-426341955d8d

Nomad does not start with newest AMI version

Problem

Linux AMI2 build based on current state of master is buggy.
On instances based on this AMI nomad does not start at all.

● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mi 2018-09-26 09:25:15 UTC; 2h 9min ago
     Docs: https://nomadproject.io/docs/
  Process: 4462 ExecStart=/opt/nomad/bin/nomad agent -config /opt/nomad/config -data-dir /opt/nomad/data (code=exited, status=1/FAILURE)
 Main PID: 4462 (code=exited, status=1/FAILURE)

Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: Started Nomad.
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: Starting Nomad...
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration nomad[4462]: No configuration loaded from /opt/nomad/config
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration nomad[4462]: ==> Must specify either server, client or dev mode for the agent.
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: Unit nomad.service entered failed state.
Sep 26 09:25:15 ip-10-124-52-121.us-east-2-integration systemd[1]: nomad.service failed.

Every container is accessible over ingress ALB

All traffic on ingress ALB is routed to the services subnet on the public-services data-center. There on each node runs fabio.
Currently fabio is able to route the traffic to every location inside the vpc and will do as soon as a job registers itself at consul using the tagging-mechanism.

For example prometheus is running o the backoffice nodes. Fabio will route the traffic to them since prometheus currently registers at consul using the tagging-mechanism.

Don't use external repositories to get images/ binaries for Nomad

Why

We want to restrict access of the nomad-masters (leader) to the internet. That's why they are inside a subnet that has only access to AWS services. This restriction is made by allowing only routes to AWS services a specified at: https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Problem - binaries/ images from non ECR sources.

The fabio binary is loaded directly from github. But there is no route that allows egress access to GH.

Refacor amazon-ecr-credential-helper usage

The configuration is cluttered around and it is unclear what is necessary and used.

The docker config.json is at two places:

  • ~/.docker/config.json
  • /etc/docker/config.json

As well for Amazon Linux 2 LTS now a yum package is available:

This will facilitate binary installation.

EBS attachment support

We need to be able to attach user-defined EBS volumes to the nomad nodes.
It should be possible to attach more than one EBS to specific data-center nodes.
Mounting of the attached volumes should be made automatically during instance creation.

Prometheus Port on Backoffice is Closed

Root-Cause

The backoffice data-center nodes are not allowed to communicate with each other over port 4646.
But this port is needed in order to be able to scrape metrics from them.

Task

  • Allow the backoffice-nodes to access each other over port 4646.

Upgrade nomad to 0.8.0

Released just a few days ago ... we should upgrade.
Changelog: https://github.com/hashicorp/nomad/blob/v0.8.0/CHANGELOG.md

Interesting improvements:

  • core: Servers can now service client HTTP endpoints [GH-3892] ... would solve #9
  • cli: Node status and filesystem related commands do not require direct network access to the Nomad client nodes [GH-3892]
  • ui: All views poll for changes using long-polling via blocking queries [GH-3936]

Cleanup ami module

Recommenced, mainly used and tested is on the module ami2.
There are no know use cases in order to preserve AMI creation using the module ami.

Hence this issue should be used to remove the unused module.

Access to logs does not work

Problem

calling nomad logs -stderr -f -job ping_service locally (even with active sshuttle)
Does not show any logs.
When the command is executed directly on the server, it works.

Docker ports open to 0.0.0.0/0

No need to have the docker ports (20000...32000) open to the "world".
At least they can be restricted to the cidr of the used vpc or better just connect the SG's of the nodes accordingly.

Refactoring

cidr_blocks -> source_security_group_id

Remove EFS and S3 dependencies in CloudInit of Nomad Client Nodes

Issue

Currently there are two variables that are used to modify the cloud-init script from the outside.
This adds a unnecessary dependency to components that are not needed for the COS deployment. Thus the COS is less usable/ harder to setup, since all dependencies have to be satisfied.

Goal

Remove the not needed dependencies + the variables.
https://github.com/MatthiasScholz/cos/blob/master/modules/nomad-datacenter/user-data-nomad-client.sh#L53

Make instance count configurable on DC level

It would be nice if it is possible to adjust the amount of nodes per data-center.
For example on the backoffice dc there might be only a fraction of nodes necessary compared to the number of nodes needed for the private- or public services dc.

In nw-separation setup nomad fails to download the images

04/14/18 08:58:48 UTC Restarting Task restarting in 30.185113191s
04/14/18 08:58:48 UTC Driver Failure failed to initialize task "ping_service_task" for alloc "b9f34abf-f20c-27ff-6341-5c1040f9476f": Failed to pull 307557990628.dkr.ecr.us-east-1.amazonaws.com/service/ping-service:0.0.7: API error (500): {"message":"Get https://307557990628.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

Provide possibility to inject userdata

Some use-cases (i.e. mounting an efs mount-target) are best implemented via user-data when the instance is created.
With the current module api it is not possible to add steps to the user-data of the nomad nodes.

It would be nice to have this option.

dnsmasq not used for DNS based service-discovery

Since not all services support consul based service-discovery, we added dnsmasq on the instances.
As specified at https://www.consul.io/docs/guides/forwarding.html dnsmasq can be used to intercept all queries to the consul domain to 127.0.0.1:8600.
There at 127.0.0.1:8600 the consul-agent is running which is able to provide a service-discovery based on the service-catalog.

But dnsmaq is not configured correctly in order to intercept local calls.

Support AWS System Manager Session Manager

Summary

Making use of the AWS Session Manager will allow to deprecate the bastion setup in order to debug cluster issues where direct instance access is needed.

Using the AWS Session Manager provides better security and less infrastructure to maintain and pay for.

Details

  • ensure AWS Session Manager is installed on the instances ( by default for AWS AmazonLinux 2 )
  • ensure instance is allowed to interact with AWS Session Manager ( instance profile )
  • cleanup documentation to advertise AWS Session Manager over Bastion setup ( +sshuttle )

Add https listener for ui-ALB's

Problem

  • Currently the communication over the UI-ALB's to the COS components is using HTTP. Thus no encryption at transport is in place.

Unable to build AMI with packer

When installing consul, sudo yum update -y is called.
During this process the os is not able to load some packages.

amazon-linux-ami2: Updated:
amazon-linux-ami2: amazon-linux-extras.noarch 0:1.4-1.amzn2
amazon-linux-ami2: aws-cfn-bootstrap.noarch 0:1.4-30.amzn2
amazon-linux-ami2: dhclient.x86_64 12:4.2.5-58.amzn2.3.2
amazon-linux-ami2: dhcp-common.x86_64 12:4.2.5-58.amzn2.3.2
amazon-linux-ami2: dhcp-libs.x86_64 12:4.2.5-58.amzn2.3.2
amazon-linux-ami2: dotnet-host.x86_64 0:2.1.0_preview2_26411_07-1
amazon-linux-ami2: ec2-utils.noarch 0:0.5-1.amzn2.0.1
amazon-linux-ami2: kernel-tools.x86_64 0:4.14.33-59.34.amzn2
amazon-linux-ami2: mssql-server.x86_64 0:14.0.3025.34-3
amazon-linux-ami2:
amazon-linux-ami2: Failed:
amazon-linux-ami2: msodbcsql17.x86_64 0:17.0.1.1-1 msodbcsql17.x86_64 0:17.1.0.1-1
amazon-linux-ami2: mssql-tools.x86_64 0:17.0.1.1-1 mssql-tools.x86_64 0:17.1.0.1-1

Restrict access to nomad-servers

see nomad/servers.tf

 # HACK: Still everything open for the nomad-servers. Has to be closed.
  allowed_inbound_cidr_blocks = ["0.0.0.0/0"]

Update Nomad/Consul ami to use consul 1.3.1

The newest released consul version is 1.3.1.
Released on November 13.

With this upgrade beside bugfixes we also get features like Connect Envoy Support (part of v 1.3.0).

Cached Consul Node Id

This issue is similar to this one: #25.

The data cached by consul is backed into a snapshot and hence reused when the snapshot get instantiated a second time.

Names of nomad and consul instances are unclear

Current state

nomad-client (dc public-services): COS-public-services-shiner
consul: COS-consul-shiner
nomad-server: nomad-example-server

Problem

  1. The names can't be long, since they are used inside the official nomad-module as name-prefix, which introduces size limits for names (32 for target-groups, 64 for security-groups)
  2. To avoid collisions the name should contain a random part.

Proposal

  1. Keep random part at the end
  2. Restrict the data-center name (only in this names) to max of 10 items.
  3. Add abbrev. for node-type
  • nomad-client: NMC
  • nomad-server: NMS
  • consul: SDCFG (Service Discovery + Configuration)

Examples

COS-SDCFG-consul-shiner
COS-NMC-public-ser-shiner
COS-NMC-private-se-shiner
COS-NMC-backoffice-shiner
COS-NMC-content-se-shiner
COS-NMS-leader-shiner

UDP blocked for docker ports

Some service (i.e. thanos) needs to be able to communicate with its cluster components using UDP.
Currently only TCP traffic is allowed across the docker ports.

--> open UDP.

Fix get_nomad_client_info.sh

Currently it returns only the public-services client node since it works on instance-tags. But these are no longer valid.

get_nomad_client_info.sh
2018-04-14 10:52:29 [INFO] [get_nomad_client_info.sh] aws ec2 describe-instances --region us-east-1 --profile playground --filter Name=tag:Name,Values=COS-NMC-public-ser-coyote Name=instance-state-name,Values=running
ISTANCE_ID INSTANCE_IP INSTANCE_IP (private)
52.91.138.241 i-0263e643f56bf7faa 10.128.50.55

Script to calculate cidr-blocks for egress_aws NatGW

Why

We want to restrict access of the nomad-masters (leader) to the internet. That's why they are inside a subnet that has only access to AWS services. This restriction is made by allowing only routes to AWS services a specified at: https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Problem - access to ECR needs a lot of the ip's specified at https://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html

Which results in more than 50 route-entries for a route-table. And the limit for route-tables is 50.
Of course a limit increase can be requested, but due to potential performance impact it's not recommended to do so.

With #6 we solved the issue with widening the cidrs to /8. But as a long term solution we need to have more restricting cidr's (i.e. /16).
But to generate these correctly (+ merge them) and optimal (least number of rules possible) we need a sophisticated script.

Nomad is not able to pull from DockerHub

When trying to deploy a docker image from docker-hub nomad responds with the following error message:

failed to initialize task "ping_service_task" for alloc "8f46a473-90de-3e96-71bb-149ad2916453": Failed to find docker auth for repo "thobe/ping_service": docker-credential-ecr-login with input "thobe/ping_service" failed with stderr: credentials not found in native keychain

Example job file:

# job>group>task>service
# container for tasks or task-groups that nomad should run
job "ping_service" {
  datacenters = ["public-services"]
  #,"private-services","content-connector","backoffice"]
  type = "service"

  meta {
    my-key = "example"
  }

  # The group stanza defines a series of tasks that should be co-located on the same Nomad client.
  # Any task within a group will be placed on the same client.
  group "ping_service_group" {
    count = 1

    # restart-policy
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }

     ephemeral_disk {
      migrate = false
      size    = "50"
      sticky  = false
    }

    # The task stanza creates an individual unit of work, such as a Docker container, web application, or batch processing.
    task "ping_service_task" {
      driver = "docker"
      config {
        # Docker Hub:
        image = "thobe/ping_service:0.0.9"
      }

      logs {
        max_files     = 2
        max_file_size = 10
      }

      config {
        port_map = {
          http = 8080
        }
      }

      resources {
        cpu    = 100 # MHz
        memory = 20 # MB
        network {
          mbits = 10
          port "http" {
          }
        }
      }

      # The service stanza instructs Nomad to register the task as a service using the service discovery integration
      service {
        name = "ping-service"
        tags = ["urlprefix-/ping"] # fabio
        port = "http"
        check {
          name     = "Ping-Service Alive State"
          port     = "http"
          type     = "http"
          method   = "GET"
          path     = "/ping"
          interval = "10s"
          timeout  = "2s"
        }
       }

      env {
        SERVICE_NAME        = "${NOMAD_DC}",
        PROVIDER            = "ping-service",
        # uncomment to enable sd over consul
        CONSUL_SERVER_ADDR  = "172.17.0.1:8500"
        #PROVIDER_ADDR = "ping-service:25000"
      }
    }
  }
}

Add possibility to tag instances

With having the ability to tag datacenter nodes it is easier to distinguish between different types of them.
This helps especially if you want to find the right target nodes for node draining in a script based manner.

i.e. for a version upgrade

Provide ingress ALB connection to backoffice DC nodes.

It makes sense to separate user ingress-traffic from ops requests (i.e. to the monitoring system).
A natural separation can be done via load-balancer.
To enable this it has to be possible to connect a ALB to the ASG of the Backoffice nodes.

Upgrade to nomad 1.0.2

Time is running new version are available for all of our dependencies.

Update:

  • nomad
  • consul
  • fabio
  • terraform modules
  • testing

Establish CI/CD System

Overview

Running a CI/CD system will improve the confidence into change introduced into the repository due to the automated execution of verification steps like validity checks, linting, test execution etc.

Details

Two systems are currently in evaluation:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.