runatlantis / terraform-gce-atlantis Goto Github PK

View Code? Open in Web Editor NEW

30.0 4.0 18.0 372 KB

A set of Terraform configurations for running Atlantis on @googlecloud Compute Engine

License: Apache License 2.0

HCL 100.00%

atlantis cos gce gcp terraform

terraform-gce-atlantis's People

Contributors

Stargazers

Watchers

Forkers

mvanholsteijn kvanzuijlen argyle-engineering cblkwell desmondh0 nitrocode tpolekhin dennislapchenko contil artusiep gabrielbarrera3018 nosportugal ragaviamutha dimisjim

terraform-gce-atlantis's Issues

Add a versions.tf

This is helpful to keep a prerequisite of providers and tf version

terraform {
  required_version = ">= 0.13"

  providers = {
    # ...
  }
}

example https://github.com/cloudposse/terraform-aws-ecr/blob/master/versions.tf

Expose tags as an input

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/88f71057b47fd89d31da5a8c57c4bd8f08a2615d/main.tf#L20

Note that you get this input out of the box with #57 using module.this.tags.

Test a lower Terraform version (e.g., 0.13.0)

We currently restrain our versions.tf to 1.2.0 or higher.

As there's many users that might still run on 0.13.0 for example, we should see if our code is compatible with this Terraform version.

Add the ability to pull the latest prerelease image

Currently, users of our platform are required to either pull the latest version of the atlantis image. We propose adding the ability to pull the latest prerelease image.

use # comments, not // comments

Best practices is to use the # comments instead of // comments

This is only one example

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/88f71057b47fd89d31da5a8c57c4bd8f08a2615d/main.tf#L2

prevent instance template from being updated when cos image is updated

When a new version of the cos image is available, it will now trigger a update on the instance template (as we always pull the latest version).

We should ignore any changes to this, and possibly allow a user to overwrite or set its own version of the COS image. As update the instance template will trigger Atlantis to be recreated. (this sucks, especially if this change is through Atlantis 😄 )

.... the plan I got

  # module.atlantis.google_compute_instance_template.default must be replaced
+/- resource "google_compute_instance_template" "default" {
      ~ id                   = "projects/xxxxxxxx/global/instanceTemplates/atlantis-20230210130232854100000001" -> (known after apply)
      ~ labels               = { # forces replacement
          ~ "container-vm" = "cos-stable-101-17162-127-5" -> "cos-stable-101-17162-127-8"
        }
      ~ metadata_fingerprint = "djl8in4QXsc=" -> (known after apply)
      - min_cpu_platform     = "" -> null
      ~ name                 = "atlantis-20230210130232854100000001" -> (known after apply)
      ~ self_link            = "https://www.googleapis.com/compute/v1/projects/xxxxxxxx/global/instanceTemplates/atlantis-20230210130232854100000001" -> (known after apply)
        tags                 = [
            "atlantis-wl45dn",
        ]
      ~ tags_fingerprint     = "" -> (known after apply)
        # (8 unchanged attributes hidden)

      + confidential_instance_config {
          + enable_confidential_compute = (known after apply)
        }

      ~ disk {
          ~ device_name       = "persistent-disk-0" -> (known after apply)
          ~ interface         = "SCSI" -> (known after apply)
          - labels            = {} -> null
          ~ mode              = "READ_WRITE" -> (known after apply)
          - resource_policies = [] -> null
          ~ source_image      = "projects/cos-cloud/global/images/cos-stable-101-17162-127-5" -> "https://www.googleapis.com/compute/v1/projects/cos-cloud/global/images/cos-stable-101-17162-127-8" # forces replacement
          ~ type              = "PERSISTENT" -> (known after apply)
            # (4 unchanged attributes hidden)
        }
      ~ disk {
          ~ boot              = false -> (known after apply)
          ~ interface         = "SCSI" -> (known after apply)
          - labels            = {} -> null
          - resource_policies = [] -> null
          + source_image      = (known after apply)
          ~ type              = "PERSISTENT" -> (known after apply)
            # (5 unchanged attributes hidden)
        }

      ~ network_interface {
          + ipv6_access_type   = (known after apply)
          ~ name               = "nic0" -> (known after apply)
          ~ network            = "https://www.googleapis.com/compute/v1/projects/xxxxxxxx/global/networks/network" -> (known after apply)
          - queue_count        = 0 -> null
          + stack_type         = (known after apply)
          ~ subnetwork         = "https://www.googleapis.com/compute/v1/projects/xxxxxxxx/regions/europe-west4/subnetworks/subnetwork" -> "projects/xxxxxxxx/regions/europe-west4/subnetworks/subnetwork"
            # (1 unchanged attribute hidden)
        }

      ~ scheduling {
          - min_node_cpus       = 0 -> null
            # (4 unchanged attributes hidden)
        }

        # (2 unchanged blocks hidden)
    }

Allow users to specify the GCP VM machine type

Currently, users of Google Cloud Platform (GCP) Virtual Machines (VMs) are unable to specify the machine type when creating or modifying a VM (we use a default value: n2-standard-2). We propose adding the ability for users to specify the machine type of their GCP VM.

This would give users greater control over the resources allocated to their VM, allowing them to optimize for performance, cost, or other factors.

Open Questions

How will users be able to determine the appropriate machine type for their workload? Will there be any guidance provided, or will they need to determine this on their own?

Block Project-wide SSH keys

Project-wide SSH keys are stored in Compute/Project-meta-data. Project wide SSH keys can be used to login into all instances within a project. Using project-wide SSH keys eases SSH key management. If SSH keys are compromised, the potential security risk can impact all instances within a project.

We currently allow project-wide SSH keys (by surpressing the checkov rule) in the instance template. Preferably this should be made configurable through a variable.

Ensure that environment variables are not shown in the UI

Input environment variables necessary to bootstrap atlantis shouldn't be exposed in the Google Cloud UI as these values contain sensitive information.

Preferably populate an atlantis.env that contains the data passed down to var.env_vars and using envFile in the container spec persist these environment variables into the atlantis container.

Can't run `terraform destroy` when using the module.

It seems that running terraform destory is prevented by the compute instance not deleting it persistent storage causing the error.

│ Error: Error when reading or editing Project Service <PROJECT-ID>/compute.googleapis.com: Error disabling service 
"compute.googleapis.com" for project "<PROJECT-ID>": Error waiting for api to disable: Error code 9, message: [Error in service 
'compute.googleapis.com': Could not turn off service, as it still has resources in use.

Not sure if this is intentional but I think if the delete_rule is changed to ON_PERMANENT_INSTANCE_DELETION it would allow you to run terraform destroy but not delete the store if you shut down the instance.

Consistent boolean naming

Consider spot_machine_enabled instead of

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/e62e69391c00c1787db2f692c4d521c2eef3a18f/variables.tf#L38-L39

Consider block_project_ssh_keys_enabled instead of

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/e62e69391c00c1787db2f692c4d521c2eef3a18f/variables.tf#L78-L79

Consider using `default` for your resource names instead of reusing `atlantis`

Module usage

module "atlantis" {
  source  = "bschaatsbergen/atlantis/gce"
  version = "0.1.5"
  # insert the 7 required variables 
}

Then a resource like this

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/88f71057b47fd89d31da5a8c57c4bd8f08a2615d/main.tf#L14

Would have a fully qualified address of module.atlantis.google_compute_instance_template.atlantis which is redundant.

Consider using default for your resource names instead of reusing atlantis which would result in a fully qualified address of module.atlantis.google_compute_instance_template.default

docs: Why we don't allow http to https redirects

Publish to terraform registry

Failed to start container: Volume atlantis-disk-0: Filesystem check failed

Trying to deploy Atlantis on GCE using IAP example.
Container is failing to start because of the filesystem error:

[   70.188299] konlet-startup[1686]: 2023/03/13 14:05:19 Attempting to unmount device /dev/sdb at /mnt/disks/gce-containers-mounts/gce-persistent-disks/atlantis-disk-0.
[   70.190423] konlet-startup[1686]: 2023/03/13 14:05:19 Unmounted /mnt/disks/gce-containers-mounts/gce-persistent-disks/atlantis-disk-0
[   70.190530] konlet-startup[1686]: 2023/03/13 14:05:19 Found 1 volume mounts in container  declaration.
[   70.197085] konlet-startup[1686]: 2023/03/13 14:05:19 Running filesystem checker on device /dev/disk/by-id/google-atlantis-disk-0...
[   70.199060] konlet-startup[1686]: 2023/03/13 14:05:19 Error: Failed to start container: Volume atlantis-disk-0: Filesystem check failed: Failed to execute command [fsck.ext4 -p /dev/disk/by-id/google-atlantis-disk-0]: exit status 8, details: /dev/disk/by-id/google-atlantis-disk-0 is mounted.
[   70.199171] konlet-startup[1686]: e2fsck: Cannot continue, aborting.

Also noticed that chown command fails as well:

...
[   17.150286] systemd-networkd[334]: vethdf3fdbb: Gained carrier
[   17.156971] systemd-networkd[334]: docker0: Gained carrier
[   18.323133] systemd-networkd[334]: docker0: Gained IPv6LL
[   18.962881] systemd-networkd[334]: vethdf3fdbb: Gained IPv6LL
[   39.220362] chown[929]: chown: cannot access '/mnt/disks/gce-containers-mounts/gce-persistent-disks/atlantis-disk-0': No such file or directory
[   55.846631] konlet-startup[627]: 2023/03/13 14:05:04 Received ImagePull response: ({"status":"Pulling from runatlantis/atlantis","id":"latest"}
...

Provision a persistent disk for Atlantis

Atlantis has no external database. Atlantis stores Terraform plan files on disk. If Atlantis loses that data in between a plan and apply cycle, then users will have to re-run plan. Because of this, we want to provision a persistent disk for Atlantis.

Does Atlantis have to run as a privileged container?

Right now, the Atlantis container runs as privileged, but it is unclear why we need to have that set. If we can get it to run as non-privileged, that would be optimal -- otherwise, we should document exactly why it needs to run as privileged so that users understand that need more thoroughly.

Update examples to support optional dns/ssl support for cases where domains are managed at another registrar

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/main/examples/complete/main.tf#L96-L107 mentions a small section about defining a dns record set (adding an A record) with a managed zone entry.

It would be helpful to mention somewhere that the current implementation is very GCP native or make this part optional.

More context: https://atlantis-community.slack.com/archives/C5MGGAV0C/p1677427493186339

Thanks!

Cannot set Server Configuration values that are only settable through config file

Slack token, for example, is currently only configured through atlantis configuration file. Any idea how to implement it with current setup?

Smoothest option is to add another cloudinit.write_files entry, right?

Add minimum example using registry syntax to readme

Consider adding a minimal example using tf registry syntax

module "atlantis" {
  source  = "bschaatsbergen/atlantis/gce"
  version = "0.1.5"
  # insert the 7 required variables 
}

Ref https://registry.terraform.io/modules/bschaatsbergen/atlantis/gce/latest

Remove provider from hcl

The provider should only be in the root module, not in the consumable module.

By removing it, you may also be able to remove the project_id input.

https://github.com/bschaatsbergen/terraform-gcp-atlantis/blob/474dbae438ca7005f03a1de3b47f962c694e8cb1/main.tf#L1-L3

avoid breaking changes by hard pinning container-vm module

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/88f71057b47fd89d31da5a8c57c4bd8f08a2615d/main.tf#L104-L106

https://registry.terraform.io/modules/terraform-google-modules/container-vm/google/latest

Also latest version is 3.1.0 and this module uses 2.x.

Expose entire container input and/or container image

Its limiting to only support a conditional for the container image

It would be good to expose the entire container input or at least the full container image.

For example, what if i wanted to use a custom image or a custom tag?

https://github.com/bschaatsbergen/terraform-gcp-atlantis/blob/474dbae438ca7005f03a1de3b47f962c694e8cb1/atlantis.tf#L9

You may want to expose more inputs to the upstream module as well

You can also avoid having to put in additional logic for the dev tag #3.

Instance being replaced even when the machine_image is pinned

machine_image pinning was introduced in #112 but even with the value set, the instance is being replaced when a new COS image comes out.

I believe the issue is in the locals:

  labels               = merge(var.labels, { "container-vm" = module.container.vm_container_label })

Didn't have time to dig into it, but I believe if we want to pass the correct label to the module, we need to parse machine_image. Alternatively we should update the way local.labels is generated.

Just raising an issue, in case someone wants to take a look. Might dig into it myself when I have a moment.

what is mig

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/88f71057b47fd89d31da5a8c57c4bd8f08a2615d/main.tf#L158-L159

option to introduce multiple instances with redis locking

Redis locking would allow an HA setup using multiple atlantis servers.

https://www.runatlantis.io/docs/server-configuration.html#locking-db-type

CloudMemory store would be the redis equivalent of aws elasticache

This module would not need to implement the redis cluster itself and it may not need adding a count to the GCE instance since the count could be added to the module itself.

An example with a CloudMemory instance with its values fed into this module would be enough to allow people to use it

resource "google_redis_instance" "cache" {
  name           = "memory-cache"
  memory_size_gb = 1
}

module "atlantis" {
  source     = "bschaatsbergen/atlantis/gce"

  count = 4

  name = "atlantis-${count.index}"

  env_vars = {
    ATLANTIS_LOCKING_DB_TYPE = "redis"
    ATLANTIS_REDIS_HOST      = ""
    ATLANTIS_REDIS_PORT      = ""

    # could be randomly generated when the redis cluster is born
    ATLANTIS_REDIS_PASSWORD  = ""

    # ...
  }

  # ...
}

Support Shared VPC deployments

Currently it seems impossible to deploy this module into a project that uses a shared VPC, as the module tries to create a firewall rule and a route (which is not allowed cross project, and also should not happen).

It also seems that the subnetwork_project seems to change after each apply, making the module non-idempotent.

Consider a release bot on new changes

See https://github.com/cloudposse/terraform-aws-ecr/blob/master/.github/auto-release.yml and https://github.com/cloudposse/terraform-aws-ecr/blob/master/.github/workflows/auto-release.yml for inspiration

Allow users to bring their own KMS key to encrypt the VM attached disks.

Currently, users of Google Cloud Platform (GCP) Virtual Machines (VMs) are only able to encrypt the attached disks of their VMs using a Google-managed key that is stored in Cloud Key Management Service (KMS).

Benefits

Increased security By allowing users to bring their own KMS key, they have the option to use a key that is stored in their own KMS, rather than relying on a Google-managed key. This can provide an additional layer of security, as the user has full control over the management and rotation of their own key.

Industry best practices: Many organizations have established security policies that require the use of customer-managed keys for data encryption. Allowing users to bring their own KMS key would enable GCP users to comply with these policies and align with industry best practices for data security.

Add an FAQ to each example

I suppose that there's common mistakes made, even though the examples are very detailed - we should provide an FAQ per example with common made mistakes.

feat: don't use static tags to tag the firewall rule, route and instance.

As discussed in #64

Users might apply similar tags to instances, firewall rules and routes to allow/deny traffic.

The tag we currently use: atlantis is very generic and could conflict if there's multiple atlantis solutions running in the project.

Currently I can think of only 2 solutions:

Possible to introduce a new variable, that's used to control the behaviour of the firewall rule and public internnet route
Add a random string to the atlantis tag: atlantis-${random_string.tag.result} for example.

tests: Scan module with a sast or dast as a pr check

See tflint, tfsec, checkov

These can be run in pre commit hooks and can be run as a pr check

feat: port the startup script to cloudinit

As users might would like to provide their own startup script instead (see #41), we should move the commands that we execute in the startup script to cloudinit.

This would not break any existing functionality (running a chown on the new GCE persistent disk mount) and allows a user to bring its own startup script.

Expose all inputs on container-vm google module

https://registry.terraform.io/modules/terraform-google-modules/container-vm/google/latest?tab=inputs

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/fbc3134ea4f365659ff4d87a589ccb63fdf81429/atlantis.tf#L8-L10

Also you might as well move all the atlantis.tf stuff to main.tf to keep it together since its not that different to warrant a different file name, no?

Improve the instance replace performance

Sometimes it takes up to 15 minutes before the Instance Manager replaces an instance.. and sometimes it's very fast - we should investigate why.

startup script improvements

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/main/startup-script.sh

Use set -e

Use /usr/bin/env bash instead of /binbash

Use shellcheck to see if it catches anything

Run shellcheck in github action whenever shell script is modified

docs: cloudinit might fail if you bring your own Docker image

Cloudinit performs a chown with the uid 100 (atlantis user) on the GCE Persistent Disk mount path - we should document this so that users are aware that when handrolling their own Docker image.

feat: Protect the Atlantis UI with Identity Aware Proxy

Consider using null label for naming convention

See the null label

https://github.com/cloudposse/terraform-null-label

This is the mixin thats used across all cloudposse modules

https://github.com/cloudposse/terraform-null-label/blob/master/exports/context.tf

This allows using standard inputs such as namespace, tenant, environment, name, attributes, tags, etc

Example of usage

https://github.com/cloudposse/terraform-aws-ecr

https://github.com/cloudposse/terraform-aws-ecr/blob/master/context.tf

https://github.com/cloudposse/terraform-aws-ecr/blob/0472d649275df45dfd47514275e69792d1567d08/main.tf#L8

So this results in all name arguments set to this value

  name = module.this.id

feat: More outputs

Consider outputting the entire module.atlantis

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/8bf8716cd15ed55ffcf7063684a80dd7953eb354/main.tf#L110-L112

output "atlantis" {
  value       = module.atlantis
  description = "All of the outputs of the upstream terraform-google-modules/container-vm/google module"
}

Provide the ability to deploy Atlantis without a domain associated to it

It would be useful to make the domain input optional to and allow for the module to either reserve a static external IP or take as an input.

It would make getting up and running a bit faster for testing purposes.

Add the ability to pull the latest dev image

Currently, users of our platform are required to either pull the latest or prerelease-latest version of the atlantis image. We propose adding the ability to pull the latest dev image.

instead of env vars taking a list, consider a map

The interface

  env_vars = [
    {
      name = "var"
      value = "value"
    }
  ]

Is more intuitive like this

  env_vars = {
    var = "value"
  }

Allow the use of spot instances

Spot instances can be useful for running Atlantis on GCP in combination with a persistent data disk because they allow you to take advantage of lower prices for compute resources while still being able to store your data on a durable, high-performance disk.

By using a PD-SSD to store your data, you can easily restart Atlantis on a new spot instance if the original instance is terminated, without losing any data. This can help you save money on compute costs while still being able to run Atlantis reliably.

Requires #11 to be completed first.

Option to put GCP IAP in front of atlantis

Related tickets

Use hcl instead of json syntax

There are multiple examples here but here is just one

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/88f71057b47fd89d31da5a8c57c4bd8f08a2615d/main.tf#L113

Which should be

 tty : true

This helps with consistency and readability since hcl is formatted with terraform fmt whereas json is not

tests: Consider terratest

For testing this module.

Example https://github.com/cloudposse/terraform-aws-ecr/tree/master/test/src

feat: Override startup script

Consider overriding the startup script path

https://github.com/bschaatsbergen/terraform-gce-atlantis/blob/8bf8716cd15ed55ffcf7063684a80dd7953eb354/main.tf#L21-L25

And consider overriding the entire metadata_startup_script input to avoid the templatefile (if there are benefits to it, idk)

Web UI Job View of Plans Not Working Behind IAP

I'm seeing something strange where the "live plan" view via the console UI has strange behavior where sometimes the full plan doesn't display (especially for larger plans). I used your terraform module for GCE setup. If I bypass IAP, everything is showing correctly.

I know it was mentioned here that this is due to IAP stripping off the bearer authorization header.

Any ideas for a workaround? Could I put nginx in front of the atlantis docker container to deal with the authorization header issue somehow?

Consider an examples/complete dir

See https://github.com/cloudposse/terraform-aws-ecr/tree/master/examples/complete for inspiration