tinkerbell / hook Goto Github PK

View Code? Open in Web Editor NEW

94.0 17.0 48.0 5.73 MB

In-memory Operating System Installation Environment for Executing Tinkerbell Workflows

License: Apache License 2.0

Go 15.79% Dockerfile 9.82% Shell 74.38%

docker linux os linuxkit tinkerbell

hook's Introduction

Tinkerbell

License

Tinkerbell is licensed under the Apache License, Version 2.0. See LICENSE for the full license text. Some of the projects used by the Tinkerbell project may be governed by a different license, please refer to its specific license.

Tinkerbell is part of the CNCF Projects.

Community

The Tinkerbell community meets bi-weekly on Tuesday. The meeting details can be found here.

Community Resources:

What's Powering Tinkerbell?

The Tinkerbell stack consists of several microservices, and a gRPC API:

Tink

Tink is the short-hand name for the tink-server and tink-worker. tink-worker and tink-server communicate over gRPC, and are responsible for processing workflows. The CLI is the user-interactive piece for creating workflows and their building blocks, templates and hardware data.

Smee

Smee is Tinkerbell's DHCP server. It handles DHCP requests, hands out IPs, and serves up iPXE. It uses the Tinkerbell client to pull and push hardware data. It only responds to a predefined set of MAC addresses so it can be deployed in an existing network without interfering with existing DHCP infrastructure.

Hegel

Hegel is the metadata service used by Tinkerbell and OSIE. It collects data from both and transforms it into a JSON format to be consumed as metadata.

OSIE

OSIE is Tinkerbell's default an in-memory installation environment for bare metal. It installs operating systems and handles deprovisioning.

Hook

Hook is the newly introduced alternative to OSIE. It's the next iteration of the in-memory installation environment to handle operating system installation and deprovisioning.

PBnJ

PBnJ is an optional microservice that can communicate with baseboard management controllers (BMCs) to control power and boot settings.

Building

Use make help. The most interesting targets are make all (or just make) and make images. make all builds all the binaries for your host OS and CPU to enable running directly. make images will build all the binaries for Linux/x86_64 and build docker images with them.

Configuring OpenTelemetry

Rather than adding a bunch of command line options or a config file, OpenTelemetry is configured via environment variables. The most relevant ones are below, for others see https://github.com/equinix-labs/otel-init-go

Currently this is just for tracing, metrics needs to be discussed with the community.

Env Variable	Required	Default
`OTEL_EXPORTER_OTLP_ENDPOINT`	n	localhost
`OTEL_EXPORTER_OTLP_INSECURE`	n	false
`OTEL_LOG_LEVEL`	n	info

To work with a local opentelemetry-collector, try the following. For examples of how to set up the collector to relay to various services take a look at otel-cli

export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_INSECURE=true
./cmd/tink-server/tink-server <stuff>

Website

For complete documentation, please visit the Tinkerbell project hosted at tinkerbell.org.

hook's People

Stargazers

Watchers

Forkers

detiber stolsma thebsdbox markjacksonfishing ghoneycutt antonym rgl micahhausler jacobweinstock belchy06 mmlb scottgarman swills jgavinray tinkerbell-bac pokearu ptrivedi metahertz chrisdoherty4 abhinavmpandey08 gregorybrzeski isabella232 ekmixon panktishah26 joedragons zkl94 paulius0112 socios-linux ader1990 st-rnd jtu-ampere platinasystems umangachapagain tomcounihan counihantom rigzba21 eskemm-numerique jasonyates-thg sirhopcount howard-yeh lightbitslabs bsonetwork terry-hasegawa doc-sheet rpardini sp1999

hook's Issues

Hook doesn't create required files/folder for docker trusted certificates

We are trying to build according to https://anywhere.eks.amazonaws.com/docs/reference/baremetal/bare-custom-hookos/ as well as latest Hook version v.0.7.0. We updated bootkit and hook-docker folders with our custom functionality

fmt.Println("Create docker cert directory - debug")
err = os.MkdirAll("/etc/docker/certs.d/<OUR IP>", os.ModeDir)
if err != nil {
        fmt.Println("Error creating dir")
        panic(err)
}

The code according to the logs inside IPXE machine suggest, that everything was executed correctly. However, no such directory exists on the host itself.

Would it be possible to get more information regarding how to add a trusted docker certificate so that we could authenticate to our local docker registry.

Include the checksums as a release artifact

Please include the checksums as a release artifact. This would make its usage easier.

Also, to make it easier to update the chart, it should be in a way that is compatible with the helm chart values.yaml declaration, or the chart values.yaml should be simplified to use the tarball checksum instead of the tarball content checksums.

Push-based publish job is failing

Expected Behaviour

Current publish job is referencing a non-existing s3 bucket and does not have credentials, example failure: https://github.com/tinkerbell/hook/actions/runs/559080308

Example architecture

My initial idea is to have two main containers in our LinuxKit image:

Docker with a bind to /var/run:

  - name: docker-osie
    image: docker:19.03.8-dind
    capabilities:
     - all
    net: host
    mounts:
     - type: cgroup
       options: ["rw","nosuid","noexec","nodev","relatime"]
    binds:
     - /etc/resolv.conf:/etc/resolv.conf
     - /var/lib/docker:/var/lib/docker
     - /lib/modules:/lib/modules
     - /var/run:/var/run
     - /etc/docker/daemon.json:/etc/docker/daemon.json
    command: ["/usr/local/bin/docker-init", "/usr/local/bin/dockerd"]
    runtime:
      mkdir: ["/var/lib/docker"]

We can interact with it:

ctr -n services.linuxkit t exec --exec-id test docker-osie docker pull nginx

Custom Go binary:

  - name: workflow
    image: workflow:beta
    binds:
     - /var/run:/var/run
     - /proc/cmdline:/proc/cmdline

The custom go binary will use the docker SDK to speak to the docker.sock in /var/run...

The final piece of the puzzle is getting the registry certificate into this, there are two options I can see:

Users build their own custom kernel/initramfs with that cert
We have an onboot container that gets certificates and puts them somewhere on the filesystem that we can pass into the docker container.

[Feature Request] Let the user choose which architecture to build

The target architecture(s) (amd64 or arm64) should be configurable.

Expected Behaviour

I would like to be able to choose which architecture I want to build without patching rules.mk.

Current Behaviour

The targets actually build both amd64 and arm64.

Possible Solution

The dirty way, patching rules.mk:

diff --git a/rules.mk b/rules.mk
index b2c5133..717e1da 100644
--- a/rules.mk
+++ b/rules.mk
@@ -22,7 +22,7 @@ ifeq ($(ARCH),aarch64)
 ARCH = arm64
 endif
 
-arches := amd64 arm64
+arches := amd64
 modes := rel dbg
 
 hook-bootkit-deps := $(wildcard hook-bootkit/*)
@@ -87,7 +87,7 @@ push-hook-bootkit push-hook-docker:
        docker buildx build --platform $$platforms --push -t $(ORG)/$(container):$T $(container)
 
 .PHONY: dist
-dist: out/$T/rel/amd64/hook.tar out/$T/rel/arm64/hook.tar ## Build tarballs for distribution
+dist: out/$T/rel/amd64/hook.tar ## Build tarballs for distribution
 dbg-dist: out/$T/dbg/$(ARCH)/hook.tar ## Build debug enabled tarball
 dist dbg-dist:
        for f in $^; do
ubuntu@factory:~/tinkerbell/hook$ git diff > ../hook.patch
ubuntu@factory:~/tinkerbell/hook$ cat ../hook.patch 
diff --git a/rules.mk b/rules.mk
index b2c5133..717e1da 100644
--- a/rules.mk
+++ b/rules.mk
@@ -22,7 +22,7 @@ ifeq ($(ARCH),aarch64)
 ARCH = arm64
 endif
 
-arches := amd64 arm64
+arches := amd64
 modes := rel dbg
 
 hook-bootkit-deps := $(wildcard hook-bootkit/*)
@@ -87,7 +87,7 @@ push-hook-bootkit push-hook-docker:
        docker buildx build --platform $$platforms --push -t $(ORG)/$(container):$T $(container)
 
 .PHONY: dist
-dist: out/$T/rel/amd64/hook.tar out/$T/rel/arm64/hook.tar ## Build tarballs for distribution
+dist: out/$T/rel/amd64/hook.tar ## Build tarballs for distribution
 dbg-dist: out/$T/dbg/$(ARCH)/hook.tar ## Build debug enabled tarball
 dist dbg-dist:
        for f in $^; do

Context

Three reasons:

I only need one architecture, the one I am using
I want to reduce my building time skipping arm64
I want to avoid possible building failures outside of my scope

chroot raise segfault when deploying centos6u9

hi guys：
i am deploying centos6u9
when chroot in /tinkerbell/cexec action, it raise segfault error 15
i think it's the VSYSCALL problem
add vsyscall=emulate into /proc/cmdline should work
but i don't know how to fix it in linuxkit

with great appreciate.

[bootkit] purpose of func `metaDataQuery` and `container_uuid`

Anyone know any context/purpose for the metaDataQuery function? It looks like it queries the metadata server (hegel) and just pulls out the id field. Then uses this value as an env var container_uuid when starting the Tink-worker. A quick grep of the Tink code bases doesn't come up with any references to this container_uuid field.

Also, it looks like this id string is the same as the WORKER_ID in /proc/cmdline. This feels like something that could potentially be removed entirely from bootkit?

CC @thebsdbox

This is from the sandbox:
"id": "0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94"
worker_id=0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94

metadata (formatted for readability)

{
    "id": "0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94",
    "metadata": {
        "facility": {
            "facility_code": "onprem",
            "plan_slug": "c2.medium.x86",
            "plan_version_slug": ""
        },
        "instance": {},
        "state": "provisioning"
    },
    "network": {
        "interfaces": [
            {
                "dhcp": {
                    "arch": "x86_64",
                    "ip": {
                        "address": "192.168.56.43",
                        "netmask": "255.255.255.0"
                    },
                    "mac": "08:00:27:9e:f5:3a"
                },
                "netboot": {
                    "allow_pxe": true,
                    "allow_workflow": true
                }
            }
        ]
    }
}

/proc/cmdline (formatted for readability)

ip=dhcp
modules=loop,squashfs,sd-mod,usb-storage
alpine_repo=http://192.168.56.4:8080/misc/osie/current/repo-x86_64/main
modloop=http://192.168.56.4:8080/misc/osie/current/modloop-x86_64
tinkerbell=http://192.168.56.4
syslog_host=192.168.56.4
parch=x86_64
packet_action=workflow
packet_state=provisioning
docker_registry=192.168.56.4
grpc_authority=192.168.56.4:42113
grpc_cert_url=http://192.168.56.4:42114/cert
instance_id=
registry_username=admin
registry_password=Admin1234
packet_base_url=http://192.168.56.4:8080/workflow
worker_id=0eba0bf8-3772-4b4a-ab9f-6ebe93b90a94
packet_bootdev_mac=08:00:27:9e:f5:3a
facility=onprem
plan=c2.medium.x86
manufacturer=
slug=
initrd=initramfs-x86_64
console=tty0
console=ttyS1,115200

Add support for virtio scsi

I want to use the virtio scsi transport but the current hook linux kernel is missing the CONFIG_SCSI_VIRTIO=y setting.

Expected Behaviour

When I use the following vagrant snippet I was expecting to see a /dev/sda device.

      config.vm.provider :libvirt do |lv, config|
        lv.storage :file, :size => '40G', :bus => 'scsi', :discard => 'unmap', :cache => 'unsafe'

Current Behaviour

There is no /dev/sda device.

Possible Solution

Compile linux with CONFIG_SCSI_VIRTIO=y.

Steps to Reproduce (for bugs)

modify the sandbox vagrantfile to use lv.storage :file, :size => '40G', :bus => 'scsi', :discard => 'unmap', :cache => 'unsafe'

Intel I225-LM not detected (old kernel issue most likely)

Expected Behaviour

Hook detects and loads the igc module on systems with Intel I225-LM NICs present.

Current Behaviour

Hook boots but seemingly fails to detect Intel I225-LM devices and the igc module does not get loaded. This results in no network connectivity if a system is connected to the provisioning network via an interface with this chipset.

Possible Solution

Update the kernel. Hook uses 5.10.57 which is many releases behind upstream 5.10 LTS (currently at 5.10.78) and likely the cause here. Ideally, hook should move to using mainline releases for better compatibility with new hardware as it makes it into new kernel releases.

Steps to Reproduce (for bugs)

Obtain a system with an Intel I225 interface. We tested the Minisforum HX90. Configure it so PXE is enabled.
Establish tinkerbell services and connect device with I225 adapter to network with access to services
Create a hardware profile and workflow for the device
Power on device
System will chainload iPXE from PXE and hook will load via iPXE
Hook will begin to boot but eventually stall out due to having no network connectivity due to not having loaded a module to enable the I225 NIC.

Context

We are intending to use tinkerbell to deploy many client devices that unfortunately only have a single I225-LM NIC on them. We can sidestep this issue by not using hook and doing automated OS installs however that is going to be a slower option than using hook to deploy disk images. Ideally, hook should function on modern hardware.

To be sure, both Fedora 34 and Debian 11 were also tested on the system with an I225-LM NIC and both were able to detect the NIC and loaded the igc module. Whatever the issue is that affects hook supporting this devices seems to have been fixed in later LTS kernels and mainline.

Questions

Is there a specific reason hook has been held back to such a dated LTS kernel release? This definitely is going to hamper support of new hardware.

I noticed some patches so I could see those requiring some work to validate against or port to a newer kernel. I could see time being a constraint here. Maintaining a kernel build is certainly not a zero time commitment.

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS): Fedora Silverblue and CoreOS
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details: podman containers

cc: @jkl92 @storrgie

include a version file in Hook

It would be nice to be able to cat a file while hook is running to know what version it is. helpful for debugging issues from users.

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

README.md needs updates

There are some broken links to the sandbox in the README.
Since the sandbox now defaults to using hook, we can probably clean out a lot of the information about
how to switch a sandbox over to using it.

Related to #45

Make deterministic device paths available

Hey gang! Another left-field question from me.. I'm working with machines whose BIOSs sometimes switch the bios order of their disks around on reboot. PITA, right? I usually work around this by using /dev/disk/by-path/ to refer to the exact disks I want to target, but this isn't working in tink-node / actions-image2disk, because /dev/disk doesn't exist... (below). Any ideas how I can refer to specific disks without the BIOS messing up the order?

We could definitely benefit from having disks available like this. linuxkit doesn't have support for this by default. It also doesn't have udev. I believe it has mdev though. We will need to figure out how to get this working in Hook.

https://cloud-native.slack.com/archives/C01SRB41GMT/p1669253401496759

why is dhcpcd being added to onboot?

why is dhcpcd being added to onboot?

hook/hook.yaml

Lines 14 to 16 in 7b64378

    
           - name: dhcpcd 
        
             image: linuxkit/dhcpcd:v0.8 
        
             command: ["/sbin/dhcpcd", "--nobackground", "-f", "/dhcpcd.conf", "-1"]

the commit that introduced those changes does not describe why.

the differences between hook.yaml and hook_debug.yaml should only be in the later having the sshd service? if so, can the two files reflect that?

Automate GitHub release workflow

This will help standardize our way out of issues like #75

This process should handle re-releases where there is a - suffix on the filename.

CI failures due to missing s3 credentials

CI is currently failing during the publish step with the following error:

s3cmd sync ./hook-2e5df3fffd5c16a24153262d4179f059123994c9.tar.gz s3://s.gianarb.it/hook/2e5df3fffd5c16a24153262d4179f059123994c9.tar.gz
ERROR: /home/runner/.s3cfg: None
ERROR: Configuration file not available.

The missing s3 credentials need to be added. I'm not too familiar with the history of this codebase, but at first glance I also find it odd that we're uploading artifacts to s.gianarb.it - so perhaps that is the main issue here?

ebpf warnings

The Hook kernel generates a hundred or so lines warning about ebpf.. we possibly could just disable it in the kernel config? Or investigate it in more detail

Expected Behaviour

A nice clean dmesg output on boot :-)

Current Behaviour

A lot of scrolling messages about ebpf, potentially pushing away any other warning messages.

How to install from official ISO image

All the examples given are using image2disk action which utilizes the raw img. I am wondering how I can use the ISO file. I converted the ISO from qemu-img convert, but that does not work as expected. One approach I can figure out is creating a VM and installing from ISO. Then utilize that img file. But that is not very straightforward.

Figure out how to build kernel as part of CI/CD and hook it to the os yaml file

As you may know, if you tried to build tinkie by yourself or if you read README.md and kernel/README.md, this is the workflow more or less.

if you need to rebuild a kernel, you can do:

$ cd kernel
$ make build_5.10.x
 ---> e08367d496ed
Step 33/35 : COPY --from=kernel-build /out/* /
 ---> 48cbcdadd35d
Step 34/35 : LABEL org.opencontainers.image.revision=cb1336fcb12243daba4c1658e89926a244b702a3
 ---> Running in 72019d18547c
Removing intermediate container 72019d18547c
 ---> 085a74b4689d
Step 35/35 : LABEL org.opencontainers.image.source=https://github.com/linuxkit/linuxkit
 ---> Running in 7c079a5f2c14
Removing intermediate container 7c079a5f2c14
 ---> 9bda170b7231
Successfully built 9bda170b7231
Successfully tagged linuxkit/kernel:5.10.11-9d7ae34f30663ec464ac3696aa1c0251865fd9ff-amd64
Tagging linuxkit/alpine@sha256:bca9e72a17c1c74cf7b28a397c643048fd055fa096a3edfe1d16365a5307b9d9 as linuxkit/alpine:e2391e0b164c57db9f6c4ae110ee84f766edc430

At that point, I usually tag the image I get, and I push it to my registry: docker.io/gianarb.

$ docker tag linuxkit/kernel:5.10.11-1ce7f0bd892ab1e3b15f1d2ae9197401eca03654-amd64 gianarb/tinkie-kernel:5.10.x
$ docker push gianarb/tinkie-kernel:5.10.x
The push refers to repository [docker.io/gianarb/tinkie-kernel]
49d9db376899: Layer already exists
5.10.x: digest: sha256:557e71bfd56932c3f4d5e22e8fec82cf92cf9b340e880f723ed8b65e47aa5d95 size: 530

Now that I have a new version of the kernel, I can build tinkie:

# from the project root
make dist

It will use the new kernel I just pushed. If the kernel tag has changed, it should change here as well:

https://github.com/gianarb/tinkie/blob/master/tinkie.yaml#L2

This project works manually, but ideally, I would like to figure out how we can make it all at once. Technically we do not need to build kernel and tinkie at the same time very often, but right now that's the case. I expect it to be a bit more independent in the near future.

boot w/ the dev-dist build

I build w/ dev-dist. The boot stuck.

Expected Behaviour

Expect boot successful and see shell.

Current Behaviour

Stuck in boot. Pls see the attached picture. it stuck after init

Possible Solution

Steps to Reproduce (for bugs)

build dev-dist
put the vmlinuxz and initramfs into current dir
boot client
client stuck

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Tidy up the readme

Just a note to tidy up the readme

how to recoginze the lvm on my disk

Hi guys
i build a system use linuxkit
it cannot recognize the lvm (lv vg pv) on my disk , what else services should i install ?

here is my yaml

kernel:
image: quay.io/tinkerbell/hook-kernel:5.10.85-5604bb0dc1cdb6263770a82bf91cbf7e00ffdd5c
cmdline: "console=tty0 console=ttyS0 console=ttyAMA0 console=ttysclp0"

init:

linuxkit/init:v0.8
linuxkit/runc:v0.8
linuxkit/containerd:v0.8
linuxkit/ca-certificates:v0.8

onboot:

name: sysctl
image: linuxkit/sysctl:v0.8
name: sysfs
image: linuxkit/sysfs:v0.8
name: dhcpcd
image: linuxkit/dhcpcd:v0.8
command: ["/sbin/dhcpcd", "--nobackground", "-f", "/dhcpcd.conf", "-1"]
binds.add:
- /var/lib/dhcpcd:/var/lib/dhcpcd
- /run:/run
  runtime:
  mkdir:
  - /var/lib/dhcpcd

services:

name: getty
image: linuxkit/getty:v0.8
binds.add:
- /etc/profile.d/local.sh:/etc/profile.d/local.sh
  env:
- INSECURE=true
name: rngd
image: linuxkit/rngd:v0.8
name: dhcpcd
image: linuxkit/dhcpcd:v0.8
binds.add:
- /var/lib/dhcpcd:/var/lib/dhcpcd
- /run:/run
  runtime:
  mkdir:
  - /var/lib/dhcpcd
name: ntpd
image: linuxkit/openntpd:v0.8
binds:
- /var/run:/var/run
name: hook-docker
image: quay.io/tinkerbell/hook-docker:latest
capabilities:
- all
  net: host
  pid: host
  mounts:
- type: cgroup
  options: ["rw", "nosuid", "noexec", "nodev", "relatime"]
  binds:
- /dev/console:/dev/console
- /dev:/dev
- /etc/resolv.conf:/etc/resolv.conf
- /lib/modules:/lib/modules
- /var/run/docker:/var/run
- /var/run/images:/var/lib/docker
- /var/run/worker:/worker
  runtime:
  mkdir:
  - /var/run/images
  - /var/run/docker
  - /var/run/worker
name: hook-bootkit
image: quay.io/tinkerbell/hook-bootkit:latest
capabilities:
- all
  net: host
  mounts:
- type: cgroup
  options: ["rw", "nosuid", "noexec", "nodev", "relatime"]
  binds:
- /var/run/docker:/var/run
  runtime:
  mkdir:
  - /var/run/docker

files:

path: etc/profile.d/local.sh
contents: |
alias docker='ctr -n services.linuxkit tasks exec --tty --exec-id cmd hook-docker docker'
alias docker-shell='ctr -n services.linuxkit tasks exec --tty --exec-id shell hook-docker sh'
mode: "0644"

trust:
org:
- linuxkit
- library

with great appreciated

Improved access to Hook logs

As far as I can tell, Hook does not ship its logs off of the machine in any way. This means that troubleshooting issues with things like Hook starting up or workflows require the user to access the console (no other way to access the machine that I'm aware of) and run docker commands. It would be nice to be able to optionally ship logs somewhere outside the machine, at least to start. OSIE did this via Syslog to Boots. I'm not saying/proposing we do this, per se, but that is an option.

(Something for the proposals repo) We probably need a more cohesive approach in the whole stack too.
tink workflow events and tink workflow status only tell us high level what is going on. Then, currently with Hook, the user has to access the console to debug issues. Something via a tink command would be one way to go about it. tink logs or similar.

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Need to add an automated workflow to build kconfig image referenced in the kernel README.md file

Expected Behaviour

The kernel readme references a non-existing kconfig image: https://github.com/tinkerbell/hook/blob/master/kernel/README.md

We should automate the building/publishing of this image based on changes to the kernel version we are targeting.

This will require:

a new quay.io repository to be created (quay.io/tinkerbell/hook-kconfig)
The readme to reference the new image
A new push job added to the github workflows to build/push the hook-kconfig image when needed

git tag "latest" behaves in a mutable way.

We currently use the git tag of "latest" to point to the top of tree commit. We update(delete and create) this tag with every merged PR. This effectively makes the latest tag mutable.

As far as I understand the latest tag's sole purpose is to enable the creation of a single GitHub release pointing to the top of tree. Deleting and creating the latest tag for every new top of tree commit causes some unexpected behavior for clients interacting with this repo.

I find the creation of a GitHub release pointing to the top of tree to be valuable but I think we should revisit the way in which we generate these releases.

Expected Behaviour

I expect git tags to always point to the same commit.

Current Behaviour

Possible Solution

One possible option could be to create a unique tag for all top of tree commits. Then from there a new GitHub release is created. When a new top of tree commit comes along, we delete the previous top of tree commit tag and the previous GitHub release and start the process over. Create a new unique tag and an associated GitHub release.

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Move sshd from debug to official but behind feature flag

I know OSIE has SSH installed, and I think it is a very useful feature to have. I use it in my homelab because I am not installing an operating system in all my hardware, some of it stays ephemeral and only runs OSIE itself. For those, I enable ssh passing a cmdline ssh.key="string" because that's how AlpineLinux works, as you can read from their documentation

https://wiki.alpinelinux.org/wiki/PXE_boot#Guide_to_options

I pilot what gets passed to OSIE via metadata.facility.facility_code. I find it convenient. Any idea about how we can do something like that?

/cc. @thebsdbox

question around use of `cpio`

I'm curious about the motivations around using cpio here in the Makefile?
The artifacts that linuxkit builds with the cli flag -format kernel+initrd are sufficient to successfully netboot a machine.

I'm also asking because I haven't been able to produce locally built artifacts (make dev-dist and make dist) that successfully start their network interfaces on boot. I've only tested this on virtualbox and vmware fusion machines. When I just use the artifacts that linuxkit produces I don't see any issues.

CC @thebsdbox

i915 drivers are needed for NUC/BRIX

Expected Behaviour

On Intel NUCs and Gigabyte Brix (typical lab equipment) support the i915 or Intel Graphics adapters.

Current Behaviour

Hook fails to load succesfully

Possible Solution

Rebuild kernel with Device_DRM_i915 (included in kernel)

Device_Drivers --> Graphics_Support --> <*>Direct Rendering Manager
Device_Drivers --> Graphics_Support --> <*> Intel 8xx/9xx/G3x/G4x/HD Graphics (NEW)

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

ci: Build failure due to quay.io Docker image registry login failure

CI is failing because GitHub Actions is attempting to login to quay.io (Metal's Docker image registry) when it does not have access to the registry's credentials (apparently in certain build contexts, it does not have access to said credentials).

Refer to: https://github.com/tinkerbell/hook/runs/4208516467?check_suite_focus=true

Expected Behaviour

CI to work.

Current Behaviour

CI fails in certain contexts.

Possible Solution

Only login to quay.io in certain build contexts. Refer to the following example for reference:
https://github.com/packethost/packet-hardware/pull/31/files#diff-944291df2c9c06359d37cc8833d182d705c9e8c3108e7cfe132d61a06e9133ddR43-R48

Steps to Reproduce (for bugs)

Run CI.

What version of linuxkit is used?

I tried to set up hook doing the following:

git clone https://github.com/linuxkit/linuxkit.git
cd linuxkit
git checkout 79b32dc # the latest commit
make
cp bin/linuxkit ~/go/bin
cd ../
git clone https://github.com/tinkerbell/hook.git
cd hook
git checkout 6f97067 # the latest commit
make dist

I got the error

linuxkit build -docker -disable-content-trust -pull -format kernel+initrd -name hook-x86_64 -dir out hook.yaml
flag provided but not defined: -disable-content-trust

I tried [email protected] but got

linuxkit build -docker -disable-content-trust -pull -format kernel+initrd -name hook-x86_64 -dir out hook.yaml
flag provided but not defined: -docker

What verision of linuxkit is used with hook?

Validate if tink-worker configuration works

Tink worker uses Viper, it means that technically it can read a configuration file in /etc/tinkerbell/tink-worker.yaml.

But I never tried it and I am not sure how it works. This is a prerequisite for #3

Rename noname to tinkie

As discussed in the Slack conversation, we would like to name this new component of tinkerbell tinkie, which is a short form for Tink(erbell) Installation Environment.

Rename default branch to main

Expected Behaviour

default branch for the repository should be main rather than master

Linuxkit errors out when trying to run `make image-amd64`

I had to remove -docker from the following line in order to proceed

hook/Makefile

Line 25 in 6f97067

    
           linuxkit build -docker -disable-content-trust -pull -format kernel+initrd -name hook-x86_64 -dir out $(LINUXKIT_CONFIG)

Here is the error:

# make image-amd64
mkdir -p out
linuxkit build -docker -disable-content-trust -pull -format kernel+initrd -name hook-x86_64 -dir out hook.yaml
flag provided but not defined: -docker
USAGE: linuxkit build [options] <file>[.yml] | -

Options:
  -decompress-kernel
    	Decompress the Linux kernel (default false)
  -dir string
    	Directory for output files, default current directory
  -disable-content-trust
    	Skip image trust verification specified in trust section of config (default false)
  -format value
    	Formats to create [ aws docker dynamic-vhd gcp iso-bios iso-efi kernel+initrd kernel+iso kernel+squashfs qcow2-bios qcow2-efi raw-bios raw-efi rpi3 tar tar-kernel-initrd vhd vmdk ]
  -name string
    	Name to use for output files
  -o string
    	File to use for a single output, or '-' for stdout
  -pull
    	Always pull images
  -size string
    	Size for output image, if supported and fixed size (default "1024M")
make: *** [Makefile:25: image-amd64] Error 2

I was not able to get nix-shell working so I installed linux-kit v0.8
That may be the problem...

Pensando driver in Kernel

I was having a chat with @mmlb about the kernel drivers Equinix Metal relays on, Pensando is one of them to interact with Smart NICs.

@thebsdbox pointed out that something should be already available https://github.com/linuxkit/linuxkit/blob/4cdf6bc56dd43227d5601218eaccf53479c765b9/kernel/config-5.6.x-x86_64#L2180

General information LinuxKit offers a pre-compiled Kernel and a workflow to load kernel modules, or to build your own kernel or modules. Here the documentation about how to do it https://github.com/linuxkit/linuxkit/blob/4cdf6bc56dd43227d5601218eaccf53479c765b9/docs/kernels.md

wget https://s.gianarb.it/hook/hook-master.tar.gz throws 404

Expected Behaviour

tar should be available

Current Behaviour

404

Evaluate kexec

At the moment we run the tink-docker image with all the privileges to allow reboot. We need to determine the path for allowing kexec to a provisioned OS.

cc/ @mmlb

Wrong /dev/null permission making ubuntu jammy deployment impossible

When switching the sandbox project to deploy ubuntu jammy, running apt update with the cexec action fails due to not having permission to write to /dev/null

Expected Behaviour

Being able to run apt update when deploying the ubuntu jammy image, permissions on /dev/null need to be 666 for apt update to work.

Current Behaviour

apt update in cexec action fails when deploying the ubuntu jammy image because it can't write to /dev/null, permissions on /dev/null are 660 and apt update doesn't work

Possible Solution

First I updated cexec container to mount /dev as rw so I could update the permissions from the template. Then I switched to a more general approach where I updated hook-docker to set the correct permissions:

hook-docker/main.go
────────────────────────────────────────────────────────────────────────────────────────────────────────────

──────────────────┐
31: func main() { │
──────────────────┘
 31 ⋮ 31 │    fmt.Println("Starting Tink-Docker")
 32 ⋮ 32 │    go rebootWatch()
 33 ⋮ 33 │
    ⋮ 34 │    fmt.Println("Make /dev/null writeable for all users!")
    ⋮ 35 │    cmd := exec.Command("chmod", "666", "/dev/null")
    ⋮ 36 │    cmd.Stdout = os.Stdout
    ⋮ 37 │    cmd.Stderr = os.Stderr
    ⋮ 38 │    err := cmd.Run()
    ⋮ 39 │    if err != nil {
    ⋮ 40 │        panic(err)
    ⋮ 41 │    }
    ⋮ 42 │
 34 ⋮ 43 │    // Parse the cmdline in order to find the urls for the repository and path to the cert
 35 ⋮ 44 │    content, err := ioutil.ReadFile("/proc/cmdline")
 36 ⋮ 45 │    if err != nil {

──────────────────┐
74: func main() { │
──────────────────┘
 65 ⋮ 74 │    }
 66 ⋮ 75 │
 67 ⋮ 76 │    // Build the command, and execute
 68 ⋮    │    cmd := exec.Command("/usr/local/bin/docker-init", "/usr/local/bin/dockerd")
    ⋮ 77 │    cmd = exec.Command("/usr/local/bin/docker-init", "/usr/local/bin/dockerd")
 69 ⋮ 78 │    cmd.Stdout = os.Stdout
 70 ⋮ 79 │    cmd.Stderr = os.Stderr
 71 ⋮ 80 │    err = cmd.Run()

While I got it working I don't know if there are better ways to solve this problem.

Steps to Reproduce (for bugs)

Try deploying ubuntu jammy image with the sandbox

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS): Linux
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details: Sandbox & docker-compose deploying on bare-metal
Link to your project or a code example to reproduce issue:

add lvm2 package into kernel Dockerfile, lead to failure of kernel build

Hi guys
I want to add lvm2 into my hook's kernel
kernel's Dockerfile look like:

FROM linuxkit/alpine:e2391e0b164c57db9f6c4ae110ee84f766edc430 AS kernel-build
RUN apk add
argp-standalone
automake
bash
bc
binutils-dev
bison
build-base
curl
diffutils
flex
git
gmp-dev
gnupg
installkernel
kmod
elfutils-dev
linux-headers
mpc1-dev
mpfr-dev
ncurses-dev
openssl-dev
patch
rsync
sed
squashfs-tools
tar
xz
xz-dev
zlib-dev
lvm2 # add lvm2

Expected Behaviour

kernel build success

Current Behaviour

docker buildx build --platform linux/amd64 --build-arg KERNEL_VERSION=5.10.85 --build-arg KERNEL_SERIES=5.10.x --build-arg EXTRA= --build-arg DEBUG= --label org.opencontainers.image.source=https://github.com/linuxkit/linuxkit --label org.opencontainers.image.revision=fd8ca1864dacef7ac819016dbbb2ad52433d0c0d --no-cache -t xxxxxxx .
[+] Building 5.1s (6/24)
=> [internal] load build definition from Dockerfile 0.2s
=> => transferring dockerfile: 32B 0.0s
=> [internal] load .dockerignore 0.3s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/linuxkit/alpine:e2391e0b164c57db9f6c4ae110ee84f766edc430 2.5s
=> CACHED [kernel-build 1/19] FROM docker.io/linuxkit/alpine:e2391e0b164c57db9f6c4ae110ee84f766edc430@sha256:bca9e72a17c1c74cf7b28a397c643048fd055fa096a3edfe1d16365a5307b9d9 0.0s
=> [internal] load build context 0.7s
=> => transferring context: 8.63kB 0.0s
=> ERROR [kernel-build 2/19] RUN apk add argp-standalone automake bash bc binutils-dev bison build-base curl diffutils flex git gmp-de 2.1s

[kernel-build 2/19] RUN apk add argp-standalone automake bash bc binutils-dev bison build-base curl diffutils flex git gmp-dev gnupg installkernel kmod elfutils-dev linux-headers mpc1-dev mpfr-dev ncurses-dev openssl-dev patch rsync sed squashfs-tools tar xz xz-dev zlib-dev lvm2 util-linux udev ntfs-3g:
#5 1.367 lvm2 (missing):
#5 1.367 ERROR: unsatisfiable constraints:
#5 1.369 required by: world[lvm2]
#5 1.369 ntfs-3g (missing):
#5 1.369 required by: world[ntfs-3g]

failed to solve: rpc error: code = Unknown desc = executor failed running [/bin/sh -c apk add argp-standalone automake bash bc binutils-dev bison build-base curl diffutils flex git gmp-dev gnupg installkernel kmod elfutils-dev linux-headers mpc1-dev mpfr-dev ncurses-dev openssl-dev patch rsync sed squashfs-tools tar xz xz-dev zlib-dev lvm2 util-linux udev ntfs-3g]: exit code: 2

is this the way i used to add pkg is wrong？

with great appreciated

Kernel publish job is having an issue

Expected Behaviour

Kernel publish job is showing success, but is failing to upload the multi-arch image to quay. Need to troubleshoot further: https://github.com/tinkerbell/hook/actions/runs/558935852

Add alias docker="ctr -n services.linuxkit t exec -t --exec-id test docker docker" to the default releases of Hook

Expected Behaviour

For convenience, the docker-cli should be available on a tink-worker (docker ps -a / docker logs tink-server / etc).

Current Behaviour

One has to first create an alias docker before being able to interact with the container runtime on a tink-worker

Possible Solution

Possibly a way to include an alias is described here: linuxkit/linuxkit#2364

Steps to Reproduce (for bugs)

n/a

Context

We've had serious issues troubleshooting why tink-workers were not starting their assigned workflows, and everytime a machine booted into LinuxKit we had to add this alias to see the actual status of the machine. To put it mildly, it was cumbersome :D

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

How to enable docker insecure-registries on OSIE

I have an internal docker registry which uses HTTP protocol. Whenever we use internal tagged images in Actions, the installation process will get stuck. I then tried to pull the image from the console, it said "http: server gave HTTP resposne to HTTPS client".

Expected Behaviour

can successfully pull the image

Current Behaviour

Error response from daemon

Possible Solution

passing args from server to hook.

Steps to Reproduce (for bugs)

build or retag the action images and push them to local registries
local registries are using http protocol
use these images in action

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Add multiple architectures

LinuxKit supports a platform flag that uses qemu to build on multi architectures.
We should use that to cover more architectures like arm/6 and 7 for example

add it to tar.gz

Running `make dev-dist` w/ linuxkit v0.8 and docker-ce v5.20.x; returns errors about unknown flags of the respective tools

Expected Behaviour

make generates the expected files vmlinuz-x86_64 and initramfs-x86_64

Current Behaviour

The build stops with errors describing that -docker is an unknown flag; similar for -load

Possible Solution

sed -e 's/ -docker//g' -i hook/Makefile and sed -e 's/ -load/ --load/g' -i hook Makefile ;)

Steps to Reproduce (for bugs)

Clone repo
make dev-dist

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
Ubuntu Server 20.04.x w/ linuxkit v0.8 and docker-ce v5.20.x
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
N/A; this issue occurs during generation of Hook
Link to your project or a code example to reproduce issue:

changes to `kernel/` directory require `validation` check

The validation check is required for all PR's and will not run for changes only in the kernel/ directory. This causes PRs like #144 to not meet the merge criteria and therefore not meet all merge requirements.

Expected Behaviour

A change only in thekernel/ directory is still able to satisfy all PR checks.

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS):
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
Link to your project or a code example to reproduce issue:

Hook dynamic runtime driver support

hook release filename changed

Expected Behaviour

The default sandbox configuration should work with hook, since it is set as default.
Release assets should not be updated/replaced without a new release being made. It is extremely confusing when release assets change and the date is not updated to reflect that there was indeed a recent change.

Current Behaviour

Sandbox will not be able to pull hook since it references OSIE_DOWNLOAD_URL="https://github.com/tinkerbell/hook/releases/download/5.10.57/hook-x86_64.tar.gz" and that file no longer exists. The x86_64 release for hook now lives at https://github.com/tinkerbell/hook/releases/download/5.10.57/hook_x86_64.tar.gz. This name change also seems to break the osie/lastmile.sh script in sandbox since filenames have changed inside the hook release archive.

We may have also experienced additional issues with the latest hook asset available on the release page. It seemed like hook wasn't properly trusting or able to connect to the sandbox registry after we made some modifications to work with the current version of hook available on the release page.

Possible Solution

Keeping consistent with filenames for hook would be very beneficial if other things depend on that naming. Additionally, some documentation for general hook configuration and troubleshooting would be helpful.

Steps to Reproduce (for bugs)

Examine deploy/compose/.env in sandbox
Compare default OSIE URL in .env against URL for current hook release

Context

We are attempting to develop an ansible role to deploy the tinkerbell sandbox and this issue has perplexed us quite a bit. It is our opinion that release artifacts should not change without also some separate verbose acknowledgement that there's been a change to the release.

Your Environment

Operating System and version (e.g. Linux, Windows, MacOS): Fedora 34 and Debian
How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details: Custom ansible.
Link to your project or a code example to reproduce issue: https://github.com/jkl92/sandbox/tree/ansible-podman_container

Write a tink-worker-init binary that set the stage for the tink worker to run

Hey!

as we spoke recently and looking at #2 you figured out how to reliably interact with docker.sock from inside a service.

The box has to be provisioned with:

A tink-worker configuration file in YAML format that can be passed to the tink-worker container
It has to download and place registry TLS certificate in the right directory

We ended up with the idea of building a binary that will run before the docker-init script in LinuxKit and will setup those bits.

Graceful reboot

Possibilities:

Pass pid:host to tink-docker which may allow a reboot syscall to work.
Some third container shenanigans that can signal to the OS.
ACPI call ... No clue how this would work.. sounds cool though.

Allow for custom data

Need to add functionality to Hook that will facilitate Custom osie data in Hook for multiple versions

Add a License file

There is no license file for this repository, but I'm guessing it should be Apache 2.0 as the rest of the project is?

	- name: dhcpcd
	image: linuxkit/dhcpcd:v0.8
	command: ["/sbin/dhcpcd", "--nobackground", "-f", "/dhcpcd.conf", "-1"]