Code Monkey home page Code Monkey logo

oci-systemd-hook's Introduction

OCI systemd hooks

OCI systemd hook enables users to run systemd in docker and OCI compatible runtimes such as runc without requiring --privileged flag.

This project produces a C binary that can be used with runc and Docker (with minor code changes). If you clone this branch and build/install oci-systemd-hook, a binary should be placed in /usr/libexec/oci/hooks.d named oci-systemd-hook.

Running Docker or OCI runc containers with this executable, oci-systemd-hook is called just before a container is started and after it is provisioned. If the CMD to run inside of the container is init or systemd, this hook will configure the container image to run a systemd environment. For all other CMD's, this hook will just exit.

When oci-systemd-hook detects systemd inside of the container it does the following:

  • Mounts a tmpfs on /run and /tmp
  • If there is content in the container image's /run and /tmp that content will be copied onto the tmpfs.
  • Creates a /etc/machine-id based on the the container's UUID
  • Mounts the hosts /sys/fs/cgroups file systemd read-only into the container
  • /sys/fs/cgroup/systemd will be mounted read/write into the container.

When the container stops, these file systems will be umounted.

systemd is expected to be able to run within the container without requiring the --privileged option. However you will still need to specify a special --stop-signal. Standard docker containers sends SIGTERM to pid 1, but systemd does not shut down properly when it recieves a SIGTERM. systemd specified that it needs to receive a RTMIN+3 signal to shutdown properly.

Usage

If you created a container image based on a Dockerfile like the following:

cat Dockerfile
FROM fedora:latest
ENV container docker
RUN yum -y update && yum -y install httpd && yum clean all
RUN systemctl mask dnf-makecache.timer && systemctl enable httpd
CMD [ "/sbin/init" ]

(The systemctl mask dnf-makecache.timer is a workaround for a container base image bug)

You should then be able to execute the following commands:

docker build -t httpd .
docker run -ti --stop-signal=RTMIN+3 httpd

If you run this hook along with oci-register-machine oci hook, you will be able to show the container's journal information on the host, using journalctl.

journalctl -M CONTAINER_UUID

To use this directly with runc, modify or add the following to config.json.

    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            }
        ]
    },
...
    "process": {
        "capabilities": {
...
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE",
                "CAP_MKNOD",
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_SYS_CHROOT"
...
        "env": [
            "container=oci",

Disabling oci-systemd-hook

To disable oci-systemd-hook for a particular run, which is primarily useful in an Atomic Host environment, the environment variable 'oci-systemd-hook' can be set to 'disabled'. This prevents oci-systemd-hook from being run for that invocation. A sample usage is:

docker run --env oci-systemd-hook=disabled -it --rm  fedora /bin/bash

To build and install

Prior to installing oci-systemd-hook, install the following packages on your linux distro:

  • autoconf
  • automake
  • gcc
  • git
  • go-md2man
  • libmount-devel
  • libselinux-devel
  • yajl-devel

In Fedora, you can use this command:

 yum -y install \
    autoconf \
    automake \
    gcc \
    git \
    go-md2man \
    libmount-devel \
    libselinux-devel \
    yajl-devel

Then clone this branch and follow these steps:

git clone https://github.com/projectatomic/oci-systemd-hook
cd oci-systemd-hook
autoreconf -i
./configure --libexecdir=/usr/libexec/oci/hooks.d
make
make install

oci-systemd-hook's People

Contributors

acudovs avatar adelton avatar brahim-raddahi-is4u avatar cevich avatar cgwalters avatar giuseppe avatar jwessel avatar mrunalp avatar mscherer avatar rh-ulrich-o avatar rhatdan avatar tomsweeneyredhat avatar weizhang555 avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oci-systemd-hook's Issues

Provide a way to override setting tmpfs

I am unsure of who the best people would be to fix this particular issue but I figured I would start here. After upgrading to Fedora 25 and switching to the Fedora provided Docker version I started having issues running Plex in a container due to permission errors (specifically exec permissions on files in /run).

After digging I found that it was related to oci-systemd-hook which sets /run to noexec and tmpfs. The main issue that I am hitting is the image (docker.io/linuxserver/plex) is using s6-overlay with an entrypoint of /init. From what I can tell, this causes oci-systemd-hook to think that it is booting a systemd container and setting /run to tmpfs.

Is it possible to override the hook on a per-container basis? While I could remove the hook completely I do want to test some use cases that use systemd in a container so it would be beneficial to keep if possible.

Thanks!

Trying to run docker in a systemd container faces cgroup path error

When trying to run docker in a systemd container with oci-systemd-hook, things go wrong with cgroup path.
# docker run -ti busybox docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:393: container init caused \"rootfs_linux.go:57: mounting \\\"cgroup\\\" to rootfs \\\"/var/lib/docker/devicemapper/mnt/e83e8ed9017e5dfe76d9bc6473d6d630902bde19a5cdecebb4cc26025381673d/rootfs\\\" at \\\"/sys/fs/cgroup\\\" caused \\\"stat /sys/fs/cgroup/44a49d3dbe07bc2e36acafd22e025c10d670d67718ecba0fc7df7aa611e6971a: no such file or directory\\\"\"".

But if we don't do the mount cgroup during the hook, docker will run pretty good.

So I want to ask @rhatdan , why should we mount cgroup into container when start a systemd container?

I removed CAP_NET_RAW from hook-config.json but the container still has cap_net_raw

I'm using oci-systemd-hook together with oci-add-hooks and Docker Swarm on CentOS 7.

I removed CAP_NET_RAW from the hook-config.json suggested at https://github.com/projectatomic/oci-systemd-hook
but when I run "capsh --print" in the Docker container, "cap_net_raw" is still printed out.

{
        "hooks": {
                "prestart": [
                        {
                                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                                "args": [ "prestart" ]
                        }
                ],
                "poststop": [
                        {
                                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                                "args": [ "poststop" ]
                        }
                ]
        },
        "process": {
                "env": [
                        "container=oci"
                ],
                "capabilities": [
                        "CAP_AUDIT_WRITE",
                        "CAP_KILL",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_MKNOD",
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_SYS_CHROOT"
                ]
        }
}
$ capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+i
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=10000(user_name)
gid=10000(group_name)
groups=10000(group_name_1),10002(group_name_2)

oci-systemd-hook: systemdhook <error>: parse error: trailing garbage#012

We are running kubernetes 1.5.5 with docker-1.12.6, We have running lots of pods/service/rc in our environment, From last 2 days we are facing issue while starting any new pods. It seems to be related to below to variable set in oci-systemd-hook/systemdhook.c

#define BUFLEN 1024
#define CONFIGSZ 65536

Following errors we are getting
oci-systemd-hook: systemdhook : parse error: trailing garbage#012
and now start getting "Config file too big",
because we added new service in kubernetes cluster. Since all env added on config.json file of container on kubernetes cluser so file size is increases due to that.
We crossed check container json size and it is around 64KB
Is there any way to set this size while intilzing this hook in docker?

systemd fails to run in container due to mounting issues

Description of problem:
When running a container systemd doesn't complete initialisation and systemctl cannot be used to start/stop services in the container, complaining of a dbus error.

Version-Release number of selected component (if applicable):
docker-1.13.1-7.git14cc629.fc26.x86_64
docker-client-4.0.6-5.fc26.noarch
docker-rhel-push-plugin-1.13.1-7.git14cc629.fc26.x86_64
docker-v1.10-migrator-1.13.1-7.git14cc629.fc26.x86_64
docker-common-1.13.1-7.git14cc629.fc26.x86_64
oci-systemd-hook-0.1.7-1.git1788cf2.fc26.x86_64

How reproducible:
Always

Steps to Reproduce:

  1. mkdir httpd-test ; cd http-test
  2. cat > Dockerfile <<EOF
    FROM fedora:latest
    ENV container oci
    RUN dnf -y install httpd; dnf clean all ; systemctl enable httpd
    STOPSIGNAL SIGRTMIN+3
    EXPOSE 80
    CMD ["/sbin/init"]
    EOF
  3. docker build -t http-test .
  4. docker run -dt --name http-test http-test

Actual results:

docker logs http-test 
Failed to determine whether /sys is a mount point: Operation not permitted
Failed to determine whether /proc is a mount point: Operation not permitted
Failed to determine whether /dev is a mount point: Operation not permitted
Failed to determine whether /dev/shm is a mount point: Operation not permitted
Failed to determine whether /run is a mount point: Operation not permitted
Failed to determine whether /sys/fs/cgroup is a mount point: Operation not permitted
Failed to determine whether /sys/fs/cgroup/systemd is a mount point: Operation not permitted
[!!!!!!] Failed to mount API filesystems, freezing.
Freezing execution.

Expected results:

 docker top http-test 
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                11860               11843               0                   10:15               ?                   00:00:00            /sbin/init
root                11954               11860               0                   10:15               ?                   00:00:00            /usr/lib/systemd/systemd-journald
root                12154               11860               1                   10:15               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              12155               12154               0                   10:15               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              12156               12154               0                   10:15               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              12157               12154               0                   10:15               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              12162               12154               0                   10:15               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
apache              12181               12154               0                   10:15               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
dbus                12186               11860               0                   10:15               ?                   00:00:00            /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root                12188               11860               0                   10:16               ?                   00:00:00            /usr/lib/systemd/systemd-logind

Additional info:
The hook works with SYS_ADMIN caps enabled for the container.

docker run -dt --cap-add SYS_ADMIN --name http-test http-test

Bugzilla ticket: https://bugzilla.redhat.com/show_bug.cgi?id=1443922

Insane cgroup mounts in /sys/fs/cgroup/systemd/libpod_parent with podman+bind mounts

  1. Firstly, this issue is only triggered when using bind mounts with podman.
  2. When a bind mount is used, the cgroups mounted in /sys/fs/cgroup/systemd/libpod_parent has an infinite recursion after a few stop/start cycles. In fact there are 86(!) mounts as below, but only four unique mount paths. These directories don't exist and the cgroup cannot be unmounted. Furthermore, after the third time the container cannot be started. No other container can be started (even those without bind mounts).
  3. @mheon in containers/podman#507 suggests that the bug is here as these crazy mount paths are not created by podman.
  4. Crazy mounts: https://github.com/projectatomic/libpod/files/1991848/cgroup.zip
  5. Any pointers as to who/what is recursively adding to the cgroup mount path?

Reproducer:

# this kludge is necessary otherwise we cannot MS_MOVE the mount
mount --make-private /tmp
podman run --name bobby_silver -v /volumes/podman/home:/home:z --entrypoint /sbin/init fedora:28
podman stop bobby_silver
podman start bobby_silver
podman stop boddy_silver
podman start bobby_silver # third time fails

single:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr

doubled:

cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)

tripled:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)

quadrupled:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr/libpod_parent/libpod-cd8be22a52efaed7e2790d2eb3421c00542c3eb9763bfe715c3ad23647c419e0/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)

can not get the container's journal information on the host, using journalctl.

I created a container image based on a Dockerfile like the following:

cat Dockerfile
FROM fedora:latest
ENV container docker
RUN yum -y update && yum -y install httpd && yum clean all
RUN systemctl mask dnf-makecache.timer && systemctl enable httpd
CMD [ "/sbin/init" ]

Then execute the following commands

# systemctl status systemd-machined.service 
● systemd-machined.service - Virtual Machine and Container Registration Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-machined.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2018-10-29 11:15:25 +08; 28s ago
     Docs: man:systemd-machined.service(8)
           http://www.freedesktop.org/wiki/Software/systemd/machined
 Main PID: 2577 (systemd-machine)
   Status: "Processing requests..."
    Tasks: 1
   Memory: 424.0K
   CGroup: /system.slice/systemd-machined.service
           └─2577 /usr/lib/systemd/systemd-machined

Oct 29 11:15:25 localhost.localdomain systemd[1]: Starting Virtual Machine and Container Registration Service...
Oct 29 11:15:25 localhost.localdomain systemd[1]: Started Virtual Machine and Container Registration Service.
Oct 29 11:15:25 localhost.localdomain systemd-machined[2577]: New machine 6c4741198c33df0e25ecbff31fa9634f.
# cat /etc/oci-register-machine.conf 
# Disable oci-register-machine by setting the disabled field to true
disabled : false
# docker build -t httpd .
# docker run -tid --stop-signal=RTMIN+3 httpd

But I can not get the container's journal information on the host, using journalctl.

# machinectl list
MACHINE                          CLASS     SERVICE
6c4741198c33df0e25ecbff31fa9634f container docker 

1 machines listed.
# journalctl -M 6c4741198c33df0e25ecbff31fa9634f
No journal files were found.
-- No entries --

Any ideas are welcome.


my docker version is:

# docker version
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-75.git8633870.el7.centos.x86_64
 Go version:      go1.9.4
 Git commit:      8633870/1.13.1
 Built:           Fri Sep 28 19:45:08 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-75.git8633870.el7.centos.x86_64
 Go version:      go1.9.4
 Git commit:      8633870/1.13.1
 Built:           Fri Sep 28 19:45:08 2018
 OS/Arch:         linux/amd64
 Experimental:    false

mount using :shared may cause crashes

docker run -it --rm --cap-add sys_admin -v /var/lib/lxc/:/var/lib/lxc/:ro,shared -v /var/lib/lxc/lxcfs/proc/meminfo:/proc/meminfo:rw -m 100m centos:tag /sbin/init
Causes a crash :

Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:393: container init caused \"process_linux.go:376: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\"".

Configure via container labels?

In #76 we talked about configuration a bit.

How about supporting configuration via labels? It feels really like a terrible hack to me to strcmp() the container's argv[0]. Something like:

LABEL org.projectatomic.systemd=yes
?
Then we could have:
LABEL org.projectatomic.systemd.journal=persistent
too?

running prestart hook 0 caused "signal: segmentation fault (core dumped)"

When trying to use oci-systemd-hook with runc container, I get:

Nov 29 12:55:48 gce-agents runc[23785]: process_linux.go:329: running prestart hook 0 caused "signal: segmentation fault (core dumped): "
Nov 29 12:55:48 gce-agents runc[23830]: open /run/runc/ga/state.json: no such file or directory

I've compiled oci-systemd-hook from master branch of this repo. And I'm using this config file: config.json.template.txt

Mounting the docker socket inside a container does not work

I'm trying to mount the docker socket of my host machine into a docker container:

docker run -it --rm --privileged -v /run/docker.sock:/run/docker.sock [...]

This works like a charm without oci-systemd-hook.

Unfortunately the container does not start with oci-systemd-hook:

/usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:334: running prestart hook 2 caused \"error running hook: exit status 1, stdout: , stderr: \"".

journalctl extract:

Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <debug>: rootfs=/var/lib/docker/btrfs/subvolumes/ef68e2c17a956a8fd7b0e8d1def13c2454fadd4aa8ccae1b6198fd113c882ca0
Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <debug>: gidMappings not found in config
Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <debug>: GID: 0
Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <debug>: uidMappings not found in config
Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <debug>: UID: 0
Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <error>: Failed to move mount /var/lib/docker/btrfs/subvolumes/ef68e2c17a956a8fd7b0e8d1def13c2454fadd4aa8ccae1b6198fd113c882ca0//run/docker.sock to /tmp/ocitmp.sr55QE//docker.sock: I
Aug 04 11:21:21 leo oci-systemd-hook[11861]: systemdhook <error>: Failed to move /run/docker.sock to /tmp/ocitmp.sr55QE: Invalid argument

I'm running oci-systemd-hook-0.1.11-1.git1ac958a.fc26.x86_64 on Fedora 26.
I could reproduce this also on Fedora 25 with oci-systemd-hook-0.1.6-1.gitfe22236.fc25.x86_64.

If you require any further information, I will be happy to provide it.

Cannot move mount from /tmp/ocitmp.XXXX to .../merged/run

On Fedora 28 /tmp is mounted as shared.

When doing the move mount from /tmp/ocitmp.XXXX to the container overlay it fails with EINVAL.

Steps:

  1. Create a systemd-based container with bind mount. (The issue does not happen if the container does not have bind mounts)
podman create --name test_1 --entrypoint /sbin/init -v /volumes/test/home:/home:z --env container=podman fedora:28
podman start test_1
oci-systemd-hook[5870]: systemdhook <error>: 4962ee46e281: Failed to move mount /tmp/ocitmp.jIxv5p to /var/lib/containers/storage/overlay/5348f52873a3f5340e3461d5fb15cbf56acd48a73989673dfd0d1a9107e462b4/merged/run: Invalid argument
  1. Setting /tmp to private, but makes this work twice but leads to other problems with containers+bind mounts namely cgroup debris: containers/podman#730

Failed to run /sbin/init in a container following fedora wiki instructions

Hi,
I was recently searching for running systemd in a docker container, and I have found systemd-containers on fedora wiki. I followed the instructions but got the following output:

[root@localhost Dockertest]# docker run -ti --tmpfs /run --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro fedora:systemd                                                                                          
systemd 238 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization container-other.
Detected architecture x86-64.

Welcome to Fedora 28 (Twenty Eight)!

Set hostname to <1417cece4bd8>.
Initializing machine ID from random generator.
Failed to create /system.slice/docker-1417cece4bd88d245ef53de8f422f07fffcd2cd50dba0a069b0464a333a92829.scope/init.scope control group: Permission denied                                                          
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object, freezing.
Freezing execution

My system is a fedora 28 VM, my docker version is docker-1.13.1-51.git4032bd5.fc28.x86_64, and the dockerfile used to build is,

FROM registry.fedoraproject.org/fedora:28
ENV container=oci
RUN dnf -y install httpd; dnf clean all; systemctl enable httpd
STOPSIGNAL SIGRTMIN+3
EXPOSE 80
CMD [ "/sbin/init" ]

I also saw a similar error on https://bugzilla.redhat.com/show_bug.cgi?id=1402264, but I did not install oci-systemd-hook at all on my host. Any suggestions for how to approach/fix the problem? Thanks =).

support `readonly: false` containers

When trying to use rw rootfs:

$ sudo runc run ga
process_linux.go:330: running prestart hook 1 caused "exit status 1: "
...
    "root": {
        "path": "rootfs",
        "readonly": false
    },
...
    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine"
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine"
            }
        ]
    },
...

running prestart hook 0 caused error

Hello,

oci-systemd-hook on latest RHEL atomic host doesn't work with runc.

# atomic host status
State: idle
Deployments:
● rhel-atomic-host:rhel-atomic-host/7/x86_64/standard
             Version: 7.3.6 (2017-06-23 16:20:45)
              Commit: e073a47baa605a99632904e4e05692064302afd8769a15290d8ebe8dbfd3c81b
# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-6.gite800860.el7.x86_64

This is the error I'm getting:

runc[15209]: container_linux.go:259: starting container process caused "process_linux.go:345: container init caused \"process_linux.go:328: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""

Config.json template is here: https://github.com/pschiffe/gce-system-container/blob/master/image/config.json.template

Can you help? How to debug?

invalid machine-id

The new function shortid which has been introduced for logging purposes (I think), causes the id which is passed to the "prestart" function to be only 12 characters long, this is what is eventually written to "/etc/machine-id" in the container.

Now, if systemd inside the spawned container tries to read this id, it regards it as "invalid", because it's not long enough, resulting in this message in this boot message:

Initializing machine ID from random generator.

The result is that a new machine-id is created on every boot and the journal is written to that directory (so it's not ending up in the mounted /var/log/journal/machine-id directory on the host).

I think we end up here:
https://github.com/systemd/systemd/blob/master/src/core/machine-id-setup.c
at the line that says:

  • Hmm, so, the id currently stored is not useful, then let's generate one */

If I change this:

return strndup(id, 12);

to this:

return strndup(id, 32);

then it is accepted as a valid machine-id and the message "Initializing machine ID from random generator." disappears ...

So I guess this requires a little bit of refactoring to still have a shortid for logging purposes and a valid machine-id at the same time

cgroup remount avc when running systemd container

Using the latest master of git + RHEL 7.3.3. This is fixed in RHBZ#1439382, but still present in the latest upstream master git.

# docker run -d rhel7 init
# ausearch -m AVC
---- time->Wed Apr 5 16:36:37 2017 type=SYSCALL msg=audit(1491424597.983:276): arch=c000003e syscall=165 success=no exit=-13 a0=55f7275fcad2 a1=55f7275f8ce0 a2=55f7275fcad2 a3=100002f items=0 ppid=13067 pid=13086 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="systemd" exe="/usr/lib/systemd/systemd" subj=system_u:system_r:svirt_lxc_net_t:s0:c11,c931 key=(null) type=AVC msg=audit(1491424597.983:276): avc: denied { remount } for pid=13086 comm="systemd" scontext=system_u:system_r:svirt_lxc_net_t:s0:c11,c931 tcontext=system_u:object_r:svirt_sandbox_file_t:s0:c11,c931 tclass=filesystem ----

Looks like the cgroup mount is messed up.
# diff -u /tmp/good /tmp/bad
--- /tmp/good 2017-04-10 10:23:39.945725760 -0400
+++ /tmp/bad 2017-04-10 10:22:42.046725760 -0400

- tmpfs on /sys/fs/cgroup type tmpfs (ro,relatime,seclabel,mode=755)
-cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
-cgroup on /sys/fs/cgroup/memory type cgroup (ro,relatime,memory)
-cgroup on /sys/fs/cgroup/blkio type cgroup (ro,relatime,blkio)
-cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,relatime,cpuacct,cpu)
-cgroup on /sys/fs/cgroup/devices type cgroup (ro,relatime,devices)
-cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,relatime,net_prio,net_cls)
-cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,relatime,hugetlb)
-cgroup on /sys/fs/cgroup/pids type cgroup (ro,relatime,pids)
-cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,relatime,perf_event)
-cgroup on /sys/fs/cgroup/freezer type cgroup (ro,relatime,freezer)
-cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,relatime,cpuset)
+cgroup on /sys/fs/cgroup/systemd type cgroup (ro,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
+cgroup on /sys/fs/cgroup/systemd/system.slice/docker-170545ba00877062942beb8146971906363b34d278f7117262814e92ae2baa22.scope type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)```

-v container propagates cgroup mounts to other containers

All using podman.

  1. Start a container Alice that doesn't use -v
## Alice:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9075fe81752a0a9383e587ba9af6de76d546cfec3f3d23683d1de165c69ed96f/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
  1. Start container Bobby using -v(notice the weird doubled path, and that these mounts are on the
    host and not the container):
## Host:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
## Bobby:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
  1. Go back to Alice:
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9075fe81752a0a9383e587ba9af6de76d546cfec3f3d23683d1de165c69ed96f/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)
cgroup on /sys/fs/cgroup/systemd/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr/libpod_parent/libpod-9ffaff1cdcab235dc1dabdb25d6d1e209f044957b02b533874e0aaf17c0200db/ctr type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,name=systemd)

Observe that Bobby's host mounts have propagated to Alice.

Unable to run systemd inside RHEL-7.6 base image on Debian unstable

$ dpkg --list | grep systemd
ii  dbus-user-session               1.12.10-1                             amd64        simple interprocess messaging system (systemd --user integration)
ii  libnss-systemd:amd64            239-11                                amd64        nss module providing dynamic user and group name resolution
ii  libpam-systemd:amd64            239-11                                amd64        system and service manager - PAM module
ii  libsystemd0:amd64               239-11                                amd64        systemd utility library
ii  systemd                         239-11                                amd64        system and service manager
ii  systemd-sysv                    239-11                                amd64        system and service manager - SysV links

# skopeo/skopeo copy --src-tls-verify=false docker://registry.access.redhat.com/rhel7:7.6 dir:rootfs

# runc spec

	"hooks": {
		"prestart": [
			{
				"path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
			}
		]
		},
...
		"args": [
			"/usr/sbin/init"
		],
# runc run -d  root
# runc ps root
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.4  43108  4940 ?        Ss   17:54   0:01 /usr/sbin/init
root       247  0.0  0.2  43108  2692 ?        Ss   18:28   0:00 (journald)

Any idea on how to debug this thing?

cgroup bind mount action in oci hook may increase security risk

I use oci-systemd-hook to run systemd service in a container,found that in this container i can see all cgroup resource,including other normal container instance cgroup resource.

root@aaa:/# findmnt
|-/sys sysfs
| '-/sys/fs/cgroup tmpfs
| |-/sys/fs/cgroup/systemd cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/pids cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/net_cls,net_prio cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/freezer cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/devices cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/hugetlb cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/files cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/perf_event cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/memory cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/cpuset cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/blkio cgroup[/docker/abc]
| |
| |-/sys/fs/cgroup/cpu,cpuacct cgroup[/docker/abc]
| |
| '-/sys/fs/cgroup tmpfs
| |-/sys/fs/cgroup/systemd cgroup cgroup rw
| |-/sys/fs/cgroup/pids cgroup cgroup ro
| |-/sys/fs/cgroup/net_cls,net_prio cgroup cgroup ro
| |-/sys/fs/cgroup/freezer cgroup cgroup ro
| |-/sys/fs/cgroup/devices cgroup cgroup ro
| |-/sys/fs/cgroup/hugetlb cgroup cgroup ro
| |-/sys/fs/cgroup/files cgroup cgroup ro
| |-/sys/fs/cgroup/perf_event cgroup cgroup ro
| |-/sys/fs/cgroup/memory cgroup cgroup ro
| |-/sys/fs/cgroup/cpuset cgroup cgroup ro
| |-/sys/fs/cgroup/blkio cgroup cgroup ro
| `-/sys/fs/cgroup/cpu,cpuacct cgroup cgroup ro

root@aaa:/sys/fs/cgroup/cpu# ll docker
total 0
drwxr-xr-x. 6 root root 0 Dec 19 07:38 ./
drwxr-xr-x. 6 root root 0 Nov 17 12:25 ../
drwxr-xr-x. 2 root root 0 Dec 19 13:48 4641782f7b104132cef7ecd688bfe42aa2b5b6a84d5dea70c12165b3af294342/
drwxr-xr-x. 2 root root 0 Dec 19 13:44 4cbe4aebed912e0cbfbbe34e347483771fd5ae0a50ef5f39dda0bf1f7f4ba462/
drwxr-xr-x. 2 root root 0 Dec 19 08:38 9857caa62a8fcc07a2b133879d8e5d4cb4deac4c1f872f0682a82ff280f35211/
-rw-r--r--. 1 root root 0 Nov 17 12:27 cgroup.clone_children
--w--w--w-. 1 root root 0 Nov 17 12:27 cgroup.event_control
-rw-r--r--. 1 root root 0 Nov 17 12:27 cgroup.procs
-rw-r--r--. 1 root root 0 Nov 17 12:27 cpu.cfs_period_us
-rw-r--r--. 1 root root 0 Nov 17 12:27 cpu.cfs_quota_us
-rw-r--r--. 1 root root 0 Nov 17 12:27 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Nov 17 12:27 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Nov 17 12:27 cpu.shares
-r--r--r--. 1 root root 0 Nov 17 12:27 cpu.stat
-r--r--r--. 1 root root 0 Nov 17 12:27 cpuacct.stat
-rw-r--r--. 1 root root 0 Nov 17 12:27 cpuacct.usage
-r--r--r--. 1 root root 0 Nov 17 12:27 cpuacct.usage_percpu
drwxr-xr-x. 2 root root 0 Dec 19 07:43 d1f6317187a78e5a3f07b11a079f7d7d0e2583f4c60e6e166059c8265faa0f47/
-rw-r--r--. 1 root root 0 Nov 17 12:27 notify_on_release
-rw-r--r--. 1 root root 0 Nov 17 12:27 tasks

this means security risk here,so i'm curious about why we must mount cgroup again?
what is the systemd truly need?
should we just mount the control directory(/sys/fs/cgroup/systemd) but not do mount other cgroup subsystem?
or can we just use another way to run systemd with no risk?
@rhatdan @mrunalp @cgwalters @TomSweeneyRedHat
so much thanks
cc:@hqhq

Explain security of rw mount

We are considering to run this in our production environment, but there is a concern about
security.

Is mounting of the host cgroups file read only / read-write into the container secure?

Can one use this functionality to hack into the host system in some way?

oci-systemd-hook run different names, or force it on a a container

This is a question or possible issue. My best guess is this is not getting the oci-systemd-hook setup.

I am trying to run the docker.io/gitlab/gitlab-ce on CentOS 7.5 with podman. I am attempting to migrate from docker. In docker this ran without issues.

The gitlab is a ubuntu 16.0.4 based image that has the CMD of "/assets/wrapper". This wrapper script runs some chef commands to do pre-setup then calls the "init" command. The init command fails with:

Running handlers:
There was an error running gitlab-ctl reconfigure:

execute[init q] (runit::sysvinit line 28) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of init q ----
STDOUT: 
STDERR: Couldn't find an alternative telinit implementation to spawn.
---- End output of init q ----
Ran init q returned 1

So init is not starting up. My question is there a way to force the oci-systemd-hook even if the CMD is not "init"?

oci-systemd-hook leaks logs by default to container host /var/log/journald

For a long time I was looking at /var/log/journald growing in size beyond limits, with new sub-directories piling in that directory...

Found the reason: it's the docker-containers that are using systemd, together with RHEL dockerd and oci-systemd-hook. In CI test rounds we are running and killings tens of docker containers per day, and the killed docker-containers are leaving their journal logs hanging in /var/log/journald/ of the host system. What's worse, the host systems journal log system does not seem to rotate these hanging logs -nor does it count them when using "journalctl --disk-usage". The contradicting disk usage reported "du"-command and "journalctl --disk-usage" was making me mad as well(and maybe would be worth another issue report). Anyway, the killed docker-containers leave "/var/log/journald//system.journal"-named files on the host, and probably the journal log rotation rules consider them as open, so it does not clean them...which causes the host journal to grow beyond the configured limits.

Versions:
oci-systemd-hook 1:0.1.8-4.1.gite533efa.el7
which was brought in as a dependency of docker-1.12.6-48.git0fdc778-el7

(I'm not quite sure if it's really been a wise decision to automatically "leak" journal logs from containers to host. Yes, it's a design decision from Red Hat -and I understand it is intentional. Still, containers were supposed to be isolated environments - so at least I would have preferred keeping the default as not to leak the journal logs(nor anything else) from containers to the host. Why make journal a special case for the simple host-container-isolation-paradigma?)

Unfortunately, the version of oci-systemd-hook shipped by Red Hat does not support '--env oci-systemd-hook=disabled' ... so the only workaround is to periodically clean up the /var/log/journal manually?

Install instructions

I see mentions that some small changes are required in Docker to use this library, from a Redhat blog post mentioning this project. There's no mentions of how to integrate these in a Docker host. Are there special steps? Running this example on Ubuntu results in an error:

FROM fedora:latest
ENV container docker
RUN yum -y update && yum -y install httpd && yum clean all
RUN systemctl mask dnf-makecache.timer && systemctl enable httpd
CMD [ "/sbin/init" ]
-> % docker run -ti --stop-signal=RTMIN+3 httpd
Failed to mount tmpfs at /run: Operation not permitted
[!!!!!!] Failed to mount API filesystems, freezing.
Freezing execution.

Recent changes to /var/log mounting breaks containers expecting build to populate /var/log

Since #42 has been merged I figured it best to open a separate issue to track this.

This affects both RHEL and Fedora with that commit.

Steps to replicate:

cat > Dockerfile.mariadb << EOF
FROM centos:latest
STOPSIGNAL SIGRTMIN+3
 
RUN yum -y install mariadb-server && yum clean all
 
RUN systemctl enable mariadb
 
VOLUME /var/lib/mysql
 
CMD ["/sbin/init"]
EOF
 
docker volume create --name localtest-mdb
docker build -f Dockerfile.mariadb -t localtest-mdb .
docker run -dt -v localtest-mdb:/var/lib/mysql --name localtest-mdb localtest-mdb
docker exec -t localtest-mdb /bin/bash -c 'for i in {1..30}; do if systemctl is-active mariadb ; then break  ; else sleep 1 ; fi done;'
docker exec -t localtest-mdb mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'testuser'@'%' IDENTIFIED BY 'testpassword' WITH GRANT OPTION;"
docker stop localtest-mdb
docker rm localtest-mdb

The mariadb service in the container after the container is started with docker run is expected to start successfully but it fails instead with the error /var/log/mariadb does not exist.

This directory was created via the mariadb-server rpm in the docker build but then does not get transferred into the /var/log tmpfs created by the hook.

This breaks any service that expects /var/log/ to already exist when run in a container using systemd and the hook.

Error: OCI runtime error: error executing hook `/usr/libexec/oci/hooks.d/oci-systemd-hook` (exit code: 1)

cat > Dockerfile << 'EOF'

FROM fedora

RUN dnf -y install httpd; dnf clean all; systemctl enable httpd

EXPOSE 80

CMD [ "/sbin/init" ]
'EOF'
podman build -t systemd .
podman run -ti -p 8080:80 systemd

Yields

Error: OCI runtime error: error executing hook `/usr/libexec/oci/hooks.d/oci-systemd-hook` (exit code: 1)

Versions:
oci-systemd-hook-0.2.0-2.git05e6923.fc31.x86_64
podman-3.2.3-2.fc34.x86_64

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.