Code Monkey home page Code Monkey logo

Comments (8)

JunFugithub avatar JunFugithub commented on May 23, 2024

Hi, sorry to bother again, I recently deployed ffdl on Google cloud again, but one of those pod, ibmcloud-object-storage-deployer which runs in the kube-system namespace, can't work with following reason.

+ DRIVER_LOCATION=/host/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
+ KUBELET_SVC_CONFIG=/host/lib/systemd/system/kubelet.service
+ apt-get -y update
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [83.2 kB]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/universe Sources [32.0 kB]
Get:6 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [133 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates/universe Sources [167 kB]
Get:8 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [1367 B]
Get:9 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [281 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [900 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [6931 B]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [599 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [10.7 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [3655 B]
Fetched 2381 kB in 1s (1754 kB/s)
Reading package lists...
+ apt-get -y install s3fs
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  ca-certificates file fuse libasn1-8-heimdal libcurl3-gnutls libfuse2
  libgssapi3-heimdal libhcrypto4-heimdal libheimbase1-heimdal
  libheimntlm0-heimdal libhx509-5-heimdal libicu60 libkrb5-26-heimdal
  libldap-2.4-2 libldap-common libmagic-mgc libmagic1 libnghttp2-14 libpsl5
  libroken18-heimdal librtmp1 libsasl2-2 libsasl2-modules libsasl2-modules-db
  libsqlite3-0 libssl1.1 libwind0-heimdal libxml2 mime-support openssl
  publicsuffix xz-utils
Suggested packages:
  libsasl2-modules-gssapi-mit | libsasl2-modules-gssapi-heimdal
  libsasl2-modules-ldap libsasl2-modules-otp libsasl2-modules-sql
The following NEW packages will be installed:
  ca-certificates file fuse libasn1-8-heimdal libcurl3-gnutls libfuse2
  libgssapi3-heimdal libhcrypto4-heimdal libheimbase1-heimdal
  libheimntlm0-heimdal libhx509-5-heimdal libicu60 libkrb5-26-heimdal
  libldap-2.4-2 libldap-common libmagic-mgc libmagic1 libnghttp2-14 libpsl5
  libroken18-heimdal librtmp1 libsasl2-2 libsasl2-modules libsasl2-modules-db
  libsqlite3-0 libssl1.1 libwind0-heimdal libxml2 mime-support openssl
  publicsuffix s3fs xz-utils
0 upgraded, 33 newly installed, 0 to remove and 33 not upgraded.
Need to get 13.3 MB of archives.
After this operation, 52.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libssl1.1 amd64 1.1.0g-2ubuntu4.3 [1130 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 openssl amd64 1.1.0g-2ubuntu4.3 [532 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 ca-certificates all 20180409 [151 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.1 [184 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic1 amd64 1:5.32-2ubuntu0.1 [68.4 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 file amd64 1:5.32-2ubuntu0.1 [22.1 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 libicu60 amd64 60.2-3ubuntu3 [8054 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsqlite3-0 amd64 3.22.0-1 [496 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libxml2 amd64 2.9.4+dfsg1-6.1ubuntu1.2 [663 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic/main amd64 mime-support all 3.60ubuntu1 [30.1 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic/main amd64 xz-utils amd64 5.2.2-1.3 [83.8 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic/main amd64 libfuse2 amd64 2.9.7-1ubuntu1 [80.9 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic/main amd64 fuse amd64 2.9.7-1ubuntu1 [24.5 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 libpsl5 amd64 0.19.1-5build1 [41.8 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic/main amd64 publicsuffix all 20180223.1310-1 [97.6 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic/main amd64 libroken18-heimdal amd64 7.5.0+dfsg-1 [41.3 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic/main amd64 libasn1-8-heimdal amd64 7.5.0+dfsg-1 [175 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic/main amd64 libheimbase1-heimdal amd64 7.5.0+dfsg-1 [29.3 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhcrypto4-heimdal amd64 7.5.0+dfsg-1 [85.9 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic/main amd64 libwind0-heimdal amd64 7.5.0+dfsg-1 [47.8 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic/main amd64 libhx509-5-heimdal amd64 7.5.0+dfsg-1 [107 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic/main amd64 libkrb5-26-heimdal amd64 7.5.0+dfsg-1 [206 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic/main amd64 libheimntlm0-heimdal amd64 7.5.0+dfsg-1 [14.8 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic/main amd64 libgssapi3-heimdal amd64 7.5.0+dfsg-1 [96.5 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsasl2-modules-db amd64 2.1.27~101-g0780600+dfsg-3ubuntu2 [14.8 kB]
Get:26 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsasl2-2 amd64 2.1.27~101-g0780600+dfsg-3ubuntu2 [49.2 kB]
Get:27 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libldap-common all 2.4.45+dfsg-1ubuntu1.1 [16.6 kB]
Get:28 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libldap-2.4-2 amd64 2.4.45+dfsg-1ubuntu1.1 [155 kB]
Get:29 http://archive.ubuntu.com/ubuntu bionic/main amd64 libnghttp2-14 amd64 1.30.0-1ubuntu1 [77.8 kB]
Get:30 http://archive.ubuntu.com/ubuntu bionic/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-1 [54.2 kB]
Get:31 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libcurl3-gnutls amd64 7.58.0-2ubuntu3.5 [212 kB]
Get:32 http://archive.ubuntu.com/ubuntu bionic/main amd64 libsasl2-modules amd64 2.1.27~101-g0780600+dfsg-3ubuntu2 [48.7 kB]
Get:33 http://archive.ubuntu.com/ubuntu bionic/universe amd64 s3fs amd64 1.82-1 [200 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 13.3 MB in 2s (8077 kB/s)
Selecting previously unselected package libssl1.1:amd64.
(Reading database ... 
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 4458 files and directories currently installed.)
Preparing to unpack .../00-libssl1.1_1.1.0g-2ubuntu4.3_amd64.deb ...
Unpacking libssl1.1:amd64 (1.1.0g-2ubuntu4.3) ...
Selecting previously unselected package openssl.
Preparing to unpack .../01-openssl_1.1.0g-2ubuntu4.3_amd64.deb ...
Unpacking openssl (1.1.0g-2ubuntu4.3) ...
Selecting previously unselected package ca-certificates.
Preparing to unpack .../02-ca-certificates_20180409_all.deb ...
Unpacking ca-certificates (20180409) ...
Selecting previously unselected package libmagic-mgc.
Preparing to unpack .../03-libmagic-mgc_1%3a5.32-2ubuntu0.1_amd64.deb ...
Unpacking libmagic-mgc (1:5.32-2ubuntu0.1) ...
Selecting previously unselected package libmagic1:amd64.
Preparing to unpack .../04-libmagic1_1%3a5.32-2ubuntu0.1_amd64.deb ...
Unpacking libmagic1:amd64 (1:5.32-2ubuntu0.1) ...
Selecting previously unselected package file.
Preparing to unpack .../05-file_1%3a5.32-2ubuntu0.1_amd64.deb ...
Unpacking file (1:5.32-2ubuntu0.1) ...
Selecting previously unselected package libicu60:amd64.
Preparing to unpack .../06-libicu60_60.2-3ubuntu3_amd64.deb ...
Unpacking libicu60:amd64 (60.2-3ubuntu3) ...
Selecting previously unselected package libsqlite3-0:amd64.
Preparing to unpack .../07-libsqlite3-0_3.22.0-1_amd64.deb ...
Unpacking libsqlite3-0:amd64 (3.22.0-1) ...
Selecting previously unselected package libxml2:amd64.
Preparing to unpack .../08-libxml2_2.9.4+dfsg1-6.1ubuntu1.2_amd64.deb ...
Unpacking libxml2:amd64 (2.9.4+dfsg1-6.1ubuntu1.2) ...
Selecting previously unselected package mime-support.
Preparing to unpack .../09-mime-support_3.60ubuntu1_all.deb ...
Unpacking mime-support (3.60ubuntu1) ...
Selecting previously unselected package xz-utils.
Preparing to unpack .../10-xz-utils_5.2.2-1.3_amd64.deb ...
Unpacking xz-utils (5.2.2-1.3) ...
Selecting previously unselected package libfuse2:amd64.
Preparing to unpack .../11-libfuse2_2.9.7-1ubuntu1_amd64.deb ...
Unpacking libfuse2:amd64 (2.9.7-1ubuntu1) ...
Selecting previously unselected package fuse.
Preparing to unpack .../12-fuse_2.9.7-1ubuntu1_amd64.deb ...
Unpacking fuse (2.9.7-1ubuntu1) ...
Selecting previously unselected package libpsl5:amd64.
Preparing to unpack .../13-libpsl5_0.19.1-5build1_amd64.deb ...
Unpacking libpsl5:amd64 (0.19.1-5build1) ...
Selecting previously unselected package publicsuffix.
Preparing to unpack .../14-publicsuffix_20180223.1310-1_all.deb ...
Unpacking publicsuffix (20180223.1310-1) ...
Selecting previously unselected package libroken18-heimdal:amd64.
Preparing to unpack .../15-libroken18-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libroken18-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libasn1-8-heimdal:amd64.
Preparing to unpack .../16-libasn1-8-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libasn1-8-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libheimbase1-heimdal:amd64.
Preparing to unpack .../17-libheimbase1-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libheimbase1-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libhcrypto4-heimdal:amd64.
Preparing to unpack .../18-libhcrypto4-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libhcrypto4-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libwind0-heimdal:amd64.
Preparing to unpack .../19-libwind0-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libwind0-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libhx509-5-heimdal:amd64.
Preparing to unpack .../20-libhx509-5-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libhx509-5-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libkrb5-26-heimdal:amd64.
Preparing to unpack .../21-libkrb5-26-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libkrb5-26-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libheimntlm0-heimdal:amd64.
Preparing to unpack .../22-libheimntlm0-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libheimntlm0-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libgssapi3-heimdal:amd64.
Preparing to unpack .../23-libgssapi3-heimdal_7.5.0+dfsg-1_amd64.deb ...
Unpacking libgssapi3-heimdal:amd64 (7.5.0+dfsg-1) ...
Selecting previously unselected package libsasl2-modules-db:amd64.
Preparing to unpack .../24-libsasl2-modules-db_2.1.27~101-g0780600+dfsg-3ubuntu2_amd64.deb ...
Unpacking libsasl2-modules-db:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Selecting previously unselected package libsasl2-2:amd64.
Preparing to unpack .../25-libsasl2-2_2.1.27~101-g0780600+dfsg-3ubuntu2_amd64.deb ...
Unpacking libsasl2-2:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Selecting previously unselected package libldap-common.
Preparing to unpack .../26-libldap-common_2.4.45+dfsg-1ubuntu1.1_all.deb ...
Unpacking libldap-common (2.4.45+dfsg-1ubuntu1.1) ...
Selecting previously unselected package libldap-2.4-2:amd64.
Preparing to unpack .../27-libldap-2.4-2_2.4.45+dfsg-1ubuntu1.1_amd64.deb ...
Unpacking libldap-2.4-2:amd64 (2.4.45+dfsg-1ubuntu1.1) ...
Selecting previously unselected package libnghttp2-14:amd64.
Preparing to unpack .../28-libnghttp2-14_1.30.0-1ubuntu1_amd64.deb ...
Unpacking libnghttp2-14:amd64 (1.30.0-1ubuntu1) ...
Selecting previously unselected package librtmp1:amd64.
Preparing to unpack .../29-librtmp1_2.4+20151223.gitfa8646d.1-1_amd64.deb ...
Unpacking librtmp1:amd64 (2.4+20151223.gitfa8646d.1-1) ...
Selecting previously unselected package libcurl3-gnutls:amd64.
Preparing to unpack .../30-libcurl3-gnutls_7.58.0-2ubuntu3.5_amd64.deb ...
Unpacking libcurl3-gnutls:amd64 (7.58.0-2ubuntu3.5) ...
Selecting previously unselected package libsasl2-modules:amd64.
Preparing to unpack .../31-libsasl2-modules_2.1.27~101-g0780600+dfsg-3ubuntu2_amd64.deb ...
Unpacking libsasl2-modules:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Selecting previously unselected package s3fs.
Preparing to unpack .../32-s3fs_1.82-1_amd64.deb ...
Unpacking s3fs (1.82-1) ...
Setting up libicu60:amd64 (60.2-3ubuntu3) ...
Setting up libnghttp2-14:amd64 (1.30.0-1ubuntu1) ...
Setting up mime-support (3.60ubuntu1) ...
Setting up libldap-common (2.4.45+dfsg-1ubuntu1.1) ...
Setting up libpsl5:amd64 (0.19.1-5build1) ...
Setting up libfuse2:amd64 (2.9.7-1ubuntu1) ...
Setting up libsasl2-modules-db:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Setting up libsasl2-2:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Setting up libroken18-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-1) ...
Setting up libxml2:amd64 (2.9.4+dfsg1-6.1ubuntu1.2) ...
Setting up libmagic-mgc (1:5.32-2ubuntu0.1) ...
Setting up libmagic1:amd64 (1:5.32-2ubuntu0.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Setting up publicsuffix (20180223.1310-1) ...
Setting up libssl1.1:amd64 (1.1.0g-2ubuntu4.3) ...
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Setting up xz-utils (5.2.2-1.3) ...
update-alternatives: using /usr/bin/xz to provide /usr/bin/lzma (lzma) in auto mode
update-alternatives: warning: skip creation of /usr/share/man/man1/lzma.1.gz because associated file /usr/share/man/man1/xz.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/unlzma.1.gz because associated file /usr/share/man/man1/unxz.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzcat.1.gz because associated file /usr/share/man/man1/xzcat.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzmore.1.gz because associated file /usr/share/man/man1/xzmore.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzless.1.gz because associated file /usr/share/man/man1/xzless.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzdiff.1.gz because associated file /usr/share/man/man1/xzdiff.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzcmp.1.gz because associated file /usr/share/man/man1/xzcmp.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzgrep.1.gz because associated file /usr/share/man/man1/xzgrep.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzegrep.1.gz because associated file /usr/share/man/man1/xzegrep.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzfgrep.1.gz because associated file /usr/share/man/man1/xzfgrep.1.gz (of link group lzma) doesn't exist
Setting up libheimbase1-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up openssl (1.1.0g-2ubuntu4.3) ...
Setting up libsqlite3-0:amd64 (3.22.0-1) ...
Setting up libsasl2-modules:amd64 (2.1.27~101-g0780600+dfsg-3ubuntu2) ...
Setting up ca-certificates (20180409) ...
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Updating certificates in /etc/ssl/certs...
133 added, 0 removed; done.
Setting up fuse (2.9.7-1ubuntu1) ...
Setting up libwind0-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libasn1-8-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libhcrypto4-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up file (1:5.32-2ubuntu0.1) ...
Setting up libhx509-5-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libkrb5-26-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libheimntlm0-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libgssapi3-heimdal:amd64 (7.5.0+dfsg-1) ...
Setting up libldap-2.4-2:amd64 (2.4.45+dfsg-1ubuntu1.1) ...
Setting up libcurl3-gnutls:amd64 (7.58.0-2ubuntu3.5) ...
Setting up s3fs (1.82-1) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Processing triggers for ca-certificates (20180409) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
+ cp /root/bin/s3fs /host/usr/local/bin/
cp: cannot create regular file '/host/usr/local/bin/': Not a directory

I was taking a look at ffdl/ibmcloud-object-storage-deployer:v0.1 on dockerhub, but no dockerfile there. Thanks in advance for any ideas.

from ffdl.

animeshsingh avatar animeshsingh commented on May 23, 2024

@sboagibm @fplk Please look into this

from ffdl.

fplk avatar fplk commented on May 23, 2024

GKE should use Container-Optimized OS underneath, cmp. https://cloud.google.com/container-optimized-os/ and it is possible the open source driver will not work without modification on that. If you want to deploy to GKE, you would have to first make sure https://github.com/IBM/ibmcloud-object-storage-plugin works. Since I don't have access to GKE, I cannot test or fix this for you.

Regarding the general FfDL setup, it should cleanly deploy against IBM Cloud. Unfortunately, I have two hard deadlines at the end of the week, so I cannot look into deployment on Minikube and DIND right now. DIND 1.10 worked a while ago, I briefly tried to deploy against 1.12 not too long ago and also ran into problems.

from ffdl.

JunFugithub avatar JunFugithub commented on May 23, 2024

Thanks for the advice of locating the issue, still on trying.

from ffdl.

sboagibm avatar sboagibm commented on May 23, 2024

@JunFugithub For minikube could you look for the statefulset that is created and do a kubectl get ss/xxxxx -o yaml and send results? And for DIND do the same, but also do a kubectl describe for the failed pod?

from ffdl.

JunFugithub avatar JunFugithub commented on May 23, 2024
  • minikub
    As for the minikube part, there're something have to be mentioned.
  1. I created a set of pv and pvc(learner-1) by method of hostPath instead of previous NFS, and I do edit the pvc section of deployment lhelper and statefulset learner mannually, because I used to think NFS probably is part of reason.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: 2018-12-20T14:53:36Z
  generation: 2
  labels:
    service: dlaas-learner
    training_id: training-pTOewHymR
    user_id: test-user
  name: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
  namespace: default
  resourceVersion: "22353"
  selfLink: /apis/apps/v1/namespaces/default/statefulsets/learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
  uid: 0785179e-0467-11e9-a165-c2aacdd61c5f
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      service: dlaas-learner
      training_id: training-pTOewHymR
      user_id: test-user
  serviceName: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/nvidiaGPU: '{ "AllocationPriority": "Dense"
          }'
        scheduler.alpha.kubernetes.io/tolerations: '[ { "key": "dedicated", "operator":
          "Equal", "value": "gpu-task" } ]'
      creationTimestamp: null
      labels:
        service: dlaas-learner
        training_id: training-pTOewHymR
        user_id: test-user
    spec:
      automountServiceAccountToken: false
      containers:
      - command:
        - bash
        - -c
        - "export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/;
          chmod +x /usr/local/bin/*.sh;\n\t\t\tif [ ! -f /job/load-model.exit ]; then\n\t\t\t\twhile
          [ ! -f /job/load-model.start ]; do sleep 2; done ;\n\t\t\t\tdate \"+%s%N\"
          | cut -b1-13 > /job/load-model.start_time ;\n\t\t\t\t\n\t\t\techo \"Starting
          Training $TRAINING_ID\"\n\t\t\tmkdir -p \"$MODEL_DIR\" ;\n\t\t\tpython -m
          zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR  ;\n\t\t\t\techo
          $? > /job/load-model.exit ;\n\t\t\tfi\n\t\t\techo \"Done load-model\" ;\n\t\t\tif
          [ ! -f /job/learner.exit ]; then\n\t\t\t\twhile [ ! -f /job/learner.start
          ]; do sleep 2; done ;\n\t\t\t\tdate \"+%s%N\" | cut -b1-13 > /job/learner.start_time
          ;\n\t\t\t\t\n\t\t\tfor i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*}
          ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i;
          done;\n\t\t\texport LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;\n\t\t\tmkdir
          -p $RESULT_DIR/learner-$LEARNER_ID ;\n\t\t\tmkdir -p $CHECKPOINT_DIR ;bash
          -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}'
          ;\n\t\t\t\techo $? > /job/learner.exit ;\n\t\t\tfi\n\t\t\techo \"Done learner\"
          ;\n\t\t\tif [ ! -f /job/store-logs.exit ]; then\n\t\t\t\twhile [ ! -f /job/store-logs.start
          ]; do sleep 2; done ;\n\t\t\t\tdate \"+%s%N\" | cut -b1-13 > /job/store-logs.start_time
          ;\n\t\t\t\t\n\t\t\techo Calling copy logs.\n\t\t\tmv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID
          ;\n\t\t\tERROR_CODE=$? ;\n\t\t\techo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete
          ;\n\t\t\tbash -c 'exit $ERROR_CODE' ;\n\t\t\t\techo $? > /job/store-logs.exit
          ;\n\t\t\tfi\n\t\t\techo \"Done store-logs\" ;\n\t\twhile true; do sleep
          2; done ;"
        env:
        - name: DATA_DIR
          value: /mnt/data/tf_training_data
        - name: LOG_DIR
          value: /job/logs
        - name: RESULT_DIR
          value: /mnt/results/tf_trained_model/training-pTOewHymR
        - name: MODEL_DIR
          value: /job/model-code
        - name: TRAINING_COMMAND
          value: 'python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz   --trainLabelsFile
            ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz   --testLabelsFile
            ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001   --trainingIters
            2000 '
        - name: TRAINING_ID
          value: training-pTOewHymR
        - name: GPU_COUNT
          value: "0.000000"
        - name: DOWNWARD_API_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: DOWNWARD_API_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: LEARNER_NAME_PREFIX
          value: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        - name: TRAINING_ID
          value: training-pTOewHymR
        - name: NUM_LEARNERS
          value: "1"
        - name: JOB_STATE_DIR
          value: /job
        - name: CHECKPOINT_DIR
          value: /mnt/results/tf_trained_model/_wml_checkpoints
        - name: RESULT_BUCKET_DIR
          value: /mnt/results/tf_trained_model
        image: tensorflow/tensorflow:1.5.0-py3
        imagePullPolicy: IfNotPresent
        name: learner
        ports:
        - containerPort: 22
          protocol: TCP
        - containerPort: 2222
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 1048576k
            nvidia.com/gpu: "0"
          requests:
            cpu: 500m
            memory: 1048576k
            nvidia.com/gpu: "0"
        securityContext:
          capabilities:
            drop:
            - CHOWN
            - DAC_OVERRIDE
            - FOWNER
            - FSETID
            - KILL
            - SETPCAP
            - NET_RAW
            - MKNOD
            - SETFCAP
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/data/tf_training_data
          name: cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        - mountPath: /mnt/results/tf_trained_model
          name: cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        - mountPath: /job
          name: jobdata
          subPath: training-pTOewHymR
        - mountPath: /entrypoint-files
          name: learner-entrypoint-files
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: regcred
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: gpu-task
      volumes:
      - flexVolume:
          driver: ibm/ibmc-s3fs
          options:
            bucket: tf_training_data
            cache-size-gb: "0"
            chunk-size-mb: "52"
            curl-debug: "false"
            debug-level: warn
            endpoint: http://192.168.64.25:31971
            ensure-disk-free: "0"
            kernel-cache: "true"
            multireq-max: "20"
            parallel-count: "5"
            region: us-standard
            s3fs-fuse-retry-count: "30"
            stat-cache-size: "100000"
            tls-cipher-suite: DEFAULT
          secretRef:
            name: cossecretdata-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        name: cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
      - flexVolume:
          driver: ibm/ibmc-s3fs
          options:
            bucket: tf_trained_model
            cache-size-gb: "0"
            chunk-size-mb: "52"
            curl-debug: "false"
            debug-level: warn
            endpoint: http://192.168.64.25:31971
            ensure-disk-free: "2048"
            kernel-cache: "false"
            multireq-max: "20"
            parallel-count: "2"
            region: us-standard
            s3fs-fuse-retry-count: "30"
            stat-cache-size: "100000"
            tls-cipher-suite: DEFAULT
          secretRef:
            name: cossecretresults-3a77bbc9-7418-44d6-7797-e697a1d43fd1
        name: cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1
      - configMap:
          defaultMode: 420
          name: learner-entrypoint-files
        name: learner-entrypoint-files
      - name: jobdata
        persistentVolumeClaim:
          claimName: learner-1
  updateStrategy:
    type: OnDelete
status:
  collisionCount: 0
  currentRevision: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-7df856b884
  observedGeneration: 2
  replicas: 1
  updateRevision: learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-5dc4cfdf78
  updatedReplicas: 1
  1. I found that the same issue of pod ibmcloud-object-storage-deployer occurs in the very beginning. I chose to ignore the pod ibmcloud-object-storage-deployer because other pods except this one deployed in a specific namespace(generally in default namespace) worked. From the advice of @fplk , I realized that there're no s3fs and ibm-volume-plugin working inside minikube. I tried again yesterday, and I followed the steps in this doc, copying the ibmc-s3fs volume plugin to the path /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs inside minikube. However, about s3fs in minikube, I don't really know how to install it inside minikube manually, which is based on os called Buildroot 2018.05 probably.(From my understanding, the pod ibmcloud-object-storage-deployer works for this step somehow, but I don't know why it failed) Following is the warning of the pods. It failed again though, there's subtle difference from previous one.
$ kubectl describe po learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0

Events:
  Type     Reason                 Age               From               Message
  ----     ------                 ----              ----               -------
  Normal   Scheduled              9m                default-scheduler  Successfully assigned learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0 to minikube
  Normal   SuccessfulMountVolume  9m                kubelet, minikube  MountVolume.SetUp succeeded for volume "learner-entrypoint-files"
  Normal   SuccessfulMountVolume  9m                kubelet, minikube  MountVolume.SetUp succeeded for volume "hostpathtest"
  Warning  FailedMount            5m (x2 over 7m)   kubelet, minikube  Unable to mount volumes for pod "learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0_default(06ed9d78-0529-11e9-a165-c2aacdd61c5f)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0". list of unmounted volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1]. list of unattached volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 learner-entrypoint-files jobdata]
  Warning  FailedMount            3m (x11 over 9m)  kubelet, minikube  MountVolume.SetUp failed for volume "cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:
  Warning  FailedMount            3m (x11 over 9m)  kubelet, minikube  MountVolume.SetUp failed for volume "cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:

# truncated, same results over and over again
$ minikube logs
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.648527   17241 driver-call.go:258] mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.649222   17241 nestedpendingoperations.go:267] Operation for "\"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (\"06ed9d78-0529-11e9-a165-c2aacdd61c5f\")" failed. No retries permitted until 2018-12-21 14:08:31.649187316 +0000 UTC m=+474.175452305 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (UniqueName: \"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\") pod \"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0\" (UID: \"06ed9d78-0529-11e9-a165-c2aacdd61c5f\") : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: "
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.953671   17241 driver-call.go:258] mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed:
Dec 21 14:06:29 minikube kubelet[17241]: E1221 14:06:29.954160   17241 nestedpendingoperations.go:267] Operation for "\"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (\"06ed9d78-0529-11e9-a165-c2aacdd61c5f\")" failed. No retries permitted until 2018-12-21 14:08:31.954118575 +0000 UTC m=+474.480383265 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\" (UniqueName: \"flexvolume-ibm/ibmc-s3fs/06ed9d78-0529-11e9-a165-c2aacdd61c5f-cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1\") pod \"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0\" (UID: \"06ed9d78-0529-11e9-a165-c2aacdd61c5f\") : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: "
Dec 21 14:06:30 minikube kubelet[17241]: W1221 14:06:30.227089   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainer-858b8ccf95-fpttp due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:06:34 minikube kubelet[17241]: E1221 14:06:34.923885   17241 kubelet.go:1635] Unable to mount volumes for pod "learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0_default(06ed9d78-0529-11e9-a165-c2aacdd61c5f)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0". list of unmounted volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1]. list of unattached volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 learner-entrypoint-files jobdata]; skipping pod
Dec 21 14:06:34 minikube kubelet[17241]: E1221 14:06:34.924016   17241 pod_workers.go:186] Error syncing pod 06ed9d78-0529-11e9-a165-c2aacdd61c5f ("learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0_default(06ed9d78-0529-11e9-a165-c2aacdd61c5f)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"learner-3a77bbc9-7418-44d6-7797-e697a1d43fd1-0". list of unmounted volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1]. list of unattached volumes=[cosinputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 cosoutputmount-3a77bbc9-7418-44d6-7797-e697a1d43fd1 learner-entrypoint-files jobdata]
Dec 21 14:06:42 minikube kubelet[17241]: W1221 14:06:42.227372   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-ui-55f5754ffb-d8msw due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:19 minikube kubelet[17241]: W1221 14:07:19.227941   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/jobmonitor-3a77bbc9-7418-44d6-7797-e697a1d43fd1-6c7d4d484942m5x due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:19 minikube kubelet[17241]: W1221 14:07:19.231865   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/lhelper-3a77bbc9-7418-44d6-7797-e697a1d43fd1-f7c7d96c5-4qqhh due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:20 minikube kubelet[17241]: W1221 14:07:20.229235   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-restapi-6fc48bd5b5-wdwbr due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:21 minikube kubelet[17241]: W1221 14:07:21.227647   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-lcm-6d96b5767b-g2nn6 due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:23 minikube kubelet[17241]: W1221 14:07:23.226763   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainingdata-c57f5cddd-bsfm4 due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:31 minikube kubelet[17241]: W1221 14:07:31.227217   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainer-858b8ccf95-fpttp due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:07:47 minikube kubelet[17241]: W1221 14:07:47.226982   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-ui-55f5754ffb-d8msw due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:08:23 minikube kubelet[17241]: W1221 14:08:23.233435   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/lhelper-3a77bbc9-7418-44d6-7797-e697a1d43fd1-f7c7d96c5-4qqhh due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:08:25 minikube kubelet[17241]: W1221 14:08:25.230748   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-trainingdata-c57f5cddd-bsfm4 due to secrets "regcred" not found.  The image pull may not succeed.
Dec 21 14:08:27 minikube kubelet[17241]: W1221 14:08:27.226867   17241 kubelet_pods.go:878] Unable to retrieve pull secret default/regcred for default/ffdl-restapi-6fc48bd5b5-wdwbr due to secrets "regcred" not found.  The image pull may not succeed.

reproduce

$ minikube start --insecure-registry 9.0.0.0/8 --insecure-registry 10.0.0.0/8 \
                 --cpus 4 \
                 --memory 4096 --disk-size=40g\
                 --vm-driver=hyperkit --apiserver-ips 127.0.0.1 --apiserver-name localhost --logtostderr
$ make deploy-plugin
$ make quickstart-deploy
$ make test-push-data-s3
$ make test-job-submit
  • dind
    As for dind, every parts was going well prior to the training part. when I ran command make test-job-submit, it got stuck. It's abnormal because no more pods was about to produce afterwards, so I can't provide neither the yaml file of statefulset of learner nor the description of failure pods. It popped out FAILED,\n Error 200: OK. I thought there's a request problem, so I retrieved the logs from pod ffdl-restapi-xx, I got a rpc error from the pod log, but have no idea what happened.
$ kubectl logs ffdl-restapi-7f5c57c77d-lp4k2
time="2018-12-21T14:31:06Z" level=debug msg="Log level set to 'debug'"
time="2018-12-21T14:31:06Z" level=debug msg="Milli CPU is: 60"
time="2018-12-21T14:31:06Z" level=info msg="GetTrainingDataMemInMB() returns 300"
time="2018-12-21T14:31:06Z" level=debug msg="Training Data Mem in MB is: 300"
time="2018-12-21T14:31:06Z" level=debug msg="No config file 'config-dev.yml' found. Using environment variables only."
{"level":"info","msg":"DLaaS REST API v1 serving on :8080","time":"2018-12-21T14:31:10Z"}
{"level":"info","method":"POST","msg":"Started handling request","remote":"127.0.0.1:40906","request":"/v1/models?version=2017-02-13","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"Enter into auth handler","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"request: \u0026{Method:POST URL:/v1/models?version=2017-02-13 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[application/json] Authorization:[Basic dGVzdC11c2VyOnRlc3Q=] Content-Type:[multipart/form-data; boundary=79f1ce044563b1c04bbc0fef5a4af5484d5361472883bc5af5b39e48168e] X-Watson-Userinfo:[bluemix-instance-id=test-user] Accept-Encoding:[gzip] User-Agent:[Go-http-client/1.1]] Body:0xc420374e00 GetBody:\u003cnil\u003e ContentLength:-1 TransferEncoding:[chunked] Close:false Host:localhost:32605 Form:map[] PostForm:map[] MultipartForm:\u003cnil\u003e Trailer:map[] RemoteAddr:127.0.0.1:40906 RequestURI:/v1/models?version=2017-02-13 TLS:\u003cnil\u003e Cancel:\u003cnil\u003e Response:\u003cnil\u003e ctx:0xc420374e40}","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"Writing to header in callBefore \"Access-Control-Allow-Origin: *\"","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"wmlTenantID: ","time":"2018-12-21T14:43:46Z"}
{"level":"debug","msg":"X-DLaaS-UserID: test-user","time":"2018-12-21T14:43:46Z"}
{"Accept":["application/json"],"Accept-Encoding":["gzip"],"Authorization":["Basic dGVzdC11c2VyOnRlc3Q="],"Content-Type":["multipart/form-data; boundary=79f1ce044563b1c04bbc0fef5a4af5484d5361472883bc5af5b39e48168e"],"User-Agent":["Go-http-client/1.1"],"X-Dlaas-Userid":["test-user"],"X-Watson-Userinfo":["bluemix-instance-id=test-user"],"level":"debug","msg":"Request headers:","time":"2018-12-21T14:43:46Z"}
{"caller_info":"server/models_impl.go:63 postModel -","level":"debug","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"postModel invoked: map[Accept:[application/json] Authorization:[Basic dGVzdC11c2VyOnRlc3Q=] Content-Type:[multipart/form-data; boundary=79f1ce044563b1c04bbc0fef5a4af5484d5361472883bc5af5b39e48168e] X-Watson-Userinfo:[bluemix-instance-id=test-user] Accept-Encoding:[gzip] X-Dlaas-Userid:[test-user] User-Agent:[Go-http-client/1.1]]","time":"2018-12-21T14:43:46Z","user_id":"test-user"}
{"caller_info":"server/models_impl.go:59 postModel -","level":"debug","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"Loading Manifest","time":"2018-12-21T14:43:46Z","user_id":"test-user"}
{"level":"info","msg":"dialing to target with scheme: \"\"","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"ccResolverWrapper: sending new addresses to cc: [{ffdl-trainer.default.svc.cluster.local:80 0  \u003cnil\u003e}]","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"ClientConn switching balancer to \"pick_first\"","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"pickfirstBalancer: HandleSubConnStateChange: 0xc420281ea0, CONNECTING","time":"2018-12-21T14:43:47Z"}
{"level":"info","msg":"pickfirstBalancer: HandleSubConnStateChange: 0xc420281ea0, READY","time":"2018-12-21T14:43:47Z"}
{"caller_info":"server/manifest.go:237 manifest2TrainingRequest -","level":"debug","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"EMExtractionSpec ImageTag: ","time":"2018-12-21T14:43:47Z","user_id":"test-user"}
{"caller_info":"server/models_impl.go:117 postModel -","error":"rpc error: code = Canceled desc = context canceled","level":"error","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"Trainer service call failed","time":"2018-12-21T14:43:56Z","user_id":"test-user"}
{"caller_info":"server/models_impl.go:857 error500 -","level":"error","model_filename":"manifest_testrun.yml","module":"rest-api","msg":"Returning 500 error: ","time":"2018-12-21T14:43:56Z","user_id":"test-user"}
{"level":"info","measure#rest-api.latency":9943356700,"method":"POST","msg":"Completed handling request","remote":"127.0.0.1:40906","request":"/v1/models?version=2017-02-13","status":500,"text_status":"Internal Server Error","time":"2018-12-21T14:43:56Z","took":9943356700}

Yesterday @fplk mentioned about the version of dind. I downloaded dind 1.10.9, but decide to give it up because of failure of dind 1.10.9 installation.

FYI, there're two more thing I'd like to mention. I left NULL to environment variable SHARED_VOLUME_STORAGE_CLASS under both minikube and dind VM. I hope there's no connection with this part. And S3 service part works well, I mean I checked out s3 buckets which does have the training data after the command make test-push-data-s3 for both dind and minikube.

@sboagibm Thanks a lot for any of your suggestions.

from ffdl.

fplk avatar fplk commented on May 23, 2024

I apologize for the delay due to the holidays. I think I can reproduce the error you encountered and have been able to get it working. A couple of things:

a) The scripts in https://github.com/IBM/FfDL/tree/master/bin/dind_scripts should largely work with the exception that you need to update DIND in launch_kubernetes.sh (RawGit is deprecated and 1.9 is old, so I successfully used https://github.com/kubernetes-sigs/kubeadm-dind-cluster/releases/download/v0.1.0/dind-cluster-v1.13.sh - just replace all occurrences in the file accordingly)

b) Here are my steps:

ssh root@<machine>.sl.cloud9.ibm.com
apt install -y git software-properties-common
mkdir -p /home/ffdlr/go/src/github.com/IBM/ && cd $_ && git clone https://github.com/IBM/FfDL.git && cd FfDL
# Replace DIND version as explained in (a)
cd bin/dind_scripts/
chmod +x create_user.sh
. create_user.sh
# Enter new password and get kicked out

ssh ffdlr@<machine>.sl.cloud9.ibm.com
cd /home/ffdlr/go/src/github.com/IBM/FfDL/bin/dind_scripts/
sudo chmod +x experimental_master.sh
. experimental_master.sh

Build own manifest with:

name: tf_convolutional_network_tutorial
description: Convolutional network model using tensorflow
version: "1.0"
gpus: 0
cpus: 0.5
memory: 1Gb
learners: 1

# Object stores that allow the system to retrieve training data.
data_stores:
  - id: sl-internal-os
    type: mount_cos
    training_data:
      container: REPLACE_INPUT_BUCKET
    training_results:
      container: REPLACE_OUTPUT_BUCKET
    connection:
      auth_url: REPLACE_ENDPOINT
      user_name: REPLACE_KEY_ID
      password: REPLACE_KEY

framework:
  name: tensorflow
  version: "1.5.0-py3"
  command: >
    python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
      --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
      --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
      --trainingIters 2000
  # Change trainingIters to 20000 if you want your model to have over 80% Accuracy rate.

evaluation_metrics:
  type: tensorboard
  in: "$JOB_STATE_DIR/logs/tb"
  # (Eventual) Available event types: 'images', 'distributions', 'histograms', 'images'
  # 'audio', 'scalars', 'tensors', 'graph', 'meta_graph', 'run_metadata'
  #  event_types: [scalars]

Run with DLAAS_URL=http://10.192.0.3:31826 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/ffdlr/go/src/github.com/IBM/FfDL/cli/bin/ffdl-linux train mymanifest.yml . from within /home/ffdlr/go/src/github.com/IBM/FfDL/etc/examples/tf-model

c) This should work, but is currently not the most user-friendly way of setting things up. I can try to push some changes to improve usability - it looks from the outside like the manifest creation gets garbled up somewhere, but until then this should get you running.

d) Minikube is a suboptimal environment due to unfixed storage bugs on their side. FfDL should be able to run against GKE in principle, but I'm not sure the open source S3 driver will work against that. I think the storage team supports DIND and IBM Cloud and their architecture should work against any cloud provider, but you would have to test the driver manually and PR minor changes if it does not work against your target provider out of the box. Or use a different driver.

from ffdl.

fplk avatar fplk commented on May 23, 2024

OK, with #158 I can deploy FfDL with the typical 4 commands from the README against DIND on macOS and Linux as well as IBM Cloud. Please let me know if it helps.

from ffdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.