conabio / kube_sipecam Goto Github PK

View Code? Open in Web Editor NEW

0.0 7.0 1.0 1.56 MB

k8s processing documentation for SiPeCaM project

License: MIT License

Dockerfile 49.60% Jupyter Notebook 50.40%

kube_sipecam's Introduction

kube_sipecam

Overview of k8s cluster in AWS:

Online documentation

kube_sipecam's People

Contributors

Watchers

Forkers

caroacostatovany

kube_sipecam's Issues

Change names of service, deployment of MAD-Mex in kube sipecam

Change names of service, deployment of MAD-Mex in:

https://github.com/CONABIO/kube_sipecam/tree/master/minikube_sipecam/deployments/MAD_Mex

As a reference use:

https://github.com/CONABIO/kube_sipecam/tree/master/minikube_sipecam/deployments/geonode_conabio/hostpath_pv

Incorporate gh action to push docker image

Will be really useful that a gh action is created to push docker image when Dockerfiles in kube_sipecam/dockerfiles/ changes

Reference:

https://github.com/docker/build-push-action

Update section metadata.annotation when creating PVC using aws-efs as provisioner for storage class

In the past, the annotation volume.beta.kubernetes.io/storage-class was used instead of storageClassName attribute. This annotation is still working; however, it won't be supported in a future Kubernetes release

ref:

https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims

https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class

So, when using aws-efs as provisioner for storage classes needs to update section: metadata.annotation when creating PVC:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: efs
  namespace: kubeflow
  annotations:
    volume.beta.kubernetes.io/storage-class: "aws-efs"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Mi

Or substitute with another provisioner (or follow suggestions of eterna2 in chat of slack of kubeflow... i checked if i had this suggestions and didnt find them.... but were based in node selector) because

https://github.com/kubernetes-retired/external-storage/tree/master/aws/efs

looks will be retired...

check if shm is necessary to mount as a volume for audio processings

Using run docker cmd for nvcr.io/nvidia/tensorflow:19.03-py3 docker image

I got:

The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TensorFlow. NVIDIA recommends the use of the following flags:
nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

Maybe I need to mount a volume like:

    volumeMounts:
     - name: efs-pvc
       mountPath: "/shared_volume"
     - name: dshm
       mountPath: /dev/shm
  volumes:
   - name: efs-pvc
     persistentVolumeClaim:
      claimName: efs
   - name: dshm 
     emptyDir:
      medium: Memory

https://github.com/CONABIO/kube_sipecam/blob/master/deployments/audio/kale-jupyterlab-kubeflow_0.4.0_1.14.0_tf_cpu.yaml#L35

check datacube explorer

See: https://github.com/opendatacube/datacube-explorer

Rebuild docker image for kale and tensorflow

First test using kale from cmd line was successfull with Docker image. But using jupyter extension wasn't. Possibly is related to running jupyter lab as user distinct to root.

Follow:

kubeflow custom notebook

kale Dockerfile

to build:

Docker image

Install scikit-learn in MAD-Mex + kale image

Add scikit-learn pkg in:

https://github.com/CONABIO/kube_sipecam/blob/master/dockerfiles/MAD_Mex/kale_geopython/0.5.0/Dockerfile

pip install --user scikit-learn

Katib functionality in kale for hp tuning

Check new functionality in kale 5.0.1 (already built in Dockerfile hsi)

https://github.com/kubeflow/katib

Install tensorflow-metadata (or tensorflow_metadata?)

When reviewing
airflow_workshop

found this file

so install tensorflow-metadata in

tfx Dockerfile

Create Dockerfile "standard" in docu to be integrated in kube_sipecam framework

Will be useful to give potential developers of processing systems a "Dockerfile standard" so their systems can be integrated in kube_sipecam framework.

Could be sth like:

FROM ubuntu:bionic

ENV TIMEZONE America/Mexico_City
ENV JUPYTERLAB_VERSION 2.1.4
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV DEBIAN_FRONTEND noninteractive
ENV DEB_BUILD_DEPS="sudo nano less git python3-dev python3-pip python3-setuptools curl wget"
ENV DEB_PACKAGES=""
ENV PIP_PACKAGES_KALE="click==7.0 six==1.12.0 setuptools==41.0.0 urllib3==1.24.2 kubeflow-kale==0.5.0"

RUN apt-get update && export $DEBIAN_FRONTEND && \
    echo $TIMEZONE > /etc/timezone && apt-get install -y tzdata

RUN apt-get update && apt-get install -y $DEB_BUILD_DEPS $DEB_PACKAGES && pip3 install --upgrade pip

RUN curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash - && apt-get install -y nodejs

RUN pip3 install jupyter "jupyterlab<2.0.0" --upgrade
RUN jupyter notebook --generate-config && sed -i "s/#c.NotebookApp.password = .*/c.NotebookApp.password = u'sha1:115e429a919f:21911277af52f3e7a8b59380804140d9ef3e2380'/" /root/.jupyter/jupyter_notebook_config.py

RUN pip3 install $PIP_PACKAGES --upgrade

RUN jupyter labextension install kubeflow-kale-launcher

#install package, for example:
RUN pip3 install "git+https://github.com/CONABIO/geonode.git#egg=geonode_conabio&subdirectory=python3_package_for_geonode"

VOLUME ["/shared_volume"]

#create url like:
ENV NB_PREFIX geonodeurl

#use url in:
ENTRYPOINT ["/usr/local/bin/jupyter", "lab", "--ip=0.0.0.0", "--no-browser", "--allow-root", "--LabApp.allow_origin='*'", "--LabApp.base_url=geonodeurl"]

Add creation of symbolic links in Dockerfile for madmex_odc_kale regarding dot files

Add in odc_kale/0.1.0_1.7.0_0.5.0/Dockerfile and odc_kale/0.1.0_1.8.3_0.5.0/Dockerfile lines like:

#some configs for antares & datacube
ln -sf /shared_volume/.antares ~/.antares
ln -sf /shared_volume/.datacube.conf ~/.datacube.conf

So there's no need to create every time this files when a new madmex-odc-kale container is running

Docker image wasn't correctly build for tfx usage and kale

Need to check dependencies in Dockerfile

https://github.com/CONABIO/kube_sipecam/blob/master/dockerfiles/audio/tfx_kale/0.4.0_1.14.0_tfx/Dockerfile

There are errors related to versions regarding kale 0.3.4 version and tensorflow 1.14.0 version

For example:

ERROR: nbclient 0.1.0 has requirement nbformat>=5.0, but you'll have nbformat 4.4.0 which is incompatible.
ERROR: kfp 0.1.40 has requirement click==7.0, but you'll have click 7.1.1 which is incompatible.

Version 0.4.0 of kale also produces errors

It's not necessary to have two deployments of kale-jupyterlab-kubeflow_0.4.0_1.14.0

It was seen after doing tests that is not necessary to distinguish between having next line:

https://github.com/CONABIO/kube_sipecam/blob/master/deployments/audio/kale-jupyterlab-kubeflow_0.4.0_1.14.0_tf.yaml#L35

and don't have it in deployment:

https://github.com/CONABIO/kube_sipecam/blob/master/deployments/audio/kale-jupyterlab-kubeflow_0.4.0_1.14.0_tf_cpu.yaml#L32

At least using the example for torch:

https://github.com/CONABIO/kube_sipecam_playground/tree/issue-1/audio/notebooks/dockerfiles/tf_kale/0.4.0_1.14.0_tf/cifar10

the kubeflow+kale run was successful

So I either could delete file

https://github.com/CONABIO/kube_sipecam/blob/master/deployments/audio/kale-jupyterlab-kubeflow_0.4.0_1.14.0_tf_cpu.yaml

or use this file to compile notebook via kale and avoid having problems in kubernetes for not finding nodes with gpu's (because stablishing inside limits block the paremeter nvidia.com/gpu: 1 causes this message)

docker image for kale outputs "permission denied"

Check

https://github.com/CONABIO/kube_sipecam/blob/master/dockerfiles/audio/0.4.0/Dockerfile

When image is deployed next output produces

Fail to get yarn configuration. {"type":"error","data":"Could not write file "/usr/local/lib/python3.6/dist-packages/jupyterlab/yarn-error.log": "EACCES: permission denied, open '/usr/local/lib/python3.6/dist-packages/jupyterlab/yarn-error.log'""}
{"type":"error","data":"An unexpected error occurred: "EACCES: permission denied, scandir '/home/miuser/.config/yarn/link'"."}
{"type":"info","data":"Visit https://yarnpkg.com/en/docs/cli/config for documentation about this command."}

There has been errors when finding libs of TensorRT, use docker image from NVIDIA

TensorRT for high performance inference, see blog

Github:

https://github.com/NVIDIA/TensorRT

Not sure when and how I got errors like:

tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-03-24 13:32:09.746769: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Maybe using tfx? or some of the dependencies of tfx?... One way to start solving previous error is using docker image in https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt :

docker pull nvcr.io/nvidia/tensorrt:20.03-py3

Choose tfx docker image as base image for audio processing kubeflow pipelines

If using Docker image 0.21.4 as base image, next output is obtained:

2020-04-24 17:41:30.028801: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-24 17:41:30.029009: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-24 17:41:30.029035: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
usage: run_executor.py [-h] --executor_class_path EXECUTOR_CLASS_PATH
                       [--temp_directory_path TEMP_DIRECTORY_PATH]
                       (--inputs INPUTS | --inputs-base64 INPUTS_BASE64)
                       (--outputs OUTPUTS | --outputs-base64 OUTPUTS_BASE64)
                       (--exec-properties EXEC_PROPERTIES | --exec-properties-base64 EXEC_PROPERTIES_BASE64)
                       [--write-outputs-stdout]
run_executor.py: error: the following arguments are required: --executor_class_path

If Docker image tfx base is used as base image, next output is obtained:

Extracting Bazel installation...
                                                           [bazel release 3.0.0]
Usage: bazel <command> <options> ...

Available commands:
  analyze-profile     Analyzes build profile data.
  aquery              Analyzes the given targets and queries the action graph.
  build               Builds the specified targets.
  canonicalize-flags  Canonicalizes a list of bazel options.
  clean               Removes output files and optionally stops the server.
  coverage            Generates code coverage report for specified test targets.
...
Getting more help:
  bazel help <command>
                   Prints help and options for <command>.
  bazel help startup_options
                   Options for the JVM hosting bazel.
  bazel help target-syntax
                   Explains the syntax for specifying targets.
  bazel help info-keys
                   Displays a list of keys used by the info command.

Need to choose which tfx base docker image will use in audio processing kubeflow pipelines

See:

tfx Dockerfile

mnist pipeline native keras

taxi pipeline simple

Check skaffold

Check https://skaffold.dev/docs/ for CI/CD pipelines in kubernetes

Error using datacube 1.8.0 regarding invalid projection, Proj Error

Error using datacube 1.8.0

pyproj.exceptions.CRSError: Invalid projection: PROJCS["unnamed",GEOGCS["WGS 84",DATUM["unknown",SPHEROID["WGS84",6378137,6556752.3141]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic_2SP"],PARAMETER["standard_parallel_1",17.5],PARAMETER["standard_parallel_2",29.5],PARAMETER["latitude_of_origin",12],PARAMETER["central_meridian",-102],PARAMETER["false_easting",2500000],PARAMETER["false_northing",0]]: (Internal Proj Error: proj_create: buildCS: missing UNIT)

Check:

opendatacube/datacube-core#880

For:

https://github.com/CONABIO/kube_sipecam/blob/master/dockerfiles/MAD_Mex/odc_kale/0.1.0_1.8.3_0.5.0/Dockerfile