Overview of k8s cluster in AWS:
conabio / kube_sipecam Goto Github PK
View Code? Open in Web Editor NEWk8s processing documentation for SiPeCaM project
License: MIT License
k8s processing documentation for SiPeCaM project
License: MIT License
Overview of k8s cluster in AWS:
Change names of service, deployment of MAD-Mex in:
https://github.com/CONABIO/kube_sipecam/tree/master/minikube_sipecam/deployments/MAD_Mex
As a reference use:
Will be really useful that a gh action is created to push docker image when Dockerfiles in kube_sipecam/dockerfiles/ changes
Reference:
In the past, the annotation volume.beta.kubernetes.io/storage-class was used instead of storageClassName attribute. This annotation is still working; however, it won't be supported in a future Kubernetes release
ref:
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class
So, when using aws-efs
as provisioner for storage classes needs to update section: metadata.annotation
when creating PVC:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: efs
namespace: kubeflow
annotations:
volume.beta.kubernetes.io/storage-class: "aws-efs"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Mi
Or substitute with another provisioner (or follow suggestions of eterna2 in chat of slack of kubeflow... i checked if i had this suggestions and didnt find them.... but were based in node selector) because
https://github.com/kubernetes-retired/external-storage/tree/master/aws/efs
looks will be retired...
Using run docker cmd for nvcr.io/nvidia/tensorflow:19.03-py3 docker image
I got:
The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TensorFlow. NVIDIA recommends the use of the following flags:
nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...
Maybe I need to mount a volume like:
volumeMounts:
- name: efs-pvc
mountPath: "/shared_volume"
- name: dshm
mountPath: /dev/shm
volumes:
- name: efs-pvc
persistentVolumeClaim:
claimName: efs
- name: dshm
emptyDir:
medium: Memory
in
??
First test using kale from cmd line was successfull with Docker image. But using jupyter extension wasn't. Possibly is related to running jupyter lab as user distinct to root.
Follow:
to build:
Add scikit-learn pkg in:
pip install --user scikit-learn
Check new functionality in kale 5.0.1 (already built in Dockerfile hsi)
Will be useful to give potential developers of processing systems a "Dockerfile standard" so their systems can be integrated in kube_sipecam framework.
Could be sth like:
FROM ubuntu:bionic
ENV TIMEZONE America/Mexico_City
ENV JUPYTERLAB_VERSION 2.1.4
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV DEBIAN_FRONTEND noninteractive
ENV DEB_BUILD_DEPS="sudo nano less git python3-dev python3-pip python3-setuptools curl wget"
ENV DEB_PACKAGES=""
ENV PIP_PACKAGES_KALE="click==7.0 six==1.12.0 setuptools==41.0.0 urllib3==1.24.2 kubeflow-kale==0.5.0"
RUN apt-get update && export $DEBIAN_FRONTEND && \
echo $TIMEZONE > /etc/timezone && apt-get install -y tzdata
RUN apt-get update && apt-get install -y $DEB_BUILD_DEPS $DEB_PACKAGES && pip3 install --upgrade pip
RUN curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash - && apt-get install -y nodejs
RUN pip3 install jupyter "jupyterlab<2.0.0" --upgrade
RUN jupyter notebook --generate-config && sed -i "s/#c.NotebookApp.password = .*/c.NotebookApp.password = u'sha1:115e429a919f:21911277af52f3e7a8b59380804140d9ef3e2380'/" /root/.jupyter/jupyter_notebook_config.py
RUN pip3 install $PIP_PACKAGES --upgrade
RUN jupyter labextension install kubeflow-kale-launcher
#install package, for example:
RUN pip3 install "git+https://github.com/CONABIO/geonode.git#egg=geonode_conabio&subdirectory=python3_package_for_geonode"
VOLUME ["/shared_volume"]
#create url like:
ENV NB_PREFIX geonodeurl
#use url in:
ENTRYPOINT ["/usr/local/bin/jupyter", "lab", "--ip=0.0.0.0", "--no-browser", "--allow-root", "--LabApp.allow_origin='*'", "--LabApp.base_url=geonodeurl"]
Add in odc_kale/0.1.0_1.7.0_0.5.0/Dockerfile and odc_kale/0.1.0_1.8.3_0.5.0/Dockerfile lines like:
#some configs for antares & datacube
ln -sf /shared_volume/.antares ~/.antares
ln -sf /shared_volume/.datacube.conf ~/.datacube.conf
So there's no need to create every time this files when a new madmex-odc-kale container is running
Need to check dependencies in Dockerfile
There are errors related to versions regarding kale 0.3.4 version and tensorflow 1.14.0 version
For example:
ERROR: nbclient 0.1.0 has requirement nbformat>=5.0, but you'll have nbformat 4.4.0 which is incompatible.
ERROR: kfp 0.1.40 has requirement click==7.0, but you'll have click 7.1.1 which is incompatible.
Version 0.4.0 of kale also produces errors
It was seen after doing tests that is not necessary to distinguish between having next line:
and don't have it in deployment:
At least using the example for torch:
the kubeflow+kale run was successful
So I either could delete file
or use this file to compile notebook via kale and avoid having problems in kubernetes for not finding nodes with gpu's (because stablishing inside limits block the paremeter nvidia.com/gpu: 1
causes this message)
Check
https://github.com/CONABIO/kube_sipecam/blob/master/dockerfiles/audio/0.4.0/Dockerfile
When image is deployed next output produces
Fail to get yarn configuration. {"type":"error","data":"Could not write file "/usr/local/lib/python3.6/dist-packages/jupyterlab/yarn-error.log": "EACCES: permission denied, open '/usr/local/lib/python3.6/dist-packages/jupyterlab/yarn-error.log'""}
{"type":"error","data":"An unexpected error occurred: "EACCES: permission denied, scandir '/home/miuser/.config/yarn/link'"."}
{"type":"info","data":"Visit https://yarnpkg.com/en/docs/cli/config for documentation about this command."}
TensorRT for high performance inference, see blog
Github:
https://github.com/NVIDIA/TensorRT
Not sure when and how I got errors like:
tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-03-24 13:32:09.746769: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Maybe using tfx? or some of the dependencies of tfx?... One way to start solving previous error is using docker image in https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt :
docker pull nvcr.io/nvidia/tensorrt:20.03-py3
If using Docker image 0.21.4 as base image, next output is obtained:
2020-04-24 17:41:30.028801: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-24 17:41:30.029009: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-24 17:41:30.029035: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
usage: run_executor.py [-h] --executor_class_path EXECUTOR_CLASS_PATH
[--temp_directory_path TEMP_DIRECTORY_PATH]
(--inputs INPUTS | --inputs-base64 INPUTS_BASE64)
(--outputs OUTPUTS | --outputs-base64 OUTPUTS_BASE64)
(--exec-properties EXEC_PROPERTIES | --exec-properties-base64 EXEC_PROPERTIES_BASE64)
[--write-outputs-stdout]
run_executor.py: error: the following arguments are required: --executor_class_path
If Docker image tfx base is used as base image, next output is obtained:
Extracting Bazel installation...
[bazel release 3.0.0]
Usage: bazel <command> <options> ...
Available commands:
analyze-profile Analyzes build profile data.
aquery Analyzes the given targets and queries the action graph.
build Builds the specified targets.
canonicalize-flags Canonicalizes a list of bazel options.
clean Removes output files and optionally stops the server.
coverage Generates code coverage report for specified test targets.
...
Getting more help:
bazel help <command>
Prints help and options for <command>.
bazel help startup_options
Options for the JVM hosting bazel.
bazel help target-syntax
Explains the syntax for specifying targets.
bazel help info-keys
Displays a list of keys used by the info command.
Need to choose which tfx base docker image will use in audio processing kubeflow pipelines
See:
Check https://skaffold.dev/docs/ for CI/CD pipelines in kubernetes
Error using datacube 1.8.0
pyproj.exceptions.CRSError: Invalid projection: PROJCS["unnamed",GEOGCS["WGS 84",DATUM["unknown",SPHEROID["WGS84",6378137,6556752.3141]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic_2SP"],PARAMETER["standard_parallel_1",17.5],PARAMETER["standard_parallel_2",29.5],PARAMETER["latitude_of_origin",12],PARAMETER["central_meridian",-102],PARAMETER["false_easting",2500000],PARAMETER["false_northing",0]]: (Internal Proj Error: proj_create: buildCS: missing UNIT)
Check:
opendatacube/datacube-core#880
For:
Development has been made in
https://github.com/CONABIO/kube_sipecam/tree/master/deployments/MAD_Mex
using minikube, kubeflow and kale.
Create dir minikube_sipecam
under root dir of this repo to hold explanation of this development.
Primarly this dir will hold all the documentation for the requirements and instructions to deploy the system. Will help to:
be a proof of concept and local deployment of kube sipecam processing system.
adopt kube sipecam processing system and familiarize with the pipelines.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.