lablup / backend.ai-kernels Goto Github PK

Repository of Backend.AI-enabled container recipes

License: GNU Lesser General Public License v3.0

Shell 1.14% Python 3.36% PHP 0.08% R 0.03% Julia 0.30% MATLAB 0.06% Dockerfile 30.66% Roff 11.54% CSS 0.15% Jupyter Notebook 51.65% JavaScript 0.26% HTML 0.58% Vue 0.19%

alpine-linux deep-learning docker programming-languages repl sandbox sorna ubuntu

backend.ai-kernels's People

Contributors

Stargazers

Watchers

Forkers

inureyes hephaex lu-project sycomix like-sinsky rlatjcj xyloon onenos-com aurelius-ai cathy-kim nokchalatte yeym-1979 minjejeon pro-utkarshm

backend.ai-kernels's Issues

Sandboxed Execution

This is an issue delegating lablup/sorna-agent#1.
(It is too long to type in the first line of commit messages...)

New "log" stream in addition to stdout/stderr streams

Some languages offer standardized logging (e.g., Python's logging module and Julia's info(), warn() functions). Let's wrap them and provide a prettier output by distinguishing them via a separate type of stream: "log". (Currently the new PUSH/PULL agent protocol only recognizes "stdout", "stderr", "media", "finished", "waiting-input" message types.)

Automatic resource usage measurement with example codes

Many deep learning codes requires a lot of memory and computation time.
We need some automated way to measure the maximum memory used and computation time for a given example code, for better capacity planning and scheduler designs.

Update python3 related kernel to python 3.6

Update python3 related kernel to python 3.6

R Language Support

Basic REPL support
Pre-install a list of essential packages
- qqplot
- more?

C++ language support

Add C++ language support

Create CUDA-capable containers for deep learning

ref) https://github.com/NVIDIA/nvidia-docker

We need to rebuild our caffe/tensorflow images based on NVIDIA's cuda base images.

Rust language support

Add support for Rust language

NodeJS Support

Basic REPL implementation
Pre-install a list of essential packages

Haskell support

Haskell support
For language specific course.

Upgrade all query-mode kernels to use new PUSH/PULL-based agent protocol

For Python kernels, we also need to update sorna-media package to v0.3.

Minimum features:

Binds PUSH/PULL ZeroMQ sockets on TCP port 2000 and 2001
Console outputs should be sent as multipart messages in this format: [b'stdout', b'utf8-encoded-text'] and [b'stderr', b'utf8-encoded-text'].
No more separate exception handler. Just print language-native traceback to stderr.
Send a multipart message [b'finished', b''] when execution is done.

Optional features:

Strip tracebacks to include only user-provided codes.
Provide interactive input function using input socket (like self.handle_input in Python 3 impl.)
- If possible, override the most commonly used, standard input function.

Tips:

Copy and modify test_run.py in python3 kernel directory to test the main programs before building docker containers for fast iteration of debugging and development.

Git support

Add a custom Git command shell for Git tutorial courses.

Upgrade TensorFlow kernel to v1.0

TensorFlow v1.0 is released last week.

Allow callbacks in nodejs4 kernel

Egoing has reported an issue that he could not see the result of the following code:

var 입력한비밀번호 = '1111';
var 소금의크기 = 32;
var 암호화반복횟수 = 10000;
var 암호의길이 = 32;
var crypto = require('crypto');
crypto.randomBytes(소금의크기, function(오류, 소금){
    crypto.pbkdf2(입력한비밀번호, 소금, 암호화반복횟수, 암호의길이, 'sha512', function(오류, 생성된암호){
        console.log(생성된암호.toString('hex'));
    });
});

This is due to the current nodejs kernel just goes through the synchronous part until it sends the execution result and callbacks generated by the user code are executed later.
We need a "blocking" mechanism until all user callbacks finish as well as temporarily removing existing sorna-side callbacks from the event loop.

As the result, I have found a small hacky open source project that uses C++ addon to access uv_run() function, and patched it to implement a blocking call until all callbacks finish:
abbr/deasync#53

Then, I have added unref() / ref() support to zeromq.node project:
JustinTulloss/zeromq.node#503

Now we can implement a proper blocking call for nodejs4 kernel.

Go Language Support

Add Go language support.

Basic REPL for Golang
Pre-install a list of essential packages
- availability of go get-like functionality?

Unify MAINTAINER field in all Dockerfiles to improve caching

Many kernels reuse the same intial Dockerfile procedures. Let's enable caching for them.

Python Support

A collection of custom package inclusion requests.

Set OMP_NUM_THREADS for calculation kernels

Several code examples (e.g., this one) using TensorFlow crash due to thread limits imposed by our jail.

terminate called after throwing an instance of 'std::system_error'
  what():  Operation not permitted

The root cause is libeigen (a C++ matrix calculation library) used by TensorFlow which reads OMP_NUM_THREADS environment variable to initialize its thread pool.

ref) http://eigen.tuxfamily.org/dox/TopicMultiThreading.html

Lua support

Basic Lua execution support
Useful packages?

Improve Jail development environment

Jail should be compiled in Linux (preferably the same version of Ubuntu as REPL kernels use) and thus native Docker environments requires a separate Ubuntu image setup.
Let's add some helper scripts for building new jail binaries.

PyTorch support

Add support for PyTorch.

Add some reasonable default configs for kernel images

Shell
- Uncomment set convert-meta off in /etc/inputrc to allow output of 8-bit characters
- Run locale-gen en_US.UTF-8 and set LANG environment variable so that bash can handle multi-byte UTF-8 characters correctly. (e.g., backspace should delete each Unicode char like a single char.)
Vim
- Add terminal encoding and indentations to /etc/vim/vimrc.local
- Note: /etc/vim/vimrc and /usr/share/vim/vim74/debian.vim already has syntax highlighting, eol, nocompatible settings.

More to come.

GNU Octave Support

Basic REPL implementation
- ref) https://github.com/esromneb/zeromq-octave
Pre-install a list of essential packages

Automate image build and deploy process

... so that other people can easily update the service images.
Currently, we use only a single sorna instance, sorna.lablup, but this should be extended to cover multiple instances via docker-registry.lablup.

Support interrupt of ongoing executions

In development phase, many engineers often interrupt ongoing executions when they realize something is going to be wrong. Jupyter notebook also supports interrupts using SIGINT signal from the notebook server to the kernel process. Let's support it.

There are some issue to think:

SIGINT signal may not be caught if it is delivered in the middle of a system call since Python 3.5+ retries interrupted I/O syscalls. ref: PEP-475
Abruptive interruption may leave the process under inconsistent state. (e.g., somewhere between sending ZeroMQ multi-part output messages)

Latest jail has interference with numpy

initialization of multiarray raised unreported exception

Enhance TensorFlow support

TensorBoard – replaced by lablup/cloud.backend.ai#38 (internal issue)
~~[ ] [TensorFlow Fold](https://github.com/tensorflow/fold~~ → Not maintained.
https://github.com/microsoft/tensorwatch
more?

Web terminal support

Interactive terminal support for tutorial/workshops.

Limit CPU core count in containers

TensorFlow kernels are not working on high-end servers due to process/thread limits in our jail.
This is probably due to mis-reporting of sysconf(_SC_NPROCESSORS_ONLN) library call, which reports the full CPU count instead of Docker-allocated cpuset.

Jail policy implementation for sockets to external networks

Phase 1: Allow some fixed hosts for lectures and tutorials, such as:

Allow connections to DNS servers (TCP/UDP port 53)
- probably we could cache DNS resolution results to restrict connections to hosts
Allow connections to the following hosts:
- github.com, bitbucket.org (TCP 22, 80, 443): source-code repository services
- httpbin.org (TCP 80, 443): HTTP protocol test server
- example.com, example.org, example.net, and possibly other IANA-managed sample domains (TCP 80, 443): example domain site reserved by IANA

Phase 2: Allow customization

Allow additional host/port pairs specified per client-side session ID via Sorna API.

TensorFlow + Keras

Keras is a wrapper around existing DL libraries.
Let's add support for it as two separate kernels images: tf + keras / theano + keras.
tf + keras image will be an upgrade of the current python3-tensorflow images.

SQL support

sqlite-based data manipluation course
(demands exist at research/consulting firms, even without programming skills)

Julia Support

Basic REPL implementation
Pre-install a list of essential packages

Suppress/remove font cache building when first using matplotlib

/home/joongi/venv-ipython/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Remove or suppress above warning messages when a fresh kernel first uses matplotlib.
Maybe we could run font cache building process during docker builds.

Python 3.5 Container

Python 3.5 container

C language support

Add C language support

ML (Ocaml) Support

For PL classes.

Java language support

Add Java language support

Limit the CPU cores exposed to kernel containers

Some C/C++ libraries used by kernel's 3rd-party packages implicitly spawn as many threads as the number of available CPU cores, and this exceeds the default child process/thread limit (32) in servers with a high number of cores. It causes crashes or indefinite hangs of kernels. 😞

Limit the maximum number of CPU cores per container (default is 1 and customizable by each kernel via image labels)
(in lablup/sorna-agent) Implement a CPU core allocation policy that takes NUMA nodes and core occupancy of existing kernels into account.

Add unit-tests for new/updated images

Write a set of parametrized test suites that uses language/version-specific example codes to test basic zero-mq REPL functionality of new/updates images.

Limit stdin functions (e.g., input) in request-reply kernels

Some users have tried input() in Python kernels during code-golf sessions at conferences. In such case they saw "unexpected" timeouts because most request-reply based kernels cannot handle user inputs.

Until we have a nice user-input handling in the front-ends, we need to explicitly disable them and show error messages to the user.

Update TensorFlow kernels to use latest CPU features

On AWS p2.xlarge instance, TF kernels give the following warnings:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

Migrate to Alpine Linux for smaller images

My initial tests show that it's viable to migrate to Alpine Linux for our kernel images.

sorna-jail works well if compiled inside Alpine Linux. (Binaries compiled in Ubuntu has glibc-specific symbol references: __fprintf_chk, __vfprintf_chk)
The image size of a Python 3.6 base kernel (without numpy/scipy stuffs but including jail support) is 112 MB only! (previously it was ~600 MB)

The core ideas to reduce image size are:

Alpine Linux tiself. The base image is less than 8 MB, just containing a busybox, the package manager apk, and a few utililties such as scanelf. libc is replaced with musl transparently.
Remove build-only depenencies at the end of a single RUN command so that each image layer has small footprints.
The package manager (apk) in Alpine Linux provides a concept of "virtual" package installation space, so we can easily purge a set of packages. Also, most Alpine Linux pacakges are made to be independent with minimal cross-dependencies.

Challenges remainig:

Migrate CUDA kernels to base our Alpine images. I think this is theorectically possible, but there may be unexpected glibc-specific dependency in CUDA binaries.

Potentially reduce image sizes by not installing recommended packages

apt-get has a cli argument --no-install-recommends that skips installation of recommended packages but the main dependencies. This could reduce the kernel image sizes a lot.
Let's test this.

NOTE: Since we have basic unit tests for docker images, it would be sufficient to check if the tests are passed after building images after applying --no-install-recommends.