Code Monkey home page Code Monkey logo

lablup / backend.ai-kernels Goto Github PK

View Code? Open in Web Editor NEW
32.0 32.0 14.0 309.95 MB

Repository of Backend.AI-enabled container recipes

Home Page: https://www.backend.ai

License: GNU Lesser General Public License v3.0

Shell 1.14% Python 3.36% PHP 0.08% R 0.03% Julia 0.30% MATLAB 0.06% Dockerfile 30.66% Roff 11.54% CSS 0.15% Jupyter Notebook 51.65% JavaScript 0.26% HTML 0.58% Vue 0.19%
alpine-linux deep-learning docker programming-languages repl sandbox sorna ubuntu

backend.ai-kernels's People

Contributors

achimnol avatar adrysn avatar dependabot[bot] avatar gofeel avatar hephaex avatar inureyes avatar kmkwon94 avatar kyujin-cho avatar lizable avatar tink-expo avatar xyloon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

backend.ai-kernels's Issues

Sandboxed Execution

This is an issue delegating lablup/sorna-agent#1.
(It is too long to type in the first line of commit messages...)

New "log" stream in addition to stdout/stderr streams

Some languages offer standardized logging (e.g., Python's logging module and Julia's info(), warn() functions). Let's wrap them and provide a prettier output by distinguishing them via a separate type of stream: "log". (Currently the new PUSH/PULL agent protocol only recognizes "stdout", "stderr", "media", "finished", "waiting-input" message types.)

Automatic resource usage measurement with example codes

Many deep learning codes requires a lot of memory and computation time.
We need some automated way to measure the maximum memory used and computation time for a given example code, for better capacity planning and scheduler designs.

R Language Support

  • Basic REPL support
  • Pre-install a list of essential packages
    • qqplot
    • more?

NodeJS Support

  • Basic REPL implementation
  • Pre-install a list of essential packages

Upgrade all query-mode kernels to use new PUSH/PULL-based agent protocol

  • Python 3 (this is the reference implementation)
  • Python 3 - TensorFlow
  • Python 3 - TensorFlow GPU
  • Python 3 - Caffe
  • Python 2
  • PHP
  • R
  • Julia
  • Javascript
  • Lua
  • Haskell
  • Octave
  • Git shell (command part)

For Python kernels, we also need to update sorna-media package to v0.3.

Minimum features:

  • Binds PUSH/PULL ZeroMQ sockets on TCP port 2000 and 2001
  • Console outputs should be sent as multipart messages in this format: [b'stdout', b'utf8-encoded-text'] and [b'stderr', b'utf8-encoded-text'].
  • No more separate exception handler. Just print language-native traceback to stderr.
  • Send a multipart message [b'finished', b''] when execution is done.

Optional features:

  • Strip tracebacks to include only user-provided codes.
  • Provide interactive input function using input socket (like self.handle_input in Python 3 impl.)
    • If possible, override the most commonly used, standard input function.

Tips:

  • Copy and modify test_run.py in python3 kernel directory to test the main programs before building docker containers for fast iteration of debugging and development.

Git support

Add a custom Git command shell for Git tutorial courses.

Allow callbacks in nodejs4 kernel

Egoing has reported an issue that he could not see the result of the following code:

var 입력한비밀번호 = '1111';
var 소금의크기 = 32;
var 암호화반복횟수 = 10000;
var 암호의길이 = 32;
var crypto = require('crypto');
crypto.randomBytes(소금의크기, function(오류, 소금){
    crypto.pbkdf2(입력한비밀번호, 소금, 암호화반복횟수, 암호의길이, 'sha512', function(오류, 생성된암호){
        console.log(생성된암호.toString('hex'));
    });
});

This is due to the current nodejs kernel just goes through the synchronous part until it sends the execution result and callbacks generated by the user code are executed later.
We need a "blocking" mechanism until all user callbacks finish as well as temporarily removing existing sorna-side callbacks from the event loop.

As the result, I have found a small hacky open source project that uses C++ addon to access uv_run() function, and patched it to implement a blocking call until all callbacks finish:
abbr/deasync#53

Then, I have added unref() / ref() support to zeromq.node project:
JustinTulloss/zeromq.node#503

Now we can implement a proper blocking call for nodejs4 kernel.

Go Language Support

Add Go language support.

  • Basic REPL for Golang
  • Pre-install a list of essential packages
    • availability of go get-like functionality?

Lua support

  • Basic Lua execution support
  • Useful packages?

Improve Jail development environment

Jail should be compiled in Linux (preferably the same version of Ubuntu as REPL kernels use) and thus native Docker environments requires a separate Ubuntu image setup.
Let's add some helper scripts for building new jail binaries.

Add some reasonable default configs for kernel images

  • Shell
    • Uncomment set convert-meta off in /etc/inputrc to allow output of 8-bit characters
    • Run locale-gen en_US.UTF-8 and set LANG environment variable so that bash can handle multi-byte UTF-8 characters correctly. (e.g., backspace should delete each Unicode char like a single char.)
  • Vim
    • Add terminal encoding and indentations to /etc/vim/vimrc.local
    • Note: /etc/vim/vimrc and /usr/share/vim/vim74/debian.vim already has syntax highlighting, eol, nocompatible settings.

More to come.

Automate image build and deploy process

... so that other people can easily update the service images.
Currently, we use only a single sorna instance, sorna.lablup, but this should be extended to cover multiple instances via docker-registry.lablup.

Support interrupt of ongoing executions

In development phase, many engineers often interrupt ongoing executions when they realize something is going to be wrong. Jupyter notebook also supports interrupts using SIGINT signal from the notebook server to the kernel process. Let's support it.

There are some issue to think:

  • SIGINT signal may not be caught if it is delivered in the middle of a system call since Python 3.5+ retries interrupted I/O syscalls. ref: PEP-475
  • Abruptive interruption may leave the process under inconsistent state. (e.g., somewhere between sending ZeroMQ multi-part output messages)

Limit CPU core count in containers

TensorFlow kernels are not working on high-end servers due to process/thread limits in our jail.
This is probably due to mis-reporting of sysconf(_SC_NPROCESSORS_ONLN) library call, which reports the full CPU count instead of Docker-allocated cpuset.

Jail policy implementation for sockets to external networks

Phase 1: Allow some fixed hosts for lectures and tutorials, such as:

  • Allow connections to DNS servers (TCP/UDP port 53)
    • probably we could cache DNS resolution results to restrict connections to hosts
  • Allow connections to the following hosts:
    • github.com, bitbucket.org (TCP 22, 80, 443): source-code repository services
    • httpbin.org (TCP 80, 443): HTTP protocol test server
    • example.com, example.org, example.net, and possibly other IANA-managed sample domains (TCP 80, 443): example domain site reserved by IANA

Phase 2: Allow customization

  • Allow additional host/port pairs specified per client-side session ID via Sorna API.

TensorFlow + Keras

Keras is a wrapper around existing DL libraries.
Let's add support for it as two separate kernels images: tf + keras / theano + keras.
tf + keras image will be an upgrade of the current python3-tensorflow images.

SQL support

sqlite-based data manipluation course
(demands exist at research/consulting firms, even without programming skills)

Julia Support

  • Basic REPL implementation
  • Pre-install a list of essential packages

Suppress/remove font cache building when first using matplotlib

/home/joongi/venv-ipython/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Remove or suppress above warning messages when a fresh kernel first uses matplotlib.
Maybe we could run font cache building process during docker builds.

Limit the CPU cores exposed to kernel containers

Some C/C++ libraries used by kernel's 3rd-party packages implicitly spawn as many threads as the number of available CPU cores, and this exceeds the default child process/thread limit (32) in servers with a high number of cores. It causes crashes or indefinite hangs of kernels. 😞

  • Limit the maximum number of CPU cores per container (default is 1 and customizable by each kernel via image labels)
  • (in lablup/sorna-agent) Implement a CPU core allocation policy that takes NUMA nodes and core occupancy of existing kernels into account.

Add unit-tests for new/updated images

Write a set of parametrized test suites that uses language/version-specific example codes to test basic zero-mq REPL functionality of new/updates images.

Limit stdin functions (e.g., input) in request-reply kernels

Some users have tried input() in Python kernels during code-golf sessions at conferences. In such case they saw "unexpected" timeouts because most request-reply based kernels cannot handle user inputs.

Until we have a nice user-input handling in the front-ends, we need to explicitly disable them and show error messages to the user.

Update TensorFlow kernels to use latest CPU features

On AWS p2.xlarge instance, TF kernels give the following warnings:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

Migrate to Alpine Linux for smaller images

My initial tests show that it's viable to migrate to Alpine Linux for our kernel images.

  • sorna-jail works well if compiled inside Alpine Linux. (Binaries compiled in Ubuntu has glibc-specific symbol references: __fprintf_chk, __vfprintf_chk)
  • The image size of a Python 3.6 base kernel (without numpy/scipy stuffs but including jail support) is 112 MB only! (previously it was ~600 MB)

The core ideas to reduce image size are:

  • Alpine Linux tiself. The base image is less than 8 MB, just containing a busybox, the package manager apk, and a few utililties such as scanelf. libc is replaced with musl transparently.
  • Remove build-only depenencies at the end of a single RUN command so that each image layer has small footprints.
  • The package manager (apk) in Alpine Linux provides a concept of "virtual" package installation space, so we can easily purge a set of packages. Also, most Alpine Linux pacakges are made to be independent with minimal cross-dependencies.

Challenges remainig:

  • Migrate CUDA kernels to base our Alpine images. I think this is theorectically possible, but there may be unexpected glibc-specific dependency in CUDA binaries.

Potentially reduce image sizes by not installing recommended packages

apt-get has a cli argument --no-install-recommends that skips installation of recommended packages but the main dependencies. This could reduce the kernel image sizes a lot.
Let's test this.

NOTE: Since we have basic unit tests for docker images, it would be sufficient to check if the tests are passed after building images after applying --no-install-recommends.

PHP Support

  • Basic REPL implementation
  • Pre-install a list of essential packages
    • Some ZEND packages?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.