Code Monkey home page Code Monkey logo

openpai-runtime's Introduction

Microsoft OpenPAI Runtime

Docker Pulls GitHub Workflow Status (branch)

Runtime component for deep learning workload

In order to better support deep learning workload, OpenPAI implements "PAI Runtime", a module that provides runtime support to job containers.

One major feature of PAI runtime is the instantiation of runtime environment variables. PAI runtime provides several built-in runtime environment variables, including the container role name and index, the IP, port of all the containers used in the job. With PAI runtime environment variables and Framework Controller, user can onboard custom workload (e.g., MPI, TensorBoard) without the involvement of (or modification to) OpenPAI platform itself. OpenPAI further allows users to define custom runtime environment variables, tailored for their workload.

Another major feature of OpenPAI runtime is the introduction of "PAI runtime plugin". The runtime plugin provides a way for users to customize their runtime behavior for a job container. Essentially, plugin is a generic method for user to inject some code during container initialization or container termination. OpenPAI implements several built-in plugins for desirable features, including a storage plugin that mounts to a remote storage service from within the job containers, an ssh plugin that supports ssh access to each container, and a failure analysis plugin that analyzes the failure reason when a container fails. We envision there will be more features implemented by the plugin mechanism.

Features

  1. Prepare OpenPAI runtime environment variables
  2. Failure analysis: report possible job failure reason based on the failure pattern
  3. Storage plugin: used to auto mount remote storage according to storage config
  4. SSH plugin: used to support ssh access to job container
  5. Cmd plugin: used to run customized commands before/after job

How to build

Please run docker build -f ./build/openpai-runtime.dockerfile . to build openpai-runtime docker image

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

openpai-runtime's People

Contributors

abuccts avatar binyang2014 avatar dependabot[bot] avatar fanyangcs avatar hzy46 avatar microsoftopensource avatar siaimes avatar suiguoxin avatar wangdian avatar ydye avatar yqwang-ms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openpai-runtime's Issues

How can I use my own runtime?

I have rebuilt my runtime image and pushed it to my repository in Dockerhub, but I don't know how to change rest-server config to use it. Can anyone help?

Init blocked in image_checker.py>connectionpool.py, while the job remains waiting.

init.log:

+ CHILD_PROCESS=UNKNOWN
+ trap exit_handler EXIT
+ PAI_WORK_DIR=/usr/local/pai
+ PAI_CONFIG_DIR=/usr/local/pai-config
+ PAI_INIT_DIR=/usr/local/pai/init.d
+ PAI_RUNTIME_DIR=/usr/local/pai/runtime.d
+ PAI_LOG_DIR=/usr/local/pai/logs/a1333840-bc83-4844-b34d-a514d3ecb2e6
+ PAI_SECRET_DIR=/usr/local/pai/secrets
+ PAI_USER_EXTENSION_SECRET_DIR=/usr/local/pai/user-extension-secrets
+ PAI_TOKEN_SECRET_DIR=/usr/local/pai/token-secrets
+ chmod a+rw /usr/local/pai/logs/a1333840-bc83-4844-b34d-a514d3ecb2e6
+ find /usr/local/pai/logs/a1333840-bc83-4844-b34d-a514d3ecb2e6 -maxdepth 1 -type f '!' -name init.log
+ LOG_FILES=
+ '[[' '!' -z  ]]
+ find /usr/local/pai -maxdepth 1 -mindepth 1 '!' -name logs -exec rm -rf '{}' ';'
rm: can't remove '/usr/local/pai/user-extension-secrets/..data': Read-only file system
rm: can't remove '/usr/local/pai/user-extension-secrets/userExtensionSecrets.yaml': Read-only file system
rm: can't remove '/usr/local/pai/user-extension-secrets/..2021_10_31_09_21_19.825915050/userExtensionSecrets.yaml': Read-only file system
+ mv ./__init__.py ./common ./init ./init.d ./package_cache ./plugins ./requirements.txt ./runtime ./runtime.d /usr/local/pai
+ cd /usr/local/pai
+ '[[' true '=' true ]]
+ CHILD_PROCESS=FRAMEWORK_BARRIER
+ echo 'frameworkbarrier start'
frameworkbarrier start
+ '[[' -f /var/run/secrets/kubernetes.io/serviceaccount/token ]]
+ unset KUBE_APISERVER_ADDRESS
+ /usr/local/pai/init.d/frameworkbarrier
+ tee /usr/local/pai/logs/a1333840-bc83-4844-b34d-a514d3ecb2e6/barrier.log
I1031 09:21:22.909826      15 barrier.go:211] Initializing frameworkbarrier
I1031 09:21:22.910057      15 barrier.go:214] With Config: 
kubeApiServerAddress: ""
kubeConfigFilePath: ""
frameworkNamespace: default
frameworkName: c584ad33319d4553e4054ab6777cf38d
barrierCheckIntervalSec: 10
barrierCheckTimeoutSec: 600
W1031 09:21:22.910094      15 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1031 09:21:22.911898      15 barrier.go:227] Running frameworkbarrier
I1031 09:21:22.934270      15 barrier.go:322] BarrierPassed: 1/1 Tasks are ready with not nil PodIP.
I1031 09:21:22.934299      15 barrier.go:267] BarrierSucceeded: All Tasks are ready with not nil PodIP.
I1031 09:21:22.938656      15 barrier.go:350] Succeeded to dump the Framework object to local file: ./framework.json
I1031 09:21:22.938780      15 barrier.go:451] Succeeded to generate the injector script to local file: ./injector.sh
I1031 09:21:22.938962      15 barrier.go:459] ExitCode: 0: Exit with success.
+ CHILD_PROCESS=ERROR_SPEC
+ cp /usr/local/pai-config/runtime-exit-spec.yaml /usr/local/pai/runtime.d
+ CHILD_PROCESS=ENV_GENERATOR
+ python /usr/local/pai/init.d/framework_parser.py genenv framework.json
2021-10-31 09:21:23,007 - INFO - framework_parser.py:196 - loading json from framework.json
2021-10-31 09:21:23,008 - INFO - framework_parser.py:107 - task roles: {'taskrole': {'number': 1, 'ports': {'schedulePortStart': 15000, 'schedulePortEnd': 40000, 'ports': {'ssh': {'count': 1}, 'http': {'count': 1}}}}}
+ CHILD_PROCESS=CONFIG_GENERATOR
+ python /usr/local/pai/init.d/framework_parser.py genconf framework.json
2021-10-31 09:21:23,059 - INFO - framework_parser.py:196 - loading json from framework.json
+ CHILD_PROCESS=PLUGIN_INITIALIZER
+ python /usr/local/pai/init.d/initializer.py /usr/local/pai/runtime.d/job_config.yaml /usr/local/pai/secrets/secrets.yaml /usr/local/pai/user-extension-secrets/userExtensionSecrets.yaml /usr/local/pai/token-secrets/token /usr/local/pai/plugins /usr/local/pai/runtime.d taskrole
2021-10-31 09:21:23,162 - INFO - initializer.py:235 - loading yaml from /usr/local/pai/runtime.d/job_config.yaml
2021-10-31 09:21:23,170 - INFO - initializer.py:173 - Starting to prepare plugin ssh
2021-10-31 09:21:23,266 - INFO - initializer.py:52 - 2021-10-31 09:21:23,266 - INFO - init.py:118 - Ssh runtime plugin perpared
2021-10-31 09:21:23,277 - INFO - initializer.py:173 - Starting to prepare plugin teamwise_storage
2021-10-31 09:21:23,802 - INFO - initializer.py:52 - This plugin is deprecated, will ignore this plugin
+ CHILD_PROCESS=PORT_CONFLICT_CHECKER
+ python /usr/local/pai/init.d/port.py /usr/local/pai/runtime.d/runtime_env.sh
2021-10-31 09:21:23,943 - INFO - port.py:71 - runtime env from /usr/local/pai/runtime.d/runtime_env.sh
+ CHILD_PROCESS=DOCKER_IMAGE_CHECKER
+ python /usr/local/pai/init.d/image_checker.py /usr/local/pai/runtime.d/job_config.yaml /usr/local/pai/secrets/secrets.yaml
2021-10-31 09:21:24,142 - INFO - image_checker.py:266 - get job config from /usr/local/pai/runtime.d/job_config.yaml
2021-10-31 09:21:24,148 - INFO - image_checker.py:276 - Start checking docker image
2021-10-31 09:21:24,150 - DEBUG - connectionpool.py:975 - Starting new HTTPS connection (1): index.docker.io:443
2021-10-31 09:21:25,214 - DEBUG - connectionpool.py:461 - https://index.docker.io:443 "HEAD /v2/ HTTP/1.1" 401 0
2021-10-31 09:21:25,219 - DEBUG - connectionpool.py:975 - Starting new HTTPS connection (1): index.docker.io:443
2021-10-31 09:21:26,286 - DEBUG - connectionpool.py:461 - https://index.docker.io:443 "HEAD /v2/cs231666/docker/manifests/v1.1.9 HTTP/1.1" 401 0
2021-10-31 09:21:26,293 - DEBUG - connectionpool.py:975 - Starting new HTTPS connection (1): auth.docker.io:443
2021-10-31 09:21:27,348 - DEBUG - connectionpool.py:461 - https://auth.docker.io:443 "GET /token?service=registry.docker.io&scope=repository%3Acs231666%2Fdocker%3Apull HTTP/1.1" 200 None
2021-10-31 09:21:27,352 - DEBUG - connectionpool.py:975 - Starting new HTTPS connection (1): index.docker.io:443

It seems that index.docker.io returned code 401, but this image does not require authentication. Attempting to pull the image on the host machine was successful:

image

I don't know why index.docker.io returns code 401, the issue just happened by accident. Because the authentication information was not filled in when the job was submitted, the initialization process was blocked, maybe this is a bug related to the requests or urllib3 package.

For openpai-runtime, maybe we need to optimize the source code to handle this situation, that is when index.docker.io returns 401 and the authentication information is empty, return False immediately.

jinja2==2.11.3 conflict with newest markupsafe, which cause init tensorboard crash.

+ python /usr/local/pai/init.d/initializer.py /usr/local/pai/runtime.d/job_config.yaml /usr/local/pai/secrets/secrets.yaml /usr/local/pai/user-extension-secrets/userExtensionSecrets.yaml /usr/local/pai/token-secrets/token /usr/local/pai/plugins /usr/local/pai/runtime.d taskrole
2022-04-14 11:30:23,269 - INFO - initializer.py:235 - loading yaml from /usr/local/pai/runtime.d/job_config.yaml
2022-04-14 11:30:23,284 - INFO - initializer.py:173 - Starting to prepare plugin ssh
2022-04-14 11:30:23,466 - INFO - initializer.py:52 - 2022-04-14 11:30:23,466 - INFO - init.py:118 - Ssh runtime plugin perpared
2022-04-14 11:30:23,481 - INFO - initializer.py:173 - Starting to prepare plugin teamwise_storage
2022-04-14 11:30:23,933 - INFO - initializer.py:52 - This plugin is deprecated, will ignore this plugin
2022-04-14 11:30:24,071 - INFO - initializer.py:173 - Starting to prepare plugin tensorboard
2022-04-14 11:30:24,129 - INFO - initializer.py:52 - Traceback (most recent call last):
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - File "/usr/local/pai/plugins/tensorboard/init.py", line 23, in <module>
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - from jinja2 import Template
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - File "/usr/local/lib/python3.7/site-packages/jinja2/__init__.py", line 12, in <module>
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - from .environment import Environment
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - File "/usr/local/lib/python3.7/site-packages/jinja2/environment.py", line 25, in <module>
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - from .defaults import BLOCK_END_STRING
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - File "/usr/local/lib/python3.7/site-packages/jinja2/defaults.py", line 3, in <module>
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - from .filters import FILTERS as DEFAULT_FILTERS  # noqa: F401
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - File "/usr/local/lib/python3.7/site-packages/jinja2/filters.py", line 13, in <module>
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - from markupsafe import soft_unicode
2022-04-14 11:30:24,130 - INFO - initializer.py:52 - ImportError: cannot import name 'soft_unicode' from 'markupsafe' (/usr/local/lib/python3.7/site-packages/markupsafe/__init__.py)
2022-04-14 11:30:24,135 - ERROR - initializer.py:56 - failed to run /usr/local/pai/plugins/tensorboard/init.py, error code is 1

openpai/openpai-runtime: barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized

root@pai-worker1:/etc/kubernetes# docker logs b441c50e30fa

  • CHILD_PROCESS=UNKNOWN
  • trap exit_handler EXIT
  • PAI_WORK_DIR=/usr/local/pai
  • PAI_CONFIG_DIR=/usr/local/pai-config
  • PAI_INIT_DIR=/usr/local/pai/init.d
  • PAI_RUNTIME_DIR=/usr/local/pai/runtime.d
  • PAI_LOG_DIR=/usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df
  • PAI_SECRET_DIR=/usr/local/pai/secrets
  • PAI_USER_EXTENSION_SECRET_DIR=/usr/local/pai/user-extension-secrets
  • PAI_TOKEN_SECRET_DIR=/usr/local/pai/token-secrets
  • chmod a+rw /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df
  • find /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df -maxdepth 1 -type f '!' -name init.log
  • LOG_FILES=
  • '[[' '!' -z ]]
  • find /usr/local/pai -maxdepth 1 -mindepth 1 '!' -name logs -exec rm -rf '{}' ';'
    rm: can't remove '/usr/local/pai/user-extension-secrets/..data': Read-only file system
    rm: can't remove '/usr/local/pai/user-extension-secrets/userExtensionSecrets.yaml': Read-only file system
    rm: can't remove '/usr/local/pai/user-extension-secrets/..2021_11_19_03_28_31.917251457/userExtensionSecrets.yaml': Read-only file system
  • mv ./init.py ./common ./init ./init.d ./package_cache ./plugins ./requirements.txt ./runtime ./runtime.d /usr/local/pai
  • cd /usr/local/pai
  • '[[' true '=' true ]]
  • CHILD_PROCESS=FRAMEWORK_BARRIER
  • echo 'frameworkbarrier start'
    frameworkbarrier start
  • '[[' -f /var/run/secrets/kubernetes.io/serviceaccount/token ]]
  • unset KUBE_APISERVER_ADDRESS
  • /usr/local/pai/init.d/frameworkbarrier
  • tee /usr/local/pai/logs/22000be4-a8f3-4e19-965c-61521c5402df/barrier.log
    I1119 03:28:37.319208 14 barrier.go:211] Initializing frameworkbarrier
    I1119 03:28:37.319449 14 barrier.go:214] With Config:
    kubeApiServerAddress: ""
    kubeConfigFilePath: ""
    frameworkNamespace: default
    frameworkName: a98ec660fe68eb3b5fa6afc70c42b464
    barrierCheckIntervalSec: 10
    barrierCheckTimeoutSec: 600
    W1119 03:28:37.319492 14 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
    I1119 03:28:37.321382 14 barrier.go:227] Running frameworkbarrier
    W1119 03:28:37.332512 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:28:47.334425 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:28:57.334495 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:29:07.334570 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:29:17.334535 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:29:27.334474 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:29:37.334299 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:29:47.334596 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:29:57.334489 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:30:07.334696 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized
    W1119 03:30:17.334543 14 barrier.go:253] Failed to get Framework object from ApiServer: Unauthorized

I don't know what should I configure any more?

Runtime exit handler error

Sometimes runtime exit handler will report:
Error:2020/07/31 07:49:23 logger.go:49: failed to get truncate exit info, err: failed to truncate the exit info
Error:2020/07/31 07:49:23 logger.go:49: runtime failed to handle exit info fatal: dumping summary info: failed to truncate the exit info

Need to investigate why this happens

Image check doesn't handle 429 response correctly

When many tasks call docker registry API concurrently. Docker registry will return 429: too many requests.
Runtime should deal with this response correctly. Do not failed the image check unless we can many sure it's an error

[Bug report] Distributed Job init failed.

protocolVersion: 2
name: siaimes
type: job
jobRetryCount: 0
prerequisites:
  - type: dockerimage
    uri: 'siaimes/pytorch1.10.0:v1.0.0'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 6
      memoryMB: 73994
    commands:
      - env
  taskrole_1:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 6
      memoryMB: 73994
    commands:
      - env
defaults:
  virtualCluster: k80
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
        userssh:
          type: custom
          value: ''
    - plugin: teamwise_storage
      parameters:
        storageConfigNames:
          - share
  hivedScheduler:
    taskRoles:
      taskrole:
        skuNum: 1
        skuType: gpu-machine-k80
      taskrole_1:
        skuNum: 1
        skuType: gpu-machine-k80

error log:

.......
+ CHILD_PROCESS=ERROR_SPEC
+ cp /usr/local/pai-config/runtime-exit-spec.yaml /usr/local/pai/runtime.d
+ CHILD_PROCESS=ENV_GENERATOR
+ python /usr/local/pai/init.d/framework_parser.py genenv framework.json
2022-03-05 07:01:14,448 - INFO - framework_parser.py:196 - loading json from framework.json
2022-03-05 07:01:14,448 - INFO - framework_parser.py:107 - task roles: {'taskrole': {'number': 1, 'ports': {'schedulePortStart': 15000, 'schedulePortEnd': 40000, 'ports': {'ssh': {'count': 1}, 'http': {'count': 1}}}}, 'taskrole1': {'number': 1, 'ports': {'schedulePortStart': 15000, 'schedulePortEnd': 40000, 'ports': {'ssh': {'count': 1}, 'http': {'count': 1}}}}}
+ CHILD_PROCESS=CONFIG_GENERATOR
+ python /usr/local/pai/init.d/framework_parser.py genconf framework.json
2022-03-05 07:01:14,511 - INFO - framework_parser.py:196 - loading json from framework.json
+ CHILD_PROCESS=PLUGIN_INITIALIZER
+ python /usr/local/pai/init.d/initializer.py /usr/local/pai/runtime.d/job_config.yaml /usr/local/pai/secrets/secrets.yaml /usr/local/pai/user-extension-secrets/userExtensionSecrets.yaml /usr/local/pai/token-secrets/token /usr/local/pai/plugins /usr/local/pai/runtime.d taskrole1
2022-03-05 07:01:14,595 - INFO - initializer.py:235 - loading yaml from /usr/local/pai/runtime.d/job_config.yaml
Traceback (most recent call last):
  File "/usr/local/pai/init.d/initializer.py", line 268, in <module>
    main()
  File "/usr/local/pai/init.d/initializer.py", line 253, in main
    args.runtime_path, args.task_role)
  File "/usr/local/pai/init.d/initializer.py", line 148, in init_plugins
    plugin_configs = collect_plugin_configs(jobconfig, taskrole)
  File "/usr/local/pai/init.d/initializer.py", line 102, in collect_plugin_configs
    if 'prerequisites' in jobconfig['taskRoles'][taskrole] and 'prerequisites' in jobconfig:
KeyError: 'taskrole1'
+ exit_handler
+ EXIT_CODE=1
+ '[[' 1 -eq 0 ]]
+ echo 'start execute exit handler'
start execute exit handler
+ echo 'child process is PLUGIN_INITIALIZER, exit code is 1'
child process is PLUGIN_INITIALIZER, exit code is 1
+ '[[' PLUGIN_INITIALIZER '=' FRAMEWORK_BARRIER ]]
+ '[[' PLUGIN_INITIALIZER '=' PORT_CONFLICT_CHECKER ]]
+ '[[' PLUGIN_INITIALIZER '=' DOCKER_IMAGE_CHECKER ]]
+ echo 'Unknown exit code, platform error'
Unknown exit code, platform error
+ exit 1

According to the error message, I changed taskrole_1 to taskrole1 and the job started normally.

Here taskrole_1 is automatically generated by the front end, and taskrole1 is what the back end thinks. There is an incompatibility here.

[Proposal] openpai-runtime interface

Current situation:

Currently, opnepai-runtime is tightly coupled with PAI and Framework Controller.
We just split the code but some logic is mixed. To use runtime, we need to use PAI and framework controller.
We need decouple with these components for independent release cycle and efficient development.

Current Problem:

  • Add features need to cross many repos.
  • Runtime can not be used by other projects, need to modify runtime code if third-party user want to customize runtime.
  • Runtime only can work with PAI and framework controller.

Methods:

Treat all PAI related logical as runtime-plugin. Then openpai-runtime repo only keep the main logical, PAI related code will be treated as PAI specific runtime plugin and maintained in PAI repo.

To implement this, we introduce two concept: init-plugin and runtime-plugin. init-plugin is maintained by developer and used to generate executable code which run in runtime-plugin. End users don't known anything about the init-plugin.

runtime-plugin is used by end user. End user use this plugin to run some command before/after actually commands.

Workflow for openpai-runtime.

  1. Start init-container, read init-plugin spec and run init-plugin sequentially.
  2. Read runtime-plugin spec generate runtime executable file
  3. Start user container, run runtime executable and start the user commands.

Implementation

Init plugin config spec

For init-plugin, it will run in init container. The workdir for init plugin is init.d folder. These plugins is doing some preparing actions such as render user commands... Here is a sample spec for init-plugin. The plugins will run sequentially :

initPlugins:
- name: frameworkBarrier
  command: 'frameworkBarrier framework.json'
- name: frameworkParser
  command: 'python frameworkParser.py framework.json'
- name: imageChecker
  command: 'python imageChecker.py'
- name: portChecker
  command: 'python portChecker.py portListFile'
- name: userCommandRender
  command: 'python command_render.py'

This spec can be transfer to runtime through INIT_CONFIG env or can be a file named init_plugins.yaml under PAI_CONFIG_DIR. We will try to parse INIT_CONFIG env first. If this env is empty, we will try to read init_plugins.yaml. If init_plugins.yaml is absent, the default config init_plugins_default.yaml will be used.

Assumption about init-plugin

We believe init-plugin is rarely changes。 Each cluster only has one configured init-plugin config. So we prefer put init-plugin config into docker image or k8s configMap

Runtime plugin & secret & exitSpec & env

Runtime spec need to be passed through RUNTIME_CONFIG env, or can be a file at PAI_CONFIG_DIR/runtime_plugin.yaml

The spec for runtime plugin

commands: ["ls  -al"]
runtimePlugin:
- plugin: ssh
  parameters:
    jobssh: true
- plugin: teamwise_storage
  parameters:
    storageConfigNames:
      - confignfs

secret file should stored at ${PAI_CONFIG_DIR}/secret.yaml and exit-spec should stored at ${PAI_CONFIG_DIR}/runtime-exit-spec.yaml for environment which want to pass to user container, please put env into ${PAI_RUNTIME_DIR}/env

Development & Usage

Customize runtime

In init container, we will try to run scripts under /user/local/pai/init.d folder. If you want to customize your init-container, please put your scripts under /user/local/pai/init.d folder

The way to build init-container:

FROM openpairuntime/openpai-runtime:latest
COPY src/* /user/local/pai/init.d

If init-plugin will output to a file, it's developer responsibility to make sure the file is on the correct path and don't overwrite something. It's developer responsibility to maintain the customized config file and make sure it's work

Use openPAI runtime

apiVersion: v1
kind: Pod
metadata:
  name: job
  namespace: default
spec:
  initContainers:
  - name: init
    image: openpairuntime/openpai-runtime:latest
    env:
    - name: RUNTIME_CONFIG
      value: >-
        commands: ["ls  -al",  "echo hi"]
        runtimePlugin:
        - plugin: ssh
          parameters:
            jobssh: true
        - plugin: teamwise_storage
          parameters:
            storageConfigNames:
            - confignfs
    - name: INIT_CONFIG
      value: >-
      initPlugins:
      - name: frameworkParser
        command: pai/frameworkBarrier framework.json
      - name: frameworkParser
        command: frameworkParser.py framework.json
    volumeMounts:
    - name: pai-vol
      mountPath: '/usr/local/pai'
    - name: 'job-secrets'
      mountPath: '/usr/local/pai/config/secrets.yaml'
    - name: 'job-exit-spec'
      mountPath: '/usr/local/pai/config/runtime-exit-spec.yaml'
  containers:
  - name: app
    image: ubuntu:latest
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"
    command: ['/usr/local/pai/runtime']
    volumeMounts:
    - name: pai-vol
      mountPath: '/usr/local/pai'
    - name: 'job-secrets'
      mountPath: '/usr/local/pai/config/secrets.yaml'
    - name: 'job-exit-spec'
      mountPath: '/usr/local/pai/config/runtime-exit-spec.yaml'
  volumes:
  - name: pai-vol
    emptyDir: {}
  - name: 'job-secrets'
    secret:
      secretName: 'job-secrets'
  - name: 'job-exit-spec'
    configMap:
      name: runtime-exit-spec-configuration

Result

After this change runtime repo will only keep common plugin:
imageChecker, userCommandRender, portConflictChecker, envGenerator. Each plugin will have clear interface and developer can reuse these plugins.
PAI related plugins will move to PAI repo. such as frameworkBarrier, frameworkParser...

Interfaces:

ENV: INIT_CONFIG, RUNTIME_CONFIG
File: PAI_CONFIG_DIR/init_plugins.yaml ${PAI_CONFIG_DIR}/secret.yaml , ${PAI_CONFIG_DIR}/runtime-exit-spec.yaml, PAI_CONFIG_DIR/runtime_plugin.yaml

Pro:
For new runtime requirement, can be implement rather as init_plugin and runtme_plugin. Do not need to change runtime code is the feature is PAI specific.
Runtime can be reused by other project

Con:

  • New interface, much work to do. Complex data/config pass through env, not friendly for end-user.
  • And new config, the job spec size may larger than before. (Can let other plugin provide task spec, such as call API to get task sepc and put it into some path)

TBD

  • How to customize image build. Allow user customize init container, will need to copy file into docker image. Should provide a pattern for build new runtime.

Enhance git plugin

When running large scale job,, clone code will encounter: requested URL returned error: 429, Need to enhance git plugin to handle this case and let task always retry.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.