Code Monkey home page Code Monkey logo

jupyter-server / gateway_provisioners Goto Github PK

View Code? Open in Web Editor NEW
33.0 5.0 14.0 982 KB

Provides remote kernel provisioners for various resource-managed clusters.

Home Page: https://gateway-provisioners.readthedocs.io

License: Other

Makefile 2.89% Python 79.21% Shell 5.36% R 2.55% Scala 4.96% JavaScript 0.64% Jinja 1.62% Dockerfile 2.75%
docker docker-swarm hadoop-yarn jupyter jupyter-client jupyter-enterprise-gateway jupyter-kernel-gateway jupyter-kernels jupyter-server kubernetes

gateway_provisioners's People

Contributors

betterlevi avatar blink1073 avatar bsdz avatar dependabot[bot] avatar echarles avatar elibixby avatar kevin-bates avatar kiersten-stokes avatar mmmommm avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gateway_provisioners's Issues

TimeoutError is not handled properly

Description

TimeoutError is not handled properly and display Invalid response 504.

Normally, I would like error messages to be displayed on the UI of Jupyter when an error occurs, but it seems that the status code 504 is not being handled properly, and an error message saying 'invalid response 504' is displayed. When I changed it to RuntimeError, the error message started to be displayed correctly.

I haven't checked other places, so I don't know, but it might be happening elsewhere as well.

Reproduce

If the launch of the kernel server's k8s pod fails, handle_launch_timeout should be called from container.py to reproduce the issue.

Expected behavior

The appropriate error message is displayed.

KubernetesProvisioner.pre_launch throw key error when launch kernel

Search similar

I've searched existed issues and not found similar one

Description

I'm using Nbconvert to execute a ipnb file to notebook, the kernel is k8s_python, when I execute the command jupyter nbconvert --to notebook --execute Untitled.ipynb, it will throw error:

[NbConvertApp] Instantiating kernel 'Kubernetes Python' with kernel provisioner: kubernetes-provisioner
[NbConvertApp] ResponseManager is bound to port 8877 for remote kernel connection information.
[NbConvertApp] ERROR | 'env'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/jupyter_client/manager.py", line 87, in wrapper
    out = await method(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/jupyter_client/manager.py", line 436, in _async_start_kernel
    kernel_cmd, kw = await self._async_pre_start_kernel(**kw)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/jupyter_client/manager.py", line 401, in _async_pre_start_kernel
    kw = await self.provisioner.pre_launch(**kw)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/gateway_provisioners/k8s.py", line 88, in pre_launch
    kwargs["env"][key] = os.environ[key]
    ~~~~~~^^^^^^^
KeyError: 'env'

Reproduce

  1. execute command jupyter nbconvert --to notebook --execute Untitled.ipynb
  2. Untitled.ipynb content is:
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb001297-d64e-474a-8665-02c3f8388a6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(1)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Kubernetes Python",
   "language": "python",
   "name": "k8s_python"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

Expected behavior

I checked this method and found one line was removed in the commit 4c82a80 , and this may lead this error occurs, not sure delete it for what reason.
image

CLI - add option to allow overwrite of existing kernel specs

Currently, the various kernel spec installers (jupyter-xxxx-spec install) unconditionally overwrite existing kernel specifications of the same name. This is the way the KernelSpecManager.install_kernel_spec() behaves. We should make the default behavior to NOT overwrite existing kernel specs and, prior to calling install_kernel_spec() use KernelSpecManager.get_kernel_spec() to see if the kernel spec already exists.

Note that KernelSpecManager.get_kernel_spec() will log a warning message if the spec doesn't exist prior to raising NoSuchKernel - which we'd have no control over, so it might be best to gather up the existing kernelspecs and look for those with the same name and resource directory - although these could produce warnings for other, unrelated, kernelspecs (like if their provisioner is not found in the existing environment). So perhaps we should just bypass KernelSpecManager all together to see if a kernel.json file exists in the candidate directory.

Add helm chart for host applications

Since Gateway Provisioners are not a stand-alone application it would be helpful to create a set of helm chart files (those defined in EG would be sufficient) in which the application information is also a variable. These would then provide users the ability to build an "application image" (using Kernel Gateway or Jupyter Server) (note: Enterprise Gateway is not compatible at this time) that can be deployed into a Kubernetes cluster with the appropriate service accounts and relationships configured. These charts should be nearly identical but strict subsets of those in EG.

Add tests

The current set of tests are from the kernel provider POC (and have since been removed) - so this becomes an "Add tests" item.

This item will be comprised of a number of pull requests and items to resolve. At some point, we can decide to close this issue and open new issues for each item, otherwise, the pull requests should be named relative to the item which they address.

  • Add an initial framework. This will include the mock infrastructure (or a majority thereof) (#46)
  • Add an initial test to demonstrate framework usage (#46)
  • Add tests for DistributedProvisioner
  • Add tests for CLI application
  • Add test for Kubernetes namespace management
  • Add test for YARN scheduler interaction

Fix docs build issue

It looks like the docs builds have been breaking due to an issue in the sequence diagram located in docs/source/contributors/sequence-diagrams.md. Here's a capture of the build failure followed by the error text:

dot code 'seqdiag {\n edge_length = 180;\n span_height = 15;\n WebApplication [label = "Web Application"];\n HostApplication [label = "Host Application"];\n KernelManager [label = "Kernel Manager"];\n Provisioner;\n Kernel;\n ResourceManager [label = "Resource Manager"];\n\n === Kernel Launch ===\n\n WebApplication -> HostApplication [label = "https POST api/kernels "];\n HostApplication -> KernelManager [label = "start_kernel() "];\n KernelManager -> Provisioner [label = "launch_process() "];\n\n Provisioner -> Kernel [label = "launch kernel"];\n Provisioner -> ResourceManager [label = "confirm startup"];\n Kernel --> Provisioner [label = "connection info"];\n ResourceManager --> Provisioner [label = "state & host info"];\n Provisioner --> KernelManager [label = "complete connection info"];\n KernelManager -> Kernel [label = "TCP socket requests"];\n Kernel --> KernelManager [label = "TCP socket handshakes"];\n KernelManager --> HostApplication [label = "kernel-id"];\n HostApplication --> WebApplication [label = "api/kernels response"];\n\n === Websocket Negotiation ===\n\n WebApplication -> HostApplication [label = "ws GET api/kernels"];\n HostApplication -> Kernel [label = "kernel_info_request message"];\n Kernel --> HostApplication [label = "kernel_info_reply message"];\n HostApplication --> WebApplication [label = "websocket upgrade response"];\n}': 'ImageDraw' object has no attribute 'textsize'

We should probably consider moving the built-in mermaid support.

Enhance k8s container status

This is just my opinion based on my experience. Thank you for the wonderful product :)

Problem

The current code monitoring the status of the containers only checks the container's status, resulting in error messages that are not user-friendly.

I believe that by having get_container_status return information other than pod_status, we can display more appropriate errors to the users.

Proposed Solution

This is quite simplified, but here's the idea. I'm using stringify to override methods, but there might be a better way.

k8s.py

# can not encode datetime type, define custom encoder and use it
class DatetimeJSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime.datetime):
            return obj.isoformat()
        return obj

@overrides
def get_container_status(self, iteration: Optional[str]) -> str:
    # Locates the kernel pod using the kernel_id selector.  Note that we also include 'component=kernel'
    # in the selector so that executor pods (when Spark is in use) are not considered.
    # If the phase indicates Running, the pod's IP is used for the assigned_ip.
    pod_status = ""
    kernel_label_selector = f"kernel_id={self.kernel_id},component=kernel"
    ret = client.CoreV1Api().list_namespaced_pod(
        namespace=self.kernel_namespace, label_selector=kernel_label_selector
    )
    if ret and ret.items:
        # if ret.items is not empty, then return the strigify json of the pod data
        pod_dict = ret.items[0].to_dict()
        dump_json = json.dumps(pod_dict, cls=DatetimeJSONEncoder)
        return dump_json
    else:
        self.log.warning(f"kernel server pod not found in namespace '{self.kernel_namespace}'")
        return ""

Additional context

This might be specific to my environment, but by setting it to wait when the k8s pod is in the ContainerCreating state or no error has occurred, and ContainersReady is false, it has started to work properly even without a kernel image puller.

This is quite simplified example code

@overrides
async def confirm_remote_startup(self):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  pod_info = self.get_container_status(str(i))
      # if pod_info is empty string or None, it means the container is not found
      if pod_info:
          pod_info_json = json.loads(pod_info)
          status = pod_info_json["status"]
          pod_phase = status["phase"].lower()
          if pod_phase == "running":
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          else:
              if "conditions" in status:
                  for condition in status["conditions"]:
                      if "containerStatuses" in status:
                          # check if the ContainerCreating
                          if (
                              status["containerStatuses"][0]["state"]["waiting"]["reason"]
                              == "ContainerCreating"
                          ):
                              self.log.info("Container is creating ...")
                              continue
               if (
                          condition["type"] == "ContainersReady"
                          and condition["status"] != "True"
                      ):
                          self.log.warning("Containers are not ready waiting 1 second.")
                          await asyncio.sleep(1)
                          continue

Cleanup Other section

Per #43 this issue tracks the work necessary to clean up the Other section of the docs. Per this comment, both the troubleshooting guide contains stale stuff we don't need to document and the resources section still has some EG stuff.

[PROPOSAL] Rename repository to gateway_provisioners

As some of you know, I'm not happy with the name remote_provisioners for the following reason: It's way too generic. Other folks may want to create their own sets of provisioners that happen to work remotely and I feel this repo will be the target of such confusion. I'd rather not "steal the name", nor do I think any repo should be named remote_provisioners for the same reason.

I believe the name gateway_provisioners includes both history and relevance. While it's true that you do not need a gateway server to host remote provisioners, it's very likely the case that you will. This is because these provisioners essentially require the invoker to be on the same network as where the kernel will run. As a result, many folks will need to "hop through" a gateway server that has been provisioned relative to the target network. Yes, applications like Jupyter Hub that spawn notebook servers in Kubernetes, will not require a gateway server but many others will want to deploy a gateway server in the cloud, in which case, Gateway Provisioners would be handy.

By changing to Gateway Provisioners, we also imply a well-known family of provisioners, all of which happen to be remote (for now), but perhaps not always.

How to deal with toree-launcher jar file distribution

The scala-based kernels use Apache Toree as their kernel and we have a toree-launcher, analogous to the other Python and R launchers, that services the Toree kernel. Because the launcher is essentially a jar file, we need to provide the ability to distribute it and reference/copy it from/to the kernel.json and/or bin/run.sh files when building scala-based kernelspecs.

Fix YARN test

Looks like the YARN tests are timing out because the Application ID is not getting conveyed. Since this is a mocked environment, something has probably side-affected things and it should be relatively straightforward to locate the issue.

Here's a build failure log: https://github.com/jupyter-server/gateway_provisioners/actions/runs/6187008029/job/16795884972

followed by some relevant text:

------------------------------ Captured log call -------------------------------
INFO     traitlets:yarn.py:293 YarnProvisioner: kernel launched. YARN RM: my-yarn-cluster.acme.com:7777, pid: 42, Kernel ID: 64bea58e-8b2d-4bd7-a916-17d52f02a35f, cmd: '['--public-key:MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCu87zbdvcyLCV7Ownj5nMbwDYpIZtxOoYONKIO7A3ulZosjRmRqzDPUivKBnK7fp5DKOEpBZlnyxc89Xglg/Zbneu4QY8AQ5oZntFTegUFbF1zk/KzQUSDFRHiPqxKs1C7WlJU4wOSbnpvZOmUuOXJZKULsFGrzLgXMH09RNS+3QIDAQAB', '--response-address:10.213.5.248:8880', '--port-range:0..0', '--kernel-id:64bea58e-8b2d-4bd7-a916-17d52f02a35f']'
ERROR    traitlets:remote_provisioner.py:339 KernelID: '64bea58e-8b2d-4bd7-a916-17d52f02a35f' launch timeout due to: Application ID is None. Failed to submit a new application to YARN within 30 seconds.  Check server log for more information.

The link for the GNU make page now produces a 404

Description

Looks like the link referenced in the docs for GNU make has moved or been removed.

Reproduce

The check links CI job produces this message:

FAILED docs/source/contributors/devinstall.md::/home/runner/work/gateway_provisioners/gateway_provisioners/docs/source/contributors/devinstall.md <a href=https://www.gnu.org/software/make/>
1 failed, 4 deselected, 2 warnings in 0.42s

Expected behavior

The doc build should succeed and the a link to GNU make should be valid.

Fix sharing of server_listener code between Python and R launchers

One of the differences introduced in this repo from EG is that the code that was duplicated between the Python and R launchers is now shared. As a result, the assembly of kernel specifications needs to take this into account within the tooling and currently does not.

Why is this named `gateway` provisioners?

Why is this named gateway provisioners and not simply kernel provisioners. The intent is that it can be used in any jupyter-client solution.

Similarly, reading the doc, I see mention of Host, Server... which confuses me at some point. A nomenclature introduction would be great, I can try to open a PR if interest.

cc/ @kevin-bates

Cleanup Operators Guide

Per #43 this issue tracks the work necessary to clean up the Operators Guide. Per this comment, a majority of the effort will be in the Deploying Gateway Provisioners, but the Configuring Gateway Provisioners topic could also use work.

Add documentation

This will likely consist of bringing over applicable sections of the Enterprise Gateway docs. We should be able to leverage much of those docs and apply general substitutions like ProcessProxy -> Provisioner, EG_ to GP_, etc. At a glance, it looks like most of the Developers and Contributors sections would apply.

Make `_determine_next_host` an async function

Problem

We are creating a custom kernel provisioner and we have an API for finding a host based on certain parameters but this API takes upto 20s to send the host. This 20s wait blocks the server and makes lab unusable during those 20s. We tried making the _determine_next_host function async ourselves by running the host fetching API in a asyncio's run_in_executor and awaiting the _determine_next_host call in the launch_kernel function.

A simple reproducer for this is adding a asyncio.sleep(10) in the DistributedProvisioner._determine_next_host function. The result will be that the kernel never reaches an alive state.

Proposed Solution

We would like a way to asynchronously determine the next host.

Additional context

I'm using gateway_provisioners v0.2.0

Port Enterprise Gateway's CustomResourceProcessProxy to CustomResourceProvisioner, etc.

Enterprise Gateway's CustomResourceProcessProxy (and its SparkOperatorProcessProxy subclass) need to be ported to this repository as kernel provisioners.

Since process proxies and kernel provisioners are extremely similar, this is primarily an exercise in massaging the names and configuration references to be in line with those used in this repository. Also, since these will be subclasses of KubernetesProvisioner, we should be able to extend the CLI tooling to enable kernelspec creation, etc.

More flexible kubernetes cluster auth

Hey thanks for the great package.

It would be nice if there were more flexible options for authenticating to the k8s master for the KubernetesProvisioner.

For example, in my deployment I have developers login to Jupyterhub with OAuth, and then I use KubeSpawner.auth_state_hook to propegate their access token to their jupyterhub server where I use it authenticate to the GKE master (since GKE lets you manage RBAC with Google IAM).

You could borrow from a bunch of other tools (like dask_kubernetes and use ~/.kube/config profiles to authenticate (an env var naming the profile.)

I may work on this if I get some time, and you're interested.

Cleanup Users Guide

Per #43 this issue tracks the work necessary to clean up the Users Guide. Per this comment, both the Users index page and its content require changes.

Custom Kernel JSON Configuration Variables

Hi, I'm trying to create a custom kernel spec JSON file, and I would like the configuration variables to be available inside a custom gateway_provisioner. I'm looking at this line of documentation under System Architecture

Each kernel.json’s kernel_provisioner stanza can specify an optional config stanza that is converted into a dictionary of name/value pairs and passed as an argument to each kernel provisioner’s constructor relative to the provisioner identified by the provisioner_name entry.

but when I put variables under the config stanza and inspect the arguments passed into the kernel provisioner's constructor, I don't see any of my configuration variables. Do you know where I might be able to find the configuration variables?

kernel.json

{
  "argv": [
  ],
  "env": {},
  "display_name": "Python Custom",
  "language": "python",
  "interrupt_mode": "signal",
  "metadata": {
    "debugger": true,
    "kernel_provisioner": {
      "provisioner_name": "custom-provisioner",
      "config": {
        "launch_timeout": 30,
        "test": "test",
        "test2": "test2"
      }
    }
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.