jupyter-server / gateway_provisioners Goto Github PK
View Code? Open in Web Editor NEWProvides remote kernel provisioners for various resource-managed clusters.
Home Page: https://gateway-provisioners.readthedocs.io
License: Other
Provides remote kernel provisioners for various resource-managed clusters.
Home Page: https://gateway-provisioners.readthedocs.io
License: Other
It would be great to have kernelspecs examples that work out-of-the-box on a local machine.
I am especially interested in the K8S deployement.
TimeoutError
is not handled properly and display Invalid response 504
.
Normally, I would like error messages to be displayed on the UI of Jupyter when an error occurs, but it seems that the status code 504 is not being handled properly, and an error message saying 'invalid response 504' is displayed. When I changed it to RuntimeError, the error message started to be displayed correctly.
I haven't checked other places, so I don't know, but it might be happening elsewhere as well.
If the launch of the kernel server's k8s pod fails, handle_launch_timeout should be called from container.py to reproduce the issue.
The appropriate error message is displayed.
I've searched existed issues and not found similar one
I'm using Nbconvert to execute a ipnb file to notebook, the kernel is k8s_python, when I execute the command jupyter nbconvert --to notebook --execute Untitled.ipynb
, it will throw error:
[NbConvertApp] Instantiating kernel 'Kubernetes Python' with kernel provisioner: kubernetes-provisioner
[NbConvertApp] ResponseManager is bound to port 8877 for remote kernel connection information.
[NbConvertApp] ERROR | 'env'
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/jupyter_client/manager.py", line 87, in wrapper
out = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/jupyter_client/manager.py", line 436, in _async_start_kernel
kernel_cmd, kw = await self._async_pre_start_kernel(**kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/jupyter_client/manager.py", line 401, in _async_pre_start_kernel
kw = await self.provisioner.pre_launch(**kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/gateway_provisioners/k8s.py", line 88, in pre_launch
kwargs["env"][key] = os.environ[key]
~~~~~~^^^^^^^
KeyError: 'env'
jupyter nbconvert --to notebook --execute Untitled.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "cb001297-d64e-474a-8665-02c3f8388a6c",
"metadata": {},
"outputs": [],
"source": [
"print(1)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Kubernetes Python",
"language": "python",
"name": "k8s_python"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
I checked this method and found one line was removed in the commit 4c82a80
, and this may lead this error occurs, not sure delete it for what reason.
Currently, the various kernel spec installers (jupyter-xxxx-spec install
) unconditionally overwrite existing kernel specifications of the same name. This is the way the KernelSpecManager.install_kernel_spec()
behaves. We should make the default behavior to NOT overwrite existing kernel specs and, prior to calling install_kernel_spec()
use KernelSpecManager.get_kernel_spec()
to see if the kernel spec already exists.
Note that KernelSpecManager.get_kernel_spec()
will log a warning message if the spec doesn't exist prior to raising NoSuchKernel
- which we'd have no control over, so it might be best to gather up the existing kernelspecs and look for those with the same name and resource directory - although these could produce warnings for other, unrelated, kernelspecs (like if their provisioner is not found in the existing environment). So perhaps we should just bypass KernelSpecManager
all together to see if a kernel.json
file exists in the candidate directory.
Since Gateway Provisioners are not a stand-alone application it would be helpful to create a set of helm chart files (those defined in EG would be sufficient) in which the application information is also a variable. These would then provide users the ability to build an "application image" (using Kernel Gateway or Jupyter Server) (note: Enterprise Gateway is not compatible at this time) that can be deployed into a Kubernetes cluster with the appropriate service accounts and relationships configured. These charts should be nearly identical but strict subsets of those in EG.
The current set of tests are from the kernel provider POC (and have since been removed) - so this becomes an "Add tests" item.
This item will be comprised of a number of pull requests and items to resolve. At some point, we can decide to close this issue and open new issues for each item, otherwise, the pull requests should be named relative to the item which they address.
It looks like the docs builds have been breaking due to an issue in the sequence diagram located in docs/source/contributors/sequence-diagrams.md
. Here's a capture of the build failure followed by the error text:
dot code 'seqdiag {\n edge_length = 180;\n span_height = 15;\n WebApplication [label = "Web Application"];\n HostApplication [label = "Host Application"];\n KernelManager [label = "Kernel Manager"];\n Provisioner;\n Kernel;\n ResourceManager [label = "Resource Manager"];\n\n === Kernel Launch ===\n\n WebApplication -> HostApplication [label = "https POST api/kernels "];\n HostApplication -> KernelManager [label = "start_kernel() "];\n KernelManager -> Provisioner [label = "launch_process() "];\n\n Provisioner -> Kernel [label = "launch kernel"];\n Provisioner -> ResourceManager [label = "confirm startup"];\n Kernel --> Provisioner [label = "connection info"];\n ResourceManager --> Provisioner [label = "state & host info"];\n Provisioner --> KernelManager [label = "complete connection info"];\n KernelManager -> Kernel [label = "TCP socket requests"];\n Kernel --> KernelManager [label = "TCP socket handshakes"];\n KernelManager --> HostApplication [label = "kernel-id"];\n HostApplication --> WebApplication [label = "api/kernels response"];\n\n === Websocket Negotiation ===\n\n WebApplication -> HostApplication [label = "ws GET api/kernels"];\n HostApplication -> Kernel [label = "kernel_info_request message"];\n Kernel --> HostApplication [label = "kernel_info_reply message"];\n HostApplication --> WebApplication [label = "websocket upgrade response"];\n}': 'ImageDraw' object has no attribute 'textsize'
We should probably consider moving the built-in mermaid support.
This is just my opinion based on my experience. Thank you for the wonderful product :)
The current code monitoring the status of the containers only checks the container's status, resulting in error messages that are not user-friendly.
I believe that by having get_container_status return information other than pod_status, we can display more appropriate errors to the users.
This is quite simplified, but here's the idea. I'm using stringify to override methods, but there might be a better way.
k8s.py
# can not encode datetime type, define custom encoder and use it
class DatetimeJSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return obj.isoformat()
return obj
@overrides
def get_container_status(self, iteration: Optional[str]) -> str:
# Locates the kernel pod using the kernel_id selector. Note that we also include 'component=kernel'
# in the selector so that executor pods (when Spark is in use) are not considered.
# If the phase indicates Running, the pod's IP is used for the assigned_ip.
pod_status = ""
kernel_label_selector = f"kernel_id={self.kernel_id},component=kernel"
ret = client.CoreV1Api().list_namespaced_pod(
namespace=self.kernel_namespace, label_selector=kernel_label_selector
)
if ret and ret.items:
# if ret.items is not empty, then return the strigify json of the pod data
pod_dict = ret.items[0].to_dict()
dump_json = json.dumps(pod_dict, cls=DatetimeJSONEncoder)
return dump_json
else:
self.log.warning(f"kernel server pod not found in namespace '{self.kernel_namespace}'")
return ""
This might be specific to my environment, but by setting it to wait when the k8s pod is in the ContainerCreating state or no error has occurred, and ContainersReady is false, it has started to work properly even without a kernel image puller.
This is quite simplified example code
@overrides
async def confirm_remote_startup(self):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pod_info = self.get_container_status(str(i))
# if pod_info is empty string or None, it means the container is not found
if pod_info:
pod_info_json = json.loads(pod_info)
status = pod_info_json["status"]
pod_phase = status["phase"].lower()
if pod_phase == "running":
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
else:
if "conditions" in status:
for condition in status["conditions"]:
if "containerStatuses" in status:
# check if the ContainerCreating
if (
status["containerStatuses"][0]["state"]["waiting"]["reason"]
== "ContainerCreating"
):
self.log.info("Container is creating ...")
continue
if (
condition["type"] == "ContainersReady"
and condition["status"] != "True"
):
self.log.warning("Containers are not ready waiting 1 second.")
await asyncio.sleep(1)
continue
Per #43 this issue tracks the work necessary to clean up the Other section of the docs. Per this comment, both the troubleshooting guide contains stale stuff we don't need to document and the resources section still has some EG stuff.
Per #43 this issue tracks the work necessary to clean up the Contributors Guide. Per this comment, most of the guide requires changes.
As some of you know, I'm not happy with the name remote_provisioners
for the following reason: It's way too generic. Other folks may want to create their own sets of provisioners that happen to work remotely and I feel this repo will be the target of such confusion. I'd rather not "steal the name", nor do I think any repo should be named remote_provisioners
for the same reason.
I believe the name gateway_provisioners
includes both history and relevance. While it's true that you do not need a gateway server to host remote provisioners, it's very likely the case that you will. This is because these provisioners essentially require the invoker to be on the same network as where the kernel will run. As a result, many folks will need to "hop through" a gateway server that has been provisioned relative to the target network. Yes, applications like Jupyter Hub that spawn notebook servers in Kubernetes, will not require a gateway server but many others will want to deploy a gateway server in the cloud, in which case, Gateway Provisioners would be handy.
By changing to Gateway Provisioners, we also imply a well-known family of provisioners, all of which happen to be remote (for now), but perhaps not always.
The scala-based kernels use Apache Toree as their kernel and we have a toree-launcher, analogous to the other Python and R launchers, that services the Toree kernel. Because the launcher is essentially a jar file, we need to provide the ability to distribute it and reference/copy it from/to the kernel.json
and/or bin/run.sh
files when building scala-based kernelspecs.
Per #43 this issue tracks the work necessary to clean up the Developers Guide. Per this comment, most of the guide needs work.
We should utilize the Jupyter-releaser tooling to produce releases if possible
Looks like the YARN tests are timing out because the Application ID
is not getting conveyed. Since this is a mocked environment, something has probably side-affected things and it should be relatively straightforward to locate the issue.
Here's a build failure log: https://github.com/jupyter-server/gateway_provisioners/actions/runs/6187008029/job/16795884972
followed by some relevant text:
------------------------------ Captured log call -------------------------------
INFO traitlets:yarn.py:293 YarnProvisioner: kernel launched. YARN RM: my-yarn-cluster.acme.com:7777, pid: 42, Kernel ID: 64bea58e-8b2d-4bd7-a916-17d52f02a35f, cmd: '['--public-key:MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCu87zbdvcyLCV7Ownj5nMbwDYpIZtxOoYONKIO7A3ulZosjRmRqzDPUivKBnK7fp5DKOEpBZlnyxc89Xglg/Zbneu4QY8AQ5oZntFTegUFbF1zk/KzQUSDFRHiPqxKs1C7WlJU4wOSbnpvZOmUuOXJZKULsFGrzLgXMH09RNS+3QIDAQAB', '--response-address:10.213.5.248:8880', '--port-range:0..0', '--kernel-id:64bea58e-8b2d-4bd7-a916-17d52f02a35f']'
ERROR traitlets:remote_provisioner.py:339 KernelID: '64bea58e-8b2d-4bd7-a916-17d52f02a35f' launch timeout due to: Application ID is None. Failed to submit a new application to YARN within 30 seconds. Check server log for more information.
Looks like the link referenced in the docs for GNU make has moved or been removed.
The check links
CI job produces this message:
FAILED docs/source/contributors/devinstall.md::/home/runner/work/gateway_provisioners/gateway_provisioners/docs/source/contributors/devinstall.md <a href=https://www.gnu.org/software/make/>
1 failed, 4 deselected, 2 warnings in 0.42s
The doc build should succeed and the a link to GNU make should be valid.
One of the differences introduced in this repo from EG is that the code that was duplicated between the Python and R launchers is now shared. As a result, the assembly of kernel specifications needs to take this into account within the tooling and currently does not.
Why is this named gateway
provisioners and not simply kernel
provisioners. The intent is that it can be used in any jupyter-client solution.
Similarly, reading the doc, I see mention of Host
, Server
... which confuses me at some point. A nomenclature introduction would be great, I can try to open a PR if interest.
cc/ @kevin-bates
Per #43 this issue tracks the work necessary to clean up the Operators Guide. Per this comment, a majority of the effort will be in the Deploying Gateway Provisioners, but the Configuring Gateway Provisioners topic could also use work.
We currently have our docs building via hatch. We should look at getting envs created for tests, linting, and builds.
This will likely consist of bringing over applicable sections of the Enterprise Gateway docs. We should be able to leverage much of those docs and apply general substitutions like ProcessProxy
-> Provisioner
, EG_
to GP_
, etc. At a glance, it looks like most of the Developers
and Contributors
sections would apply.
We are creating a custom kernel provisioner and we have an API for finding a host based on certain parameters but this API takes upto 20s to send the host. This 20s wait blocks the server and makes lab unusable during those 20s. We tried making the _determine_next_host
function async ourselves by running the host fetching API in a asyncio
's run_in_executor
and awaiting the _determine_next_host
call in the launch_kernel
function.
A simple reproducer for this is adding a asyncio.sleep(10)
in the DistributedProvisioner._determine_next_host
function. The result will be that the kernel never reaches an alive state.
We would like a way to asynchronously determine the next host.
I'm using gateway_provisioners
v0.2.0
Enterprise Gateway's CustomResourceProcessProxy
(and its SparkOperatorProcessProxy
subclass) need to be ported to this repository as kernel provisioners.
Since process proxies and kernel provisioners are extremely similar, this is primarily an exercise in massaging the names and configuration references to be in line with those used in this repository. Also, since these will be subclasses of KubernetesProvisioner
, we should be able to extend the CLI tooling to enable kernelspec creation, etc.
As of #8, I just noticed that the __pycache__
directory exists in the launcher scripts locations of generated kernel spec directories, which should be removed following the copy.
We should utilize the same kind of pre-commit tooling used in the jupyter-server org.
Hey thanks for the great package.
It would be nice if there were more flexible options for authenticating to the k8s master for the KubernetesProvisioner
.
For example, in my deployment I have developers login to Jupyterhub with OAuth, and then I use KubeSpawner.auth_state_hook
to propegate their access token to their jupyterhub server where I use it authenticate to the GKE master (since GKE lets you manage RBAC with Google IAM).
You could borrow from a bunch of other tools (like dask_kubernetes
and use ~/.kube/config
profiles to authenticate (an env var naming the profile.)
I may work on this if I get some time, and you're interested.
The docs are in place and building, but still contain information that is only relative to Enterprise Gateway and really needs a series of passes (per #43 (comment) below). As a result, I've decided to turn this issue into an umbrella issue so that this work can be more easily distributed (as necessary).
Per #43 this issue tracks the work necessary to clean up the Users Guide. Per this comment, both the Users index page and its content require changes.
Hi, I'm trying to create a custom kernel spec JSON file, and I would like the configuration variables to be available inside a custom gateway_provisioner. I'm looking at this line of documentation under System Architecture
Each kernel.json’s kernel_provisioner stanza can specify an optional config stanza that is converted into a dictionary of name/value pairs and passed as an argument to each kernel provisioner’s constructor relative to the provisioner identified by the provisioner_name entry.
but when I put variables under the config stanza and inspect the arguments passed into the kernel provisioner's constructor, I don't see any of my configuration variables. Do you know where I might be able to find the configuration variables?
kernel.json
{
"argv": [
],
"env": {},
"display_name": "Python Custom",
"language": "python",
"interrupt_mode": "signal",
"metadata": {
"debugger": true,
"kernel_provisioner": {
"provisioner_name": "custom-provisioner",
"config": {
"launch_timeout": 30,
"test": "test",
"test2": "test2"
}
}
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.