databrickslabs / jupyterlab-integration Goto Github PK

DEPRECATED: Integrating Jupyter with Databricks via SSH

License: Other

Makefile 0.83% Python 40.26% TypeScript 3.30% Batchfile 0.21% CSS 0.11% HTML 54.67% Shell 0.18% Dockerfile 0.16% JavaScript 0.26%

jupyter jupyter-notebook databricks databricks-api databricks-deploy

jupyterlab-integration's Introduction

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

DEPRECATED

SSH access in most workspaces is not allowed any more
Jupyter features like ipywidgets are now supported in Databricks Notebooks

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Local JupyterLab connecting to Databricks via SSH

This package allows to connect to a remote Databricks cluster from a locally running JupyterLab.

>>> 1 New minor release V2.2.1 (June 2021) <<<

Support of Databricks Runtimes 6.4(ESR) and 7.3, 7.6, 8.0, 8.1, 8.2, 8.3 (both standard and ML)
Upgrade to ssh_ipykernel 1.2.3 (security fixes for the Javascript Jupyterlab extension of ssh_ipykernel)
Security fixes for the Javascript Jupyterlab extension of databrickslabs-jupyterlab

2 Overview

3 Prerequisites

Operating System

Jupyterlab Integration will run on the following operation systems:
- macOS
- Linux
- Windows 10 (with OpenSSH)
Anaconda

JupyterLab Integration is based on Anaconda and supports:
- A recent version of Anaconda with Python >= 3.8
- The tool conda must be newer than 4.7.5, test were executed with 4.9.2.
Since Jupyterlab Integration will create a separate conda environment, Miniconda is sufficient to start
Python

JupyterLab Integration only works with Python 3 and supports Python 3.7 and Python 3.8 both on the remote cluster and locally.
Databricks CLI

For JupyterLab Integration a recent version of Databricks CLI is needed. To install Databricks CLI and to configure profiles for your clusters, please refer to AWS / Azure.

Note:
- JupyterLab Integration does not support Databricks CLI profiles with username password. Only Personal Access Tokens are supported.
- Whenever $PROFILE is used in this documentation, it refers to a valid Databricks CLI profile name, stored in a shell environment variable.
SSH access to Databricks clusters

Configure your Databricks clusters to allow ssh access, see Configure SSH access

Note:
- Only clusters with valid ssh configuration are visible to databrickslabs_jupyterlab.
Databricks Runtime

JupyterLab Integration has been tested with the following Databricks runtimes on AWS and Azure:
- '6.4 (ESR)'
- '7.3' and '7.3 ML'
- '7.6' and '7.6 ML'
- '8.0' and '8.0 ML'
- '8.1' and '8.1 ML'
- '8.2' and '8.2 ML'
- '8.3' and '8.3 ML'
Newer runtimes might work, however are subject to own tests.

4 Running with docker

A docker image ready for working with Jupyterlab Integration is available from Dockerhub. It is recommended to prepare your environment by pulling the repository: docker pull bwalter42/databrickslabs_jupyterlab:2.2.1

There are two scripts in the folder docker:

for Windows: dk.dj.bat and dk-jupyter.bat
for macOS/Linux: dk-dj and dk-jupyter

Alternatively, under macOS and Linux one can use the following bash functions:

databrickslabs-jupyterlab for docker:

This is the Jupyterlab Integration configuration utility using the docker image:

function dk-dj {
    docker run -it --rm -p 8888:8888 \
        -v $(pwd)/kernels:/home/dbuser/.local/share/jupyter/kernels/ \
        -v $HOME/.ssh/:/home/dbuser/.ssh  \
        -v $HOME/.databrickscfg:/home/dbuser/.databrickscfg \
        -v $(pwd):/home/dbuser/notebooks \
        bwalter42/databrickslabs_jupyterlab:2.2.1 /opt/conda/bin/databrickslabs-jupyterlab $@
}

jupyter for docker:

Allows to run jupyter commands using the docker image:

function dk-jupyter {
    docker run -it --rm -p 8888:8888 \
        -v $(pwd)/kernels:/home/dbuser/.local/share/jupyter/kernels/ \
        -v $HOME/.ssh/:/home/dbuser/.ssh  \
        -v $HOME/.databrickscfg:/home/dbuser/.databrickscfg \
        -v $(pwd):/home/dbuser/notebooks \
        bwalter42/databrickslabs_jupyterlab:2.2.1 /opt/conda/bin/jupyter $@
}

The two scripts assume that notebooks will be in the current folder and kernels will be in the kernels subfolder of the current folder:

$PWD  <= Start jupyterLab from here
 |_ kernels
 |  |_ <Jupyterlab Integration kernel spec>
 |  |_ ... 
 |_ project
 |  |_ notebook.ipynb
 |_ notebook.ipynb
 |_ ...

Note, the scripts dk-dj / dk-dj.bat will modify your ~/.ssh/config and ~/.ssh/know_hosts! If you you do not want this to happen, you can for example extend the folder structure to

$PWD  <= Start jupyterLab from here
|_ .ssh                      <= new
|  |_ config                 <= new
|  |_ id_$PROFILE            <= new
|  |_ id_$PROFILE.pub        <= new
|_ kernels
|  |_ <Jupyterlab Integration kernel spec>
|  |_ ... 
|_ project
|  |_ notebook.ipynb
|_ notebook.ipynb
|_ ...

and create the necessary public/private key pair in $(pwd)/.ssh and change the parameter -v $HOME/.ssh/:/home/dbuser/.ssh to -v $(pwd)/.ssh/:/home/dbuser/.ssh in both commands.

5 Local installation

Install Jupyterlab Integration

Create a new conda environment and install databrickslabs_jupyterlab with the following commands:
```
(base)$ conda create -n dj python=3.8  # you might need to add "pywin32" if you are on Windows
(base)$ conda activate dj
(dj)$   pip install --upgrade databrickslabs-jupyterlab[cli]==2.2.1
```
The prefix (db-jlab)$ for all command examples in this document assumes that the conda enviromnent db-jlab is activated.
The tool databrickslabs-jupyterlab / dj

It comes with a batch file dj.bat for Windows. On MacOS or Linux both dj and databrickslabs-jupyterlab exist

6 Getting started with local installation or docker

Ensure, ssh access is correctly configured, see Configure SSH access

6.1 Starting JupyterLab

Create a kernel specification

In the terminal, create a jupyter kernel specification for a Databricks CLI profile $PROFILE with the following command:
- Local installation
```
(db-jlab)$ dj $PROFILE -k
```
- With docker
```
(db-jlab)$ dk-dj $PROFILE -k
```
A new kernel is available in the kernel change menu (see here for an explanation of the kernel name structure)
Start JupyterLab
- Local installation
```
(db-jlab)$ dj $PROFILE -l      # or 'jupyter lab'
```
- With docker
```
(db-jlab)$ dk-dj $PROFILE -l   # or 'dk-jupyter lab'
```
The command with -l is a safe version for the standard command to start JupyterLab (jupyter lab) that ensures that the kernel specificiation is updated.

6.2 Using Spark in the Notebook

Check whether the notebook is properly connected

When the notebook connected successfully to the cluster, the status bar at the bottom of JupyterLab should show

if you use a kernel with Spark, else just

If this is not the case, see Troubleshooting
Test the Spark access

To check the remote Spark connection, enter the following lines into a notebook cell:
```
import socket

from databrickslabs_jupyterlab import is_remote

result = sc.range(10000).repartition(100).map(lambda x: x).sum()
print(socket.gethostname(), is_remote())
print(result)
```
It will show that the kernel is actually running remotely and the hostname of the driver. The second part quickly smoke tests a Spark job.

Success: Your local JupyterLab is successfully contected to the remote Databricks cluster

7 Advanced topics

7.1 Switching kernels and restart after cluster auto-termination

7.2 Creating a mirror of a remote Databricks cluster

7.3 Detailed databrickslabs_jupyterlab command overview

7.4 How it works

7.5 Troubleshooting

8 Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

9 Test notebooks

To work with the test notebooks in ./examples the remote cluster needs to have the following libraries installed:

mlflow==1.x
spark-sklearn

jupyterlab-integration's People

Contributors

Stargazers

Watchers

Forkers

aditya-chaturvedi pauldx carlosemiliobuenaventura jinlmsft baljit92 apulich-exos gabelerner gregorigiacomo tkforks nemanja899 jiayuasu dreyercito

jupyterlab-integration's Issues

Jupyterlab==3.. support

Hi!
Do you have plans to support new jupyterlab 3? If so, are there any approximate dates for it and can I help somehow?

Comparing to OS solutions

Hi all, thanks for this nice effort and great work! However, I miss the potential to switch the connectivity around (eg. connecting from my k8s to DB cluster).

So, how is this different to jupyter gateway or jupyter enterprise gateway?

Would using the Windows Subsystem for Linux be a potential workaround for the lack of Windows support?

As mentioned in the document, Either Macos or Linux. Windows is currently not supported. What about Windows Subsystem for Linus?

Upgrade notebook to latest version 6.2.0

Is it possible to upgrade notebook to latest version 6.2.0?

In the used version of jupyter/notebook (notebook==6.0.3) there is a problem with Ensure that cell ids persist after save, discussed PR in jupyter/notebook. Which has been fixed in notebook in latest version 6.2.0. Without this whenever we save Notebook it will create new ID for each cell and review on Notebook becomes difficult.

I was checking if there is any possibility to upgrade notebook to latest version 6.2.0, your help will be appreciated!

Read local files from notebooks connecting to Databricks via SSH

Hello,
I was wondering if there is any way that notebooks connecting to Databricks via SSH could read files on local machine.

I have a yaml file and a notebook on my local side. I opened the notebook from jupyterlab connecting to Databricks via SSH and the notebook tried to read the yaml file but it did not work. Because the yaml file did not exist on Databricks. What I could figure out this is uploading yaml file to dbfs system so that the notebook can read the yaml file from there. Is there any better way to do that?

Thanks.

local.py line 125 IndexError: list index out of range

Hello, all: trying to set up following the instruction on Azure Databricks getting this error,

databrickslabs-jupyterlab eastus -k -i 0426-155413-ring996
Traceback (most recent call last):
  File "/data/anaconda/envs/db-jlab/bin/databrickslabs-jupyterlab", line 171, in <module>
    version = conda_version()
  File "/data/anaconda/envs/db-jlab/lib/python3.6/site-packages/databrickslabs_jupyterlab/local.py", line 125, in conda_version
    return result["stdout"].strip().split(" ")[1]
IndexError: list index out of range

I am able to SSH into master no problem.
-p also list the profile.

databrickslabs-jupyterlab -p

PROFILE              HOST                                                         SSH KEY
eastus               https://eastus.azuredatabricks.net/?o=3573392022285404       OK

databricks cli connects with cluster successful.

databricks clusters list --profile eastus
0426-155413-ring996   gpu  RUNNING
0404-233454-navel281  std  TERMINATED

Much appreciated on your help.

running in jupyterhub inside docker image

I can see this would be useful running from jupyterhub which is a multi tenant jupyter service. For example, a user runs a local notebook and sends a job to a remote cluster. I would like the databricks cluster to start when the job is sent and not be running continuously and then shutdown after the clusters inactive timeout settiing.

In our case we use jupyterhub running in AKS and Azure Databricks in own vnet.

I did try to create a docker image as an extension of one of the default jupyter images.

FROM jupyter/datascience-notebook:latest
ENV BASH_ENV ~/.bashrc
RUN conda create -n db-jlab
RUN echo "source activate db-jlab" > ~/.bashrc
ENV PATH /opt/conda/envs/env/bin:$PATH
RUN pip install --upgrade databrickslabs-jupyterlab
RUN databrickslabs-jupyterlab -b
RUN pip install databricks-cli
USER $NB_USER

`
I am still testing

DBR 7.3 version support

Readme mentions we support DBR 7.0 beta. Is 7.3 ML LTS also supported?

I successfully connect to my cluster on jupyter lab, but I run some test code, the notebook has no response

Hi DataBricks team,

I successfully connect to my cluster on jupyter lab, I try do run some test codes but has no response. And I am sure I can ssh to my cluster. How do I fix this??

A bug with incorrect jupyter_notebook_config.py overwriting with multiline values (e.g. dicts)

If there are some multiline dicts in jupyter_notebook_config.py like

c.NotebookApp.tornado_settings={
  'headers': {
    'SOME_HEADER': 'SOME_VALUE'
  }
}

then databrickslabs_jupyterlab.local.write_config splits the first line by "=" which results in

['c.NotebookApp.tornado_settings', '{']

Next lines are just simply ignored as there is no "=" symbol there. As a result, the correct c.NotebookApp.tornado_settings settings are overwritten by c.NotebookApp.tornado_settings={ which simply breaks the config as now we have only an opening brace.

A workaround on this is just to flatten the config lines into one, but it can make the config unreadable if there are lots of lines. So maybe there is some sense to change write_config func so it can handle such cases.

I guess I can fix it and create a pull-request

databrickslabs-jupyterlab configuration breaks for conda 4.8.0

conda version is latest i.e. greater than 4.7.5 - still databrickslabs-jupyterlab complains of it being too old.

(dbconnect) ~ λ databrickslabs-jupyterlab $PROFILE -s -i $CLUSTER_ID
Too old conda version:
Please update conda to at least 4.7.5
(dbconnect) ~ λ conda --version
conda 4.8.0

dbutils.library support

Hi,
Are there any plans to add support for dbutils.library module? Right now simple dbutils.library.help("install") produces an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-28186010f9f4> in <module>
----> 1 dbutils.library.help("install")

AttributeError: 'DbjlUtils' object has no attribute 'library'

In ML runtime there is also a great magic %pip - https://docs.databricks.com/notebooks/notebooks-python-libraries.html#enable-pip-and-conda-magic-commands
It installs libraries both to driver and executor nodes.
In contrast, when running %pip install inside jupyterlab notebook connected to databricks cluster - it installs libraries only on driver node. Which makes it unusable in case of udfs, cause executors need same libraries also.
Could you suggest any workaround? Or maybe there are some plans to bring such support to jupyterlab-integration?

Any way to install notebook scoped libraries interactively without init scripts?

Thanks in advance

Any guidance around whether jupyterlab-integration or jupyterlab with databricks-connect?

I found that databricks-connect supports jupyter and I made sure that jupyterlab works with databricks-connect with this link.
https://docs.databricks.com/dev-tools/databricks-connect.html#jupyter
Which one should I use for using jupyterlab with databricks, this library or databricks-connect? Does the development on this repo continue? I'm wondering where the databricks team will be putting effort into to integrate jupyterlab.

Connecting via private IP

We have a setup using VPN to connect to our databricks clusters.
I want to be able to connect to the cluster using its private IP.

I've tried just changing the IP in the ssh config, but that gets overridden.

Any ideas?

Matplotlib and plotly plots don't render

Dataframes render fine in local Jupyter notebook, but matplotlib and plotly plots do not.

For matplotlib, the plot simply does not render:

For plotly, the error when using default plotly renderer is

When updating the default render to "notebook", nothing is printed out, similar to matplotlib. I can make plotly work with a very hacky workaround, by saving the plot as HTML and then displaying the HTML.

Support any virtual env manager

The package relies on conda, but conda is very heavyweight and not everyone is a huge conda fan.

I believe this is a lower priority issue, but it would be great if we could use any virtual env manager we want, like say Poetry.

Custom environment variables for kernels

Hi!
Firstly, Thank you for your work!

I'm wondering If I can somehow specify custom environment variables for my ssh kernel.
I have some custom libraries for which I need to specify PYTHONPATH, LD_LIBRARY_PATH and some other env vars. When I just use simple Databricks notebooks I have my own docker image and Init Scripts where I setup these vars and add them to .bashrc. But if I run Jupyterlab with databricks integration then I don't see those variables. The only workaround I've found so far is to edit local.py from this library and setup my env there.
Obviously, this is a rather dirty hack.

Cluster not reachable exception

First off just want to say I think this tool is great, and generally works flawlessly. I'm running into an issue where i'll sporadically receive a "Cluster Unreachable" exception, prompting me to restart the cluster. For long running jobs, this is can be annoying, since it forces me to restart the cluster, and then re kick off the job. Any ideas why this is happening? It happens even in the middle of interactive work where my local machine is active and (in theory) the SSH tunnel is stable (although I haven't tested network disruptions etc).

Here's the pop up that will surface:

Any help is much appreciated.

After I pip install, but I encounter ModuleNotFoundError: No module named 'version_parser'

Hi Databricks,

After I pip install --upgrade databrickslabs-jupyterlab==2.2.1, but encounter encounter ModuleNotFoundError: No module named 'version_parser'.

Where I can find the version_parser to install??

Best,

Token invalid error for Azure Databricks workspaces

In Azure databricks environments we

(db-jlab) C02Y77B9JG5H:~ gobinath$ databrickslabs-jupyterlab $PROFILE -k -o 4116859307136712 -i 0520-162211-ilk548
Valid version of conda detected: 4.7.12

* Getting host and token from .databrickscfg

* Select remote cluster

Token for profile 'jupyterssh' is invalid

=> Exiting
(db-jlab) C02Y77B9JG5H:~ gobinath$


(db-jlab) C02Y77B9JG5H:~ gobinath$ databrickslabs-jupyterlab $PROFILE -s -i 0520-162211-ilk548
Valid version of conda detected: 4.7.12

* Getting host and token from .databrickscfg

   => ssh key '/Users/gobinath/.ssh/id_jupyterssh' does not exist
   => Shall it be created (y/n)? (default = n): y
   => Creating ssh key /Users/gobinath/.ssh/id_jupyterssh
   => OK
Token for profile 'jupyterssh' is invalid

=> Exiting
(db-jlab) C02Y77B9JG5H:~ gobinath$

I know the token is good because I validated over and again with direct cli command:

(db-jlab) C02Y77B9JG5H:~ gobinath$ databricks clusters list --profile jupyterssh
0520-162211-ilk548  test_jupyter  RUNNING
(db-jlab) C02Y77B9JG5H:~ gobinath$

Scala Question

Can you please clarify how the notebook experience would work if I used scala?
I've read the following, and had follow-up questions...

https://github.com/databrickslabs/jupyterlab-integration/blob/master/docs/v2/how-it-works.md

Based on my understanding of that article, the scala kernel, a JVM, would never run locally on my workstation. Is that correct? It sounds everything I'm doing in each cell is being proxied to the remote cluster, including any logic that would otherwise be executed on the spark driver.

I am pretty excited by your demo, that I saw here:
https://github.com/databrickslabs/jupyterlab-integration/blob/master/docs/v2/news/scala-magic.md

I guess the concern I have is that if the scala kernel is never running on the local machine, then it will be difficult to achieve a rich scala development experience within Jupyter. I think you highlighted some of the limitations already. As-of now I've been using almond-sh ( https://almond.sh/ ) as my scala kernel in Jupyter and it sounds like this jupyterlab-integration experience would be very different.

Please let me know. I'm very eager to develop scala notebooks in Jupyterlab that will interact with a remote databricks cluster (via db-connect). It seems like a good combination to use jupyterlab for development, along with a remote cluster that I don't need to manage myself.

Missing arguments error on getting Spark context

I'm getting a missing arguments error of 'pinned_mode' upon successful connect when I tried to get the Spark context.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-376bcb4d28bb> in <module>
      1 from databrickslabs_jupyterlab.connect import dbcontext, is_remote
----> 2 dbcontext()

/databricks/python/lib/python3.7/site-packages/databrickslabs_jupyterlab/connect.py in dbcontext(progressbar)
    179     # ... and connect to this gateway
    180     #
--> 181     gateway = get_existing_gateway(port, True, auth_token)
    182     print(". connected")
    183     # print("Python interpreter: %s" % interpreter)

TypeError: get_existing_gateway() missing 1 required positional argument: 'pinned_mode'

Remove Anaconda dependency

Since Anaconda has licensing which is incompatible with some corporate structures, it is important to use open source pip package manager instead.

I will investigate and see what can be done. There is a conda version of the databricks runtime docker images, but they are ostensibly less up-to-date than the pip versions.

databrickslabs / jupyterlab-integration Goto Github PK

jupyterlab-integration's Introduction

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

DEPRECATED

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Local JupyterLab connecting to Databricks via SSH

>>> 1 New minor release V2.2.1 (June 2021) <<<

2 Overview

3 Prerequisites

4 Running with docker

5 Local installation

6 Getting started with local installation or docker

6.1 Starting JupyterLab

6.2 Using Spark in the Notebook

7 Advanced topics

8 Project Support

9 Test notebooks

jupyterlab-integration's People

Contributors

Stargazers

Watchers

Forkers

jupyterlab-integration's Issues

Recommend Projects

Recommend Topics

Recommend Org