Code Monkey home page Code Monkey logo

jupyterlab-integration's Introduction

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

DEPRECATED

  • SSH access in most workspaces is not allowed any more
  • Jupyter features like ipywidgets are now supported in Databricks Notebooks

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Local JupyterLab connecting to Databricks via SSH

This package allows to connect to a remote Databricks cluster from a locally running JupyterLab.

>>> 1 New minor release V2.2.1 (June 2021) <<<

  • Support of Databricks Runtimes 6.4(ESR) and 7.3, 7.6, 8.0, 8.1, 8.2, 8.3 (both standard and ML)
  • Upgrade to ssh_ipykernel 1.2.3 (security fixes for the Javascript Jupyterlab extension of ssh_ipykernel)
  • Security fixes for the Javascript Jupyterlab extension of databrickslabs-jupyterlab

2 Overview

introduction

3 Prerequisites

  1. Operating System

    Jupyterlab Integration will run on the following operation systems:

    • macOS
    • Linux
    • Windows 10 (with OpenSSH)
  2. Anaconda

    JupyterLab Integration is based on Anaconda and supports:

    • A recent version of Anaconda with Python >= 3.8
    • The tool conda must be newer than 4.7.5, test were executed with 4.9.2.

    Since Jupyterlab Integration will create a separate conda environment, Miniconda is sufficient to start

  3. Python

    JupyterLab Integration only works with Python 3 and supports Python 3.7 and Python 3.8 both on the remote cluster and locally.

  4. Databricks CLI

    For JupyterLab Integration a recent version of Databricks CLI is needed. To install Databricks CLI and to configure profiles for your clusters, please refer to AWS / Azure.

    Note:

    • JupyterLab Integration does not support Databricks CLI profiles with username password. Only Personal Access Tokens are supported.
    • Whenever $PROFILE is used in this documentation, it refers to a valid Databricks CLI profile name, stored in a shell environment variable.
  5. SSH access to Databricks clusters

    Configure your Databricks clusters to allow ssh access, see Configure SSH access

    Note:

    • Only clusters with valid ssh configuration are visible to databrickslabs_jupyterlab.
  6. Databricks Runtime

    JupyterLab Integration has been tested with the following Databricks runtimes on AWS and Azure:

    • '6.4 (ESR)'
    • '7.3' and '7.3 ML'
    • '7.6' and '7.6 ML'
    • '8.0' and '8.0 ML'
    • '8.1' and '8.1 ML'
    • '8.2' and '8.2 ML'
    • '8.3' and '8.3 ML'

    Newer runtimes might work, however are subject to own tests.

4 Running with docker

A docker image ready for working with Jupyterlab Integration is available from Dockerhub. It is recommended to prepare your environment by pulling the repository: docker pull bwalter42/databrickslabs_jupyterlab:2.2.1

There are two scripts in the folder docker:

  • for Windows: dk.dj.bat and dk-jupyter.bat
  • for macOS/Linux: dk-dj and dk-jupyter

Alternatively, under macOS and Linux one can use the following bash functions:

  • databrickslabs-jupyterlab for docker:

    This is the Jupyterlab Integration configuration utility using the docker image:

    function dk-dj {
        docker run -it --rm -p 8888:8888 \
            -v $(pwd)/kernels:/home/dbuser/.local/share/jupyter/kernels/ \
            -v $HOME/.ssh/:/home/dbuser/.ssh  \
            -v $HOME/.databrickscfg:/home/dbuser/.databrickscfg \
            -v $(pwd):/home/dbuser/notebooks \
            bwalter42/databrickslabs_jupyterlab:2.2.1 /opt/conda/bin/databrickslabs-jupyterlab $@
    }
  • jupyter for docker:

    Allows to run jupyter commands using the docker image:

    function dk-jupyter {
        docker run -it --rm -p 8888:8888 \
            -v $(pwd)/kernels:/home/dbuser/.local/share/jupyter/kernels/ \
            -v $HOME/.ssh/:/home/dbuser/.ssh  \
            -v $HOME/.databrickscfg:/home/dbuser/.databrickscfg \
            -v $(pwd):/home/dbuser/notebooks \
            bwalter42/databrickslabs_jupyterlab:2.2.1 /opt/conda/bin/jupyter $@
    }

The two scripts assume that notebooks will be in the current folder and kernels will be in the kernels subfolder of the current folder:

$PWD  <= Start jupyterLab from here
 |_ kernels
 |  |_ <Jupyterlab Integration kernel spec>
 |  |_ ... 
 |_ project
 |  |_ notebook.ipynb
 |_ notebook.ipynb
 |_ ...

Note, the scripts dk-dj / dk-dj.bat will modify your ~/.ssh/config and ~/.ssh/know_hosts! If you you do not want this to happen, you can for example extend the folder structure to

$PWD  <= Start jupyterLab from here
|_ .ssh                      <= new
|  |_ config                 <= new
|  |_ id_$PROFILE            <= new
|  |_ id_$PROFILE.pub        <= new
|_ kernels
|  |_ <Jupyterlab Integration kernel spec>
|  |_ ... 
|_ project
|  |_ notebook.ipynb
|_ notebook.ipynb
|_ ...

and create the necessary public/private key pair in $(pwd)/.ssh and change the parameter -v $HOME/.ssh/:/home/dbuser/.ssh to -v $(pwd)/.ssh/:/home/dbuser/.ssh in both commands.

5 Local installation

  1. Install Jupyterlab Integration

    Create a new conda environment and install databrickslabs_jupyterlab with the following commands:

    (base)$ conda create -n dj python=3.8  # you might need to add "pywin32" if you are on Windows
    (base)$ conda activate dj
    (dj)$   pip install --upgrade databrickslabs-jupyterlab[cli]==2.2.1

    The prefix (db-jlab)$ for all command examples in this document assumes that the conda enviromnent db-jlab is activated.

  2. The tool databrickslabs-jupyterlab / dj

    It comes with a batch file dj.bat for Windows. On MacOS or Linux both dj and databrickslabs-jupyterlab exist

6 Getting started with local installation or docker

Ensure, ssh access is correctly configured, see Configure SSH access

6.1 Starting JupyterLab

  1. Create a kernel specification

    In the terminal, create a jupyter kernel specification for a Databricks CLI profile $PROFILE with the following command:

    • Local installation

      (db-jlab)$ dj $PROFILE -k
    • With docker

      (db-jlab)$ dk-dj $PROFILE -k

    A new kernel is available in the kernel change menu (see here for an explanation of the kernel name structure)

  2. Start JupyterLab

    • Local installation

      (db-jlab)$ dj $PROFILE -l      # or 'jupyter lab'
    • With docker

      (db-jlab)$ dk-dj $PROFILE -l   # or 'dk-jupyter lab'

    The command with -l is a safe version for the standard command to start JupyterLab (jupyter lab) that ensures that the kernel specificiation is updated.

6.2 Using Spark in the Notebook

  1. Check whether the notebook is properly connected

    When the notebook connected successfully to the cluster, the status bar at the bottom of JupyterLab should show

    if you use a kernel with Spark, else just

    If this is not the case, see Troubleshooting

  2. Test the Spark access

    To check the remote Spark connection, enter the following lines into a notebook cell:

    import socket
    
    from databrickslabs_jupyterlab import is_remote
    
    result = sc.range(10000).repartition(100).map(lambda x: x).sum()
    print(socket.gethostname(), is_remote())
    print(result)

    It will show that the kernel is actually running remotely and the hostname of the driver. The second part quickly smoke tests a Spark job.

Success: Your local JupyterLab is successfully contected to the remote Databricks cluster

7 Advanced topics

7.1 Switching kernels and restart after cluster auto-termination

7.2 Creating a mirror of a remote Databricks cluster

7.3 Detailed databrickslabs_jupyterlab command overview

7.4 How it works

7.5 Troubleshooting

8 Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

9 Test notebooks

To work with the test notebooks in ./examples the remote cluster needs to have the following libraries installed:

  • mlflow==1.x
  • spark-sklearn

jupyterlab-integration's People

Contributors

aditya-chaturvedi avatar apulich-exos avatar bernhard-42 avatar geeksheikh avatar pohlposition avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

jupyterlab-integration's Issues

Jupyterlab==3.*.* support

Hi!
Do you have plans to support new jupyterlab 3? If so, are there any approximate dates for it and can I help somehow?

Comparing to OS solutions

Hi all, thanks for this nice effort and great work! However, I miss the potential to switch the connectivity around (eg. connecting from my k8s to DB cluster).

So, how is this different to jupyter gateway or jupyter enterprise gateway?

Upgrade notebook to latest version 6.2.0

Is it possible to upgrade notebook to latest version 6.2.0?

In the used version of jupyter/notebook (notebook==6.0.3) there is a problem with Ensure that cell ids persist after save, discussed PR in jupyter/notebook. Which has been fixed in notebook in latest version 6.2.0. Without this whenever we save Notebook it will create new ID for each cell and review on Notebook becomes difficult.

I was checking if there is any possibility to upgrade notebook to latest version 6.2.0, your help will be appreciated!

Read local files from notebooks connecting to Databricks via SSH

Hello,
I was wondering if there is any way that notebooks connecting to Databricks via SSH could read files on local machine.

I have a yaml file and a notebook on my local side. I opened the notebook from jupyterlab connecting to Databricks via SSH and the notebook tried to read the yaml file but it did not work. Because the yaml file did not exist on Databricks. What I could figure out this is uploading yaml file to dbfs system so that the notebook can read the yaml file from there. Is there any better way to do that?

Thanks.

local.py line 125 IndexError: list index out of range

Hello, all: trying to set up following the instruction on Azure Databricks getting this error,

databrickslabs-jupyterlab eastus -k -i 0426-155413-ring996
Traceback (most recent call last):
  File "/data/anaconda/envs/db-jlab/bin/databrickslabs-jupyterlab", line 171, in <module>
    version = conda_version()
  File "/data/anaconda/envs/db-jlab/lib/python3.6/site-packages/databrickslabs_jupyterlab/local.py", line 125, in conda_version
    return result["stdout"].strip().split(" ")[1]
IndexError: list index out of range

I am able to SSH into master no problem.
-p also list the profile.

databrickslabs-jupyterlab -p

PROFILE              HOST                                                         SSH KEY
eastus               https://eastus.azuredatabricks.net/?o=3573392022285404       OK

databricks cli connects with cluster successful.

databricks clusters list --profile eastus
0426-155413-ring996   gpu  RUNNING
0404-233454-navel281  std  TERMINATED

Much appreciated on your help.

running in jupyterhub inside docker image

I can see this would be useful running from jupyterhub which is a multi tenant jupyter service. For example, a user runs a local notebook and sends a job to a remote cluster. I would like the databricks cluster to start when the job is sent and not be running continuously and then shutdown after the clusters inactive timeout settiing.

In our case we use jupyterhub running in AKS and Azure Databricks in own vnet.

I did try to create a docker image as an extension of one of the default jupyter images.

`

FROM jupyter/datascience-notebook:latest
ENV BASH_ENV ~/.bashrc
RUN conda create -n db-jlab
RUN echo "source activate db-jlab" > ~/.bashrc
ENV PATH /opt/conda/envs/env/bin:$PATH
RUN pip install --upgrade databrickslabs-jupyterlab
RUN databrickslabs-jupyterlab -b
RUN pip install databricks-cli
USER $NB_USER

`
I am still testing

A bug with incorrect jupyter_notebook_config.py overwriting with multiline values (e.g. dicts)

If there are some multiline dicts in jupyter_notebook_config.py like

c.NotebookApp.tornado_settings={
  'headers': {
    'SOME_HEADER': 'SOME_VALUE'
  }
}

then databrickslabs_jupyterlab.local.write_config splits the first line by "=" which results in

['c.NotebookApp.tornado_settings', '{']

Next lines are just simply ignored as there is no "=" symbol there. As a result, the correct c.NotebookApp.tornado_settings settings are overwritten by c.NotebookApp.tornado_settings={ which simply breaks the config as now we have only an opening brace.

A workaround on this is just to flatten the config lines into one, but it can make the config unreadable if there are lots of lines. So maybe there is some sense to change write_config func so it can handle such cases.

I guess I can fix it and create a pull-request

databrickslabs-jupyterlab configuration breaks for conda 4.8.0

conda version is latest i.e. greater than 4.7.5 - still databrickslabs-jupyterlab complains of it being too old.

(dbconnect) ~ λ databrickslabs-jupyterlab $PROFILE -s -i $CLUSTER_ID
Too old conda version:
Please update conda to at least 4.7.5
(dbconnect) ~ λ conda --version
conda 4.8.0

dbutils.library support

Hi,
Are there any plans to add support for dbutils.library module? Right now simple dbutils.library.help("install") produces an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-28186010f9f4> in <module>
----> 1 dbutils.library.help("install")

AttributeError: 'DbjlUtils' object has no attribute 'library'

In ML runtime there is also a great magic %pip - https://docs.databricks.com/notebooks/notebooks-python-libraries.html#enable-pip-and-conda-magic-commands
It installs libraries both to driver and executor nodes.
In contrast, when running %pip install inside jupyterlab notebook connected to databricks cluster - it installs libraries only on driver node. Which makes it unusable in case of udfs, cause executors need same libraries also.
Could you suggest any workaround? Or maybe there are some plans to bring such support to jupyterlab-integration?

Any way to install notebook scoped libraries interactively without init scripts?

Thanks in advance

Any guidance around whether jupyterlab-integration or jupyterlab with databricks-connect?

I found that databricks-connect supports jupyter and I made sure that jupyterlab works with databricks-connect with this link.
https://docs.databricks.com/dev-tools/databricks-connect.html#jupyter
Which one should I use for using jupyterlab with databricks, this library or databricks-connect? Does the development on this repo continue? I'm wondering where the databricks team will be putting effort into to integrate jupyterlab.

Connecting via private IP

We have a setup using VPN to connect to our databricks clusters.
I want to be able to connect to the cluster using its private IP.

I've tried just changing the IP in the ssh config, but that gets overridden.

Any ideas?

Matplotlib and plotly plots don't render

Dataframes render fine in local Jupyter notebook, but matplotlib and plotly plots do not.

For matplotlib, the plot simply does not render:

image

For plotly, the error when using default plotly renderer is

image

When updating the default render to "notebook", nothing is printed out, similar to matplotlib. I can make plotly work with a very hacky workaround, by saving the plot as HTML and then displaying the HTML.

Support any virtual env manager

The package relies on conda, but conda is very heavyweight and not everyone is a huge conda fan.

I believe this is a lower priority issue, but it would be great if we could use any virtual env manager we want, like say Poetry.

Custom environment variables for kernels

Hi!
Firstly, Thank you for your work!

I'm wondering If I can somehow specify custom environment variables for my ssh kernel.
I have some custom libraries for which I need to specify PYTHONPATH, LD_LIBRARY_PATH and some other env vars. When I just use simple Databricks notebooks I have my own docker image and Init Scripts where I setup these vars and add them to .bashrc. But if I run Jupyterlab with databricks integration then I don't see those variables. The only workaround I've found so far is to edit local.py from this library and setup my env there.
Obviously, this is a rather dirty hack.

Cluster not reachable exception

First off just want to say I think this tool is great, and generally works flawlessly. I'm running into an issue where i'll sporadically receive a "Cluster Unreachable" exception, prompting me to restart the cluster. For long running jobs, this is can be annoying, since it forces me to restart the cluster, and then re kick off the job. Any ideas why this is happening? It happens even in the middle of interactive work where my local machine is active and (in theory) the SSH tunnel is stable (although I haven't tested network disruptions etc).

Here's the pop up that will surface:
image

Any help is much appreciated.

Token invalid error for Azure Databricks workspaces

In Azure databricks environments we

(db-jlab) C02Y77B9JG5H:~ gobinath$ databrickslabs-jupyterlab $PROFILE -k -o 4116859307136712 -i 0520-162211-ilk548
Valid version of conda detected: 4.7.12

* Getting host and token from .databrickscfg

* Select remote cluster

Token for profile 'jupyterssh' is invalid

=> Exiting
(db-jlab) C02Y77B9JG5H:~ gobinath$

(db-jlab) C02Y77B9JG5H:~ gobinath$ databrickslabs-jupyterlab $PROFILE -s -i 0520-162211-ilk548
Valid version of conda detected: 4.7.12

* Getting host and token from .databrickscfg

   => ssh key '/Users/gobinath/.ssh/id_jupyterssh' does not exist
   => Shall it be created (y/n)? (default = n): y
   => Creating ssh key /Users/gobinath/.ssh/id_jupyterssh
   => OK
Token for profile 'jupyterssh' is invalid

=> Exiting
(db-jlab) C02Y77B9JG5H:~ gobinath$

I know the token is good because I validated over and again with direct cli command:

(db-jlab) C02Y77B9JG5H:~ gobinath$ databricks clusters list --profile jupyterssh
0520-162211-ilk548  test_jupyter  RUNNING
(db-jlab) C02Y77B9JG5H:~ gobinath$ 

Scala Question

Can you please clarify how the notebook experience would work if I used scala?
I've read the following, and had follow-up questions...

https://github.com/databrickslabs/jupyterlab-integration/blob/master/docs/v2/how-it-works.md

Based on my understanding of that article, the scala kernel, a JVM, would never run locally on my workstation. Is that correct? It sounds everything I'm doing in each cell is being proxied to the remote cluster, including any logic that would otherwise be executed on the spark driver.

I am pretty excited by your demo, that I saw here:
https://github.com/databrickslabs/jupyterlab-integration/blob/master/docs/v2/news/scala-magic.md

I guess the concern I have is that if the scala kernel is never running on the local machine, then it will be difficult to achieve a rich scala development experience within Jupyter. I think you highlighted some of the limitations already. As-of now I've been using almond-sh ( https://almond.sh/ ) as my scala kernel in Jupyter and it sounds like this jupyterlab-integration experience would be very different.

Please let me know. I'm very eager to develop scala notebooks in Jupyterlab that will interact with a remote databricks cluster (via db-connect). It seems like a good combination to use jupyterlab for development, along with a remote cluster that I don't need to manage myself.

Missing arguments error on getting Spark context

I'm getting a missing arguments error of 'pinned_mode' upon successful connect when I tried to get the Spark context.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-376bcb4d28bb> in <module>
      1 from databrickslabs_jupyterlab.connect import dbcontext, is_remote
----> 2 dbcontext()

/databricks/python/lib/python3.7/site-packages/databrickslabs_jupyterlab/connect.py in dbcontext(progressbar)
    179     # ... and connect to this gateway
    180     #
--> 181     gateway = get_existing_gateway(port, True, auth_token)
    182     print(". connected")
    183     # print("Python interpreter: %s" % interpreter)

TypeError: get_existing_gateway() missing 1 required positional argument: 'pinned_mode'

Remove Anaconda dependency

Since Anaconda has licensing which is incompatible with some corporate structures, it is important to use open source pip package manager instead.

I will investigate and see what can be done. There is a conda version of the databricks runtime docker images, but they are ostensibly less up-to-date than the pip versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.