Artifacts to assist with establishing R/RStudio workloads on Databricks.
In /init-scripts
there is currently one notebook which configures & installs:
- Simba Spark ODBC driver (
2.6.19
) {mlflow}
and{odbc}
(MRAN snapshot2022-02-24
)- ODBC data sources:
databricks-self
: Existing cluster (self)databricks
: Any databricks endpoint/cluster
- RStudio Connection Snippets
RStudio is not installed as part of the init script as it is pre-installed with the ML variant of the Databricks Runtime (DBR). It is recommended that you use the ML runtime (prefereably LTS) in order to reduce cluster start times.
To ensure ODBC connections work seamlessly its recommended to update the init script. The start of the script includes the following:
# SET VARIABLES
WORKSPACE_ID=<Workspace ID>
WORKSPACE_URL=<Workspace URL>
MRAN_SNAPSHOT=<MRAN Snapshot Date>
This should look something like...
# SET VARIABLES
WORKSPACE_ID=123123123123123
WORKSPACE_URL=XXXXXXXXXX.cloud.databricks.com
MRAN_SNAPSHOT=2022-02-24
WORKSPACE_ID
can be derived via the workspace URL (after ?o=
) or by asking your Databricks account admin.
MRAN_SNAPSHOT
is found via DBR release notes, see below.
In /cluster-policies
there are:
-
rstudio-generic.json
:- DBR 10.4 ML LTS (
10.4.x-cpu-ml-scala2.12
) (forced) - Auto-termination disabled (forced)
- Set
purpose
tag torstudio
(forced) - Set
init_scripts
to include init scriptdbfs:/databricks/init/r-env-init-aws.sh
(forced) - Policy only works for
all-purpose
clusters, will not work for job clusters
- DBR 10.4 ML LTS (
-
rstudio-single-node.json
:- Extends
rstudio-generic.json
as baseline - Sets cluster to
SingleNode
mode
- Extends
For further information on configuring cluster policies see the docs.
The init script will configure two ODBC data sources:
databricks-self
: Existing cluster (self)databricks
: Any databricks endpoint/cluster
These will be available within the RStudio connections pane with preconfigured code snippets.
PWD
is expected to be a Databricks Personal Access Token.
HTTPPath
is provided in the cluster/endpoint UI under ODBC settings (docs).
# connecting via ODBC to a SQL Endpoint
library(DBI)
conn <- dbConnect(
odbc::odbc(),
dsn = "databricks",
HTTPPath = "/sql/1.0/endpoints/XXXXXXXXXX",
PWD = "dapiXXXXXXXXXXXXX"
)
# connecting via ODBC to the same cluster that RStudio is running on
library(DBI)
conn <- dbConnect(
odbc::odbc(),
dsn = "databricks-self",
PWD = "dapiXXXXXXXXXXXXX"
)
It's recommended to not store tokens or passwords in plain text. Databricks recommends the use of secret scopes which can be set and accessed through Spark configs on the Datbaricks cluster (docs).
This would enable the following:
library(DBI)
# set `spark.<property-name> {{secrets/<scope-name>/<secret-name>}}` on cluster
conn <- dbConnect(
odbc::odbc(),
dsn = "databricks-self",
PWD = sparkR.conf("<property-name>")
)
Download URL and instructions for using Simba drivers:
To get the URL you will need to 'Copy Link Address'
on the download button, this can then replace line 18 in the init script.
It's possible that the way the driver structures its contents may change with newer/older versions, this would then impact the ODBC configuration.
Therefore in the snippet below, the Driver
path may require updating.
[databricks-self]
Driver = /opt/simba/spark/lib/64/libsparkodbc_sb64.so
It's recommended to use the snap MRAN snapshot as the Databricks Runtime being used. This is disclosed in the DBR release notes (example).
It's also possible to use the MRAN time machine to choose a desired snapshot.
Add these variables to /etc/R/Renviron.site
:
MLFLOW_PYTHON_BIN="/databricks/python/bin/python3"
MLFLOW_BIN="/databricks/python3/bin/mlflow"
/etc/R/Renviron.site
needs to be configured withRETICULATE_PYTHON
variable.- This can be changed as neccessary, this is set to
/databricks/python3/bin/python3
.
- This can be changed as neccessary, this is set to
/usr/lib/R/etc/Renviron.site
is adjusted to updatePATH
with the followingPATH=${PATH}:/databricks/conda/bin
Despite the ML runtime including RStudio there may be cases where a different version is required, or Server Pro/Workbench is prefered. Documentation for these processes is found here.