databrickslabs / databricks-sdk-r Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 5.0 9.08 MB

Databricks SDK for R (Experimental)

Home Page: https://databrickslabs.github.io/databricks-sdk-r/

License: Apache License 2.0

R 94.26% Makefile 5.74%

data-science databricks r sdk

databricks-sdk-r's People

Contributors

Stargazers

Watchers

Forkers

hadley rafikurlansik atheriel jmbarbone

databricks-sdk-r's Issues

Drop `@include`

I'm pretty certain it's not necessary.

I'm happy to do a PR for this if you show me how to run the code generator.

Documentation headings

It looks like the code generator is converting headings to:

#' **Release status**

Instead of

## Release status

Additionally, in statement_execution.R it's generating this:

#' ----
#'
#' ### **Warning: We recommend you protect the URLs in the EXTERNAL_LINKS.**
#'
#' When using the EXTERNAL_LINKS disposition, a short-lived pre-signed URL is
#' generated, which the client can use to download the result chunk directly
#' from cloud storage. As the short-lived credential is embedded in a pre-signed
#' URL, this URL should be protected.
#'
#' Since pre-signed URLs are generated with embedded temporary credentials, you
#' need to remove the authorization header from the fetch requests.
#'
#' ----
#'

and the sequences of dashes are causing this warning when documenting:

Warning: [statement_execution.R:12] @details markdown
translation failed
✖ Internal error: unknown xml node thematic_break
ℹ Please file an issue at
  https://github.com/r-lib/roxygen2/issues

If you can tell me what you're trying to achieve here, I can suggest how you might best express that in R's documentation.

Add package website

This is really easy to do, so I think it's worth it. You can get a basic website up and going by running usethis::use_pkgdown_github_pages()

I need to build a Shiny app where the user will be authenticated from DataBricks before they fetch the data. I read that the authentication is not yet supported by this repo. Do you guys are planning it Or is there any alternative to implement this in R. Please any feedback is appreciated.

syntax errors `list(, `

There is a common syntax error across multiple functions:

try(list(, 1))
#> Error in list(, 1) : argument 1 is empty
try(list(, a = 1))
#> Error in list(, a = 1) : argument 1 is empty

^{Created on 2024-03-22 with reprex v2.1.0}

A few examples:

databricks-sdk-r/R/grants.R

Lines 17 to 20 in 5e37300

    
           grantsGet <- function(client, securable_type, full_name, principal = NULL) { 
        
             query <- list(, principal = principal) 
        
             client$do("GET", paste("/api/2.1/unity-catalog/permissions/", securable_type, 
        
               "/", full_name, sep = ""), query = query)

databricks-sdk-r/R/lakeview.R

Lines 65 to 69 in 5e37300

    
           lakeviewPublish <- function(client, dashboard_id, embed_credentials = NULL, warehouse_id = NULL) { 
        
             body <- list(, embed_credentials = embed_credentials, warehouse_id = warehouse_id) 
        
             client$do("POST", paste("/api/2.0/lakeview/dashboards/", dashboard_id, "/published", 
        
               , sep = ""), body = body) 
        
           }

Currently, these can be found in this search:

https://github.com/search?q=repo%3Adatabrickslabs%2Fdatabricks-sdk-r%20%22list(%2C%20%22&type=code

Getting client settings, such as host or account_id, is ugly

If I want to get the host URL for a workspace, I have to do some error-prone string manipulation, for example (see the lines between START HERE and END HERE):

require(databricks)

client <- DatabricksClient()

response <- clustersCreate(
  client = client,
  cluster_name = "my-cluster",
  spark_version = "12.2.x-scala2.12",
  node_type_id = "i3.xlarge",
  autotermination_minutes = 15,
  num_workers = 1
)

# ##########
# START HERE
# ##########
# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}
# ########
# END HERE
# ########

print(paste(
  "View the cluster at ",
  host,
  "#setting/clusters/",
  response$cluster_id,
  "/configuration",
  sep = "")
)

Ideally, I'd like to see something more along the lines of this (see the line that ends with # <-- DO THIS INSTEAD):

require(databricks)

client <- DatabricksClient()

response <- clustersCreate(
  client = client,
  cluster_name = "my-cluster",
  spark_version = "12.2.x-scala2.12",
  node_type_id = "i3.xlarge",
  autotermination_minutes = 15,
  num_workers = 1
)

print(paste(
  "View the cluster at ",
  client$host, # <-- DO THIS INSTEAD
  "#setting/clusters/",
  response$cluster_id,
  "/configuration",
  sep = "")
)

Also, similar to the other Databricks SDKs, I'd like to be to able to get other client settings, such as:

client$account_id
client$auth_type
client$azure_* (multiple settings)
client$client_* (multiple settings)
client$config_file
client$debug_headers
client$debug_truncate_bytes
client$google_* (multiple settings)
client$host
client$http_timeout_seconds
client$password
client$profile
client$rate_limit
client$retry_timeout_seconds
client$token
client$username

There be might be a few more that I missed.

Bug : jobsRunNow Failing with stack usage close to the limit Error

jobsRunNow function throws following error however the job run completes successfully:

jobsRunNow(job_id=1234, client=client)
Error : C stack usage  15941956 is too close to the limit
Error: C stack usage  15941956 is too close to the limit
Error: C stack usage  15941956 is too close to the limit

Switch out dplyr dependency for vctrs

dplyr is a user facing package so it's rather dependency heavy in favour of providing the whole package to users. This means it doesn't feel like a great dependency for an SDK, which wants to be low-level. Fortunately, it should be pretty easy to switch out dplyr for vctrs, which is the low-level equivalent, replacing dplyr::bind_rows() with vctrs::vec_rbind().

Switch to httr2

httr is mostly in maintenance mode, so I'd recommend using httr2 instead. It includes nice features for automatically retrying on failure, e.g. https://httr2.r-lib.org/reference/req_retry.html.

Embrace R's namespacing

Looking at the following example from the readme, I'm guessing your native language has nested namespaces 😄

library(dplyr)
running <- databricks::clusters$list() %>% filter(state == 'RUNNING')
context <- databricks::command_execution$create(cluster_id=running$cluster_id, language='python')
res <- databricks::command_execution$execute(cluster_id=running$cluster_id, context_id=context$id, language='sql', command='show tables')

You can make it a bit more idiomatic just by attaching databricks:

library(dplyr)
library(databricks)

running <- clusters$list() %>% filter(state == 'RUNNING')
context <- command_execution$create(cluster_id=running$cluster_id, language='python')
res <- command_execution$execute(cluster_id=running$cluster_id, context_id=context$id, language='sql', command='show tables')

But I think it would be even more R like if you eliminated the intermediate lists in favour of a naming convention (R is a bit more like C in this sense):

library(dplyr)
library(databricks)

running <- clusters_list() %>% filter(state == 'RUNNING')
context <- command_execution_create(cluster_id=running$cluster_id, language='python')
res <- command_execution_execute(cluster_id=running$cluster_id, context_id=context$id, language='sql', command='show tables')

Fortunately it looks like this will be pretty easy to do since you're already generating these functions then wrapping them in a list 😄

jobsRunNow ties up console and doesn't print status

Default behavior for jobsRunNow is to block the console from running any commands. This is inconvenient if a user wants to kick off a remote job and then continue to use their console for additional work.

In addition, the cli_reporter seems to be broken, as nothing is printed to the console during job runs.

Suggested changes:

Make the default behavior for jobsRunNow not tie up the R console and simply return the active run id.
Print the URL to the current job run in the console for convenience.

support DATABRICKS_HOST without `https://`

Does the R SDK support using a DATABRICKS_HOST without https://?

When I try to connect to a Databricks instance without the https://, I get an error from curl::curl_fetch_memory(), but when I add the https:// I am able to connect and list clusters. Here are some example commands:

> Sys.setenv(DATABRICKS_HOST="dbc-12345.cloud.databricks.com")
> client <- DatabricksClient()
> clustersList(client)[, "cluster_name"]
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Bad URL, colon is first character
> Sys.setenv(DATABRICKS_HOST="https://dbc-12345.cloud.databricks.com")
> client <- DatabricksClient()
> clustersList(client)[, "cluster_name"]
[1] "Cluster A"

If this is not currently supported, can support for this be added? A number of the other Databricks libraries support setting the DATABRICKS_HOST with and without https://.

	grantsGet <- function(client, securable_type, full_name, principal = NULL) {
	query <- list(, principal = principal)
	client$do("GET", paste("/api/2.1/unity-catalog/permissions/", securable_type,
	"/", full_name, sep = ""), query = query)

	lakeviewPublish <- function(client, dashboard_id, embed_credentials = NULL, warehouse_id = NULL) {
	body <- list(, embed_credentials = embed_credentials, warehouse_id = warehouse_id)
	client$do("POST", paste("/api/2.0/lakeview/dashboards/", dashboard_id, "/published",
	, sep = ""), body = body)
	}