dyfanjones / rathena Goto Github PK

View Code? Open in Web Editor NEW

35.0 4.0 6.0 1.94 MB

Connect R to Athena using Boto3 SDK (DBI Interface)

Home Page: https://dyfanjones.github.io/RAthena/

License: Other

R 99.77% Dockerfile 0.23%

r athena aws boto3 database

rathena's People

Contributors

Stargazers

Watchers

Forkers

ras44 henrywu2019 rheehot mstei4176 natebyers mgirlich

rathena's Issues

Error in py_call_impl

Hey all, just trying to use RAthena for the first time and keep running into issues around Error in py_call_impl(callable, dots$args, dots$keywords): UnknownServiceError: Unknown service: 'glue'.

I'll dig in and do a bit of debugging but I wanted to go ahead and register this issue.

Reprex below:

library(DBI)
library(RAthena)
library(reticulate)

use_condaenv("RAthena")

con <- dbConnect(RAthena::athena(), s3_staging_dir = 's3://my_test_bucket_98b/athena-temp/')
#> Warning: Python '/Users/jal/Library/r-miniconda/envs/RAthena/bin/python' was
#> requested but '/Users/jal/Library/r-miniconda/envs/r-reticulate/bin/python' was
#> loaded instead (see reticulate::py_config() for more information)
#> Error in py_call_impl(callable, dots$args, dots$keywords): UnknownServiceError: Unknown service: 'glue'. Valid service names are: acm, apigateway, application-autoscaling, appstream, athena, autoscaling, batch, budgets, clouddirectory, cloudformation, cloudfront, cloudhsm, cloudsearch, cloudsearchdomain, cloudtrail, cloudwatch, codebuild, codecommit, codedeploy, codepipeline, codestar, cognito-identity, cognito-idp, cognito-sync, config, cur, datapipeline, dax, devicefarm, directconnect, discovery, dms, ds, dynamodb, dynamodbstreams, ec2, ecr, ecs, efs, elasticache, elasticbeanstalk, elastictranscoder, elb, elbv2, emr, es, events, firehose, gamelift, glacier, greengrass, health, iam, importexport, inspector, iot, iot-data, kinesis, kinesisanalytics, kms, lambda, lex-models, lex-runtime, lightsail, logs, machinelearning, marketplace-entitlement, marketplacecommerceanalytics, meteringmarketplace, mturk, opsworks, opsworkscm, organizations, pinpoint, polly, rds, redshift, rekognition, resourcegroupstaggingapi, route53, route53domains, s3, sdb, servicecatalog, ses, shield, sms, snowball, sns, sqs, ssm, stepfunctions, storagegateway, sts, support, swf, waf, waf-regional, workdocs, workspaces, xray
#> 
#> Detailed traceback:
#>   File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/boto3/session.py", line 263, in client
#>     aws_session_token=aws_session_token, config=config)
#>   File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/session.py", line 836, in create_client
#>     client_config=config, api_version=api_version)
#>   File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/client.py", line 64, in create_client
#>     service_model = self._load_service_model(service_name, api_version)
#>   File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/client.py", line 96, in _load_service_model
#>     api_version=api_version)
#>   File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/loaders.py", line 132, in _wrapper
#>     data = func(self, *args, **kwargs)
#>   File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/loaders.py", line 378, in load_service_model
#>     known_service_names=', '.join(sorted(known_services)))

^{Created on 2021-11-19 by the reprex package (v0.3.0)}

Connecting to Athena throws error: Could not connect to the endpoint URL: "https://sts.c.amazonaws.com/"

Issue Description

dbConnect() returns Error: Could not connect to the endpoint URL: "https://sts.c.amazonaws.com/"

nslookup and ping on this URL do not work as well.

R runs on an instance in AWS. Everything worked two to five days ago.

nslookup and ping on https://sts.amazonaws.com work though.

Meaning my instance is connected.

Reproducible Example

Sorry not possible to support you with an reproducible example

Example:

library(RAthena)
library(DBI)
con <- RAthena::dbConnect(RAthena::athena(),s3_staging_dir = 's3://bucketnamepath',role_arn = 'role:arn')

Error: Could not connect to the endpoint URL: "https://sts.c.amazonaws.com/"

Session Info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Ubuntu 19.10                
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language (EN)                        
 collate  C.UTF-8                     
 ctype    C.UTF-8                     
 tz       Etc/UTC                     
 date     2020-04-29                  

─ Packages ──────────────────────────────────────────────────────
 ! package     * version date       lib source        
 P assertthat    0.2.1   2019-03-21 [?] CRAN (R 3.6.3)
 P backports     1.1.6   2020-04-05 [?] CRAN (R 3.6.3)
 P callr         3.4.3   2020-03-28 [?] CRAN (R 3.6.3)
 P cli           2.0.2   2020-02-28 [?] CRAN (R 3.6.3)
 P crayon        1.3.4   2017-09-16 [?] CRAN (R 3.6.3)
 P data.table    1.12.8  2019-12-09 [?] CRAN (R 3.6.3)
 P DBI         * 1.1.0   2019-12-15 [?] CRAN (R 3.6.3)
 P desc          1.2.0   2018-05-01 [?] CRAN (R 3.6.3)
 P devtools    * 2.3.0   2020-04-10 [?] CRAN (R 3.6.3)
 P digest        0.6.25  2020-02-23 [?] CRAN (R 3.6.3)
 P ellipsis      0.3.0   2019-09-20 [?] CRAN (R 3.6.3)
 P fansi         0.4.1   2020-01-08 [?] CRAN (R 3.6.3)
 P fs            1.4.1   2020-04-04 [?] CRAN (R 3.6.3)
 P glue          1.4.0   2020-04-03 [?] CRAN (R 3.6.3)
 P jsonlite      1.6.1   2020-02-02 [?] CRAN (R 3.6.3)
 P lattice       0.20-41 2020-04-02 [?] CRAN (R 3.6.3)
 P magick        2.3     2020-01-24 [?] CRAN (R 3.6.3)
 P magrittr      1.5     2014-11-22 [?] CRAN (R 3.6.3)
 P Matrix        1.2-18  2019-11-27 [?] CRAN (R 3.6.1)
 P memoise       1.1.0   2017-04-21 [?] CRAN (R 3.6.3)
 P pkgbuild      1.0.7   2020-04-25 [?] CRAN (R 3.6.3)
 P pkgload       1.0.2   2018-10-29 [?] CRAN (R 3.6.3)
 P prettyunits   1.1.1   2020-01-24 [?] CRAN (R 3.6.3)
 P processx      3.4.2   2020-02-09 [?] CRAN (R 3.6.3)
 P ps            1.3.2   2020-02-13 [?] CRAN (R 3.6.3)
 P R6            2.4.1   2019-11-12 [?] CRAN (R 3.6.3)
 P RAthena     * 1.8.0   2020-03-17 [?] CRAN (R 3.6.3)
 P Rcpp          1.0.4.6 2020-04-09 [?] CRAN (R 3.6.3)
 P remotes       2.1.1   2020-02-15 [?] CRAN (R 3.6.3)
   renv          0.9.3   2020-02-10 [1] CRAN (R 3.6.3)
 P reticulate    1.15    2020-04-02 [?] CRAN (R 3.6.3)
 P rlang         0.4.5   2020-03-01 [?] CRAN (R 3.6.3)
 P rprojroot     1.3-2   2018-01-03 [?] CRAN (R 3.6.3)
 P rstudioapi    0.11    2020-02-07 [?] CRAN (R 3.6.3)
 P sessioninfo   1.1.1   2018-11-05 [?] CRAN (R 3.6.3)
 P testthat      2.3.2   2020-03-02 [?] CRAN (R 3.6.3)
 P usethis     * 1.6.1   2020-04-29 [?] CRAN (R 3.6.3)
 P uuid          0.1-4   2020-02-26 [?] CRAN (R 3.6.3)
 P withr         2.2.0   2020-04-20 [?] CRAN (R 3.6.3)

[1] /home/ubuntu/efs/project/usagebox-potential-tdm-buyers/renv/library/R-3.6/x86_64-pc-linux-gnu
[2] /tmp/RtmpmJZSnq/renv-system-library

 P ── Loaded and on-disk path mismatch.

Rhub Check Error

Rhub check failed with following error message:

checking package dependencies ... ERROR
Package required but not available: 'data.table'
Package suggested but not available: 'dplyr'

The suggested packages are required for a complete check.
variable R_CHECK_FORCE_SUGGESTS to a false value.

Checking can be attempted without them by setting the environment
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

Translate date variables to, e.g., "date'2019-12-10'" in generated SQL

Issue Description

I'm sorry for bombarding you with all the issues. In any case, this one is more of a feature request.

The thing is, unlike most other databases, Athena does not interpret iso-date strings as dates, so a query such as:
select * from table where date_col <= '2019-12-10'
fails.

To make it work, one needs to explicitly cast the date-string to a date. That is, this works:
select * from table where date_col <= date'2019-12-10'

To make things work on the R side when using dplyr, something like this needs to be used (with 'today' a R variable of class Date):
... %>% filter(date_col <= as.Date(today))
So, one needs to cast the Date as a Date for the generated sql to be correct.

This is something one can live with, so no rush with this one, but it would be a big convenience, if RAthena would detect when a date variable is used and add the date-prefix automatically, so code such as
... %>% filter(date_col <= today)
would work similar to most other databases.

Unfortunately I can't figure out how to do this so I'm just throwing it here if some solution comes to mind. In principle, I think, overloading the method escape.Date from dbplyr could help here, but I'm drawing a blank on how to accomplish this.

I want to emphasize, that this is more of an issue/feature of Athena/Presto itself, but this kind of functionality in RAthena would hide things from the user with no downside I can immediately think of.

Push package to anaconda

Would be nice to have this package put in the anaconda distribution: https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-r-pkgs.html

"Error: Unable to locate credentials" when running `dbSendQuery`

Hi, When running the following in Python, I get the credentials

import boto3
session = boto3.Session(profile_name = 'default')
credentials = session.get_credentials().get_frozen_credentials()
credentials

But when trying to run the following query in R

library(DBI)
library(RAthena)

con <- dbConnect(RAthena::athena(), 
                 region_name = 'us-east-1',
                 profile_name = 'default',
                 s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))
dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")

I get

Error: The config profile (default) could not be found
Traceback:

dbConnect(RAthena::athena(), region_name = "us-east-1", profile_name = "default",
. s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))

dbConnect(RAthena::athena(), region_name = "us-east-1", profile_name = "default",
. s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))

.local(drv, ...)

AthenaConnection(aws_access_key_id = aws_access_key_id, aws_secret_access_key = aws_secret_access_key,
. aws_session_token = aws_session_token, schema_name = schema_name,
. work_group = work_group, poll_interval = poll_interval, encryption_option = encryption_option,
. kms_key = kms_key, s3_staging_dir = s3_staging_dir, region_name = region_name,
. botocore_session = botocore_session, profile_name = profile_name,
. aws_expiration = aws_expiration, ...)

tryCatch(ptr <- boto$Session(aws_access_key_id = aws_access_key_id,
. aws_secret_access_key = aws_secret_access_key, aws_session_token = aws_session_token,
. region_name = region_name, botocore_session = botocore_session,
. profile_name = profile_name, ...), error = function(e) py_error(e))

tryCatchList(expr, classes, parentenv, handlers)

tryCatchOne(expr, names, parentenv, handlers[[1L]])

value[3L]

py_error(e)

stop(py_err$value, call. = F)

If I'm not specifying the profile, I'm getting

Error: Unable to locate credentials
Traceback:

dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")

dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")

AthenaResult(conn = conn, statement = statement, s3_staging_dir = s3_staging_dir)

tryCatch(response <- list(QueryExecutionId = do.call(Athena$start_query_execution,
. Request, quote = T)$QueryExecutionId), error = function(e) py_error(e))

tryCatchList(expr, classes, parentenv, handlers)

tryCatchOne(expr, names, parentenv, handlers[[1L]])

value[3L]

py_error(e)

stop(py_err$value, call. = F)

Just a note, I'm able to run this with AWR.Athena::Athena() successfully (it finds the credentials), e.g.

con <- dbConnect(AWR.Athena::Athena(), 
                   region = 'us-east-1', 
                   S3OutputLocation = paste(s3_bucket, s3_output, sep = "/"), 
                   Schema = 'default')
dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")

returns

<JDBCResult>

Thanks

aws configure list output

Session Info:

Session info ---------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz Asia/Jerusalem
date 2020-03-26

Packages -------------------------------------------------------------------
! package * version date lib source
arrow * 0.16.0.2 2020-02-14 [1] CRAN (R 3.5.3)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
AWR.Athena * 2.0.7-0 2019-07-08 [1] CRAN (R 3.5.3)
aws.s3 * 0.3.20 2020-03-11 [1] CRAN (R 3.5.3)
aws.signature 0.5.2 2019-08-08 [1] CRAN (R 3.5.3)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.5.3)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.2)
bit * 1.1-15.2 2020-02-10 [1] CRAN (R 3.5.3)
bit64 * 0.9-7 2017-05-08 [1] CRAN (R 3.5.3)
callr 3.4.1 2020-01-24 [1] CRAN (R 3.5.3)
cli 2.0.0 2019-12-09 [1] CRAN (R 3.5.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.3)
curl 4.3 2019-12-02 [1] CRAN (R 3.5.3)
data.table * 1.12.8 2019-12-09 [1] CRAN (R 3.5.3)
DBI * 1.1.0.9000 2020-03-04 [1] Github (r-dbi/DBI@500edd3)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.3)
devtools 2.2.2 2020-02-17 [1] CRAN (R 3.5.3)
digest 0.6.23 2019-11-23 [1] CRAN (R 3.5.3)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.5.3)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.5.3)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.3)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.5.3)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.3)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.5.3)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.5.3)
IRdisplay 0.7.0 2018-11-29 [1] CRAN (R 3.5.3)
IRkernel 1.1 2019-12-06 [1] CRAN (R 3.5.3)
jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.5.3)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.5.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.3)
pbdZMQ 0.3-3 2018-05-05 [1] CRAN (R 3.5.2)
pillar 1.4.3 2019-12-20 [1] CRAN (R 3.5.3)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.5.3)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.5.3)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.5.3)
ps 1.3.1 2020-02-12 [1] CRAN (R 3.5.1)
purrr 0.3.3 2019-10-18 [1] CRAN (R 3.5.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.5.3)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.5.3)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.5.3)
repr 1.0.2 2019-12-16 [1] CRAN (R 3.5.3)
D rJava 0.9-11 2019-03-29 [1] CRAN (R 3.5.3)
RJDBC 0.2-8 2020-03-10 [1] CRAN (R 3.5.3)
rlang 0.4.5 2020-03-01 [1] CRAN (R 3.5.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.3)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.3)
stringi * 1.4.6 2020-02-17 [1] CRAN (R 3.5.3)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.5.3)
tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.5.3)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.5.3)
uuid 0.1-2 2015-07-28 [1] CRAN (R 3.5.2)
vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.5.3)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.3)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.5.3)

Error: [Errno 22] Invalid argument when attempting to connect

Hi there,

Attempting to set up a connect from R to Athena, but am getting the error Error: [Errno 22] Invalid argument when trying to connection. I'm invoking the function as follows:

at_conn <- DBI::dbConnect(RAthena::athena()
, aws_access_key_id = aws_access_key_id
, aws_secret_access_key = aws_secret_access_key
, s3_staging_dir = dir
, region_name = region)

And have installed boto3 via pip, installed miniconda and run the install_boto() function.

My "dir" argument is of the form "s3://location-information-39492/r/". Region is simply "us-eas-1". The access keys work using SQL workbench. Not really sure what is causing this issue, but from searches, it seems to have something to do with reserved keywords for Windows potentially?

Appreciate any help here.

Commas and quote characters in data mess things up when doing dbWrite

Issue Description

Since data.table::fwrite tries to handle special characters in it's own way, that is, escaping field separators and and quote characters etc, and quoting strings when necessary, things get weird when Athena tries to deal with such source files.

Reproducible Example

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena")

withquote <-
data.frame(col1 = c("abcd,", "15 \"", "\\n"), col2 = c(1,2,3), col3 = Sys.time(), col4 = Sys.Date())
dbWriteTable(con, "temp.quote_test", withquote)
tbl(con, in_schema("temp", "quote_test"))

(The datetimes and dates are there for later.)

Dealing with a comma in the data can be done using tsv file type. However a tab would cause problems then.

The quote character will be problematic in either case. Default behaviour of fwrite is to '"double" (default, same as write.csv), in which case the double quote is doubled with another one.' and then the whole entry is encolsed in another set of quotes, which Athena has no idea how to deal with. The conditional quoting takes place also when a field has a comma in csv of a tab in tsv.

My (wholly unelegant) solution has been to use
, quote=F, row.names=F, col.names=F, sep="|", na=""
when writing the file. And telling athena to use "|" as the separator. Additionally, I've simply removed all "|"s from the data. The point is that the pipe character is not that common in my data.

This is obviously not at all optimal. I think the nicest thing would be to have the file with escaped special characters, but without the enclosing quotes. However, I don't think fwrite can accomplish this.

The other solution would be to enclose everything in double quotes, using quote=TRUE,qmethod='escape', col.names=F, but then one needs to use another SerDe in Athena. Now, parsing dates and datetimes get's complicated and I have not been able to get those to function.

The way around this is to

Define a temp table as, e.g.:

create external table temp.col_test (col1 string, col2 string, col3 string, col4 string)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar'     = '"',
   'escapeChar'    = '\\'
)
LOCATION 's3://bucket/path_to_data/'

Create the final table with

create table col_test2 as (SELECT 
try_cast(col1 as varchar) as col1,
try_cast(col2 as integer) as col2,
try_cast(col3 as timestamp) as col3,
try_cast(col4 as date) as col4

FROM "temp"."col_test")

or if one wishes to append to an existing table:

insert into temp.col_test2 (SELECT 
try_cast(col1 as varchar) as col1,
try_cast(col2 as integer) as col2,
try_cast(col3 as timestamp) as col3,
try_cast(col4 as date) as col4

FROM "temp"."col_test")

I know we are back at using Athena to create the final tables, but this is the only way so far I've been able to get everything to work at the same time. Well, an unescaped new line will still mess things up...

I know this is a bit of a horror story, but on the other hand these are situations that at least I have hade to deal with with 'real data'.

as.integer is translated to cast(col as INT), which does not work

Issue Description

The new sql translations seem to have an error with integers. Using as.integer results in

Error: SYNTAX_ERROR: line 1:146: Unknown type: int

Reproducible Example

Example:

tbl(con, "some_table") %>% 
mutate(ci = '2019') %>% 
mutate(ci = as.integer(ci))

This is fixed by replacing "INT" with "INTEGER" at

RAthena/R/sql_translate_env.R

Line 52 in 45fc753

    
           as.integer = sql_cast("INT"), # using INT from DataTypes.R conversion from R to Athena

Also, I think the line

RAthena/R/DataTypes.R

Line 6 in 45fc753

integer = "INT",

should be corrected.

Added snappy compression to parquet uploaded files

Parquet snappy compression:
https://arrow.apache.org/docs/r/reference/write_parquet.html

To reduce cost in AWS Athena queries, compression is required. From current understand snappy compression for parquet is standard.

Were gzip compression might give small files (better for cost), snappy compression is better for performance. This should give the best for both worlds, compressed files but can be queried effectively.

Issue with install_boto()

Hello!

So I accidentally typed (n), while running RAthena::install_boto(), e.g.:

RAthena::install_boto()
No non-system installation of Python could be found.
Would you like to download and install Miniconda?
Miniconda is an open source environment management system for Python.
See https://docs.conda.io/en/latest/miniconda.html for more details.
Would you like to install Miniconda? [Y/n]: n

Now, I don't see that prompt and I cannot install boto:

RAthena::install_boto()
Using virtual environment 'RAthena' ...
/home/volodymyr/.virtualenvs/RAthena/bin/python: No module named pip
Error in strsplit(output, "\s+")[[1]] : subscript out of bounds
In addition: Warning message:
In system2(python, c("-m", "pip", "--version"), stdout = TRUE) :
running command ''/home/volodymyr/.virtualenvs/RAthena/bin/python' -m pip --version' had status 1

What would be my next steps in order to install boto and use RAthena?

Maybe wday() should raise a warning

I think the behaviour of adding + 1 to the function in Athena could raise some warnings even though it matches the R output it is quite strange to debug locally

Thanks for the amazing pkg

Feature request: optionally delete data in S3 when doing dbRemoveTable

Executing "drop table" in Athena leaves the data in place.

However, sometimes the data in S3 is useless without the table definition (such as when generating temp-tables with ... %>% compute()), and it would be convenient to delete data with the same call.

The following definition of dbRemoveTable would accomplish this without interfering with 'normal' usage. There are some overrideable precautions built in to avoid accidental data loss (for example, if a table definition points to the root of bucket, everything there would get deleted).

What do you think?

#' @rdname dbRemoveTable
#' @export
setMethod(
  "dbRemoveTable", c("AthenaConnection", "character"),
  function(conn, name, delete_data = FALSE, no_confirm = FALSE, ...) {
    if (!dbIsValid(conn)) {stop("Connection already closed.", call. = FALSE)}
    
    if(!grepl("\\.", name)) name <- paste(conn@info$dbms.name, name, sep = ".")
  
    if(delete_data)
    {
      nobjects <- 1000 # We get only 1000 objects at a time
      while(nobjects >= 1000) {
        if (grepl("\\.", name)) {
          dbms.name <- gsub("\\..*", "" , name)
          Table <- gsub(".*\\.", "" , name)
        } else {
          dbms.name <- conn@info$dbms.name
          Table <- name}
      
        glue <- conn@ptr$client("glue")
        tryCatch(
          s3_path <- split_s3_uri(glue$get_table(DatabaseName = dbms.name,
                                   Name = Table)$Table$StorageDescriptor$Location),
          error = function(e) py_error(e))
        
        s3 <- conn@ptr$client("s3")
        tryCatch(
          objects <- s3$list_objects(Bucket=s3_path$bucket, Prefix=s3_path$key),
          error = function(e) py_error(e))
        
        nobjects <- length(objects$Contents)
        
        all_keys <- sapply(objects$Contents, function(x) x$Key)
        
        message(paste0("Info: The following S3 objects will be deleted:\n", paste0(paste0("s3://", s3_path$bucket, all_keys), collapse="\n")))
        if(!no_confirm) {
          confirm <- readline(prompt = "Delete files (y/n)?: ")
          if(confirm != "y") {
            message("Info: Table deletion aborted.")
            return(NULL)
          }
          
        }
        
        # This could be done with s3$delte_objects in one command, but cannot figure out the syntax right now...
        for(c_object in all_keys){
          tryCatch(
            s3$delete_object(Bucket=s3_path$bucket, Key=c_object),
            error = function(e) py_error(e))
        }
      }
    }
    
    res <- dbExecute(conn, paste("DROP TABLE ", name, ";"))
    dbClearResult(res)
    if(!delete_data) message("Info: Only Athena table has been removed.")
    invisible(TRUE)
  })

Release RAthena 2.5.0

Prepare for release:

Check current CRAN check results
Polish NEWS
urlchecker::url_check()
devtools::check_rhub(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
devtools::check_win_devel()
rhub::check_for_cran(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
Update cran-comments.md

Submit to CRAN:

devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
Finish blog post

post new features on rblogs etc...

RAthena and noctua have both gone through some significant updates since last publication.

Pass bit64 integers from athena big integer type

Writing Athena tables

Datatype is correctly passed to athena through the AthenaDataType function:

AthenaDataType <-
  function(fields, ...) {
    switch(
      class(fields)[1],
      logical =   "BOOLEAN",
      integer =   "INT",
      integer64 = "BIGINT",
      numeric =   "DOUBLE",
      double = "DOUBLE",
      factor =    "STRING",
      character = "STRING",
      list = "STRING",
      Date =      "DATE",
      POSIXct =   "TIMESTAMP",
      stop("Unknown class ", paste(class(fields), collapse = "/"), call. = FALSE)
    )
  }

Read Athena tables

This shouldn't be a problem if user has data.table already installed onto machine:

data.table interger64 documentation:

integer64: "integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64. Alternatively, "double"|"numeric" reads as base::read.csv does; i.e., possibly with loss of precision and if so silently. Or, "character".

This would be an issue for base R read reading in an Athena table with read.csv function

tbl %>% compute(name="test.table") tries to return a tbl-pointer for, e.g., "default.test.table"

Issue Description

Doing something like the following:

tbl(con, "table1") %>%
  compute(name = "test.table")

will result in an error, such as:

Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.test.table does not exist

The command does produce the table 'table' in database 'test', but compute tries to a table in the 'default' database (or which every you have set when connecting to Athena) named 'test.table'

Cache query id to save cost

pyathena has a really good method to reduce cost by caching query id's. It would be good to have this feature in RAthena and noctua

Release RAthena 2.4.0 on to cran

Prepare for release:

Check current CRAN check results
Polish NEWS
urlchecker::url_check()
devtools::check_rhub(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
devtools::check_win_devel()
rhub::check_for_cran(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
Update cran-comments.md

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
Finish blog post

RAthena 1.9.0 cran submission delay

Due to recent changes in R-4.0.0 \donttest now run. This causes a problem with most examples, as they require credentials. This results in the R CMD checks failing on Rhub servers. I have reached out to the cran ([email protected]) asking advice for how the examples should be set up going forward.

To prevent spamming the cran with multiple package submission attempts, this release will be slightly delayed depending on response from cran submissions.

reticulate enhancements

Make sure RAthena takes full advantage of the new changes to reticulate:

https://blog.rstudio.com/2019/12/20/reticulate-1-14/

Fetching Data from Athena Speed test

For extra information please refer to nocuta: #25

Selecting only a datetime column does not work

Issue Description

If a single datetime column is selected, an error is thrown. Didn't have time to dig deeper, but here it is fyi.

Reproducible Example

Example:

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena")

df <- data.frame(time_col = Sys.time(), other_col = 1:5)
dbWriteTable(con, name="temp.test6", df)
tbl(con, in_schema("temp", "test6")) %>% select(time_col)

Error in setnames(ans, col.names) :
Can't assign 1 names to a 2 column data.table
In addition: Warning messages:
1: In data.table::fread(File, col.names = names(Type2), colClasses = unname(Type2), :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(File, col.names = names(Type2), colClasses = unname(Type2), :

Error in setnames(ans, col.names) :
Can't assign 1 names to a 2 column data.table

Release Date v-1.9.0

The next release is planned for the 14/05/2020. This is to prevent the package being release too frequently to the cran.

If the new features are required before the planned release please let me know and the release date can be changed.

Token always expired

Hi!
I am experiencing the issue with library when trying to assume role during the connection process. But for same user over pure boto3 with same credentials everything is working fine.

0. Working example with boto3:

session = boto3.Session(profile_name='my-profile')
credentials = session.get_credentials()
current_credentials = credentials.get_frozen_credentials()
client = boto3.client('athena',
                       aws_access_key_id = current_credentials.access_key,
                       aws_secret_access_key = current_credentials.secret_key,
                       aws_session_token = current_credentials.token)

And 3 not working examples:
1. Passing the access parameters within connection

> con <- dbConnect(RAthena::athena(), 
+                  aws_access_key_id = credentials$AccessKeyId,
+                  aws_secret_access_key = credentials$SecretAccessKey,
+                  s3_staging_dir='s3://my-bucket/tmp/',
+                  region_name='eu-west-1')
Error: An error occurred (UnrecognizedClientException) when calling the GetDatabases operation: The security token included in the request is invalid.

2. Over the profile file:

> con <- dbConnect(RAthena::athena(), 
+                  profile_name = "default",
+                  role_arn = "my-role-arn",
+                  s3_staging_dir='s3:/my-bucket/tmp/',
+                  region_name='eu-west-1')
Error in names(y$TableType) <- y$Name : 
  attempt to set an attribute on NULL

3. Without an assume role:

> con <- dbConnect(RAthena::athena(),
+                  s3_staging_dir='s3:/my-bucket/tmp/',
+                  region_name='eu-west-1')
Error in names(y$TableType) <- y$Name : 
  attempt to set an attribute on NULL

Could you please help? Am I doing something wrong?

Aws Athena ml capabilities?

Should RAthena and noctua support AWS Athena ml capabilities? MLcapabilities are created by sagemaker. Athena can be used to create an sql wrapper to call sagemaker models.

https://docs.aws.amazon.com/athena/latest/ug/querying-mlmodel.html

Extend dbWriteTable to support uploading json line format

jsonlite supports outputting json lines format: (http://jsonlines.org/). RAthena and noctua should be able to support uploading JSON format to AWS Athena.

JSON DDL format: https://docs.aws.amazon.com/athena/latest/ug/parsing-JSON.html

mutate(as.character) produces an error (unknown type)

Issue Description

Similar to the earlier issue with as.integer #42, as.character also produces an error, as 'string' is an unknown type. Here we should use 'varchar'.

Reproducible Example

Example:

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena")

tbl(con, "iris")) %>%
mutate(sepal_length = as.character(sepal_length))

Error: SYNTAX_ERROR: line 1:8: Unknown type: string

Surely related again to the differences between HIVE DDL and Presto SQL Syntax.

RStudio emr write data to S3

A second user is having issue with install_boto: https://community.rstudio.com/t/emr-write-data-to-s3/55408/4

Might need to add better documentation to help improve user's experience with the package.

Will investigate in how to improve the function install_boto.

Should vroom be used for read/write for extra speed?

The package vroom: https://github.com/r-lib/vroom claims:

The fastest delimited reader for R, 1.27 GB/sec.

Benchmark: https://vroom.r-lib.org/articles/benchmarks.html

When passing data to and from AWS Athena vroom has the potential to increase the performance with large files plus it offers a method to write gzip files in parrel:

vroom_write(mtcars, pipe("pigz > mtcars.tsv.gz"))

Down side:

Currently package is highly dependent on data.table i.e. data is generally return using data.table. Plus alot of data wrangling techniques use data.table for speed.
There is a potential Rstudio version caveat:

RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so so it is recommended you use at least that version.

Extra dependency

Add Rstudio Viewer to display tables in RStudio's connection tab

It would be good to get RAthena to display tables from Athena in Rstudio's connection tab: odbc:View.R

List of milestones

v-1.8.0

New Features:

local caching of AWS Athena queries to reduce AWS Athena costs RAthena: (#81)

v-1.7.1

Bug:

sql-translate-env R functions paste/paste0 defaulted to concat_ws (#65)
data.table due to fwrite supporting compression in (>=1.12.4) dependency needs to be altered to reflect this
sql-translate-env correctly translate data variables (#44)
dbWriteTable utilise existing s3 location when appending
db_compute to take into account schema and name from a string (#74)

New Features:

dbStatistics return stats from Athena queries (#67)
RAthena_options checks if file parser exists before adapting it
RAthena_options restricts vroom support to (>=1.2.0) due to api changes

v-1.7.0

New Features:

Added a method to integrate into RStudio connection tab #57
Return amount of data scanned by aws athena
Ability to change back end file parser from data.table to vroom #56

Documentation

Update roxygen2 from 6.1.1 to 7.0.2
Update pkgdown::site documentation with new roxygen2

v-1.6.0

New Features:

Support environment variable AWS_ATHENA_WORK_GROUP
Convert default delimited file from csv to tsv. This is to enable array and json to be pushed to Athena from R
Added Append checker. This checks what file type is currently being used, utilities file type when pushing new data to existing AWS Athena Table.
dbRemoveTable to be able to delete Athena table s3 files

Bug:

sql_translate_env Integer miss classified
data.frames with row.names returning NA when uploading to Athena
Issue with special characters not being correctly passed to Athena
Create time stamp field incorrectly passed from Athena to R when returned alone

Changes:

remove setting column names to lower case

Documentation:

R documentation for new features
Update pkgdown site documentation

paste not translated correctly

Issue Description

Doing something like

tbl(con, "table) %>% mutate(new_col = paste(col1, col2))

fails with an error about a missing fucntion (concat_ws).

I think we need to add paste (and paste0) to sql_translate_env's athtena specific translations.

Adding lines:

      paste = sql_prefix("CONCAT"),
      paste0 = sql_prefix("CONCAT"),

to sql_translator under sql_translate_env.AthenaConnection should do the trick.

Forcing generated column names to lower case is unnecessary

Issue Description

Here's another one:
When creating new columns with, e.g., dplyr's mutate, the column names are forced to lower case. This happens silently and deviates from the 'default' behaviour of dplyr (at least with the other DBI-backends I've been using).

With the previously implemented quoting of column names, forcing lower case is not necessary, and could be removed.

Reproducible Example

Example:

library(dplyr)

a <-
tbl(con, "some_table) %>%   
mutate(AaBb = 1) %>% 
select(AaBb) %>% 
collect

b <- tibble(AaBb = 1:5)

a %>% inner_join(b)

This leads to the error:

Error: by required, because the data sources have no common variables

I think this is a bit unexpected.

If I'm not missing something, the 'tolower' part from line

RAthena/R/DataTypes.R

Line 21 in 45fc753

Names <- tolower(sapply(data_type, function(x) x$Name))

could be removed and everything would still work OK and the user would get what was requested.

What do you think?

p.s. One would still get surprises when using copy_to with upper-case column names. Athena seems to accept quoted upper case letters in the DDL, but replaces them with lower case automatically, so this is not something that can be helped in RAthena. Still, I think this is a separate issue.

Aws Glue to convert csv to parquet

For what ever reason, if apache arrow cant be installed onto local machine should aws glue be used to convert csv file format parquet to help with aws athena performance?

Upload and register flat files into Athena

To be able to upload a flat file saved on users computer and register it in AWS Athena without the requirement of having to upload it first into R.

Compression GZIP when uploading files to Athena/S3

Support GZIP compression:

https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html

Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only compatible format.

Possible Better formatting for the return partitions of an AWS Athena table

Currently RAthena has a function dbGetPartition which returns a data.table in the default AWS Athena format:

library(DBI)
library(RAthena)
library(data.table)

con <- dbConnect(athena())

test_df2_partitions = dbGetPartition(con, "test_df2")

#                   partition
# 1: year=2020/month=11/day=17

This format isn't too bad as it just returns format from Athena. Would it be useful to reformat this into the following?

get_partitions = function(dt){
  dt = dt[, tstrsplit(partition, split =  "/")]
  partitions = sapply(names(dt), function(x) strsplit(dt[[x]][1], split = "=")[[1]][1])
  for (col in names(dt)) set(dt, j=col, value=tstrsplit(dt[[col]], split =  "=")[2])
  setnames(dt, old = names(dt), new = partitions)
  return(dt)
}

get_partitions(test_df2_partitions)

#   year month day
# 1: 2020    11  17

Problem is that dbGetPartition has been in the package for sometime and changing it now would break possible solutions users have developed.

having datetime-strings in input parameters will produce errors

Issue Description

A query such as:

tbl(con, "table) %>% 
filter(date_col == "2020-01-31 51:59:25")

will produce an error, as Athena does not understand a literal of the form date '2020-01-31 51:59:25', but R can cast it to a date even with 'as.Date(x, tryFormats = "%Y-%m-%d")' used for testing in sql_escape_string.AthenaConnection.

This can be fixed with;

# Athena specifc S3 method for converting date variables and iso formateed date strings to date literals
#' @rdname sql_translate_env
#' @export
sql_escape_string.AthenaConnection <- function(con, x) {
  all_dates <- all(try(as.Date(x, tryFormats = "%Y-%m-%d"), silent=T) == x) & all(nchar(x) == 10)
  if(all_dates & !is.na(all_dates)) {
    paste0('DATE ', DBI::dbQuoteString(con, x))
  } else {
    DBI::dbQuoteString(con, x)
  }
}

However, it wouldn't be a bad idea to have similar handling of date times and format those to, e.g., "datetime'2020-01-31 51:59:25'".

I will look into it, but it would be good to at least have the fix above included before pushing to cran.

Release RAthena 2.3.0

Prepare for release:

Check current CRAN check results
Polish NEWS
urlchecker::url_check()
devtools::check_rhub(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
devtools::check_win_devel()
rhub::check_for_cran(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always", LIBARROW_BINARY="true"))
Update cran-comments.md

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
Finish blog post

Verbose query statistics

Hello!
Thanks for your awesome package I managed to set dozen of automated scripts which use Athena quite intensive. So, thank you so much!

It's a feature request or\and idea for future work rather then issue.
dding the show_statistics extra param in dbGetQuery() or additional function dbGetQueryStatistics() that uses the QueryExecution['Statistics'] part of boto3 get_query_execution function's response.

Why it might be helpful? The pricing of Athena for now is $5 per 1TB of scanned data, and Athena itself is designed to query huge amounts of data. It might be helpful for managing the operations' costs in complex environments.

`tbl(con, "table")` fails with a single column source table

Issue Description

Setting up a connection to an Athena 'table' with just one column with tidyverse's tbl-function fails with the error:

Error in .valueClassTest(ans, "data.frame", "dbFetch") :
invalid value from generic function ‘dbFetch’, class “factor”, expected “data.frame”

Reproducible Example

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena")

tttt <- data.frame(a = sample(LETTERS, 5))
copy_to(con, tttt, name = "test")
tbl(con, "test")

The tbl-command gives an error:

Error in .valueClassTest(ans, "data.frame", "dbFetch") :
invalid value from generic function ‘dbFetch’, class “factor”, expected “data.frame”

The error seems to be triggered at line 149 in Result.R:

RAthena/R/Result.R

Line 149 in 460b80e

    
           tryCatch(result <- res@athena$get_query_results(QueryExecutionId = res@info$QueryExecutionId, MaxResults = n),

So maybe an issue with boto3?

In any case, to get things working, I commented out the whole block dealing with the case n>= 0 etc.
Surely this one is here for a reason, so a better solution would be needed, but nothing to offer for that at the moment.

s3_uri incorrect when file path doesn't match file name

When uploading data to s3 uri "s3://mybucket/test_df4/" with file name "test_df.csv" s3_uri is changed to "s3://mybucket/4/"

Custom escape methods for date-times

I wasn't thinking when I merged tidyverse/dbplyr#391 into dbplyr; the methods really should live in the individual backends. I'm going to remove these methods for the next dplyr release so can you please include in your package:

#' @export
sql_escape_date.AthenaConnection <- function(con, x) {
  paste0('date ', dbQuoteString(con, as.character(x)))
}
#' @export
sql_escape_datetime.AthenaConnection <- function(con, x) {
  x <- strftime(x, "%Y-%m-%d %H:%M:%OS %Z")
  paste0('timestamp ', dbQuoteString(con, x))
}

The code comes from OssiLehtinen so please credit him

Default compression method for flat files

Currently RAthena and noctua support gzip compression when uploading data to S3 and Athena. Is there a better compression algorithm for flat files? Top 10 Performance Tuning Tips for Amazon Athena

Algorithm	Splittable?	Compression ratio	Compress + Decompress speed
Gzip (DEFLATE)	No	High	Medium
bzip2	Yes	Very high	Slow
LZO	No	Low	Fast
Snappy	No	Low	Very fast

For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable. When they are not an option, then try BZip2 or Gzip with an optimal file size.

From this it looks like BZIP2/GZIP are currently recommended. Might need to benchmark speed of BZip2 and GZIP files when reading from Athena

Column names with special chacaters produce errors with copy_to

Issue Description

If column names in a data.frame contain special characters, such as 'ä' and 'ö', copy_to fails with the error

Error: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:8: no viable alternative at input 'CREATE EXTERNAL'

Reproducible Example

library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
               profile_name = "rathena")

oddcolnames <- data.frame(tämänimi = 1:5)
a_t <- copy_to(cona, oddcolnames, "test")

This can be fixed by replacing line

RAthena/R/table.R

Line 360 in 48c01bd

paste0(field_names, " ", field.types)

with

paste0("`", field_names, "`", " ", field.types)

Cran resubmission

Please see DyfanJones/noctua#120 for further details

assume role with web identity?

Possible method for connecting to AWS Athena.

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_assumerolewithwebidentity

https://developer.amazon.com/docs/login-with-amazon/web-docs.html

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html#STS.Client.assume_role_with_web_identity

Missed a couple resources

{awsathena} : rJava Interface to AWS Athena SDK
{metis} : Helpers for Accessing and Querying Amazon Athena using R, Including a lightweight RJDBC shim
{metisjars} : JARs for ^^
{metis.tidy} : Access and Query Amazon Athena via the Tidyverse

They're used in production every day to analyze petabytes of internet scan and honeypot data.

dbWriteTable s3.location

Appending

When appending to an already exiting table dbWriteTable will just utilize parameter s3.location for S3 location. However an issue will occur if backend data in s3 is situated in a different location i.e.
Backend s3 location: s3://path/to/bucket1/tbl_data/
s3.location parameter: s3://path/to/bucket2/tbl_data/

To over come this the existing s3 location should overwrite s3.location parameter when the parameter append is set to TRUE

Backend S3 structure

Currently data is sent to the s3 path:

s3:/path/to/bucket/{tbl}/

This is ok when user only has 1 database/schema for Athena. When a use has 2 databases i.e:

dbWriteTable(con, "schema1.tbl", tbl)
dbWriteTable(con, "schema12.tbl", tbl)

The back end will return: s3:/path/to/bucket/{tbl}/ for both cases. Ideally it should align to the following schema: s3:/path/to/bucket/{schema}/{tbl}/ and return:

s3:/path/to/bucket/schema1/tbl/
s3:/path/to/bucket/schema2/tbl/

dyfanjones / rathena Goto Github PK

rathena's People

Contributors

Stargazers

Watchers

Forkers

rathena's Issues

Issue Description

Reproducible Example

Issue Description

Issue Description

Reproducible Example

Issue Description

Reproducible Example

Writing Athena tables

Read Athena tables

Issue Description

Issue Description

Reproducible Example

Issue Description

Reproducible Example

v-1.8.0

New Features:

v-1.7.1

Bug:

New Features:

v-1.7.0

New Features:

Documentation

v-1.6.0

New Features:

Bug:

Changes:

Documentation:

Issue Description

Issue Description

Reproducible Example

Issue Description

Issue Description

Reproducible Example

Issue Description

Reproducible Example

Recommend Projects

Recommend Topics

Recommend Org