dyfanjones / rathena Goto Github PK
View Code? Open in Web Editor NEWConnect R to Athena using Boto3 SDK (DBI Interface)
Home Page: https://dyfanjones.github.io/RAthena/
License: Other
Connect R to Athena using Boto3 SDK (DBI Interface)
Home Page: https://dyfanjones.github.io/RAthena/
License: Other
Hey all, just trying to use RAthena for the first time and keep running into issues around Error in py_call_impl(callable, dots$args, dots$keywords): UnknownServiceError: Unknown service: 'glue'.
I'll dig in and do a bit of debugging but I wanted to go ahead and register this issue.
Reprex below:
library(DBI)
library(RAthena)
library(reticulate)
use_condaenv("RAthena")
con <- dbConnect(RAthena::athena(), s3_staging_dir = 's3://my_test_bucket_98b/athena-temp/')
#> Warning: Python '/Users/jal/Library/r-miniconda/envs/RAthena/bin/python' was
#> requested but '/Users/jal/Library/r-miniconda/envs/r-reticulate/bin/python' was
#> loaded instead (see reticulate::py_config() for more information)
#> Error in py_call_impl(callable, dots$args, dots$keywords): UnknownServiceError: Unknown service: 'glue'. Valid service names are: acm, apigateway, application-autoscaling, appstream, athena, autoscaling, batch, budgets, clouddirectory, cloudformation, cloudfront, cloudhsm, cloudsearch, cloudsearchdomain, cloudtrail, cloudwatch, codebuild, codecommit, codedeploy, codepipeline, codestar, cognito-identity, cognito-idp, cognito-sync, config, cur, datapipeline, dax, devicefarm, directconnect, discovery, dms, ds, dynamodb, dynamodbstreams, ec2, ecr, ecs, efs, elasticache, elasticbeanstalk, elastictranscoder, elb, elbv2, emr, es, events, firehose, gamelift, glacier, greengrass, health, iam, importexport, inspector, iot, iot-data, kinesis, kinesisanalytics, kms, lambda, lex-models, lex-runtime, lightsail, logs, machinelearning, marketplace-entitlement, marketplacecommerceanalytics, meteringmarketplace, mturk, opsworks, opsworkscm, organizations, pinpoint, polly, rds, redshift, rekognition, resourcegroupstaggingapi, route53, route53domains, s3, sdb, servicecatalog, ses, shield, sms, snowball, sns, sqs, ssm, stepfunctions, storagegateway, sts, support, swf, waf, waf-regional, workdocs, workspaces, xray
#>
#> Detailed traceback:
#> File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/boto3/session.py", line 263, in client
#> aws_session_token=aws_session_token, config=config)
#> File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/session.py", line 836, in create_client
#> client_config=config, api_version=api_version)
#> File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/client.py", line 64, in create_client
#> service_model = self._load_service_model(service_name, api_version)
#> File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/client.py", line 96, in _load_service_model
#> api_version=api_version)
#> File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/loaders.py", line 132, in _wrapper
#> data = func(self, *args, **kwargs)
#> File "/Users/jal/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/botocore/loaders.py", line 378, in load_service_model
#> known_service_names=', '.join(sorted(known_services)))
Created on 2021-11-19 by the reprex package (v0.3.0)
dbConnect()
returns Error: Could not connect to the endpoint URL: "https://sts.c.amazonaws.com/"
nslookup and ping on this URL do not work as well.
R runs on an instance in AWS. Everything worked two to five days ago.
nslookup and ping on https://sts.amazonaws.com work though.
Meaning my instance is connected.
Sorry not possible to support you with an reproducible example
Example:
library(RAthena)
library(DBI)
con <- RAthena::dbConnect(RAthena::athena(),s3_staging_dir = 's3://bucketnamepath',role_arn = 'role:arn')
Error: Could not connect to the endpoint URL: "https://sts.c.amazonaws.com/"
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────
setting value
version R version 3.6.3 (2020-02-29)
os Ubuntu 19.10
system x86_64, linux-gnu
ui RStudio
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Etc/UTC
date 2020-04-29
─ Packages ──────────────────────────────────────────────────────
! package * version date lib source
P assertthat 0.2.1 2019-03-21 [?] CRAN (R 3.6.3)
P backports 1.1.6 2020-04-05 [?] CRAN (R 3.6.3)
P callr 3.4.3 2020-03-28 [?] CRAN (R 3.6.3)
P cli 2.0.2 2020-02-28 [?] CRAN (R 3.6.3)
P crayon 1.3.4 2017-09-16 [?] CRAN (R 3.6.3)
P data.table 1.12.8 2019-12-09 [?] CRAN (R 3.6.3)
P DBI * 1.1.0 2019-12-15 [?] CRAN (R 3.6.3)
P desc 1.2.0 2018-05-01 [?] CRAN (R 3.6.3)
P devtools * 2.3.0 2020-04-10 [?] CRAN (R 3.6.3)
P digest 0.6.25 2020-02-23 [?] CRAN (R 3.6.3)
P ellipsis 0.3.0 2019-09-20 [?] CRAN (R 3.6.3)
P fansi 0.4.1 2020-01-08 [?] CRAN (R 3.6.3)
P fs 1.4.1 2020-04-04 [?] CRAN (R 3.6.3)
P glue 1.4.0 2020-04-03 [?] CRAN (R 3.6.3)
P jsonlite 1.6.1 2020-02-02 [?] CRAN (R 3.6.3)
P lattice 0.20-41 2020-04-02 [?] CRAN (R 3.6.3)
P magick 2.3 2020-01-24 [?] CRAN (R 3.6.3)
P magrittr 1.5 2014-11-22 [?] CRAN (R 3.6.3)
P Matrix 1.2-18 2019-11-27 [?] CRAN (R 3.6.1)
P memoise 1.1.0 2017-04-21 [?] CRAN (R 3.6.3)
P pkgbuild 1.0.7 2020-04-25 [?] CRAN (R 3.6.3)
P pkgload 1.0.2 2018-10-29 [?] CRAN (R 3.6.3)
P prettyunits 1.1.1 2020-01-24 [?] CRAN (R 3.6.3)
P processx 3.4.2 2020-02-09 [?] CRAN (R 3.6.3)
P ps 1.3.2 2020-02-13 [?] CRAN (R 3.6.3)
P R6 2.4.1 2019-11-12 [?] CRAN (R 3.6.3)
P RAthena * 1.8.0 2020-03-17 [?] CRAN (R 3.6.3)
P Rcpp 1.0.4.6 2020-04-09 [?] CRAN (R 3.6.3)
P remotes 2.1.1 2020-02-15 [?] CRAN (R 3.6.3)
renv 0.9.3 2020-02-10 [1] CRAN (R 3.6.3)
P reticulate 1.15 2020-04-02 [?] CRAN (R 3.6.3)
P rlang 0.4.5 2020-03-01 [?] CRAN (R 3.6.3)
P rprojroot 1.3-2 2018-01-03 [?] CRAN (R 3.6.3)
P rstudioapi 0.11 2020-02-07 [?] CRAN (R 3.6.3)
P sessioninfo 1.1.1 2018-11-05 [?] CRAN (R 3.6.3)
P testthat 2.3.2 2020-03-02 [?] CRAN (R 3.6.3)
P usethis * 1.6.1 2020-04-29 [?] CRAN (R 3.6.3)
P uuid 0.1-4 2020-02-26 [?] CRAN (R 3.6.3)
P withr 2.2.0 2020-04-20 [?] CRAN (R 3.6.3)
[1] /home/ubuntu/efs/project/usagebox-potential-tdm-buyers/renv/library/R-3.6/x86_64-pc-linux-gnu
[2] /tmp/RtmpmJZSnq/renv-system-library
P ── Loaded and on-disk path mismatch.
Rhub check failed with following error message:
checking package dependencies ... ERROR
Package required but not available: 'data.table'
Package suggested but not available: 'dplyr'
The suggested packages are required for a complete check.
variable R_CHECK_FORCE_SUGGESTS to a false value.
Checking can be attempted without them by setting the environment
See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.
I'm sorry for bombarding you with all the issues. In any case, this one is more of a feature request.
The thing is, unlike most other databases, Athena does not interpret iso-date strings as dates, so a query such as:
select * from table where date_col <= '2019-12-10'
fails.
To make it work, one needs to explicitly cast the date-string to a date. That is, this works:
select * from table where date_col <= date'2019-12-10'
To make things work on the R side when using dplyr, something like this needs to be used (with 'today' a R variable of class Date):
... %>% filter(date_col <= as.Date(today))
So, one needs to cast the Date as a Date for the generated sql to be correct.
This is something one can live with, so no rush with this one, but it would be a big convenience, if RAthena would detect when a date variable is used and add the date-prefix automatically, so code such as
... %>% filter(date_col <= today)
would work similar to most other databases.
Unfortunately I can't figure out how to do this so I'm just throwing it here if some solution comes to mind. In principle, I think, overloading the method escape.Date from dbplyr could help here, but I'm drawing a blank on how to accomplish this.
I want to emphasize, that this is more of an issue/feature of Athena/Presto itself, but this kind of functionality in RAthena would hide things from the user with no downside I can immediately think of.
Would be nice to have this package put in the anaconda distribution: https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-r-pkgs.html
Hi, When running the following in Python, I get the credentials
import boto3
session = boto3.Session(profile_name = 'default')
credentials = session.get_credentials().get_frozen_credentials()
credentials
But when trying to run the following query in R
library(DBI)
library(RAthena)
con <- dbConnect(RAthena::athena(),
region_name = 'us-east-1',
profile_name = 'default',
s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))
dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")
I get
Error: The config profile (default) could not be found
Traceback:
- dbConnect(RAthena::athena(), region_name = "us-east-1", profile_name = "default",
. s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))- dbConnect(RAthena::athena(), region_name = "us-east-1", profile_name = "default",
. s3_staging_dir = paste(s3_bucket, s3_output, sep = "/"))- .local(drv, ...)
- AthenaConnection(aws_access_key_id = aws_access_key_id, aws_secret_access_key = aws_secret_access_key,
. aws_session_token = aws_session_token, schema_name = schema_name,
. work_group = work_group, poll_interval = poll_interval, encryption_option = encryption_option,
. kms_key = kms_key, s3_staging_dir = s3_staging_dir, region_name = region_name,
. botocore_session = botocore_session, profile_name = profile_name,
. aws_expiration = aws_expiration, ...)- tryCatch(ptr <- boto$Session(aws_access_key_id = aws_access_key_id,
. aws_secret_access_key = aws_secret_access_key, aws_session_token = aws_session_token,
. region_name = region_name, botocore_session = botocore_session,
. profile_name = profile_name, ...), error = function(e) py_error(e))- tryCatchList(expr, classes, parentenv, handlers)
- tryCatchOne(expr, names, parentenv, handlers[[1L]])
- value[3L]
- py_error(e)
- stop(py_err$value, call. = F)
If I'm not specifying the profile
, I'm getting
Error: Unable to locate credentials
Traceback:
- dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")
- dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")
- AthenaResult(conn = conn, statement = statement, s3_staging_dir = s3_staging_dir)
- tryCatch(response <- list(QueryExecutionId = do.call(Athena$start_query_execution,
. Request, quote = T)$QueryExecutionId), error = function(e) py_error(e))- tryCatchList(expr, classes, parentenv, handlers)
- tryCatchOne(expr, names, parentenv, handlers[[1L]])
- value[3L]
- py_error(e)
- stop(py_err$value, call. = F)
Just a note, I'm able to run this with AWR.Athena::Athena()
successfully (it finds the credentials), e.g.
con <- dbConnect(AWR.Athena::Athena(),
region = 'us-east-1',
S3OutputLocation = paste(s3_bucket, s3_output, sep = "/"),
Schema = 'default')
dbSendQuery(con, "SELECT * FROM ppc_pressqa.ppc_current_ib limit 10")
returns
<JDBCResult>
Thanks
aws configure list
output
Session Info:
Session info ---------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz Asia/Jerusalem
date 2020-03-26
Packages -------------------------------------------------------------------
! package * version date lib source
arrow * 0.16.0.2 2020-02-14 [1] CRAN (R 3.5.3)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
AWR.Athena * 2.0.7-0 2019-07-08 [1] CRAN (R 3.5.3)
aws.s3 * 0.3.20 2020-03-11 [1] CRAN (R 3.5.3)
aws.signature 0.5.2 2019-08-08 [1] CRAN (R 3.5.3)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.5.3)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.2)
bit * 1.1-15.2 2020-02-10 [1] CRAN (R 3.5.3)
bit64 * 0.9-7 2017-05-08 [1] CRAN (R 3.5.3)
callr 3.4.1 2020-01-24 [1] CRAN (R 3.5.3)
cli 2.0.0 2019-12-09 [1] CRAN (R 3.5.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.3)
curl 4.3 2019-12-02 [1] CRAN (R 3.5.3)
data.table * 1.12.8 2019-12-09 [1] CRAN (R 3.5.3)
DBI * 1.1.0.9000 2020-03-04 [1] Github (r-dbi/DBI@500edd3)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.3)
devtools 2.2.2 2020-02-17 [1] CRAN (R 3.5.3)
digest 0.6.23 2019-11-23 [1] CRAN (R 3.5.3)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.5.3)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.5.3)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.3)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.5.3)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.3)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.5.3)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.5.3)
IRdisplay 0.7.0 2018-11-29 [1] CRAN (R 3.5.3)
IRkernel 1.1 2019-12-06 [1] CRAN (R 3.5.3)
jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.5.3)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.5.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.3)
pbdZMQ 0.3-3 2018-05-05 [1] CRAN (R 3.5.2)
pillar 1.4.3 2019-12-20 [1] CRAN (R 3.5.3)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.5.3)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.5.3)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.5.3)
ps 1.3.1 2020-02-12 [1] CRAN (R 3.5.1)
purrr 0.3.3 2019-10-18 [1] CRAN (R 3.5.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.5.3)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.5.3)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.5.3)
repr 1.0.2 2019-12-16 [1] CRAN (R 3.5.3)
D rJava 0.9-11 2019-03-29 [1] CRAN (R 3.5.3)
RJDBC 0.2-8 2020-03-10 [1] CRAN (R 3.5.3)
rlang 0.4.5 2020-03-01 [1] CRAN (R 3.5.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.3)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.3)
stringi * 1.4.6 2020-02-17 [1] CRAN (R 3.5.3)
testthat 2.3.1 2019-12-01 [1] CRAN (R 3.5.3)
tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.5.3)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.5.3)
uuid 0.1-2 2015-07-28 [1] CRAN (R 3.5.2)
vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.5.3)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.3)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.5.3)
Hi there,
Attempting to set up a connect from R to Athena, but am getting the error Error: [Errno 22] Invalid argument when trying to connection. I'm invoking the function as follows:
at_conn <- DBI::dbConnect(RAthena::athena()
, aws_access_key_id = aws_access_key_id
, aws_secret_access_key = aws_secret_access_key
, s3_staging_dir = dir
, region_name = region)
And have installed boto3 via pip, installed miniconda and run the install_boto() function.
My "dir" argument is of the form "s3://location-information-39492/r/". Region is simply "us-eas-1". The access keys work using SQL workbench. Not really sure what is causing this issue, but from searches, it seems to have something to do with reserved keywords for Windows potentially?
Appreciate any help here.
Since data.table::fwrite tries to handle special characters in it's own way, that is, escaping field separators and and quote characters etc, and quoting strings when necessary, things get weird when Athena tries to deal with such source files.
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
profile_name = "rathena")
withquote <-
data.frame(col1 = c("abcd,", "15 \"", "\\n"), col2 = c(1,2,3), col3 = Sys.time(), col4 = Sys.Date())
dbWriteTable(con, "temp.quote_test", withquote)
tbl(con, in_schema("temp", "quote_test"))
(The datetimes and dates are there for later.)
Dealing with a comma in the data can be done using tsv file type. However a tab would cause problems then.
The quote character will be problematic in either case. Default behaviour of fwrite is to '"double" (default, same as write.csv), in which case the double quote is doubled with another one.' and then the whole entry is encolsed in another set of quotes, which Athena has no idea how to deal with. The conditional quoting takes place also when a field has a comma in csv of a tab in tsv.
My (wholly unelegant) solution has been to use
, quote=F, row.names=F, col.names=F, sep="|", na=""
when writing the file. And telling athena to use "|" as the separator. Additionally, I've simply removed all "|"s from the data. The point is that the pipe character is not that common in my data.
This is obviously not at all optimal. I think the nicest thing would be to have the file with escaped special characters, but without the enclosing quotes. However, I don't think fwrite can accomplish this.
The other solution would be to enclose everything in double quotes, using quote=TRUE,qmethod='escape', col.names=F
, but then one needs to use another SerDe in Athena. Now, parsing dates and datetimes get's complicated and I have not been able to get those to function.
The way around this is to
create external table temp.col_test (col1 string, col2 string, col3 string, col4 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\'
)
LOCATION 's3://bucket/path_to_data/'
create table col_test2 as (SELECT
try_cast(col1 as varchar) as col1,
try_cast(col2 as integer) as col2,
try_cast(col3 as timestamp) as col3,
try_cast(col4 as date) as col4
FROM "temp"."col_test")
or if one wishes to append to an existing table:
insert into temp.col_test2 (SELECT
try_cast(col1 as varchar) as col1,
try_cast(col2 as integer) as col2,
try_cast(col3 as timestamp) as col3,
try_cast(col4 as date) as col4
FROM "temp"."col_test")
I know we are back at using Athena to create the final tables, but this is the only way so far I've been able to get everything to work at the same time. Well, an unescaped new line will still mess things up...
I know this is a bit of a horror story, but on the other hand these are situations that at least I have hade to deal with with 'real data'.
The new sql translations seem to have an error with integers. Using as.integer results in
Error: SYNTAX_ERROR: line 1:146: Unknown type: int
Example:
tbl(con, "some_table") %>%
mutate(ci = '2019') %>%
mutate(ci = as.integer(ci))
This is fixed by replacing "INT" with "INTEGER" at
Line 52 in 45fc753
Also, I think the line
Line 6 in 45fc753
Parquet snappy compression:
https://arrow.apache.org/docs/r/reference/write_parquet.html
To reduce cost in AWS Athena queries, compression is required. From current understand snappy compression for parquet is standard.
Were gzip
compression might give small files (better for cost), snappy compression is better for performance. This should give the best for both worlds, compressed files but can be queried effectively.
Hello!
So I accidentally typed (n), while running RAthena::install_boto(), e.g.:
RAthena::install_boto()
No non-system installation of Python could be found.
Would you like to download and install Miniconda?
Miniconda is an open source environment management system for Python.
See https://docs.conda.io/en/latest/miniconda.html for more details.
Would you like to install Miniconda? [Y/n]: n
Now, I don't see that prompt and I cannot install boto:
RAthena::install_boto()
Using virtual environment 'RAthena' ...
/home/volodymyr/.virtualenvs/RAthena/bin/python: No module named pip
Error in strsplit(output, "\s+")[[1]] : subscript out of bounds
In addition: Warning message:
In system2(python, c("-m", "pip", "--version"), stdout = TRUE) :
running command ''/home/volodymyr/.virtualenvs/RAthena/bin/python' -m pip --version' had status 1
What would be my next steps in order to install boto and use RAthena?
I think the behaviour of adding + 1 to the function in Athena could raise some warnings even though it matches the R output it is quite strange to debug locally
Thanks for the amazing pkg
Executing "drop table" in Athena leaves the data in place.
However, sometimes the data in S3 is useless without the table definition (such as when generating temp-tables with ... %>% compute()), and it would be convenient to delete data with the same call.
The following definition of dbRemoveTable would accomplish this without interfering with 'normal' usage. There are some overrideable precautions built in to avoid accidental data loss (for example, if a table definition points to the root of bucket, everything there would get deleted).
What do you think?
#' @rdname dbRemoveTable
#' @export
setMethod(
"dbRemoveTable", c("AthenaConnection", "character"),
function(conn, name, delete_data = FALSE, no_confirm = FALSE, ...) {
if (!dbIsValid(conn)) {stop("Connection already closed.", call. = FALSE)}
if(!grepl("\\.", name)) name <- paste(conn@info$dbms.name, name, sep = ".")
if(delete_data)
{
nobjects <- 1000 # We get only 1000 objects at a time
while(nobjects >= 1000) {
if (grepl("\\.", name)) {
dbms.name <- gsub("\\..*", "" , name)
Table <- gsub(".*\\.", "" , name)
} else {
dbms.name <- conn@info$dbms.name
Table <- name}
glue <- conn@ptr$client("glue")
tryCatch(
s3_path <- split_s3_uri(glue$get_table(DatabaseName = dbms.name,
Name = Table)$Table$StorageDescriptor$Location),
error = function(e) py_error(e))
s3 <- conn@ptr$client("s3")
tryCatch(
objects <- s3$list_objects(Bucket=s3_path$bucket, Prefix=s3_path$key),
error = function(e) py_error(e))
nobjects <- length(objects$Contents)
all_keys <- sapply(objects$Contents, function(x) x$Key)
message(paste0("Info: The following S3 objects will be deleted:\n", paste0(paste0("s3://", s3_path$bucket, all_keys), collapse="\n")))
if(!no_confirm) {
confirm <- readline(prompt = "Delete files (y/n)?: ")
if(confirm != "y") {
message("Info: Table deletion aborted.")
return(NULL)
}
}
# This could be done with s3$delte_objects in one command, but cannot figure out the syntax right now...
for(c_object in all_keys){
tryCatch(
s3$delete_object(Bucket=s3_path$bucket, Key=c_object),
error = function(e) py_error(e))
}
}
}
res <- dbExecute(conn, paste("DROP TABLE ", name, ";"))
dbClearResult(res)
if(!delete_data) message("Info: Only Athena table has been removed.")
invisible(TRUE)
})
Prepare for release:
Submit to CRAN:
Wait for CRAN...
RAthena
and noctua
have both gone through some significant updates since last publication.
Datatype is correctly passed to athena through the AthenaDataType
function:
AthenaDataType <-
function(fields, ...) {
switch(
class(fields)[1],
logical = "BOOLEAN",
integer = "INT",
integer64 = "BIGINT",
numeric = "DOUBLE",
double = "DOUBLE",
factor = "STRING",
character = "STRING",
list = "STRING",
Date = "DATE",
POSIXct = "TIMESTAMP",
stop("Unknown class ", paste(class(fields), collapse = "/"), call. = FALSE)
)
}
This shouldn't be a problem if user has data.table already installed onto machine:
data.table interger64 documentation:
integer64: "integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64. Alternatively, "double"|"numeric" reads as base::read.csv does; i.e., possibly with loss of precision and if so silently. Or, "character".
read.csv
functionDoing something like the following:
tbl(con, "table1") %>%
compute(name = "test.table")
will result in an error, such as:
Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.test.table does not exist
The command does produce the table 'table' in database 'test', but compute tries to a table in the 'default' database (or which every you have set when connecting to Athena) named 'test.table'
pyathena
has a really good method to reduce cost by caching query id's. It would be good to have this feature in RAthena
and noctua
Prepare for release:
Submit to CRAN:
Wait for CRAN...
Due to recent changes in R-4.0.0 \donttest
now run. This causes a problem with most examples, as they require credentials. This results in the R CMD checks failing on Rhub servers. I have reached out to the cran ([email protected]
) asking advice for how the examples should be set up going forward.
To prevent spamming the cran with multiple package submission attempts, this release will be slightly delayed depending on response from cran submissions.
Make sure RAthena
takes full advantage of the new changes to reticulate
:
For extra information please refer to nocuta: #25
If a single datetime column is selected, an error is thrown. Didn't have time to dig deeper, but here it is fyi.
Example:
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
profile_name = "rathena")
df <- data.frame(time_col = Sys.time(), other_col = 1:5)
dbWriteTable(con, name="temp.test6", df)
tbl(con, in_schema("temp", "test6")) %>% select(time_col)
Error in setnames(ans, col.names) :
Can't assign 1 names to a 2 column data.table
In addition: Warning messages:
1: In data.table::fread(File, col.names = names(Type2), colClasses = unname(Type2), :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
2: In data.table::fread(File, col.names = names(Type2), colClasses = unname(Type2), :
Error in setnames(ans, col.names) :
Can't assign 1 names to a 2 column data.table
The next release is planned for the 14/05/2020. This is to prevent the package being release too frequently to the cran.
If the new features are required before the planned release please let me know and the release date can be changed.
Hi!
I am experiencing the issue with library when trying to assume role during the connection process. But for same user over pure boto3 with same credentials everything is working fine.
0. Working example with boto3:
session = boto3.Session(profile_name='my-profile')
credentials = session.get_credentials()
current_credentials = credentials.get_frozen_credentials()
client = boto3.client('athena',
aws_access_key_id = current_credentials.access_key,
aws_secret_access_key = current_credentials.secret_key,
aws_session_token = current_credentials.token)
And 3 not working examples:
1. Passing the access parameters within connection
> con <- dbConnect(RAthena::athena(),
+ aws_access_key_id = credentials$AccessKeyId,
+ aws_secret_access_key = credentials$SecretAccessKey,
+ s3_staging_dir='s3://my-bucket/tmp/',
+ region_name='eu-west-1')
Error: An error occurred (UnrecognizedClientException) when calling the GetDatabases operation: The security token included in the request is invalid.
2. Over the profile file:
> con <- dbConnect(RAthena::athena(),
+ profile_name = "default",
+ role_arn = "my-role-arn",
+ s3_staging_dir='s3:/my-bucket/tmp/',
+ region_name='eu-west-1')
Error in names(y$TableType) <- y$Name :
attempt to set an attribute on NULL
3. Without an assume role:
> con <- dbConnect(RAthena::athena(),
+ s3_staging_dir='s3:/my-bucket/tmp/',
+ region_name='eu-west-1')
Error in names(y$TableType) <- y$Name :
attempt to set an attribute on NULL
Could you please help? Am I doing something wrong?
Should RAthena and noctua support AWS Athena ml capabilities? MLcapabilities are created by sagemaker. Athena can be used to create an sql wrapper to call sagemaker models.
https://docs.aws.amazon.com/athena/latest/ug/querying-mlmodel.html
jsonlite supports outputting json lines format: (http://jsonlines.org/). RAthena
and noctua
should be able to support uploading JSON format to AWS Athena
.
JSON DDL format: https://docs.aws.amazon.com/athena/latest/ug/parsing-JSON.html
Similar to the earlier issue with as.integer #42, as.character also produces an error, as 'string' is an unknown type. Here we should use 'varchar'.
Example:
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
profile_name = "rathena")
tbl(con, "iris")) %>%
mutate(sepal_length = as.character(sepal_length))
Error: SYNTAX_ERROR: line 1:8: Unknown type: string
Surely related again to the differences between HIVE DDL and Presto SQL Syntax.
A second user is having issue with install_boto
: https://community.rstudio.com/t/emr-write-data-to-s3/55408/4
Might need to add better documentation to help improve user's experience with the package.
Will investigate in how to improve the function install_boto
.
The package vroom
: https://github.com/r-lib/vroom claims:
The fastest delimited reader for R, 1.27 GB/sec.
Benchmark: https://vroom.r-lib.org/articles/benchmarks.html
When passing data to and from AWS Athena vroom
has the potential to increase the performance with large files plus it offers a method to write gzip
files in parrel:
vroom_write(mtcars, pipe("pigz > mtcars.tsv.gz"))
Down side:
data.table
i.e. data is generally return using data.table
. Plus alot of data wrangling techniques use data.table
for speed.RStudio’s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so so it is recommended you use at least that version.
It would be good to get RAthena to display tables from Athena in Rstudio's connection tab: odbc:View.R
paste/paste0
defaulted to concat_ws
(#65)data.table
due to fwrite
supporting compression in (>=1.12.4) dependency needs to be altered to reflect thisdbWriteTable
utilise existing s3 location when appendingdb_compute
to take into account schema and name from a string (#74)dbStatistics
return stats from Athena queries (#67)RAthena_options
checks if file parser exists before adapting itRAthena_options
restricts vroom
support to (>=1.2.0) due to api changesroxygen2
from 6.1.1 to 7.0.2pkgdown::site
documentation with new roxygen2
AWS_ATHENA_WORK_GROUP
csv
to tsv
. This is to enable array and json to be pushed to Athena from RdbRemoveTable
to be able to delete Athena table s3 filessql_translate_env
Integer miss classifiedNA
when uploading to Athenapkgdown
site documentationDoing something like
tbl(con, "table) %>% mutate(new_col = paste(col1, col2))
fails with an error about a missing fucntion (concat_ws).
I think we need to add paste (and paste0) to sql_translate_env's athtena specific translations.
Adding lines:
paste = sql_prefix("CONCAT"),
paste0 = sql_prefix("CONCAT"),
to sql_translator under sql_translate_env.AthenaConnection should do the trick.
Here's another one:
When creating new columns with, e.g., dplyr's mutate, the column names are forced to lower case. This happens silently and deviates from the 'default' behaviour of dplyr (at least with the other DBI-backends I've been using).
With the previously implemented quoting of column names, forcing lower case is not necessary, and could be removed.
Example:
library(dplyr)
a <-
tbl(con, "some_table) %>%
mutate(AaBb = 1) %>%
select(AaBb) %>%
collect
b <- tibble(AaBb = 1:5)
a %>% inner_join(b)
This leads to the error:
Error:
by
required, because the data sources have no common variables
I think this is a bit unexpected.
If I'm not missing something, the 'tolower' part from line
Line 21 in 45fc753
What do you think?
p.s. One would still get surprises when using copy_to with upper-case column names. Athena seems to accept quoted upper case letters in the DDL, but replaces them with lower case automatically, so this is not something that can be helped in RAthena. Still, I think this is a separate issue.
For what ever reason, if apache arrow cant be installed onto local machine should aws glue be used to convert csv file format parquet to help with aws athena performance?
To be able to upload a flat file saved on users computer and register it in AWS Athena without the requirement of having to upload it first into R.
Support GZIP compression:
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only compatible format.
Currently RAthena
has a function dbGetPartition
which returns a data.table
in the default AWS Athena format:
library(DBI)
library(RAthena)
library(data.table)
con <- dbConnect(athena())
test_df2_partitions = dbGetPartition(con, "test_df2")
# partition
# 1: year=2020/month=11/day=17
This format isn't too bad as it just returns format from Athena. Would it be useful to reformat this into the following?
get_partitions = function(dt){
dt = dt[, tstrsplit(partition, split = "/")]
partitions = sapply(names(dt), function(x) strsplit(dt[[x]][1], split = "=")[[1]][1])
for (col in names(dt)) set(dt, j=col, value=tstrsplit(dt[[col]], split = "=")[2])
setnames(dt, old = names(dt), new = partitions)
return(dt)
}
get_partitions(test_df2_partitions)
# year month day
# 1: 2020 11 17
Problem is that dbGetPartition
has been in the package for sometime and changing it now would break possible solutions users have developed.
A query such as:
tbl(con, "table) %>%
filter(date_col == "2020-01-31 51:59:25")
will produce an error, as Athena does not understand a literal of the form date '2020-01-31 51:59:25', but R can cast it to a date even with 'as.Date(x, tryFormats = "%Y-%m-%d")' used for testing in sql_escape_string.AthenaConnection.
This can be fixed with;
# Athena specifc S3 method for converting date variables and iso formateed date strings to date literals
#' @rdname sql_translate_env
#' @export
sql_escape_string.AthenaConnection <- function(con, x) {
all_dates <- all(try(as.Date(x, tryFormats = "%Y-%m-%d"), silent=T) == x) & all(nchar(x) == 10)
if(all_dates & !is.na(all_dates)) {
paste0('DATE ', DBI::dbQuoteString(con, x))
} else {
DBI::dbQuoteString(con, x)
}
}
However, it wouldn't be a bad idea to have similar handling of date times and format those to, e.g., "datetime'2020-01-31 51:59:25'".
I will look into it, but it would be good to at least have the fix above included before pushing to cran.
Prepare for release:
Submit to CRAN:
Wait for CRAN...
Hello!
Thanks for your awesome package I managed to set dozen of automated scripts which use Athena quite intensive. So, thank you so much!
It's a feature request or\and idea for future work rather then issue.
dding the show_statistics
extra param in dbGetQuery()
or additional function dbGetQueryStatistics()
that uses the QueryExecution['Statistics']
part of boto3 get_query_execution
function's response.
Why it might be helpful? The pricing of Athena for now is $5 per 1TB of scanned data, and Athena itself is designed to query huge amounts of data. It might be helpful for managing the operations' costs in complex environments.
Setting up a connection to an Athena 'table' with just one column with tidyverse's tbl-function fails with the error:
Error in .valueClassTest(ans, "data.frame", "dbFetch") :
invalid value from generic function ‘dbFetch’, class “factor”, expected “data.frame”
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
profile_name = "rathena")
tttt <- data.frame(a = sample(LETTERS, 5))
copy_to(con, tttt, name = "test")
tbl(con, "test")
The tbl-command gives an error:
Error in .valueClassTest(ans, "data.frame", "dbFetch") :
invalid value from generic function ‘dbFetch’, class “factor”, expected “data.frame”
The error seems to be triggered at line 149 in Result.R:
Line 149 in 460b80e
So maybe an issue with boto3?
In any case, to get things working, I commented out the whole block dealing with the case n>= 0 etc.
Surely this one is here for a reason, so a better solution would be needed, but nothing to offer for that at the moment.
When uploading data to s3 uri "s3://mybucket/test_df4/" with file name "test_df.csv" s3_uri is changed to "s3://mybucket/4/"
I wasn't thinking when I merged tidyverse/dbplyr#391 into dbplyr; the methods really should live in the individual backends. I'm going to remove these methods for the next dplyr release so can you please include in your package:
#' @export
sql_escape_date.AthenaConnection <- function(con, x) {
paste0('date ', dbQuoteString(con, as.character(x)))
}
#' @export
sql_escape_datetime.AthenaConnection <- function(con, x) {
x <- strftime(x, "%Y-%m-%d %H:%M:%OS %Z")
paste0('timestamp ', dbQuoteString(con, x))
}
The code comes from OssiLehtinen so please credit him
Currently RAthena
and noctua
support gzip compression when uploading data to S3 and Athena. Is there a better compression algorithm for flat files? Top 10 Performance Tuning Tips for Amazon Athena
Algorithm | Splittable? | Compression ratio | Compress + Decompress speed |
---|---|---|---|
Gzip (DEFLATE) | No | High | Medium |
bzip2 | Yes | Very high | Slow |
LZO | No | Low | Fast |
Snappy | No | Low | Very fast |
For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable. When they are not an option, then try BZip2 or Gzip with an optimal file size.
From this it looks like BZIP2/GZIP are currently recommended. Might need to benchmark speed of BZip2 and GZIP files when reading from Athena
If column names in a data.frame contain special characters, such as 'ä' and 'ö', copy_to fails with the error
Error: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:8: no viable alternative at input 'CREATE EXTERNAL'
library(RAthena)
library(DBI)
con <- dbConnect(RAthena::athena(),
profile_name = "rathena")
oddcolnames <- data.frame(tämänimi = 1:5)
a_t <- copy_to(cona, oddcolnames, "test")
This can be fixed by replacing line
Line 360 in 48c01bd
with
paste0("`", field_names, "`", " ", field.types)
Please see DyfanJones/noctua#120 for further details
Possible method for connecting to AWS Athena.
https://developer.amazon.com/docs/login-with-amazon/web-docs.html
They're used in production every day to analyze petabytes of internet scan and honeypot data.
Appending
When appending to an already exiting table dbWriteTable
will just utilize parameter s3.location for S3 location. However an issue will occur if backend data in s3 is situated in a different location i.e.
Backend s3 location: s3://path/to/bucket1/tbl_data/
s3.location parameter: s3://path/to/bucket2/tbl_data/
To over come this the existing s3 location should overwrite s3.location
parameter when the parameter append
is set to TRUE
Backend S3 structure
Currently data is sent to the s3 path:
s3:/path/to/bucket/{tbl}/
This is ok when user only has 1 database/schema for Athena. When a use has 2 databases i.e:
dbWriteTable(con, "schema1.tbl", tbl)
dbWriteTable(con, "schema12.tbl", tbl)
The back end will return: s3:/path/to/bucket/{tbl}/
for both cases. Ideally it should align to the following schema: s3:/path/to/bucket/{schema}/{tbl}/
and return:
s3:/path/to/bucket/schema1/tbl/
s3:/path/to/bucket/schema2/tbl/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.