Code Monkey home page Code Monkey logo

redshifttools's Introduction

redshiftTools

This is an R Package meant to easen common operations with Amazon Redshift. The first motivation for this package was making it easier for bulk uploads, where the procedure for uploading data consists in generating various CSV files, uploading them to an S3 bucket and then calling a copy command on the server, this package helps with all those tasks in encapsulated functions.

WARNING

This package is not being maintained, however this fork is being maintained instead: https://github.com/RedOakStrategic/redshiftTools

Installation

To install the latest CRAN version, you’ll need to execute:

    install.packages('redshiftTools')

If instead you want to install the latest github master version:

    devtools::install_github("sicarul/redshiftTools")

Drivers

This library supports two official ways of connecting to Amazon Redshift (Others may work, but untested):

RPostgres

This Postgres library is great, and it works even with Amazon Redshift servers with SSL enabled. It previously didn’t support transactions, but is now the recommended way to work with redshiftTools.

To use it, please configure like this:

    devtools::install_github("r-dbi/RPostgres")
    library(RPostgres)
    
    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')
    test=dbGetQuery(con, 'select 1')

RJDBC

If you download the official redshift driver .jar, you can use it with this R library, it’s not great in the sense that you can’t use it with dplyr for example, since it doesn’t implement all the standard DBI interfaces, but it works fine for uploading data.

To use it, please configure like this:

    install.packages('RJDBC')
    library(RJDBC)
    
    # Save the driver into a directory
    dir.create('~/.redshiftTools')
    # - Check your AWS Dashboard to get the latest URL instead of this version -
    download.file('http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar','~/.redshiftTools/redshift-driver.jar')
    
    # Use Redshift driver
    driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "~/.redshiftTools/redshift-driver.jar", identifier.quote="`")

    # Create connection, in production, you may want to move these variables to a .env file with library dotenv, or other methods.
    dbname="dbname"
    host='my-redshift-url.amazon.com'
    port='5439'
    user='myuser'
    password='mypassword'
    ssl='true'
    url <- sprintf("jdbc:redshift://%s:%s/%s?tcpKeepAlive=true&ssl=%s&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory", host, port, dbname, ssl)
    conn <- dbConnect(driver, url, user, password)

Usage

Creating tables

For creating tables, there is a support function, rs_create_statement, which receives a data.frame and returns the query for creating the same table in Amazon Redshift.

n=1000
testdf = data.frame(
a=rep('a', n),
b=c(1:n),
c=rep(as.Date('2017-01-01'), n),
d=rep(as.POSIXct('2017-01-01 20:01:32'), n),
e=rep(as.POSIXlt('2017-01-01 20:01:32'), n),
f=rep(paste0(rep('a', 4000), collapse=''), n) )

cat(rs_create_statement(testdf, table_name='dm_great_table'))

This returns:

CREATE TABLE dm_great_table (
a VARCHAR(8),
b int,
c date,
d timestamp,
e timestamp,
f VARCHAR(4096)
);

The cat is only done to view properly in console, it’s not done directly in the function in case you need to pass the string to another function (Like a query runner)

Uploading data

For uploading data, you’ll have available now 2 functions: rs_replace_table and rs_upsert_table, both of these functions are called with almost the same parameters, except on upsert you can specify with which keys to search for matching rows.

For example, suppose we have a table to load with 2 integer columns, we could use the following code:

    library("aws.s3")
    library(RPostgres)
    library(redshiftTools)

    a=data.frame(a=seq(1,10000), b=seq(10000,1))
    n=head(a,n=10)
    n$b=n$a
    nx=rbind(n, data.frame(a=seq(5:10), b=seq(10:5)))

    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')

    b=rs_replace_table(a, dbcon=con, table_name='mytable', bucket="mybucket", split_files=4)
    c=rs_upsert_table(nx, dbcon=con, table_name = 'mytable', split_files=4, bucket="mybucket", keys=c('a'))

Creating tables with data

A conjunction of rs_create_statement and rs_replace_table can be found in rs_create_table. You can create a table from scratch from R and upload the contents of the data frame, without needing to write SQL code at all.

    library("aws.s3")
    library(RPostgres)
    library(redshiftTools)

    a=data.frame(a=seq(1,10000), b=seq(10000,1))
    
    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')

    b=rs_create_table(a, dbcon=con, table_name='mytable', bucket="mybucket", split_files=4)
    

redshifttools's People

Contributors

emelieh21 avatar gitter-badger avatar igoldin2u avatar ilyaminati avatar kwent avatar mfarkhann avatar niklasvm avatar rtjohn avatar sicarul avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

redshifttools's Issues

Recent Error

Hi,
I am getting lately this error when trying upload data.frame to Redshift.

[1] "Getting number of slices from Redshift"
[1] "4 slices detected, will split into 16 files"
Error in if (key == "") { : argument is of length zero

Do you know why it happens? It seems to be connected with uploadToS3 function

rs_upsert_table() connection timeout

Hi there,
I've been using redshifttools package in R to write data back to redshift for months. Recently we've made some changes to redshift connections (which is outside of my control) and I get timeout error while running rs_upsert_table() function. The reason is that by the time the function is ready to write the data (only large data has this issue) back to redshift after copying the packets to S3 bucket, the connection has been timed-out and i get connection error. I've tried various combinations of dividing data to files because i still have the same issue, because i can't divide data to smaller sizes based on the key (otherwise new data will be over-written with latest chunk of data with the same key, if that makes any sense).

My question is, is there any way that the function could keep the connection alive while writing files to S3 bucket? so that by the time it's ready to load the data back to redshift from the bucket the connection hasn't been timed-out?

Many thanks.

No IAM role error

Hi there,

Recently discovered this package and am trying to get it to work. I installed as listed and am using a connection provided (that works with dbWriteTable from DBI) though RPostgreSQL. I'm getting the following error when attempting load a data frame of two columns and 100,000 rows (as a test case):

[1] "Getting number of slices from Redshift"
[1] "2 slices detected, will split into 8 files"
Error in get_ec2_role(verbose = verbose) : 
  No IAM role profile available in instance metadata

Note, previously I get an error saying that it needed a package called aws.ec2metadata installed, which does not seem to have been installed in the setup process. I installed it by doing install.packages("aws.ec2metadata"), which seemed to work, but I am still getting the above error.

Any ideas? Would love to use this package, seems fantastic.

unable to find an inherited method for function ‘dbExecute’ for signature ‘\"character\", \"character\"’"

Hi there,
I've been able to use rs_upsert_table() function to date without any issues. However, recently I'm facing a problem that I haven't been able to solve. I'm trying to write a data.table, test, into Redshift. I have many tables like this and I haven't had any issues writing them to Redshift. However, this particular table is giving me trouble. I get this message:

"unable to find an inherited method for function ‘dbExecute’ for signature ‘"character", "character"’"
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbExecute’ for signature ‘"character", "character"’

I understand the nature of this error message but I have no idea where it would come from.
I tried solving it by making a very small set that resembles the structure of my data and tried it on a test table in Redshift and i still get the same error.

here is the test set:

test <- data.table(col1=c('2018-01', '2018-01', '2018-02')
, col2= c('blahblah', 'aslkjaslkj', '2w3nln3fll')
, col3= c('DFGSE$G#$G', 'GSDGSDFGEG#GERG', '354524')
, col4= c(1L,2L,3L)
, col5= c(10L,20L,30L)
, col6= c(11L,21L,31L)
, col7= c(12L,22L,23L)
, col8= c(31L,32L,33L)
, col9=c(Sys.Date(), Sys.Date(), Sys.Date()))

str(test)
Classes ‘data.table’ and 'data.frame': 3 obs. of 8 variables:
$ col1 : chr "2018-01" "2018-01" "2018-02"
$ col2 : chr "blahblah" "aslkjaslkj" "2w3nln3fll"
$ col3 : chr "DFGSE$G#$G" "GSDGSDFGEG#GERG" "354524"
$ col4 : int 1 2 3
$ col5: int 10 20 30
$ col6 : int 11 21 31
$ col7 : int 12 22 23
$ col8 : int 31 32 33
$ col9: Date, format: "2018-10-25" "2018-10-25" "2018-10-25"

  • attr(*, ".internal.selfref")=

here is the test table schema in redshift:
public test col1 character(7) lzo false 0 false
public test col2 character varying(256) lzo false 0 false
public test col3 character varying(256) lzo false 0 false
public test col4 integer lzo false 0 false
public test col5 integer lzo false 0 false
public test col6 integer lzo false 0 false
public test col7 integer lzo false 0 false
public test col8 integer lzo false 0 false
public test col9 timestamp without time zone lzo false 0 false


when i try to write the test data.table into test table in redshift, this is the error I receive:

  success <- rs_upsert_table(   test 
                              , dbcon=write_conn
                              , table_name = 'test'
                              , keys=c('col1')
                              , split_files = 4
                              # , bucket= 'XXX'  -> not used as set in Sys.setenv
                              # , region = 'XXX'
                              # , access_key = 'XXX'
                              # , secret_key = 'XXX'
  ) 

[1] "The provided data.frame has 3 rows"
[1] "Uploading he-r-to-redshift/bzuvwsncqygbygmjlblvfkvycaceikczsbavwjhgldudkluxtl.0001"
[1] "Uploading he-r-to-redshift/bzuvwsncqygbygmjlblvfkvycaceikczsbavwjhgldudkluxtl.0002"
[1] "Uploading he-r-to-redshift/bzuvwsncqygbygmjlblvfkvycaceikczsbavwjhgldudkluxtl.0003"
[1] "unable to find an inherited method for function ‘dbExecute’ for signature ‘"character", "character"’"
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbExecute’ for signature ‘"character", "character"’

[1] "Deleting temporary files from S3 bucket"
[1] "Deleting bzuvwsncqygbygmjlblvfkvycaceikczsbavwjhgldudkluxtl.0001"
[1] "Deleting bzuvwsncqygbygmjlblvfkvycaceikczsbavwjhgldudkluxtl.0002"
[1] "Deleting bzuvwsncqygbygmjlblvfkvycaceikczsbavwjhgldudkluxtl.0003"

I have no idea what's going on. I have had so many tables like this and am able to write them back in Redshift.

AWS session token not found when using rs_create_table

This error is showing now when uploading data

Warning message:
In value[3L] :
Failed to fetch row: ERROR: AWS session token not found
DETAIL:

error: AWS session token not found
code: 8001
context:
query: 3867846
location: aws_credentials_parser.cpp:150
process: padbmaster [pid=8216]

Issue installing aws.ec2metadata

Hi,
The package is not working for me, so I installed the package again, please take a look at the errors and warnings.

install.packages("aws.ec2metadata", repos = c(cloudyr = "http://cloudyr.github.io/drat", getOption("repos"))) I get the following output:

Installing package into ‘C:/Users/muguet.RSINT/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
Warning in install.packages :
cannot open URL 'http://cloudyr.github.io/drat/bin/windows/contrib/3.5/PACKAGES.rds': HTTP status was '404 Not Found'
Warning in install.packages :
cannot open URL 'http://cloudyr.github.io/drat/bin/windows/contrib/3.5/PACKAGES.gz': HTTP status was '404 Not Found'
Warning in install.packages :
cannot open URL 'http://cloudyr.github.io/drat/bin/windows/contrib/3.5/PACKAGES': HTTP status was '404 Not Found'
Warning in install.packages :
unable to access index for repository http://cloudyr.github.io/drat/bin/windows/contrib/3.5:
cannot open URL 'http://cloudyr.github.io/drat/bin/windows/contrib/3.5/PACKAGES'

There is a binary version available but the source version is later:
binary source needs_compilation
aws.ec2metadata 0.1.4 0.1.5 FALSE

installing the source package ‘aws.ec2metadata’

trying URL 'http://cloudyr.github.io/drat/src/contrib/aws.ec2metadata_0.1.5.tar.gz'
Content type 'application/gzip' length 4445 bytes
downloaded 4445 bytes

  • installing source package 'aws.ec2metadata' ...
    ** R
    ** inst
    ** byte-compile and prepare package for lazy loading
    ** help
    *** installing help indices
    converting help for package 'aws.ec2metadata'
    finding HTML links ... done
    ec2metadata html
    ** building package indices
    ** testing if installed package can be loaded
    *** arch - i386
    *** arch - x64
  • DONE (aws.ec2metadata)
    In R CMD INSTALL

Thank you

Download Function

This is a feature request more than an issue.

I'm interested in a function that can be used to download the contents of a table or results of a query using the reverse of the pattern used for rs_append_table.
i.e. Redshift --> S3 --> Local

My main motivation is that a normal download via a SELECT query can take very long for large results.

Error message

Hi, i am very excited about this tool.
But I got this message using "rs_upsert_table", i have everything specified and set up including bucket,split_files,region ,access_key ,secret_key,iam_role_arn. Thank you for your time and help!

Error in curl_fetch_memory(uri) : Timeout was reached
Error in curl_fetch_memory(uri) :
URL using bad/illegal format or missing URL
Error in role[["AWS_ACCESS_KEY_ID"]] : subscript out of bounds
**

rs_upsert_table() Error $code "AccessDenied"

Hi there, i'm trying to upsert a dataframe using rs_upsert_table() function.
I'm having AccessDenied issue, although I have tested the access_key_id and Access_key.
I think the problem rises from region, which for me is "ap-southeast-2".

When i first used the following I noticed that the region wasn't getting changed and it was still set to default us-east-1.

rs_upsert_table( test
, dbcon=write_conn
, table_name = 'my_tablet'
, keys=c('key1')
, bucket="my-bucket"
, region = "ap-southeast-2" ## this wouldn't change the region
, access_key = "myKey"
, secret_key = "mySecret"
, split_files = 4
)

List of 4
$ Code : chr "AccessDenied"
$ Message : chr "Access Denied"
.
.
.

then i tried this:

Sys.setenv("AWS_ACCESS_KEY_ID" = "myKey",
"AWS_SECRET_ACCESS_KEY" = "mySecret",
"AWS_DEFAULT_REGION" = "ap-southeast-2",
"AWS_BUCKET_NAME" = "my-bucket"
)

and when i ran rs_upsert_table, I saw that the region was finally set to "ap-southeast-2", but I would still get the AccesDenied error.

So, i thought maybe there is something wrong with my credentials and write access.
so I did these:

the following was successful after i set check_region to FALSE. Otherwise i would get an access error

put_object('C:/test.txt', 'my-bucket', check_region = FALSE)
[1] TRUE

same with the following, it was successful after i set check_region to FALSE.

get_bucket('he-r-to-redshift', check_region = FALSE)
Bucket: my-bucket

$Contents
Key: my-bucket
LastModified: 2018-08-21T03:30:03.000Z
ETag: "blahblah"
Size (B): 20
Owner:
Storage class: STANDARD

So, as it turns out, the access key and secret work just fine, but I still can't get rs_upsert_table() to work, throwing AccessDenied error code.


sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_0.7.6 data.table_1.11.4 redshiftTools_0.3.900 RPostgres_1.1.1 aws.s3_0.3.12 RPostgreSQL_0.6-2 DBI_1.0.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 rstudioapi_0.7 magrittr_1.5 bindr_0.1.1 xml2_1.2.0 hms_0.4.2 tidyselect_0.2.4
[8] bit_1.1-14 aws.signature_0.4.4 R6_2.2.2 rlang_0.2.1 blob_1.1.1 httr_1.3.1 tools_3.5.1
[15] bit64_0.9-7 digest_0.6.15 assertthat_0.2.0 tibble_1.4.2 crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5
[22] base64enc_0.1-3 curl_3.2 mime_0.5 glue_1.3.0 pillar_1.3.0 compiler_3.5.1 pkgconfig_2.0.1

Error with curl?

Hey Pablo,

I've recently discovered this package and I have been using it with great success. So thank you!

I've encountered an error message that does not have an existing issue, so I wanted to report it, but I'm not quite sure what the problem is, so I can't give a reproducible example. As you'll see I am trying to load a very large data frame. So maybe that is the issue.

The provided data.frame has 15516052 rows and 546 columns
Getting number of slices from Redshift
1.97626258336499e-323 slices detected, will split into 4816 files
Uploading 4816 files with prefix bmwgevwtjusrfqoeajpdvqbklsmtlhfevocadmkthdtotijroy to bucket my-bucket-name
Uploading file 575/4816 [=============>--------------------------------------------------------------------------------------------------------]
Error in curl::curl_fetch_memory(url, handle = handle) : 
  OpenSSL SSL_write: SSL_ERROR_ZERO_RETURN, errno 32

Also, sometimes I get,

Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
Error in the HTTP2 framing layer

Do you understand either of these error messages?

Function hangs in 3.5.1?

Hi there, fantastic package, I use this on the regular. Just recently update to R 3.5.1, however, and calling to the function rs_upsert_table seems to hang. It detects the number of splits, but never begins uploading to S3. I've tried the same process on a 3.4.2 installation and it works as normal.

Wondering if you're experiencing this as well or have tried.

Appreciate it!

Add error handling?

I sometimes get internal server error (500) codes when redshiftTools (RST) is interacting with AWS's API. It doesn't matter if RST is uploading data files to S3 or deleting them. On a long running job, an error like this occasionally means I have to start it over. Have you encountered this?

I know we cannot eliminate those errors, but I wanted to engage you in discussing adding exception handling to RST. I might be able to submit a PR if we can decide how best to add this feature.

Error in parse_aws_s3_response(r, Sig, verbose = verbose) : Forbidden (HTTP 403).

Hello,
I'm using the below code to create a table:

 redshiftTools::rs_create_table (buyers, dbcon=zzzz_ananda, 
+                                 table_name='pricing_uniquebuyers', 
+                                 bucket="ananda-fastload-raw", 
+                                 region="eu-west-1", 
+                                 access_key=credentials$AccessKeyId, 
+                                 secret_key=credentials$SecretAccessKey,
+                                 session_token=credentials$Token)

But it hits an error then it tries to delete the temporary files. I've checked with the administrator and I have full rights to perform this operation. How may I bypass this issue?

Initiating Redshift table creation for table pricing_uniquebuyers
Initiating Redshift table replacement for table pricing_uniquebuyers
The provided data.frame has 98668 rows
Getting number of slices from Redshift
0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000197626258336499 slices detected, will split into 16 files
Uploading 16 files with prefix cqxhhuwwxdxftxhtbwbjpakjyvfuntcjukigrucdmrzszzcvep to bucket ananda-fastload-raw
Upload to S3 complete!                                                                                                                                                               
Deleting target table for replacement
Insert new rows
Drop staging table
Committing changes
Deleting temporary files from S3 bucket
Deleting 16 files with prefix cqxhhuwwxdxftxhtbwbjpakjyvfuntcjukigrucdmrzszzcvep from bucket ananda-fastload-raw
List of 4
 $ Code     : chr "AccessDenied"
 $ Message  : chr "Access Denied"
 $ RequestId: chr "83D5BC09581EB13B"
 $ HostId   : chr "KlXGVrH5Fkxni8i++At/hYi23MF/2X3w6PpDWgrZE31WnTWDhHp05KkHr/gWasQAtGCo1N+5J7c="
 - attr(*, "headers")=List of 6
  ..$ x-amz-request-id : chr "83D5BC09581EB13B"
  ..$ x-amz-id-2       : chr "KlXGVrH5Fkxni8i++At/hYi23MF/2X3w6PpDWgrZE31WnTWDhHp05KkHr/gWasQAtGCo1N+5J7c="
  ..$ content-type     : chr "application/xml"
  ..$ transfer-encoding: chr "chunked"
  ..$ date             : chr "Thu, 10 Oct 2019 09:00:44 GMT"
  ..$ server           : chr "AmazonS3"
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 - attr(*, "class")= chr "aws_error"
NULL
Error in parse_aws_s3_response(r, Sig, verbose = verbose) : 
  Forbidden (HTTP 403).

rs_create_table only uploads half of records in dataframe

Hi,
Really grateful for this package! I'm running into some trouble with rs_create_table. No matter what I do, the function seems to upload exactly half of all the records in the data frame, rather than all of them.

For example, if I create a data frame with 4 records and write to Redshift with rs_create_table like this:

df <- data.frame("val1" = 1:4, "val2" = 5:8)

conn <- dbConnect(RPostgres::Postgres(),
                  host = host,
                  port = port,
                  user = user,
                  password = password,
                  dbname = dbname,
                  sslmode='require')


table_name = "mts.vals"
rs_create_table(
  df, 
  dbcon=conn, 
  table_name=table_name,  
  bucket=s3_bucket,
  access_key = aws_access_key_id,
  secret_key = aws_secret_access_key,
  region = aws_default_region
)

I see this output in the console:


[1] "The provided data.frame has 4 rows"
[1] "Getting number of slices from Redshift"
[1] "3.16202013338398e-322 slices detected, will split into 3.16202013338398e-322 files"
[1] "Uploading ngreen/fepuobieukyyjelvgovpzrgvreappxwzxdoqmqjfgtvgvfwurq.0001"
[1] "Uploading ngreen/fepuobieukyyjelvgovpzrgvreappxwzxdoqmqjfgtvgvfwurq.0000"
[1] "Copying data from S3 into Redshift"
INFO:  Load into table 'ozlubiwvstkdcpjr' completed, 2 record(s) loaded successfully.
[1] "Deleting target table for replacement"
[1] "Insert new rows"
[1] "Drop staging table"
[1] "Committing changes"
[1] "Deleting temporary files from S3 bucket"
[1] "Deleting fepuobieukyyjelvgovpzrgvreappxwzxdoqmqjfgtvgvfwurq.0001"
[1] "Deleting fepuobieukyyjelvgovpzrgvreappxwzxdoqmqjfgtvgvfwurq.0000"
[1] TRUE

When I log into Redshift and run this:

SELECT COUNT(1) FROM mts.vals

I get an output of 2 rather than 4.

Any help appreciated - thanks!
-Noah

ERROR: Column length exceeds maximum allowed (maximum column length is 65535)

Hi,
Thanks for this great package.
I've using it for several months, and it's been great for my team.

Yesterday we are trying to copy table from postgres to redsfhit,
it contains json field that consist of long text.
max(nchar(col)) is between 32768 and 65536

when we try to upload it to redshift, it produce this error

Error: Failed to fetch row: ERROR: Column length exceeds maximum allowed (maximum column length is 65535)

After digging around,
it is because of calculateCharSize set max size is 65536

While based on redshift documenation, the max size is 65535 bytes (64K -1)

Here is minimal reproducible example.
I'm sorry we didn't have public redshift and s3, so we are using R environment for this example

library(redshiftTools)
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), ,
                 host=Sys.getenv('redshit_host'),
                 dbname=Sys.getenv('redshit_databaseName'),
                 user = Sys.getenv('redshit_user'),
                 password= Sys.getenv('redshit_password'),
                 port = Sys.getenv('redshit_port'),
                 sslmode='require')

text <- paste(rep("A",2^16-1),collapse = "")
nchar(text)
#> [1] 65535

df_test <- data.frame(text = text)

cat(rs_create_statement(df_test, table_name='test_table'))
#> CREATE TABLE test_table (
#> text VARCHAR(65536) encode zstd
#> );

b <- rs_create_table(df_test, 
                dbcon=con,
                table_name='test_table', 
                bucket=Sys.getenv('S3_bucket'))
#> Error: Failed to fetch row: ERROR:  Column length exceeds maximum allowed (maximum column length is 65535)

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 9 (stretch)
#> 
#> Matrix products: default
#> BLAS: /usr/lib/openblas-base/libblas.so.3
#> LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] RPostgres_1.1.1.9002  redshiftTools_0.3.900
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0            knitr_1.20            xml2_1.2.0           
#>  [4] magrittr_1.5          hms_0.4.2             aws.s3_0.3.12        
#>  [7] bit_1.1-14            aws.signature_0.4.4   R6_2.3.0             
#> [10] rlang_0.3.1           blob_1.1.1            stringr_1.3.1        
#> [13] httr_1.3.1            tools_3.5.1           DBI_1.0.0.9001       
#> [16] htmltools_0.3.6       yaml_2.2.0            bit64_0.9-7          
#> [19] rprojroot_1.3-2       digest_0.6.18         base64enc_0.1-3      
#> [22] curl_3.2              evaluate_0.11         rmarkdown_1.10       
#> [25] stringi_1.2.4         aws.ec2metadata_0.1.5 compiler_3.5.1       
#> [28] backports_1.1.2       jsonlite_1.6          pkgconfig_2.0.2

Created on 2019-02-04 by the reprex package (v0.2.0).

load into table successfully, but reported error when deleting data in s3

Hi Pablo,

I use "rs_create_table" to upload data into redshift. The uploading process is successful but there is error when deleting those temp files in s3:

INFO: Load into table 'iytqadwbvpzmgsfj' completed, 1 record(s) loaded successfully.

Error in parse_aws_s3_response(r, Sig, verbose = verbose) :
Forbidden (HTTP 403).

Missing connection parameter on README

A nit, but on your README page, under Drivers, RPostgres
test=dbGetQuery('select 1')

should be
test=dbGetQuery(con,'select 1')

Or at least on my windows RGui(64-bit) app.

Installation error: ‘redshiftTools’ is not available (for R version 4.0.0)

I've just upgraded R to version 4.0.0.
When I try to install the package with the following command

install.packages("redshiftTools",repos = c("http://rstudio.org/_packages","http://cran.rstudio.com"))

I am getting the following error:

Warning in install.packages :
  package ‘redshiftTools’ is not available (for R version 4.0.0)
Warning in install.packages :
  unable to access index for repository http://rstudio.org/_packages/bin/windows/contrib/4.0:
  cannot open URL 'http://rstudio.org/_packages/bin/windows/contrib/4.0/PACKAGES'

Is it possible to still install it and get my scripts to work?

COPY Parameters

Hello! Awesome package! Is there a way to pass parameters for the COPY statement (column mapping, data format parameters, data conversion parameters, data load operations)? I'm getting error messages that would be easily remedied with the parameter timeformat = 'auto', but I don't see a way to pass that along. Sorry if this is an inappropriate place to post this question; I am new to GitHub! And thanks in advance for your help!

rs_create_table/rs_upsert_table error

Error: Expecting a single string value: [type=character; extent=2].


14.
stop(structure(list(message = "Expecting a single string value: [type=character; extent=2].", call = NULL, cppstack = NULL), class = c("Rcpp::not_compatible", "C++Error", "error", "condition")))
13.
result_create(conn@ptr, statement)
12.
initialize(value, ...)
11.
initialize(value, ...)
10.
new("PqResult", conn = conn, ptr = result_create(conn@ptr, statement), sql = statement, bigint = conn@bigint)
9.
.local(conn, statement, ...)
8.
dbSendQuery(conn, statement, ...)
7.
dbSendQuery(conn, statement, ...)
6.
dbSendStatement(conn, statement, ...)
5.
dbSendStatement(conn, statement, ...)
4.
dbExecute(dbcon, query)
3.
dbExecute(dbcon, query)
2.
queryStmt(dbcon, tableSchema)
1.
rs_create_table(iris, dbcon = con, table_name = in_schema("xxxx", table_name), bucket = config$redshift$redshift_bucket, split_files = 1)

S3 Bucket not found

Hi @sicarul,

I have been working on a solution to push tables from R to redshift. However, I have encountered an issue while using the rs_upsert_table command wherein I am getting the following error. I have all the credentials and aws bucket access, however, the code is still bugging out.

ERROR**
Client error: (403) Forbidden
Error in uploadToS3(df, bucket, split_files, access_key, secret_key, region) :
Bucket does not exist
Calls: rs_upsert_table -> uploadToS3
Execution halted
***ERROR END

Any inputs would be great here. Thanks!

Best,
Aayush Sahni

adding 'headers =' to 'put_object'

I have been getting an Access Denied error when trying to use the rs_replace_table command because I am unable to specify a list of headers for amz server side encryption. Can this get added to both commands?

rs_replace_table(data = temp, dbcon = con, tablename = 'table.name',
    bucket = 'bucket_location',
    headers = list('x-amz-server-side-encryption' = 'xx'))

When calling aws.s3::put_object by itself, this works with our connection but I am unable to figure out how to add this into the call.

rs_create_table throwing an error $ Code : chr "AccessDenied"

Hi I am trying write R dataframe to Redshift table by using below function from EC2 instance.

rs_create_table(seg_mem_model, dbcon=conn, table_name='segmenattion_model', bucket=s3_bucket,
region = s3_region, iam_role_arn = aws_iam_role, split_files=4)

I got below error.

[1] "The provided data.frame has 1805422 rows"
List of 4
$ Code : chr "AccessDenied"
$ Message : chr "Access Denied"
$ RequestId: chr "A096AC9FC2F163A7"
$ HostId : chr "CkxoZsK8sE2wW/jMb5vXSOp2Cb1SLvzVZ2xlaiA+h9dfliJBSkoW+lsmot/TWbYXcTaqM2mAhDA="

  • attr(, "headers")=List of 6
    ..$ x-amz-request-id : chr "A096AC9FC2F163A7"
    ..$ x-amz-id-2 : chr "CkxoZsK8sE2wW/jMb5vXSOp2Cb1SLvzVZ2xlaiA+h9dfliJBSkoW+lsmot/TWbYXcTaqM2mAhDA="
    ..$ content-type : chr "application/xml"
    ..$ transfer-encoding: chr "chunked"
    ..$ date : chr "Mon, 22 Jan 2018 22:09:52 GMT"
    ..$ server : chr "AmazonS3"
    ..- attr(
    , "class")= chr [1:2] "insensitive" "list"
  • attr(*, "class")= chr "aws_error"
    NULL
    Error in parse_aws_s3_response(r, Sig, verbose = verbose) :
    Forbidden (HTTP 403).

Please find below the sessionInfo() output.

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] redshiftTools_0.3.900 RPostgreSQL_0.6-2 DBI_0.7-13 RPostgres_1.0-4 aws.s3_0.3.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.15 digest_0.6.14 aws.signature_0.3.5 R6_2.2.2 pillar_1.1.0 httr_1.3.1 rlang_0.1.6 curl_3.1
[9] blob_1.1.0 xml2_1.1.1 tools_3.4.3 bit64_0.9-7 bit_1.1-12 hms_0.4.0 compiler_3.4.3 pkgconfig_2.0.1
[17] base64enc_0.1-3 tibble_1.4.1

rs_create_table needs session_token argument on EC2

I'm developing inside a docker container on an EC2 with a defined IAM role.
Here is the Dockerfile used to create the image:

FROM rocker/verse:3.5.0

WORKDIR /home/rstudio/work

#Settings
COPY startup.sh /etc/cont-init.d/theme

# System package installation

# Package installations
# using `install.packages()` will download packages from the MRAN snapshot

RUN Rscript -e "devtools::install_github('lgjohnson/aws.signature', ref = 'fix-cred-bug')"
RUN Rscript -e "devtools::install_github('cloudyr/aws.s3', ref = 'ff2966749a361ff143ef99053325163460dc15d0')"
RUN Rscript -e "install.packages(c('psych', 'pscl'))"

# ENTRYPOINT inherited from rocker image (init)

Here is the basic issue:

> library(aws.s3)
> bucket_exists("mybucket")
[1] TRUE
attr(,"x-amz-id-2")
[1] "stuff-redacted"
attr(,"x-amz-request-id")
[1] "B1D6161A1CE1004D"
attr(,"date")
[1] "Thu, 11 Apr 2019 00:13:11 GMT"
attr(,"x-amz-bucket-region")
[1] "us-west-2"
attr(,"content-type")
[1] "application/xml"
attr(,"transfer-encoding")
[1] "chunked"
attr(,"server")
[1] "AmazonS3"

> creds <- aws.signature::locate_credentials()
> rs_create_table(mydata, dbcon=redshiftcon, table_name='mytable', 
+                bucket="mybucket", 
+                region = creds$region,
+               access_key = creds$key,
+                secret_key = creds$secret,
+                split_files=4)
[1] "The provided data.frame has 36 rows"
Client error: (403) Forbidden
Error in uploadToS3(df, bucket, split_files, access_key, secret_key, region) : 
  Bucket does not exist

So bucket_exists() confirms the bucket is there, but rs_create_table() thinks the bucket is not. If you look at the first 16 lines of code for s3HTTP() which is what is called by get_bucket()

function (verb = "GET", bucket = "", path = "", query = NULL, 
	headers = list(), request_body = "", write_disk = NULL, 
	accelerate = FALSE, dualstack = FALSE, parse_response = TRUE, 
	check_region = FALSE, url_style = c("path", "virtual"), 
	base_url = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"), 
	verbose = getOption("verbose", FALSE), show_progress = getOption("verbose", 
		FALSE), region = NULL, key = NULL, secret = NULL, session_token = NULL, 
	use_https = TRUE, ...) 
{
	credentials <- aws.signature::locate_credentials(key = key, 
		secret = secret, session_token = session_token, region = region, 
		verbose = verbose)
	key <- credentials[["key"]]
	secret <- credentials[["secret"]]
	session_token <- credentials[["session_token"]]
	region <- credentials[["region"]]

You can see that s3HTTP() is prepared to handle session_token as an input. The solution should be relatively simple. In the uploadToS3 function if you add an argument to line 4 such as:

if (!bucket_exists(bucket, key = key, secret = secret, region = region, session = session)) 

and a similar change to rs_replace_table

function (df, dbcon, table_name, split_files, bucket = Sys.getenv("AWS_BUCKET_NAME"), 
	region = Sys.getenv("AWS_DEFAULT_REGION"), access_key = Sys.getenv("AWS_ACCESS_KEY_ID"), 
	secret_key = Sys.getenv("AWS_SECRET_ACCESS_KEY"), iam_role_arn = Sys.getenv("AWS_IAM_ROLE_ARN"), session = Sys.getenv("AWS_SESSION_TOKEN")
	wlm_slots = 1, additional_params = "") 

There are a few other lines that would need to be changed as well. I'm happy to submit a PR for this if you'd like.

rs_replace_table error: Forbidden (HTTP 403)

Hi there

Great package, thank you!

I am receiving an error when using:

rs_replace_table(
  df = df,
  dbcon = rs,
  table_name = 'my_schema.rs_test',
  bucket = 's3://mybucket/',
  split_files = 1
)

Note: i have anonymised the S3 bucket and redshift schema.

Output:

Initiating Redshift table replacement for table my_schema.rs_test
The provided data.frame has 36 rows
Uploading 1 files with prefix cxcuowiijlrghzkaydkwwfkuttqzrtfotsdhzoiqoonjtuiqka to bucket s3://mybucket/
Error in { : task 1 failed - "Forbidden (HTTP 403)."

I think I have found the issue which is that my S3 bucket requires server side encryption to be set. I have used the aws.s3 package before and usually just pass

headers = c(`x-amz-server-side-encryption` = "AES256")

to put_object

Any thoughts? I'm happy to open a PR with a potential fix.

UTF-8 character insert errors

Double byte char types like utf-8 such as JP characters are actually a different size in redshift storage for varchar, needs to accommodate these double type types. Can you update the nchar function in tabledefinitions.R line 4 to read:

maxChar = max(nchar(col, type='bytes'), na.rm=T)

Sample code to create the issue:

dat1 <- data.frame(A="あいう",stringsAsFactors = F) b=rs_create_table(dat1, dbcon=con, table_name='mytesttable', split_files=1)

Above results in
ERROR: Load into table 'uvroyetkqxjdgcbn' failed. Check 'stl_load_errors' system table for details.

In stl_load_errors is says
String length exceeds DDL length

receiving "ERROR: Query cancelled on user's request\n" in version 0.3.900

I am trying to use the rs_upsert_table command after upgrading to version 0.3.900 and receive the error "ERROR: Query cancelled on user's request\n"

However, the exact same commands in version 0.1.1 do not produce this error. Any idea what has changed that has caused this? Should I simply downgrade my version? Any help installing an older version?

For reference, my commands look like:
dbSendQuery(con2,"create table newTABLE(
x VARCHAR(100),
y VARCHAR(100),
z VARCHAR(100),
i integer,
j integer);")
rs_upsert_table(dF,dbcon=con2,table_name="newTABLE",bucket="myBucket")

Multibyte characters create STL load errors

First just wanted to say thanks for creating such a great package. It's been incredibly valuable for me in my work.

The issue I've recently run into has to do with having multibyte characters in my dataset. I work at a digital marketing agency, and occasionally will have trademark symbols and other unicode characters in my datasets. R will interpret this as a single character, but AWS redshift decodes it into multiple characters.

The result is that the calculateCharSize function leads to an incorrect table definition (as interpreted by the database) and I get an STL load error as a result.

I've been able to get around this issue by creating a dummy row of data with a longer string, but it may be helpful to modify the calculateCharSize function so that it rounds up and sets the character size to the next level. I realize some efficiency might be lost in this approach, but it may be worth it so that others can avoid this same problem.

Warning message causes rs_replace_table() function not to upload data

First of all: thanks for the very useful package. I am a frequent user of the rs_replace_table function. However, today I noticed the upload to Redshift from S3 does not continue when a warning occurs.

I got this warning:

<simpleWarning in result_fetch(res@ptr, n = n): Don't need to call dbFetch() for statements, only for queries>

And after that the upload to Redshift was skipped and the tables were deleted again from the S3 bucket.

After removing the line warning = function(w) { print(w) } from the rs_replace_table function in my R session (see replace.R line 97-99), it worked fine again. Not sure if you intended the upload to stop when a warning occurs, but I just wanted to inform you about this issue.

pb_tick

Hi, looks like I have a bug with pb_tick (progress bar function I think). Any idea how to resolve it? I am new at R so I am not even sure which library this function comes from.

Here is my error message and traceback. Any help is greatly appreciated!

Error in pb_tick(self, private, len, tokens) : !self$finished is not TRUE

traceback()
8: stop(simpleError(msg, call = if (p <- sys.parent(1L)) sys.call(p)))
7: stopifnot(!self$finished)
6: pb_tick(self, private, len, tokens)
5: pb$tick()
4: .f(.x[[i]], .y[[i]], ...)
3: map2(splitted, 1:split_files, upload_part)
2: uploadToS3(df, bucket, split_files, access_key, secret_key, session_token,
region)
1: rs_replace_table(test, dbcon = con_dw, table_name = "sometablename",
bucket = "mybucket", access_key = "myaccesskey",
secret_key = "mysecretkey")

R: Error when I try to upload a data-frame to a new table in Redshift database

I'm trying to upload the content of a data frame into Redshift database using the following command:
redshiftTools::rs_upsert_table(MyDataFrame, dbcon=con, table_name='NewTableInDB', bucket="mybucket")
but it throws an error below:
Error in parse_aws_s3_response(r, Sig, verbose = verbose) : Forbidden (HTTP 403).
How can I resolve it?

Error in if (nrow(data) == 0) { : missing value where TRUE/FALSE needed

with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY specified via Sys.setenv

redshiftTools 0.11
aws.s3 0.3.3

encountered when trying to rs_replace_table

--------Platform info ----
version
_
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness

Not working for tables with an identity column

(edited after spending more time working out the issue)

I have a table which has an incremental identity column.

I run:

rs_replace_table(df = table, dbcon = con, table_name = 'table', bucket = "bucket")

And get the following error after successful upload to S3

[1] "Copying data from S3 into Redshift"
[1] "Failed to fetch row: ERROR:  Load into table 'ngphviyuofradklt' failed.  Check 'stl_load_errors' system table for details.\n"

The error happens because the temporary table which is created with like doesn't take over the property of the id column and it just becomes an ordinary BIGINT column. COPY fails naturally because there is one column missing in the data frame.

I solved it for my purpose by just directly COPYing into the original table, but it isn't the ideal solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.