Code Monkey home page Code Monkey logo

redshifttools's Introduction

redshiftTools

This is an R Package meant to easen common operations with Amazon Redshift. The first motivation for this package was making it easier for bulk uploads, where the procedure for uploading data consists in generating various CSV files, uploading them to an S3 bucket and then calling a copy command on the server, this package helps with all those tasks in encapsulated functions.

Installation

To install the latest CRAN version, you’ll need to execute:

    install.packages('redshiftTools')

If instead you want to install the latest github master version:

    devtools::install_github("sicarul/redshiftTools")

Drivers

This library supports two official ways of connecting to Amazon Redshift (Others may work, but untested):

RPostgres

This Postgres library is great, and it works even with Amazon Redshift servers with SSL enabled. It previously didn’t support transactions, but is now the recommended way to work with redshiftTools.

To use it, please configure like this:

    devtools::install_github("r-dbi/RPostgres")
    library(RPostgres)
    
    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')
    test=dbGetQuery(con, 'select 1')

RJDBC

If you download the official redshift driver .jar, you can use it with this R library, it’s not great in the sense that you can’t use it with dplyr for example, since it doesn’t implement all the standard DBI interfaces, but it works fine for uploading data.

To use it, please configure like this:

    install.packages('RJDBC')
    library(RJDBC)
    
    # Save the driver into a directory
    dir.create('~/.redshiftTools')
    # - Check your AWS Dashboard to get the latest URL instead of this version -
    download.file('http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar','~/.redshiftTools/redshift-driver.jar')
    
    # Use Redshift driver
    driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "~/.redshiftTools/redshift-driver.jar", identifier.quote="`")

    # Create connection, in production, you may want to move these variables to a .env file with library dotenv, or other methods.
    dbname="dbname"
    host='my-redshift-url.amazon.com'
    port='5439'
    user='myuser'
    password='mypassword'
    ssl='true'
    url <- sprintf("jdbc:redshift://%s:%s/%s?tcpKeepAlive=true&ssl=%s&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory", host, port, dbname, ssl)
    conn <- dbConnect(driver, url, user, password)

Usage

Creating tables

For creating tables, there is a support function, rs_create_statement, which receives a data.frame and returns the query for creating the same table in Amazon Redshift.

n=1000
testdf = data.frame(
a=rep('a', n),
b=c(1:n),
c=rep(as.Date('2017-01-01'), n),
d=rep(as.POSIXct('2017-01-01 20:01:32'), n),
e=rep(as.POSIXlt('2017-01-01 20:01:32'), n),
f=rep(paste0(rep('a', 4000), collapse=''), n) )

cat(rs_create_statement(testdf, table_name='dm_great_table'))

This returns:

CREATE TABLE dm_great_table (
a VARCHAR(8),
b int,
c date,
d timestamp,
e timestamp,
f VARCHAR(4096)
);

The cat is only done to view properly in console, it’s not done directly in the function in case you need to pass the string to another function (Like a query runner)

Uploading data

For uploading data, you’ll have available now 2 functions: rs_replace_table and rs_upsert_table, both of these functions are called with almost the same parameters, except on upsert you can specify with which keys to search for matching rows.

For example, suppose we have a table to load with 2 integer columns, we could use the following code:

    library("aws.s3")
    library(RPostgres)
    library(redshiftTools)

    a=data.frame(a=seq(1,10000), b=seq(10000,1))
    n=head(a,n=10)
    n$b=n$a
    nx=rbind(n, data.frame(a=seq(5:10), b=seq(10:5)))

    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')

    b=rs_replace_table(a, dbcon=con, table_name='mytable', bucket="mybucket", split_files=4)
    c=rs_upsert_table(nx, dbcon=con, table_name = 'mytable', split_files=4, bucket="mybucket", keys=c('a'))

Creating tables with data

A conjunction of rs_create_statement and rs_replace_table can be found in rs_create_table. You can create a table from scratch from R and upload the contents of the data frame, without needing to write SQL code at all.

    library("aws.s3")
    library(RPostgres)
    library(redshiftTools)

    a=data.frame(a=seq(1,10000), b=seq(10000,1))
    
    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')

    b=rs_create_table(a, dbcon=con, table_name='mytable', bucket="mybucket", split_files=4)
    

redshifttools's People

Contributors

sicarul avatar ilyaminati avatar tsydorenko avatar emelieh21 avatar igoldin2u avatar niklasvm avatar rtjohn avatar gitter-badger avatar kwent avatar mfarkhann avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.