Code Monkey home page Code Monkey logo

s3tools's Introduction

The s3tools package allows you to read and write files from the Analytical Platform's data store in Amazon S3. It allows you to find the files you have access to, read them into R, and write files back out to Amazon S3.

Which buckets do I have access to?

You can find which buckets you have access to using the following code. It will return a character vector of buckets.

s3tools::accessible_buckets()

What files do I have access to?

## List all the files in the alpha-everyone bucket
s3tools::list_files_in_buckets('alpha-everyone')

## You can list files in more than one bucket:
s3tools::list_files_in_buckets(c('alpha-everyone', 'alpha-dash'))

## You can filter by prefix, to return only files in a folder
s3tools::list_files_in_buckets('alpha-everyone', prefix='s3tools_tests')

## The 'prefix' argument is used to filter results to any path that begins with the prefix. 
s3tools::list_files_in_buckets('alpha-everyone', prefix='s3tools_tests', path_only = TRUE)

## For more complex filters, you can always filter down the dataframe using standard R code:
library(dplyr)

## All files containing the string 'iris'
s3tools::list_files_in_buckets('alpha-everyone') %>% 
  dplyr::filter(grepl("iris",path)) # Use a regular expression

## All excel files containing 'iris;
s3tools::list_files_in_buckets('alpha-everyone') %>% 
  dplyr::filter(grepl("iris*.xls",path)) 

Reading files

Once you know the full path that you'd like to access, you can read the file as follows.

csv files

For csv files, this will use the default read.csv csv reader:

df <-s3tools::s3_path_to_full_df("alpha-everyone/s3tools_tests/folder1/iris_folder1_1.csv")
print(head(df))

For large csv files, if you want to preview the first few rows without downloading the whole file, you can do this:

df <- s3tools::s3_path_to_preview_df("alpha-moj-analytics-scratch/my_folder/10mb_random.csv")
print(df)

Other file types

For xls, xlsx, sav (spss), dta (stata), and sas7bdat (sas) file types, s3tools will attempt to read these files if the relevant reader package is installed:

df <-s3tools::s3_path_to_full_df("alpha-everyone/s3tools_tests/iris_base.xlsx")  # Uses readxl if installed, otherwise errors

df <-s3tools::s3_path_to_full_df("alpha-everyone/s3tools_tests/iris_base.sav")  # Uses haven if installed, otherwise errors
df <-s3tools::s3_path_to_full_df("alpha-everyone/s3tools_tests/iris_base.dta")  # Uses haven if installed, otherwise errors
df <-s3tools::s3_path_to_full_df("alpha-everyone/s3tools_tests/iris_base.sas7bdat")  # Uses haven if installed, otherwise errors

If you have a different file type, or you're having a problem with the automatic readers, you can specify a file read function:

s3tools::read_using(FUN=readr::read_csv, path = "alpha-everyone/s3tools_tests/iris_base.csv")

If you're interested in adding support for additional file types, feel free to add some code to this file and raise a pull request against the s3tools repo.

Downloading files

df <- s3tools::download_file_from_s3("alpha-everyone/s3tools_tests/iris_base.csv", "my_downloaded_file.csv")

# By default, if the file already exists you will receive an error.  To override:
df <- s3tools::download_file_from_s3("alpha-everyone/s3tools_tests/iris_base.csv", "my_downloaded_file.csv", overwrite =TRUE)

Writing data to s3

Writing files to s3

s3tools::write_file_to_s3("my_downloaded_file.csv", "alpha-everyone/delete/my_downloaded_file.csv")

# By default, if the file already exists you will receive an error.  To override:
s3tools::write_file_to_s3("my_downloaded_file.csv", "alpha-everyone/delete/my_downloaded_file.csv", overwrite =TRUE)

Writing a dataframe to s3 in csv format

s3tools::write_df_to_csv_in_s3(iris, "alpha-everyone/delete/iris.csv")

# By default, if the file already exists you will receive an error.  To override:
s3tools::write_df_to_csv_in_s3(iris, "alpha-everyone/delete/iris.csv", overwrite =TRUE)

s3tools's People

Contributors

andyhd avatar calumabarnett avatar isichei avatar jamiefraser77 avatar joeprinold avatar mandarinduck avatar r4vi avatar robinl avatar thomashepworth avatar willbowditch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

s3tools's Issues

Reverse dependency issues with cloudyr

If you run:

remotes::install_github("moj-analytical-services/s3tools")

In a new packrat environment, some functionality (e.g. accessing files) doesn't work

s3tools::list_files_in_buckets("alpha-everyone") results in:

Error in s3tools::list_files_in_buckets("alpha-everyone") : 
  You asked to list some buckets you don't have access to:  alpha-everyone

Add optional arguments for writing csvs out.

Add options for adding additional arguments to write.csv() in write_df_to_csv_in_s3() function. Needed for appending to csvs and also preventing row names being written to file.

Fix dash web

Just filter to only include buckets with 'alpha' in the title

Sunset s3tools in favour of botor

botor is an R wrapper to boto3, with some nice helper functions for s3, specifically:

  • s3_read - equivalent to s3tools::read_using()
  • s3_write - a counterpart to s3_read, which doesn't have an exact s3tools analogue (which is definitely missing)
  • s3_list_buckets.

That covers all that s3tools is needed for, and works out of the gate as it does standard aws cred magic. I'd therefore suggest that s3tools gets deprecated altogether and the guidance is updated to have a conversion guide to the equivalent botor functions. An alternative would be to make s3tools use botor under the hood, but that would add an unnecessary maintenance burden for very little gain.

Would it be better if the default CSV reader was fread?

This isn't a major issue as fread can be used within read_using, but fread is faster than read.csv and copes well with untidy CSVs (e.g. where the first row has fewer columns than the rest, which unfortunately occurs in downloads of SOP reports).

Both fread and read_csv are fast and apply similar default options to read_excel (e.g. trimming white space), but read_csv doesn't cope so well with untidy CSVs.

I realise either way it creates a dependency on another package, which may be better avoided.

Problems with aws.ec2metadata dependency and creation of .aws/credentials

In a new packrat environment I did this:

install.packages("devtools")
devtools::install_github("moj-analytical-services/s3tools")
s3tools::accessible_buckets()

# > s3tools::accessible_buckets()
# Error in aws.signature::read_credentials(filename) : 
#   File '/home/robinl/.aws/credentials' does not exist.
# Error in loadNamespace(name) : 
#   there is no package called ‘aws.ec2metadata’


install.packages("aws.ec2metadata")
s3tools::accessible_buckets()

# > s3tools::accessible_buckets()
# Error in aws.signature::read_credentials(filename) : 
#   File '/home/robinl/.aws/credentials' does not exist.
# [1] "advanced-analytics-web"           "alpha-dag-crest-data-engineering" "alpha-dash"                      
# [4] "alpha-everyone"                   "alpha-legal-aid-analysis"         "dash.mojanalytics.xyz"   

s3tools::accessible_buckets()
# > s3tools::accessible_buckets()
# [1] "advanced-analytics-web"           "alpha-dag-crest-data-engineering" "alpha-dash"                      
# [4] "alpha-everyone"                   "alpha-legal-aid-analysis"         "dash.mojanalytics.xyz"    

s3_path_to_df

s3_path_to_df is still in the documentation, but is not an exported object from 'namespace:s3tools'

Only works in R on the platform

I don't think it's mentioned explicitly anywhere in the documentation that this package only works in R on the platform. Would be helpful to mention I think.

Inconsistencies in s3 check_region may cause problems

Was having problems writing files to s3 with overwrite = FALSE
s3tools::write_df_to_csv_in_s3(df, "alpha-everyone/delete/iris4.csv", overwrite = FALSE)

because this uses s3tools:::s3_file_exists("alpha-everyone/delete/iris4.csv") and check_region wasn't enabled in that function

aws.s3 removed from CRAN

It seems the aws.s3 package has been removed from CRAN, which means that installing s3tools using remotes::install_github won't work out of the box. It's still on anaconda, so in principle that would be the preferred installation method. There are no installation recommendations in the readme though (presumably because it's installed on AP by default) which can add to the confusion if things aren't working immediately.

Problem reading xls file with maximum number of rows

Attempting to read a sheet from an xls file with the maximum 65,536 rows causes the following error in RStudio:

ERROR session hadabend; LOGGED FROM: rstudio::core::Error {anonymous}::rInit(const rstudio::r::session::RInitInfo&) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:563

Reading another sheet from the same file with around 40,000 rows causes no problems. I'm not sure at how many lines the issue occurs.

The same issue happens using read_using with FUN=read_excel. But reading the file locally using read_excel is fine with the maximum number of rows.

Credentials error - SignatureDoesNotMatch

I'm getting an error when writing a dataframe to S3:

write_df_to_csv_in_s3(employ_data, s3_path = "alpha-mra-consultation/redis_data_backup.csv", overwrite = TRUE )
List of 10
 $ Code                 : chr "SignatureDoesNotMatch"
 $ Message              : chr "The request signature we calculated does not match the signature you provided. Check your key and signing method."

I've tried s3tools::get_credentials() first, but it didn't fix the problem

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.