Code Monkey home page Code Monkey logo

spark.sas7bdat's Introduction

spark.sas7bdat

The spark.sas7bdat package allows R users working with Apache Spark to read in SAS datasets in .sas7bdat format into Spark by using the spark-sas7bdat Spark package. This allows R users to

  • load a SAS dataset in parallel into a Spark table for further processing with the sparklyr package
  • process in parallel the full SAS dataset with dplyr statements, instead of having to import the full SAS dataset in RAM (using the foreign/haven packages) and hence avoiding RAM problems of large imports

Example

The following example reads in a file called iris.sas7bdat in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.

library(sparklyr)
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")

sc <- spark_connect(master = "local")
x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")
x

The resulting pointer to a Spark table can be further used in dplyr statements

library(dplyr)
x %>% group_by(Species) %>%
  summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))

Installation

Install the package from CRAN.

install.packages('spark.sas7bdat')

Or install this development version from github.

devtools::install_github("bnosac/spark.sas7bdat", build_vignettes = TRUE)
vignette("spark_sas7bdat_examples", package = "spark.sas7bdat")

The package has been tested out with Spark version 2.0.1 and Hadoop 2.7.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")

Speed comparison

In order to compare the functionality to the read_sas function from the haven package, below we show a comparison on a small 5234557 rows x 2 columns SAS dataset with only numeric data. Processing is done on 8 cores. With the haven package you need to import the data in RAM, with the spark.sas7bdat package, you can immediately execute dplyr statements on top of the SAS dataset.

mysasfile <- "/home/bnosac/Desktop/testdata.sas7bdat"
system.time(x <- spark_read_sas(sc, path = mysasfile, table = "testdata"))
   user  system elapsed 
  0.008   0.000   0.051 
system.time(x <- haven::read_sas(mysasfile))
   user  system elapsed 
  1.172   0.032   1.200 

Support in big data and Spark analysis

Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be

spark.sas7bdat's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

spark.sas7bdat's Issues

Short description clarification

It would be good to modify the short description of the repository from:

Read in SAS data in parallel into Apache Spark

To:

R library to read SAS data in parallel into Apache Spark
or
R library to read SAS data in parallel into Apache Spark (using spark-sas7bdat)

or similar, as this is specifically for R and caused a bit of confusion for me as I thought it was a replacement for spark-sas7bdat for a while (Until I properly read the readme and found it actually uses spark-sas7bdat)

ailed to find data source: com.github.saurfang.sas.spark.

I'm getting a similar error to another (now closed) issue, on two different computers, connected to different internet connections.

I've used ´spark_read_sas´ with no issues (Spark 2.0.0) the past two days, but now I'm consequently getting this error. On your example and on my own data.

Looks like an issue, but I can be wrong.

java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at http://spark-packages.org
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at sparklyr.Invoke.invoke(invoke.scala:139)
	at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
	at sparklyr.StreamHandler.read(stream.scala:66)
	at sparklyr.BackendHandler.channelRead0(handler.scala:51)
	at sparklyr.BackendHandler.channelRead0(handler.scala:4)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.github.saurfang.sas.spark.DefaultSource
	at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
	... 32 more

dplyr function errors

When you upload a file using spark_read_sas and then create an R object using part of the data then it often throws and error.

Create a connection using sparklyr in R studio.
sc<-spark_connect(master = "local", config = config)

read a file using spark_read_sas

mydata<-spark_read_sas(sc, "/your path/ .sas7bdat", Alldata)

create an R object using dplyr

mydata_1 <- mydata %>% select(A,B,C,D)

Now apply a function

mydata_1 %>% summarise_all(function(x) sum(is.na(x)))

It throws the following error

Error in as.character(x[[1]]) :
cannot coerce type 'closure' to vector of type 'character'

Note: If you run the same code using haven library's read_sas it works fine.

Can't connect to spark with spark-sas7bdat package

I was trying to follow the example to connect, but it keeps failing. Here's code I ran on my machine.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
sc <- spark_connect(master = "local")

* Using Spark: 2.0.1
Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId,  : 
  Gateway in localhost:8880 did not respond.


Try running `options(sparklyr.log.console = TRUE)` followed by `sc <- spark_connect(...)` for more debugging info.

Following the error message, here's debugging info.

options(sparklyr.log.console = TRUE)
sc <- spark_connect(master = "local")


* Using Spark: 2.0.1
Ivy Default Cache set to: /Users/matthewson/.ivy2/cache
The jars for the packages stored in: /Users/matthewson/.ivy2/jars
:: loading settings :: url = jar:file:/Users/matthewson/spark/spark-2.0.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
saurfang#spark-sas7bdat added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
:: resolution report :: resolve 774ms :: artifacts dl 0ms
	:: modules in use:
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
	---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
		module not found: saurfang#spark-sas7bdat;2.0.0-s_2.11

	==== local-m2-cache: tried

	  file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

	==== local-ivy-cache: tried

	  /Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/ivys/ivy.xml

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  /Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/jars/spark-sas7bdat.jar

	==== central: tried

	  https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

	==== spark-packages: tried

	  http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1076)
	at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:294)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Here's my sessionInfo:

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spark.sas7bdat_1.4 sparklyr_1.7.1    

loaded via a namespace (and not attached):
 [1] pillar_1.6.2      compiler_4.1.1    dbplyr_2.1.1      r2d3_0.2.5        base64enc_0.1-3   tools_4.1.1       digest_0.6.27    
 [8] jsonlite_1.7.2    lifecycle_1.0.0   tibble_3.1.4      pkgconfig_2.0.3   rlang_0.4.11      DBI_1.1.1         rstudioapi_0.13  
[15] curl_4.3.2        yaml_2.2.1        parallel_4.1.1    fastmap_1.1.0     withr_2.4.2       dplyr_1.0.7       httr_1.4.2       
[22] generics_0.1.0    vctrs_0.3.8       htmlwidgets_1.5.3 askpass_1.1       rappdirs_0.3.3    rprojroot_2.0.2   tidyselect_1.1.1 
[29] glue_1.4.2        forge_0.2.0       R6_2.5.1          fansi_0.5.0       purrr_0.3.4       tidyr_1.1.3       magrittr_2.0.1   
[36] ellipsis_0.3.2    htmltools_0.5.2   assertthat_0.2.1  config_0.3.1      utf8_1.2.2        openssl_1.4.4     crayon_1.4.1   

Failed to find data source

Trying your example,

myfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
myfile
x <- spark_read_sas(sc, path = myfile, table = "sas_example")
x

I got an error message

Error: java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
...

I used Spark standalone cluster mode (Spark 2.0.2; 3 nodes).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.