bnosac / spark.sas7bdat Goto Github PK

26.0 3.0 7.0 21 KB

Read in SAS data in parallel into Apache Spark

R 100.00%

spark sas7bdat r sparklyr

spark.sas7bdat's Introduction

spark.sas7bdat

The spark.sas7bdat package allows R users working with Apache Spark to read in SAS datasets in .sas7bdat format into Spark by using the spark-sas7bdat Spark package. This allows R users to

load a SAS dataset in parallel into a Spark table for further processing with the sparklyr package
process in parallel the full SAS dataset with dplyr statements, instead of having to import the full SAS dataset in RAM (using the foreign/haven packages) and hence avoiding RAM problems of large imports

Example

The following example reads in a file called iris.sas7bdat in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.

library(sparklyr)
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")

sc <- spark_connect(master = "local")
x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")
x

The resulting pointer to a Spark table can be further used in dplyr statements

library(dplyr)
x %>% group_by(Species) %>%
  summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))

Installation

Install the package from CRAN.

install.packages('spark.sas7bdat')

Or install this development version from github.

devtools::install_github("bnosac/spark.sas7bdat", build_vignettes = TRUE)
vignette("spark_sas7bdat_examples", package = "spark.sas7bdat")

The package has been tested out with Spark version 2.0.1 and Hadoop 2.7.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")

Speed comparison

In order to compare the functionality to the read_sas function from the haven package, below we show a comparison on a small 5234557 rows x 2 columns SAS dataset with only numeric data. Processing is done on 8 cores. With the haven package you need to import the data in RAM, with the spark.sas7bdat package, you can immediately execute dplyr statements on top of the SAS dataset.

mysasfile <- "/home/bnosac/Desktop/testdata.sas7bdat"
system.time(x <- spark_read_sas(sc, path = mysasfile, table = "testdata"))
   user  system elapsed 
  0.008   0.000   0.051 
system.time(x <- haven::read_sas(mysasfile))
   user  system elapsed 
  1.172   0.032   1.200

Support in big data and Spark analysis

Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be

spark.sas7bdat's People

Stargazers

Watchers

Forkers

jjallaire 8bit-pixies hope-onely t-h-i-n-k holamap gbisschoff r-spark

spark.sas7bdat's Issues

Short description clarification

It would be good to modify the short description of the repository from:

Read in SAS data in parallel into Apache Spark

To:

R library to read SAS data in parallel into Apache Spark
or
R library to read SAS data in parallel into Apache Spark (using spark-sas7bdat)

or similar, as this is specifically for R and caused a bit of confusion for me as I thought it was a replacement for spark-sas7bdat for a while (Until I properly read the readme and found it actually uses spark-sas7bdat)

ailed to find data source: com.github.saurfang.sas.spark.

I'm getting a similar error to another (now closed) issue, on two different computers, connected to different internet connections.

I've used ´spark_read_sas´ with no issues (Spark 2.0.0) the past two days, but now I'm consequently getting this error. On your example and on my own data.

Looks like an issue, but I can be wrong.

java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at http://spark-packages.org
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at sparklyr.Invoke.invoke(invoke.scala:139)
	at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
	at sparklyr.StreamHandler.read(stream.scala:66)
	at sparklyr.BackendHandler.channelRead0(handler.scala:51)
	at sparklyr.BackendHandler.channelRead0(handler.scala:4)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.github.saurfang.sas.spark.DefaultSource
	at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
	... 32 more

dplyr function errors

When you upload a file using spark_read_sas and then create an R object using part of the data then it often throws and error.

Create a connection using sparklyr in R studio.
sc<-spark_connect(master = "local", config = config)

read a file using spark_read_sas

mydata<-spark_read_sas(sc, "/your path/ .sas7bdat", Alldata)

create an R object using dplyr

mydata_1 <- mydata %>% select(A,B,C,D)

Now apply a function

mydata_1 %>% summarise_all(function(x) sum(is.na(x)))

It throws the following error

Error in as.character(x[[1]]) :
cannot coerce type 'closure' to vector of type 'character'

Note: If you run the same code using haven library's read_sas it works fine.

Can't connect to spark with spark-sas7bdat package

I was trying to follow the example to connect, but it keeps failing. Here's code I ran on my machine.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
sc <- spark_connect(master = "local")

* Using Spark: 2.0.1
Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId,  : 
  Gateway in localhost:8880 did not respond.


Try running `options(sparklyr.log.console = TRUE)` followed by `sc <- spark_connect(...)` for more debugging info.

Following the error message, here's debugging info.

options(sparklyr.log.console = TRUE)
sc <- spark_connect(master = "local")


* Using Spark: 2.0.1
Ivy Default Cache set to: /Users/matthewson/.ivy2/cache
The jars for the packages stored in: /Users/matthewson/.ivy2/jars
:: loading settings :: url = jar:file:/Users/matthewson/spark/spark-2.0.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
saurfang#spark-sas7bdat added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
:: resolution report :: resolve 774ms :: artifacts dl 0ms
	:: modules in use:
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
	---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
		module not found: saurfang#spark-sas7bdat;2.0.0-s_2.11

	==== local-m2-cache: tried

	  file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

	==== local-ivy-cache: tried

	  /Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/ivys/ivy.xml

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  /Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/jars/spark-sas7bdat.jar

	==== central: tried

	  https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

	==== spark-packages: tried

	  http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1076)
	at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:294)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Here's my sessionInfo:

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spark.sas7bdat_1.4 sparklyr_1.7.1    

loaded via a namespace (and not attached):
 [1] pillar_1.6.2      compiler_4.1.1    dbplyr_2.1.1      r2d3_0.2.5        base64enc_0.1-3   tools_4.1.1       digest_0.6.27    
 [8] jsonlite_1.7.2    lifecycle_1.0.0   tibble_3.1.4      pkgconfig_2.0.3   rlang_0.4.11      DBI_1.1.1         rstudioapi_0.13  
[15] curl_4.3.2        yaml_2.2.1        parallel_4.1.1    fastmap_1.1.0     withr_2.4.2       dplyr_1.0.7       httr_1.4.2       
[22] generics_0.1.0    vctrs_0.3.8       htmlwidgets_1.5.3 askpass_1.1       rappdirs_0.3.3    rprojroot_2.0.2   tidyselect_1.1.1 
[29] glue_1.4.2        forge_0.2.0       R6_2.5.1          fansi_0.5.0       purrr_0.3.4       tidyr_1.1.3       magrittr_2.0.1   
[36] ellipsis_0.3.2    htmltools_0.5.2   assertthat_0.2.1  config_0.3.1      utf8_1.2.2        openssl_1.4.4     crayon_1.4.1

Failed to find data source

Trying your example,

myfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
myfile
x <- spark_read_sas(sc, path = myfile, table = "sas_example")
x

I got an error message

Error: java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
...

I used Spark standalone cluster mode (Spark 2.0.2; 3 nodes).

bnosac / spark.sas7bdat Goto Github PK

spark.sas7bdat's Introduction

spark.sas7bdat

Example

Installation

Speed comparison

Support in big data and Spark analysis

spark.sas7bdat's People

Stargazers

Watchers

Forkers

spark.sas7bdat's Issues

Short description clarification

ailed to find data source: com.github.saurfang.sas.spark.

dplyr function errors

Can't connect to spark with spark-sas7bdat package

Failed to find data source

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent