bnosac / spark.sas7bdat Goto Github PK
View Code? Open in Web Editor NEWRead in SAS data in parallel into Apache Spark
Read in SAS data in parallel into Apache Spark
When you upload a file using spark_read_sas and then create an R object using part of the data then it often throws and error.
Create a connection using sparklyr in R studio.
sc<-spark_connect(master = "local", config = config)
read a file using spark_read_sas
mydata<-spark_read_sas(sc, "/your path/ .sas7bdat", Alldata)
create an R object using dplyr
mydata_1 <- mydata %>% select(A,B,C,D)
Now apply a function
mydata_1 %>% summarise_all(function(x) sum(is.na(x)))
It throws the following error
Error in as.character(x[[1]]) :
cannot coerce type 'closure' to vector of type 'character'
Note: If you run the same code using haven library's read_sas it works fine.
Trying your example,
myfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
myfile
x <- spark_read_sas(sc, path = myfile, table = "sas_example")
x
I got an error message
Error: java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
...
I used Spark standalone cluster mode (Spark 2.0.2; 3 nodes).
I was trying to follow the example to connect, but it keeps failing. Here's code I ran on my machine.
library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
sc <- spark_connect(master = "local")
* Using Spark: 2.0.1
Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, :
Gateway in localhost:8880 did not respond.
Try running `options(sparklyr.log.console = TRUE)` followed by `sc <- spark_connect(...)` for more debugging info.
Following the error message, here's debugging info.
options(sparklyr.log.console = TRUE)
sc <- spark_connect(master = "local")
* Using Spark: 2.0.1
Ivy Default Cache set to: /Users/matthewson/.ivy2/cache
The jars for the packages stored in: /Users/matthewson/.ivy2/jars
:: loading settings :: url = jar:file:/Users/matthewson/spark/spark-2.0.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
saurfang#spark-sas7bdat added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 774ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: saurfang#spark-sas7bdat;2.0.0-s_2.11
==== local-m2-cache: tried
file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom
-- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:
file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar
==== local-ivy-cache: tried
/Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/ivys/ivy.xml
-- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:
/Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/jars/spark-sas7bdat.jar
==== central: tried
https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom
-- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:
https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom
-- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:
http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1076)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:294)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Here's my sessionInfo:
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] spark.sas7bdat_1.4 sparklyr_1.7.1
loaded via a namespace (and not attached):
[1] pillar_1.6.2 compiler_4.1.1 dbplyr_2.1.1 r2d3_0.2.5 base64enc_0.1-3 tools_4.1.1 digest_0.6.27
[8] jsonlite_1.7.2 lifecycle_1.0.0 tibble_3.1.4 pkgconfig_2.0.3 rlang_0.4.11 DBI_1.1.1 rstudioapi_0.13
[15] curl_4.3.2 yaml_2.2.1 parallel_4.1.1 fastmap_1.1.0 withr_2.4.2 dplyr_1.0.7 httr_1.4.2
[22] generics_0.1.0 vctrs_0.3.8 htmlwidgets_1.5.3 askpass_1.1 rappdirs_0.3.3 rprojroot_2.0.2 tidyselect_1.1.1
[29] glue_1.4.2 forge_0.2.0 R6_2.5.1 fansi_0.5.0 purrr_0.3.4 tidyr_1.1.3 magrittr_2.0.1
[36] ellipsis_0.3.2 htmltools_0.5.2 assertthat_0.2.1 config_0.3.1 utf8_1.2.2 openssl_1.4.4 crayon_1.4.1
It would be good to modify the short description of the repository from:
Read in SAS data in parallel into Apache Spark
To:
R library to read SAS data in parallel into Apache Spark
or
R library to read SAS data in parallel into Apache Spark (using spark-sas7bdat)
or similar, as this is specifically for R and caused a bit of confusion for me as I thought it was a replacement for spark-sas7bdat for a while (Until I properly read the readme and found it actually uses spark-sas7bdat)
I'm getting a similar error to another (now closed) issue, on two different computers, connected to different internet connections.
I've used ´spark_read_sas´ with no issues (Spark 2.0.0) the past two days, but now I'm consequently getting this error. On your example and on my own data.
Looks like an issue, but I can be wrong.
java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at sparklyr.Invoke.invoke(invoke.scala:139)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
at sparklyr.StreamHandler.read(stream.scala:66)
at sparklyr.BackendHandler.channelRead0(handler.scala:51)
at sparklyr.BackendHandler.channelRead0(handler.scala:4)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.github.saurfang.sas.spark.DefaultSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
... 32 more
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.