bnosac / spark.sas7bdat Goto Github PK

26.0 3.0 7.0 21 KB

Read in SAS data in parallel into Apache Spark

R 100.00%

spark sas7bdat r sparklyr

spark.sas7bdat's Issues

dplyr function errors

When you upload a file using spark_read_sas and then create an R object using part of the data then it often throws and error.

Create a connection using sparklyr in R studio.
sc<-spark_connect(master = "local", config = config)

read a file using spark_read_sas

mydata<-spark_read_sas(sc, "/your path/ .sas7bdat", Alldata)

create an R object using dplyr

mydata_1 <- mydata %>% select(A,B,C,D)

Now apply a function

mydata_1 %>% summarise_all(function(x) sum(is.na(x)))

It throws the following error

Error in as.character(x[[1]]) :
cannot coerce type 'closure' to vector of type 'character'

Note: If you run the same code using haven library's read_sas it works fine.

Failed to find data source

Trying your example,

myfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
myfile
x <- spark_read_sas(sc, path = myfile, table = "sas_example")
x

I got an error message

Error: java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
...

I used Spark standalone cluster mode (Spark 2.0.2; 3 nodes).

Can't connect to spark with spark-sas7bdat package

I was trying to follow the example to connect, but it keeps failing. Here's code I ran on my machine.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
sc <- spark_connect(master = "local")

* Using Spark: 2.0.1
Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId,  : 
  Gateway in localhost:8880 did not respond.


Try running `options(sparklyr.log.console = TRUE)` followed by `sc <- spark_connect(...)` for more debugging info.

Following the error message, here's debugging info.

options(sparklyr.log.console = TRUE)
sc <- spark_connect(master = "local")


* Using Spark: 2.0.1
Ivy Default Cache set to: /Users/matthewson/.ivy2/cache
The jars for the packages stored in: /Users/matthewson/.ivy2/jars
:: loading settings :: url = jar:file:/Users/matthewson/spark/spark-2.0.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
saurfang#spark-sas7bdat added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
:: resolution report :: resolve 774ms :: artifacts dl 0ms
	:: modules in use:
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
	---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
		module not found: saurfang#spark-sas7bdat;2.0.0-s_2.11

	==== local-m2-cache: tried

	  file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  file:/Users/matthewson/.m2/repository/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

	==== local-ivy-cache: tried

	  /Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/ivys/ivy.xml

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  /Users/matthewson/.ivy2/local/saurfang/spark-sas7bdat/2.0.0-s_2.11/jars/spark-sas7bdat.jar

	==== central: tried

	  https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

	==== spark-packages: tried

	  http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.pom

	  -- artifact saurfang#spark-sas7bdat;2.0.0-s_2.11!spark-sas7bdat.jar:

	  http://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/2.0.0-s_2.11/spark-sas7bdat-2.0.0-s_2.11.jar

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;2.0.0-s_2.11: not found]
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1076)
	at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:294)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Here's my sessionInfo:

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spark.sas7bdat_1.4 sparklyr_1.7.1    

loaded via a namespace (and not attached):
 [1] pillar_1.6.2      compiler_4.1.1    dbplyr_2.1.1      r2d3_0.2.5        base64enc_0.1-3   tools_4.1.1       digest_0.6.27    
 [8] jsonlite_1.7.2    lifecycle_1.0.0   tibble_3.1.4      pkgconfig_2.0.3   rlang_0.4.11      DBI_1.1.1         rstudioapi_0.13  
[15] curl_4.3.2        yaml_2.2.1        parallel_4.1.1    fastmap_1.1.0     withr_2.4.2       dplyr_1.0.7       httr_1.4.2       
[22] generics_0.1.0    vctrs_0.3.8       htmlwidgets_1.5.3 askpass_1.1       rappdirs_0.3.3    rprojroot_2.0.2   tidyselect_1.1.1 
[29] glue_1.4.2        forge_0.2.0       R6_2.5.1          fansi_0.5.0       purrr_0.3.4       tidyr_1.1.3       magrittr_2.0.1   
[36] ellipsis_0.3.2    htmltools_0.5.2   assertthat_0.2.1  config_0.3.1      utf8_1.2.2        openssl_1.4.4     crayon_1.4.1

Short description clarification

It would be good to modify the short description of the repository from:

Read in SAS data in parallel into Apache Spark

To:

R library to read SAS data in parallel into Apache Spark
or
R library to read SAS data in parallel into Apache Spark (using spark-sas7bdat)

or similar, as this is specifically for R and caused a bit of confusion for me as I thought it was a replacement for spark-sas7bdat for a while (Until I properly read the readme and found it actually uses spark-sas7bdat)

ailed to find data source: com.github.saurfang.sas.spark.

I'm getting a similar error to another (now closed) issue, on two different computers, connected to different internet connections.

I've used ´spark_read_sas´ with no issues (Spark 2.0.0) the past two days, but now I'm consequently getting this error. On your example and on my own data.

Looks like an issue, but I can be wrong.

java.lang.ClassNotFoundException: Failed to find data source: com.github.saurfang.sas.spark. Please find packages at http://spark-packages.org
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at sparklyr.Invoke.invoke(invoke.scala:139)
	at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
	at sparklyr.StreamHandler.read(stream.scala:66)
	at sparklyr.BackendHandler.channelRead0(handler.scala:51)
	at sparklyr.BackendHandler.channelRead0(handler.scala:4)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.github.saurfang.sas.spark.DefaultSource
	at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:130)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:130)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:130)
	... 32 more

bnosac / spark.sas7bdat Goto Github PK

spark.sas7bdat's Issues

dplyr function errors

Failed to find data source

Can't connect to spark with spark-sas7bdat package

Short description clarification

ailed to find data source: com.github.saurfang.sas.spark.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent