zuinnote / spark-hadoopoffice-ds Goto Github PK
View Code? Open in Web Editor NEWA Spark datasource for the HadoopOffice library
License: Apache License 2.0
A Spark datasource for the HadoopOffice library
License: Apache License 2.0
Since Spark 2.3.0 (https://spark.apache.org/releases/spark-release-2-3-0.html) a new datasource API v2 is available (experimental).
We should add support in addition to the v1 API.
In org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.class v1.3.3 the following lines gives me a NullPointerException:
WorkbookDocument wd = WorkbookDocument.Factory.parse(workbookDataXML); this.isDate1904 = wd.getWorkbook().getWorkbookPr().getDate1904();
cause workbookDocument .getWorkbook().getWorkbookPr() can return null.
For fix see example https://github.com/alibaba/easyexcel/blob/master/src/main/java/com/alibaba/excel/analysis/v07/XlsxSaxAnalyser.java#L40
Unfortunately, I am not allowed to distribute any of the spreadsheets that fail.
Stacktrace:
12:57:44,174 ERROR [rker for task 0] org.apache.spark.executor.Executor.logError: Exception in task 0.0 in stage 0.0 (TID 0) line: org.apache.spark.internal.Logging$class.logError(Logging.scala:91)
java.lang.NullPointerException
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:349)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3.apply(DefaultSource.scala:317)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3.apply(DefaultSource.scala:312)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Hi Jörn,
I am really enthusiastic about your library and we want to use it ☺︎. I can also see that you update your library regularly and recently.
May I ask your opinion about the production readiness or maturity of your library? Would you recommend using it in production?
Danke, Mark
Using version 1.0.2
Currently when I try to use the hadoopoffice-ds package to write an .xlsx file, for example by using your own example "hadoopoffice\examples\scala-spark2-excel-out-ds" on my Spark 2.1.0 cluster, I get an error stating a method of the abstract OutputWriterFactory class can not be found on the OutputWriterFactory it's trying to use, when it calls the df.write method
Stacktrace:
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
... 3 more
My own example code that's failing with Spark 2.1.0
import spark.implicits._
val sqlContext=new SQLContext(sc)
import sqlContext.implicits._
val sRdd = sc.parallelize(Seq("1","2","3","4","5")).repartition(1)
val df= sRdd.toDF()
try {
df.write.format("org.zuinnote.spark.office.excel").mode(SaveMode.Overwrite).save("/user/spark/output/")
} catch {
case e: Exception => {
val sw = new StringWriter
e.printStackTrace(new PrintWriter(sw))
println(sw.toString)
}
}
It works fine when I use Spark 2.0.2 however.
I did some research:
It's correct that Spark 2.1.0 adds a method to the abstract OutputWriterFactory:
org.apache.spark.sql.execution.datasources.OutputWriterFactory#def getFileExtension(context: TaskAttemptContext): String
I saw a similar issue when the databricks-avro dependency was used, which had a similar issue when version 3.1.0 was used, 3.2.0 fixed this.
Link describing the issue and fix: databricks/spark-redshift#315
This is not the dependency you were using however (apache's own version), so I think it's a dependency that's using it?
All that neads to be done is add an override for the method returning a fileextension.
I can see what else I can find in the meantime, but I'm having issues setting up debugging from IntelliJ with the Spark cluster (HDInsight), so have a hard time figuring out where it's going wrong exactly. Maybe you can check it out too?
Appreciate your work by the way!
I have the python code below on databricks and I am getting the rror "java.lang.NoSuchMethodError: org.apache.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
Please help
from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.zuinnote.spark.office.excel").options("read.locale.bcp47","us").load('/home/user/office/input')
All configuration options that contain a capital letter are not applied correctly. This is due to an undocumented Spark behaviour to convert all option keys to lower case. This means for example options such as hadoopoffice.read.lowFootprint have never been applied when using the Spark-HadoopOffice data source (no issues for the other platforms).
Most likely we will implement a workaround in the HadoopOffice library without breaking compatibility.
Scala 2.10 needs to be dropped, because we will drop support for JDK7:
ZuInnoTe/hadoopoffice#20
Furthermore, Spark 2.1.0 supports only Scala 2.11
When importing xlsx some decimal values are skipped. Superficially it looks like it depends on number of decimal places.
CVE-2021-44228 describes a highly critical security issue in the library log4j2. While in most installation where hadoopoffice is used it is less likely to be exploitable, we include this the library log4j2 > 2.15 to mitigate this issue. Independent of this - as a defense in the depth - we recommend additionally to upgrade to the latest JDK8 (at least > 191, better the latest) or JDK11/JDK17 and to set the Java System properties (e.g. via -D) "log4j2.formatMsgNoLookups" to "true" and set "com.sun.jndi.rmi.object.trustURLCodebase" and "com.sun.jndi.cosnaming.object.trustURLCodebase" to "false".
The full list of mitigations can be found in the link above.
Partitioning is useful for splitting the output in folders and files while writing (and reading).
In excel case can be useful for separating by day, customer or others ways or for creating smaller spreadsheets that will load faster.
The current behavior of this datasource when DataFrameWriter.partitionBy() method is generating a single file and overwriting for each partition.
For example when a dataframe is partitioned by year, month, and day Sparks writes in two different forms while using the 'csv' (or parquet, avro, etc...) and using the Excel output from this datasource.
Using 'csv':
csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
Using 'org.zuinnote.spark.office.excel':
xls_output_dir/part-m-00000.xls
xls_output_dir/_SUCCESS
By default the datasource tries to infer the simple schema (read.spark.simpleMode) from all rows in an Excel. The idea is to specify a new option read.spark.inferSchemaRows to limit it to a certain number of rows (e.g. 100). Default: all rows
The Spark Datasource does not pick up options for writing correctly. That means some options for writing are not properly set
Currently the datasource supports reading spreadsheets, such as MSExcel, into a dataframe consisting of Array of SpreadSheetCellDAO. Although this is a very flexible mechanism, supporting conversion in any of the Spark SQL supported datatypes, handling of erroneous data etc., some users want to have an automated simple conversion into a DataFrame consisting of Spark SQL primitive datatypes, such as String, Decimal, Byte, Short, Integer or Long.
The idea is that with an option read.spark.simpleMode this can be supported.
Add unit/integration tests.
All source files and packages shall contain SPDX metadata information, especially about licenses, copyrights, creator, package download information.
See here for more information about SPDX: https://spdx.org/about
Hi,
I am working on data bricks platform with runtime version 5.2 (Spark 2.4.0 ,Scala 2.11)
and have the following version of the library installed from maven central:
com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.3.0
the workers and driver nodes are standarDS3 (14GB) memory.
So, I am trying to read an .xlsx file with the following line of code:
sqlContext.read.format("org.zuinnote.spark.office.excel").option('read.sheet.skiplines.num',1).option("read.locale.bcp47", "us").load([path)
I am able to read a smaller file of around 100 records with no issue but each time i try to read a file of around 700000 records i get the following stack trace
Py4JJavaError: An error occurred while calling o622.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 278.0 failed 4 times, most recent failure: Lost task 0.3 in stage 278.0 (TID 10532, 10.139.64.9, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2100)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2088)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2087)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2087)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1076)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1076)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1076)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2319)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2267)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2255)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2252)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:259)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:269)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:69)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:75)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:497)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:48)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectResult(Dataset.scala:2827)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3439)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2556)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2556)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:3423)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:99)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3422)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2556)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2770)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:265)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:302)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Cleaning up unclosed ZipFile for archive /local_disk0/tmp/tmp-6233698570479588979.xlsx
Kindly let me know if this is an issue with the version I have installed and are there any memory constraints for the library to function properly
Any help is appreciated
Thanks
According to above code, when writing excel files, all sheets except default sheet won't have header.
Hello,
Currently your library supports reading excel files and writing them, which is excellent!
What I would really like to do is:
Create 1 page with calculations, referring other sheets (250+), with just raw data (480 cells per sheet)
What I have:
250+ dfs of raw data (integers, strings), 1 df for each sheet to be output
According to the documentation data can be written in two ways:
- You can have a dataframe with columns of simple datatypes (no map, no list, no struct) that should be written in rows of an Excel sheet. You can define the sheetname by using the option "write.spark.defaultsheetname" (default is "Sheet1"). In this way, you can only write values, but no formulas, comments etc.
- You can have a dataframe with arrays where each element corresponds to the schema defined above. In this case you have full control where the data ends, you can use formulas, comments etc.
How I envisioned this could work with your current library, to avoid creating 480 cells each for 250+ dfs of sources:
This did not work unfortunately, because reading an xlsx will return a df that cannot be directly written as an xlsx with this library.
My question:
Is there an easy way to do this currently? (I'm not that familiar with Spark yet, so there could be something I'm missing!) If not, do you think it should be able to do this?
To me it makes sense to be able to import data, make some adjustments, and export it again.
What do you think?
Keep up the good work!
Kind regards,
Joris
Scala 2.13 is already out, but Spark does not support it at the moment. We should look into cross compile for Scala 2.13 and 2.12 and drop 2.11 once Spark 3.0 is out.
Start spark2-shell with the library:
spark2-shell --packages com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.2.4
Run the following code and get the result beneath the code:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.cloudera1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql._
import org.apache.spark.sql._
scala> val q = "select * from branch_grouped_by_month"
q: String = select * from branch_grouped_by_month
scala> val df = spark.sql(q)
df: org.apache.spark.sql.DataFrame = [branch_id: string, dist_accts: bigint ... 1 more field]
scala> df.show()
+---------+----------+-----------------+
|branch_id|dist_accts| _c2|
+---------+----------+-----------------+
| 001659| 1| 4536.45|
| 880700| 1| 646.94|
| 008754| 2| 15.5|
| 008820| 2| 4329.82|
| 880565| 1| 0.0|
| 001746| 1| 454.13|
| 800415| 1| 0.0|
| 800202| 1| 0.0|
| 101029| 1| 0.0|
| 880694| 1| 681.63|
| 060022| 1| 909.34|
| 800284| 1| 1862.48|
| 001241| 1| 9171.86|
| 880573| 2| 1729.8|
| 001780| 1| -0.79|
| 060700| 1| 866.01|
| 800421| 2|8154.120000000001|
| 001137| 1| 0.0|
| 060043| 1| 0.0|
| 060001| 2| 3499.45|
+---------+----------+-----------------+
only showing top 20 rows
scala> df.write.format("org.zuinnote.spark.office.excel").option("spark.write.useHeader",true).option("write.locale.bcp47", "us").save("gaps.xlsx");
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
at org.zuinnote.spark.office.excel.DefaultSource.<init>(DefaultSource.scala:115)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
... 50 elided
scala>
```
What am I missing here? Can anyone please help me to sort it out? Thank you very much in advance.
Once Spark 3.0 is out we should investigate support of it.
When writing to XLSX all int fields are output as 0.
This code works:
myDF.select("value1", "value2", "year","month","day")
.write().format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
.option("hadoopoffice.write.header.write", "true")
.save("hdfs://user/spark/warehouse/output");
And outputs:
+-----+---+--------+------+
| Name|Age|Children|Salary|
+-----+---+--------+------+
| John| 33| 1| 10000|
| Mary| 43| 9| 1000|
|Peter| 71| 3| 5000|
|James| 22| 1| 2000|
|Homer| 55| 3| 2222|
+-----+---+--------+------+
This doesn't works:
myDF.select("value1", "value2", "year","month","day")
.write().format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.write.mimeType", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
.option("hadoopoffice.write.header.write", "true")
.save("hdfs://user/spark/warehouse/output");
Neither this works:
myDF.select("value1", "value2", "year","month","day")
.write().format("org.zuinnote.spark.office.excel")
.save("hdfs://user/spark/warehouse/output");
And outputs as:
+-----+---+--------+------+
| Name|Age|Children|Salary|
+-----+---+--------+------+
| John| 0 | 0| 0|
| Mary| 0 | 0| 0|
|Peter| 0 | 0| 0|
|James| 0 | 0| 0|
|Homer| 0 | 0| 0|
+-----+---+--------+------+
Attached a test case in Java.
Can you please provide an example on how to specify a custom schema when reading an Excel file? Thanks
I created a simple xlsx file with two values(1, A).
However when opening the Xlsx file, the value of Integer appears as =1 instead of 1.
`scala> val sRdd = sc.parallelize(Seq((1,"A")))
sRdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at :24
scala> val df = spark.createDataFrame(sRdd)
df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
scala> df.repartition(1).write.format("org.zuinnote.spark.office.excel").option("write.locale.bcp47", "us").option("mapreduce.output.fileoutputformat.compress", "false").option("mapreduce.map.output.compress", "false").save("/tmp/somedir/1")
`
When i downloaded part-m-00000.xlsx and opened the file in Excel, the cell A1 has the value =1 and cell B1 has the value A
I am not sure why a column with integer type is written with value =1
I am unable to find an option in the wiki that explains this behavior.
I'm getting an intermittent issue when trying to read and operate on multiple excel files at once. The exception is below. If I re-run the application another time, it runs fine without issue. I did some googling and it seems related to the fact that SimpleDateFormat is not thread safe (see: https://stackoverflow.com/questions/6840803/why-is-javas-simpledateformat-not-thread-safe)
18/11/14 18:42:20 ERROR Executor: Exception in task 0.0 in stage 45.0 (TID 5582)
java.lang.NumberFormatException: For input string: "818.E818E2"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.text.DigitList.getDouble(DigitList.java:169)
at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/11/14 18:42:20 ERROR Executor: Exception in task 1.0 in stage 45.0 (TID 5583)
java.lang.NumberFormatException: For input string: "818.E818E22"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.text.DigitList.getDouble(DigitList.java:169)
at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1869)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/11/14 18:42:20 ERROR TaskSetManager: Task 0 in stage 45.0 failed 1 times; aborting job
18/11/14 18:42:21 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 45.0 failed 1 times, most recent failure: Lost task 0.0 in stage 45.0 (TID 5582, localhost, executor driver): java.lang.NumberFormatException: For input string: "818.E818E2"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.text.DigitList.getDouble(DigitList.java:169)
at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "818.E818E2"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.text.DigitList.getDouble(DigitList.java:169)
at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
I am trying to write a DF to excel sheet in spark 2.1.0 with the header row.
val sampleDF: DataFrame = spark.createDataFrame(Seq(
(2, "abc"),
(2, "def"),
(4, "123")
)).toDF("id", "text")
sampleDF.repartition(1).write
.format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "de")
.option("write.spark.useHeader","true")
.mode(SaveMode.Overwrite)
.save("/home/vghantasala/header.xlsx")
the above code does not insert the header row (a row with "id" and "text" cols) in the excel sheet. It just inserts the contents of the DF into the excel sheet
Am I missing something here ?
Thanks
I try to get data with "read.spark.simpleMode", but i get a "java.lang.NumberFormatException"
the schema is
root
|-- c0: string (nullable = true)
|-- c1: string (nullable = true)
|-- c2: long (nullable = true)
|-- c3: string (nullable = true)
|-- c4: string (nullable = true)
|-- c5: string (nullable = true)
|-- c6: long (nullable = true)
|-- c7: string (nullable = true)
|-- c8: string (nullable = true)
|-- c9: long (nullable = true)
|-- c10: string (nullable = true)
|-- c11: string (nullable = true)
|-- c12: string (nullable = true)
|-- c13: decimal(6,3) (nullable = true)
|-- c14: string (nullable = true)
|-- c15: string (nullable = true)
|-- c16: string (nullable = true)
|-- c17: string (nullable = true)
i want to get a schema that's all String. thanks
If the data source is used with Spark and Kyroserializers activated then a NullPointerException is thrown
There was a change in the OutputWriter internal API:
databricks/spark-avro#240
This can be fixed by adding a method write(row: InternalRow) that calls write(row: Row). This means an InternalRow needs to be converted to a Row.
inferSchema does not result the correct schema if reading an Excel file with just one line or a header + a line.
Note: This does NOT affect reading/processing the full data. It is for scenarios where there is just one excel file with 1 line that it cannot derive the schema
I tried creating dataframe out of a 60Mb excel file using spark-hadoopoffice-ds-2.11.
But it throws java.lang.OutOfMemoryError: GC overhead limit exceeded.
spark.executor.instances was set to 3.
What is the right solution for this?
I am trying to use the spark-hadoopoffice-ds_2.11 (v1.2.4) library in one of my projects and I am stuck with the following error.
java.lang.NoSuchMethodError: org.apache.poi.xssf.usermodel.XSSFWorkbook.getProperties()Lorg/apache/poi/ooxml/POIXMLProperties;
at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.prepareXSSFMetaData(MSExcelWriter.java:729)
at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.prepareMetaData(MSExcelWriter.java:636)
at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.close(MSExcelWriter.java:340)
at org.zuinnote.hadoop.office.format.common.OfficeWriter.close(OfficeWriter.java:131)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordWriter.close(AbstractSpreadSheetDocumentRecordWriter.java:216)
at org.zuinnote.spark.office.excel.ExcelOutputWriter.close(ExcelOutputWriter.scala:137)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)`
I got this error while running the following scala/spark code in cloudera hadoop cluster.
sparkSession.sql(sqlText)
.repartition(1)
.write
.format("org.zuinnote.spark.office.excel")
.option("read.locale.bcp47", "us")
.option("write.spark.defaultsheetname", "Sheet1")
.save(extractFile)
Here is how the dependencies look like in my maven pom file
<scala.version>2.11.8</scala.version>
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>spark-hadoopoffice-ds_2.11</artifactId>
<version>1.2.4</version>
<exclusions>
<exclusion>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
</exclusion>
</exclusions>
</dependency>
I have already tried going through the suggestions provided here - https://stackoverflow.com/questions/18231134/how-to-avoid-java-lang-nosuchmethoderror-org-apache-poi-util-ioutils-copyljava and made sure that there are no other versions of POI in use.
Unfortunately it did not resolve the error. Could you have a look at this issue and let me know if I am doing anything wrong?
Thanks in advance.
Excel has limitations for the number of rows in a sheet (1,048,576 rows):
https://support.office.com/en-us/article/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
Currently, in simple mode (note: this does not affect DAO mode), one can only write into a single Excel sheet.
This enhancement will introduce a configuration option, hadoopoffice.write.simple.spillOverSheet (default: true), that allows that any rows beyond the limit of a sheet are spilled over to further sheet(s).
The additional sheets have the default sheet name and an increasing number added to them.
Note: Generally, it depends on the Excel version and the available memory if it can process such sheets.
Related:
ZuInnoTe/hadoopoffice#64
Scala 2.12 is source code compatible, but not binary compatible. The idea of this issue is to compile the ds also for scala 2.12
Hi,
I'm getting IndexOutOfBoundsException when trying to read an xlsx file. The testing.xlsx which comes as a part of this library works file but not the one which I'm using. Not sure if this is Spark's issue. I'm using pyspark for the coding.
System Config:
Code Snippet:
df_RPM = spark.read.format('org.zuinnote.spark.office.excel')
.option("read.locale.bcp47", "us")
.option("read.spark.simpleMode", "true")
.option("hadoopoffice.read.header.read", "true")
.option("hadoopoffice.read.lowFootprint", "true")
.load(r'D:\IDD\data\iDD\Sample.xlsx')
Excel File
Sample.xlsx
Exception:
Py4JJavaError Traceback (most recent call last)
in
4 .option("hadoopoffice.read.header.read", "true")
5 .option("hadoopoffice.read.lowFootprint", "true")
----> 6 .load(r'D:\IDD\data\iDD\Sample.xlsx')
D:\Spark\spark-2.3.3-bin-hadoop2.7\python\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
164 self.options(**options)
165 if isinstance(path, basestring):
--> 166 return self._df(self._jreader.load(path))
167 elif path is not None:
168 if type(path) != list:
D:\Spark\spark-2.3.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
D:\Spark\spark-2.3.3-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
D:\Spark\spark-2.3.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o232.load.
: java.lang.IndexOutOfBoundsException: 15
at scala.collection.mutable.ListBuffer.update(ListBuffer.scala:139)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$4.apply(DefaultSource.scala:211)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$4.apply(DefaultSource.scala:188)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:203)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:203)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:202)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
with spark-hadoopoffice-ds v1.3.5 lowFootprint parser returns numeric columns content as _null_s.
with spark-hadoopoffice-ds v1.3.5 an spreadsheetml 2006 format xlsx files numeric columns leads to a:
XSSFPullParser.getNext: Cannot read celltype line: org.zuinnote.hadoop.office.format.common.parser.msexcel.internal.XSSFPullParser.getNext(XSSFPullParser.java:192
which result in a null instead of the numeric value.
The cause of this is the missing of t-attributes in the c-elements in the contents of /xl/worksheets/sheet1.xml. Take for example a look at the cells H1 through H5 and Q1 through Q5. The values are:
Cell Value
H1 empty
H2 a string
H3 1
H4 25
H5 25
Q1 empty
Q2 a string
Q3 1
Q4 8
Q5 2
The numeric values above correspond to the value of the worksheet-sheetData-row-c-v elements (see
spreadsheetml2006mainpartial.xml.txt)
This issue is reproducible with a http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac spreadsheet so see (MapIssue31.xlsx).
The numeric value of the cell is located as String at the XMLEvent CharactersEventImpl.mContent from xer.nextEvent().
It would be nice when the value of the v-elements is returned instead of the current _null_s.
Updated with an example to reproduce.
It's not happening with every file for excel but for some files, i am getting NegativeArraySizeException while reading specific worksheet.
java.lang.NegativeArraySizeException
at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:433)
at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127)
at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44)
at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
For instance if i am choosing useHeader true and pass a file or multiple files with multiple sheets but all sheets and files has exactly same number of columns with same name and datatype, then final datasets contains additional new row at start of new sheet data.
E.g. 3 files with 2 sheets in each, total 6 sheets, so final dataset will have 5 additional rows, from each sheet's first row as a "data row"
When excel cell has cell value as a date for example 17-05-2018 19:10 it is converting that column into double type. So resultant value will be 17.0000 and spark consider it as a double datatype column
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.