zuinnote / spark-hadoopoffice-ds Goto Github PK

View Code? Open in Web Editor NEW

39.0 5.0 12.0 248 KB

A Spark datasource for the HadoopOffice library

License: Apache License 2.0

Scala 100.00%

spark datasource hadoopoffice excel read write xls xlsx

spark-hadoopoffice-ds's Introduction

spark-hadoopoffice-ds

A Spark datasource for the HadoopOffice library.

This Spark datasource assumes at least Spark 3.x and Scala 2.12. Scala 2.13 is supported as well. Note: Spark 2.x is probably working just fine with this datasource, but it is not tested anymore.

However, the HadoopOffice library can also be used directly from Spark 1.x and/or Scala 2.10 (see how to section). Currently this datasource supports the following formats of the HadoopOffice library:

Excel
- Datasource format: org.zuinnote.spark.office.Excel
- Loading and Saving of old Excel (.xls) and new Excel (.xlsx)

This datasource is available on Spark-packages.org and on Maven Central.

Find here the status from the Continuous Integration service: https://travis-ci.org/ZuInnoTe/spark-hadoopoffice-ds/

Release Notes

Find the latest release information here

Options

All options from the HadoopOffice library, such as metadata, encryption/decryption or low footprint mode, are supported.

Additionally the following options exist:

"read.spark.simplemode" (Before 1.5: "read.spark.simpleMode") infers the schema of the DataFrame from the data in the Excel or use a custom schema. This schema consists of primitive DataTypes of Spark SQL (String, Byte, Short, Integer, Long, Decimal, Date, Boolean). If the schema is inferred it is done only based on one file in the directory. Additionally, the conversion of Decimals is based on the locale that you define (see hadoopoffice options from above). True if schema should be inferred, False if not. Default: False
"read.spark.simpleMode.maxinferrows" (as of 1.1.0) (Before 1.5: "read.spark.simpleMode.maxInferRows"). This defines the maximum rows to read for inferring the schema. This is useful if you know already that the schema can be determined from a given number of rows. Alternatively, if you want to provide a custom schema set this to 0. Default: all rows ("-1")
There are also other options related to headers, locales etc. (see options from HadoopOffice library)

There are the following options related to Spark in case you need to write rows containing primitive types. In this case a default sheetname need to be set:

"write.spark.defaultsheetname", any valid sheetname, e.g. Sheet1
There are also other options related to headers, locales etc. (see options from HadoopOffice library)

Additionally, the following options of the standard Hadoop API are supported:

"mapreduce.output.fileoutputformat.compress", true if output should be compressed, false if not. Note that many office formats have already a build-in compression so an additional compression may not make sense.
"mapreduce.output.fileoutputformat.compress.codec", codec class, e.g. org.apache.hadoop.io.compress.GzipCodec

Dependency

A lot of options changed in version 1.2.0 to harmonize behavior with other Big Data platforms. Read carefully the documentation and test your application.

Since version 1.7.0 you need to use Spark 3.x and at least Scala 2.12

Scala 2.12

groupId: com.github.zuinnote

artifactId: spark-hadoopoffice-ds_2.12

version: 1.7.0

Scala 2.13

groupId: com.github.zuinnote

artifactId: spark-hadoopoffice-ds_2.12

version: 1.7.0

The Scala 2.13 version requires at least Spark 3.x

Older Scala versions

Note: If you require Scala 2.10 then you cannot use this data source, but you can use the Hadoop FileFormat if you want to use the latest HadoopOffice version, cf. an example for reading and writing.

Alternatively you can use the older version of this data source (not recommended): 1.1.1 (see documentation). However, in this case you will miss features and bug fixes.

If you need Scala 2.11 then you can use an older version of this data source (not recommended): 1.6.4 (see documentation). However, in this case you will miss features and bug fixes.

Spark versions

Note: if you require Spark 2.x then you cannot use this data source (minimal version: 3.x). You can use an older version of this data source (not recommended): 1.6.4 (see documentation). However, in this case you will miss features and bug fixes.

Schema

There are two different schemas that you can configure:

Excel Cell Schema - here more information of the Excel cell are exposed (e.g. formattedValue, formula, address etc.)
Simple Schema - here the data is exposed using Spark datatypes (e.g. int, long, decimal, string, date etc.)

The Excel cell schema is very useful in case you want to have more information about the cell and the simple schema is useful in case you want to work only with the data (e.g. doing calculations, filtering by date etc.).

Excel Cell

An Excel file loaded into a DataFrame has the following schema. Basically each row contains an Array with all Excel cells in this row. For each cell the following information are available:

formattedValue: This is what you see when you open Excel
comment: A comment for this cell
formula: A formula for this cell (Note: without the =, e.g. "A1+A2")
address: The address of the cell in A1 format (e.g. "B2")
sheetName: The name of the sheet of this cell

root                                                                                                                                                                                   
|-- rows: array (nullable = true)                                                                                                                                                     
|    |-- element: struct (containsNull = true)                                                                                                                                        
|    |    |-- formattedValue: string (nullable = true)                                                                                                                                
|    |    |-- comment: string (nullable = true)                                                                                                                                       
|    |    |-- formula: string (nullable = true)                                                                                                                                       
|    |    |-- address: string (nullable = true)                                                                                                                                       
|    |    |-- sheetName: string (nullable = true)

Simple

If you use the option "read.spark.simpleMode" then the schema consists of primitve Spark SQL DataTypes. For example, for this Excel file the following schema is automatically inferred (note also the option "hadoopoffice.read.header.read" is applied):

root
|-- decimalsc1: decimal(2,1) (nullable = true)
|-- booleancolumn: boolean (nullable = true)
|-- datecolumn: date (nullable = true)
|-- stringcolumn: string (nullable = true)
|-- decimalp8sc3: decimal(8,3) (nullable = true)
|-- bytecolumn: byte (nullable = true)
|-- shortcolumn: short (nullable = true)
|-- intcolumn: integer (nullable = true)
|-- longcolumn: long (nullable = true)

Develop

Reading

As you can see in the schema, the datasource reads each Excel row in an array. Each element of the array is a structure describing an Excel cell. This structure describes the formatted value (based on the locale), the comment, the formula, the address of the cell in A1 format and the name of the sheet to which the cell belongs. In Scala you can easily read Excel files using the following snippet (assuming US locale for the Excel file):

val sqlContext = sparkSession.sqlContext
val df = sqlContext.read
   .format("org.zuinnote.spark.office.excel")
   .option("read.locale.bcp47", "en")  
.load(args(0))

Find a full example here.

Another option is to infer the schema of primitive Spark SQL DataTypes automatically:

val sqlContext = sparkSession.sqlContext
val df = sqlContext.read
   .format("org.zuinnote.spark.office.excel")
   .option("read.locale.bcp47", "en").option("read.spark.simpleMode",true)  
.load(args(0))

This option can be combined with hadoopoffice.read.header.read to interpret the first row in the Excel as column names of the DataFrame.

Writing

You can have two options for writing data to Excel files:

You can have a dataframe with columns of simple datatypes (no map, no list, no struct) that should be written in rows of an Excel sheet. You can define the sheetname by using the option "write.spark.defaultsheetname" (default is "Sheet1"). In this way, you can only write values, but no formulas, comments etc. Additionally you can define the option "hadoopoffice.write.header.write" to write the column names of the DataFrame as the first row of the Excel.
You can have a dataframe with arrays where each element corresponds to the schema defined above. In this case you have full control where the data ends, you can use formulas, comments etc.

The second option is illustrated in this snippet (Assuming US locale for the Excel). It creates a simple Excel document with 4 cells. They are stored in sheet "Sheet1". The following Cells exist (A1 with value 1), (A2 with value 2 and comment), (A3 with value 3), (B1 with formula A2+A3). The resulting Excel file is stored in the directory /home/user/office/output

val sRdd = sparkSession.sparkContext.parallelize(Seq(Seq("","","1","A1","Sheet1"),Seq("","This is a comment","2","A2","Sheet1"),Seq("","","3","A3","Sheet1"),Seq("","","A2+A3","B1","Sheet1"))).repartition(1)
   val df= sRdd.toDF()
   df.write
     .format("org.zuinnote.spark.office.excel")
   .option("write.locale.bcp47", "en")
.save("/home/user/office/output")

Find a full example here.

You can write with partitions as follows (as of v 1.3.2). Let us assume you have an Excel with Name, Year, Month, Day columns and you want to create partitions by Year, Month, Day. Then you need to use the following code:

df.toDF.write.partitionBy("year","month","day").format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "en")
.save("/home/user/office/output")

This will create the following structure on HDFS (or the filesystem that is supported by Spark):

output/_SUCCESS
output/year=2018/month=1/day=1/part-00000.xlsx
output/year=2019/month=12/day=31/part-00000.xlsx

Language bindings

Scala

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format) and shows the total number of rows, the schema and the first 20 rows. The locale for formatting cell values is set to "us". Find a full example here.

val sqlContext = sparkSession.sqlContext
val df = sqlContext.read
   .format("org.zuinnote.spark.office.excel")
   .option("read.locale.bcp47", "en")  // example to set the locale to us
   .load("/home/user/office/input")
   val totalCount = df.count
   // print to screen
   println("Total number of rows in Excel: "+totalCount)
   df.printSchema
   // print formattedValues
df.show

Java

SQLContext sqlContext = sparkSession.sqlContext;
Dataframe df = sqlContext.read
.format("org.zuinnote.spark.office.excel")
   .option("read.locale.bcp47", "en")  // example to set the locale to us
   .load("/home/user/office/input");
	long totalCount = df.count;
   // print to screen
   System.out.println("Total number of rows in Excel: "+totalCount);
   df.printSchema();
   // print formattedValues
df.show();

R

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format). The locale for formatting cell values is set to "us".

library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.7.0" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "/home/user/office/input", source = "org.zuinnote.spark.office.excel", "read.locale.bcp47" = "en")

Python

This example loads Excel documents from the folder "/home/user/office/input" using the Excel representation (format).The locale for formatting cell values is set to "us".

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('org.zuinnote.spark.office.excel').options('read.locale.bcp47'='en').load('/home/user/office/input')

SQL

The following statement creates a table that contains Excel data in the folder //home/user/office/input. The locale for formatting cell values is set to "us".

CREATE TABLE ExcelData
USING  org.zuinnote.spark.office.excel
OPTIONS (path "/home/user/office/input", read.locale.bcp47 "en")

spark-hadoopoffice-ds's People

Contributors

Stargazers

Watchers

Forkers

mehdimeziane ludochane nambastha lybigdata 0xqq itsandrewyoung juarezr junqueira cmftall jhondong zyclove chiennht

spark-hadoopoffice-ds's Issues

Nullpointer Exception when using Spark with Kyroserializer

If the data source is used with Spark and Kyroserializers activated then a NullPointerException is thrown

How do i get a schema that's all String with "read.spark.simpleMode"

i want to get a schema that's all String. thanks

write.spark.useHeader is not working

I am trying to write a DF to excel sheet in spark 2.1.0 with the header row.

val sampleDF: DataFrame = spark.createDataFrame(Seq(
      (2, "abc"),
      (2, "def"),
      (4, "123")

    )).toDF("id", "text")

    sampleDF.repartition(1).write
      .format("org.zuinnote.spark.office.excel")
      .option("write.locale.bcp47", "de")
      .option("write.spark.useHeader","true")
      .mode(SaveMode.Overwrite)
      .save("/home/vghantasala/header.xlsx")

the above code does not insert the header row (a row with "id" and "text" cols) in the excel sheet. It just inserts the contents of the DF into the excel sheet
Am I missing something here ?

Thanks

error on databricks

I have the python code below on databricks and I am getting the rror "java.lang.NoSuchMethodError: org.apache.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V

Please help

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format("org.zuinnote.spark.office.excel").options("read.locale.bcp47","us").load('/home/user/office/input')

Skipped imported decimal values

When importing xlsx some decimal values are skipped. Superficially it looks like it depends on number of decimal places.

Test values which cause first row to be ignored:
16884.924
1725.5523
Test values which are imported correctly:
16884.9245
1725.5523
Test values which are also imported correctly:
16884.924
1725.552

Add support for partitioning by folders in Hive/Spark Style

Partitioning is useful for splitting the output in folders and files while writing (and reading).
In excel case can be useful for separating by day, customer or others ways or for creating smaller spreadsheets that will load faster.

The current behavior of this datasource when DataFrameWriter.partitionBy() method is generating a single file and overwriting for each partition.

For example when a dataframe is partitioned by year, month, and day Sparks writes in two different forms while using the 'csv' (or parquet, avro, etc...) and using the Excel output from this datasource.

Using 'csv':

csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv

Using 'org.zuinnote.spark.office.excel':

xls_output_dir/part-m-00000.xls
xls_output_dir/_SUCCESS

workbookDocument .getWorkbook().getWorkbookPr() can return null

In org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.class v1.3.3 the following lines gives me a NullPointerException:
WorkbookDocument wd = WorkbookDocument.Factory.parse(workbookDataXML); this.isDate1904 = wd.getWorkbook().getWorkbookPr().getDate1904();

cause workbookDocument .getWorkbook().getWorkbookPr() can return null.
For fix see example https://github.com/alibaba/easyexcel/blob/master/src/main/java/com/alibaba/excel/analysis/v07/XlsxSaxAnalyser.java#L40

Unfortunately, I am not allowed to distribute any of the spreadsheets that fail.

Stacktrace:
12:57:44,174 ERROR [rker for task 0] org.apache.spark.executor.Executor.logError: Exception in task 0.0 in stage 0.0 (TID 0) line: org.apache.spark.internal.Logging$class.logError(Logging.scala:91)
java.lang.NullPointerException
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:349)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3.apply(DefaultSource.scala:317)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3.apply(DefaultSource.scala:312)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Support Spark 3.1

Once Spark 3.0 is out we should investigate support of it.

Spreadsheet format: Add option to infer schema for a given set of rows

By default the datasource tries to infer the simple schema (read.spark.simpleMode) from all rows in an Excel. The idea is to specify a new option read.spark.inferSchemaRows to limit it to a certain number of rows (e.g. 100). Default: all rows

Support for Spark 2.1.0

Using version 1.0.2

Currently when I try to use the hadoopoffice-ds package to write an .xlsx file, for example by using your own example "hadoopoffice\examples\scala-spark2-excel-out-ds" on my Spark 2.1.0 cluster, I get an error stating a method of the abstract OutputWriterFactory class can not be found on the OutputWriterFactory it's trying to use, when it calls the df.write method

Stacktrace:


Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	... 3 more

My own example code that's failing with Spark 2.1.0

import spark.implicits._
val sqlContext=new SQLContext(sc)
import sqlContext.implicits._

val sRdd = sc.parallelize(Seq("1","2","3","4","5")).repartition(1)
val df= sRdd.toDF()
try {
    df.write.format("org.zuinnote.spark.office.excel").mode(SaveMode.Overwrite).save("/user/spark/output/")
} catch {
    case e: Exception => {
         val sw = new StringWriter
       e.printStackTrace(new PrintWriter(sw))
        println(sw.toString)
    } 
}

It works fine when I use Spark 2.0.2 however.

I did some research:

It's correct that Spark 2.1.0 adds a method to the abstract OutputWriterFactory:
org.apache.spark.sql.execution.datasources.OutputWriterFactory#def getFileExtension(context: TaskAttemptContext): String
I saw a similar issue when the databricks-avro dependency was used, which had a similar issue when version 3.1.0 was used, 3.2.0 fixed this.
Link describing the issue and fix: databricks/spark-redshift#315

This is not the dependency you were using however (apache's own version), so I think it's a dependency that's using it?

All that neads to be done is add an override for the method returning a fileextension.

I can see what else I can find in the meantime, but I'm having issues setting up debugging from IntelliJ with the Spark cluster (HDInsight), so have a hard time figuring out where it's going wrong exactly. Maybe you can check it out too?

Appreciate your work by the way!

Maturity / Production Readiness

Hi Jörn,

I am really enthusiastic about your library and we want to use it ☺︎. I can also see that you update your library regularly and recently.

May I ask your opinion about the production readiness or maturity of your library? Would you recommend using it in production?

Danke, Mark

Job abortion on databricks

Hi,
I am working on data bricks platform with runtime version 5.2 (Spark 2.4.0 ,Scala 2.11)
and have the following version of the library installed from maven central:
com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.3.0

the workers and driver nodes are standarDS3 (14GB) memory.

So, I am trying to read an .xlsx file with the following line of code:

sqlContext.read.format("org.zuinnote.spark.office.excel").option('read.sheet.skiplines.num',1).option("read.locale.bcp47", "us").load([path)

I am able to read a smaller file of around 100 records with no issue but each time i try to read a file of around 700000 records i get the following stack trace

Py4JJavaError: An error occurred while calling o622.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 278.0 failed 4 times, most recent failure: Lost task 0.3 in stage 278.0 (TID 10532, 10.139.64.9, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2100)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2088)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2087)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2087)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1076)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1076)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1076)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2319)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2267)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2255)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2252)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:259)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:269)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:69)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:75)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:497)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:48)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectResult(Dataset.scala:2827)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3439)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2556)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2556)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:3423)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:99)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3422)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2556)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2770)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:265)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:302)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Cleaning up unclosed ZipFile for archive /local_disk0/tmp/tmp-6233698570479588979.xlsx

Kindly let me know if this is an issue with the version I have installed and are there any memory constraints for the library to function properly

Any help is appreciated

Thanks

Spark Datasource does not pick up options for writing correctly

The Spark Datasource does not pick up options for writing correctly. That means some options for writing are not properly set

Include support for scala 2.12

Scala 2.12 is source code compatible, but not binary compatible. The idea of this issue is to compile the ds also for scala 2.12

Support for Scala 2.13 and drop support for Scala 2.11

Scala 2.13 is already out, but Spark does not support it at the moment. We should look into cross compile for Scala 2.13 and 2.12 and drop 2.11 once Spark 3.0 is out.

Spill over to next sheet if number of rows exceeding Excel limitations

Excel has limitations for the number of rows in a sheet (1,048,576 rows):
https://support.office.com/en-us/article/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

Currently, in simple mode (note: this does not affect DAO mode), one can only write into a single Excel sheet.

This enhancement will introduce a configuration option, hadoopoffice.write.simple.spillOverSheet (default: true), that allows that any rows beyond the limit of a sheet are spilled over to further sheet(s).
The additional sheets have the default sheet name and an increasing number added to them.

Note: Generally, it depends on the Excel version and the available memory if it can process such sheets.

Related:
ZuInnoTe/hadoopoffice#64

Reading Spreadsheets: Automatically determine schema with simple data types

Currently the datasource supports reading spreadsheets, such as MSExcel, into a dataframe consisting of Array of SpreadSheetCellDAO. Although this is a very flexible mechanism, supporting conversion in any of the Spark SQL supported datatypes, handling of erroneous data etc., some users want to have an automated simple conversion into a DataFrame consisting of Spark SQL primitive datatypes, such as String, Decimal, Byte, Short, Integer or Long.

The idea is that with an option read.spark.simpleMode this can be supported.

Add unit/integration tests.

Not all HadoopOffice configuration is applied correctly

All configuration options that contain a capital letter are not applied correctly. This is due to an undocumented Spark behaviour to convert all option keys to lower case. This means for example options such as hadoopoffice.read.lowFootprint have never been applied when using the Spark-HadoopOffice data source (no issues for the other platforms).

Most likely we will implement a workaround in the HadoopOffice library without breaking compatibility.

Suggestion: Writing, Reading, combining, saving excel files

Hello,

Currently your library supports reading excel files and writing them, which is excellent!

What I would really like to do is:

Create 1 page with calculations, referring other sheets (250+), with just raw data (480 cells per sheet)

What I have:

250+ dfs of raw data (integers, strings), 1 df for each sheet to be output

According to the documentation data can be written in two ways:

You can have a dataframe with columns of simple datatypes (no map, no list, no struct) that should be written in rows of an Excel sheet. You can define the sheetname by using the option "write.spark.defaultsheetname" (default is "Sheet1"). In this way, you can only write values, but no formulas, comments etc.

You can have a dataframe with arrays where each element corresponds to the schema defined above. In this case you have full control where the data ends, you can use formulas, comments etc.

How I envisioned this could work with your current library, to avoid creating 480 cells each for 250+ dfs of sources:

Use method 1 to write each of the 250+ dfs as a separate xlsx, with a unique name for each sheet using write.spark.defaultsheetname
Write the front page with formulas using method 2. to a separate xlsx
Read these files as dfs, and use union to create 1 df (with multiple sheets)
Write them as a single xlsx

This did not work unfortunately, because reading an xlsx will return a df that cannot be directly written as an xlsx with this library.

My question:

Is there an easy way to do this currently? (I'm not that familiar with Spark yet, so there could be something I'm missing!) If not, do you think it should be able to do this?

To me it makes sense to be able to import data, make some adjustments, and export it again.

What do you think?

Keep up the good work!

Kind regards,

Joris

inferSchema for Excelfiles with one row is not working correctly

inferSchema does not result the correct schema if reading an Excel file with just one line or a header + a line.

Note: This does NOT affect reading/processing the full data. It is for scenarios where there is just one excel file with 1 line that it cannot derive the schema

Spark 2.2 support

There was a change in the OutputWriter internal API:
databricks/spark-avro#240

This can be fixed by adding a method write(row: InternalRow) that calls write(row: Row). This means an InternalRow needs to be converted to a Row.

IndexOutOfBoundsException: When reading xlsx file.

Hi,

I'm getting IndexOutOfBoundsException when trying to read an xlsx file. The testing.xlsx which comes as a part of this library works file but not the one which I'm using. Not sure if this is Spark's issue. I'm using pyspark for the coding.

System Config:

Spark : 2.3.3
spark-hadoopoffice-ds: V1.3.0
Scala: 2.11.8

Code Snippet:

df_RPM = spark.read.format('org.zuinnote.spark.office.excel')
.option("read.locale.bcp47", "us")
.option("read.spark.simpleMode", "true")
.option("hadoopoffice.read.header.read", "true")
.option("hadoopoffice.read.lowFootprint", "true")
.load(r'D:\IDD\data\iDD\Sample.xlsx')

Excel File
Sample.xlsx

Exception:

Py4JJavaError Traceback (most recent call last)
in
4 .option("hadoopoffice.read.header.read", "true")
5 .option("hadoopoffice.read.lowFootprint", "true")
----> 6 .load(r'D:\IDD\data\iDD\Sample.xlsx')

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
164 self.options(**options)
165 if isinstance(path, basestring):
--> 166 return self._df(self._jreader.load(path))
167 elif path is not None:
168 if type(path) != list:

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o232.load.
: java.lang.IndexOutOfBoundsException: 15
at scala.collection.mutable.ListBuffer.update(ListBuffer.scala:139)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$4.apply(DefaultSource.scala:211)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$4.apply(DefaultSource.scala:188)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:203)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:203)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:202)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

"hadoopoffice.write.header.write" is only working for default sheet

https://github.com/ZuInnoTe/spark-hadoopoffice-ds/blob/19f86dd1273f43c347f135fd5b41b4e2fac48fe9/src/main/scala/org/zuinnote/spark/office/excel/ExcelOutputWriter.scala#L107-#L118

According to above code, when writing excel files, all sheets except default sheet won't have header.

java.lang.NoSuchMethodError while using spark-hadoopoffice-ds_2.11 (v1.2.4)

I am trying to use the spark-hadoopoffice-ds_2.11 (v1.2.4) library in one of my projects and I am stuck with the following error.

java.lang.NoSuchMethodError: org.apache.poi.xssf.usermodel.XSSFWorkbook.getProperties()Lorg/apache/poi/ooxml/POIXMLProperties;
        at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.prepareXSSFMetaData(MSExcelWriter.java:729)
        at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.prepareMetaData(MSExcelWriter.java:636)
        at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.close(MSExcelWriter.java:340)
        at org.zuinnote.hadoop.office.format.common.OfficeWriter.close(OfficeWriter.java:131)
        at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordWriter.close(AbstractSpreadSheetDocumentRecordWriter.java:216)
        at org.zuinnote.spark.office.excel.ExcelOutputWriter.close(ExcelOutputWriter.scala:137)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)`

I got this error while running the following scala/spark code in cloudera hadoop cluster.

sparkSession.sql(sqlText)
        .repartition(1)
        .write
        .format("org.zuinnote.spark.office.excel")
        .option("read.locale.bcp47", "us")
        .option("write.spark.defaultsheetname", "Sheet1")
        .save(extractFile)

Here is how the dependencies look like in my maven pom file

<scala.version>2.11.8</scala.version>
 
<dependency>
 <groupId>com.github.zuinnote</groupId>
 <artifactId>spark-hadoopoffice-ds_2.11</artifactId>
 <version>1.2.4</version>
 <exclusions>
     <exclusion>
         <groupId>com.github.zuinnote</groupId>
         <artifactId>hadoopoffice-fileformat</artifactId>
     </exclusion>
     <exclusion>
         <groupId>org.apache.poi</groupId>
         <artifactId>poi</artifactId>
     </exclusion>
     <exclusion>
         <groupId>org.apache.poi</groupId>
         <artifactId>poi-ooxml</artifactId>
     </exclusion>
 </exclusions>
</dependency>

I have already tried going through the suggestions provided here - https://stackoverflow.com/questions/18231134/how-to-avoid-java-lang-nosuchmethoderror-org-apache-poi-util-ioutils-copyljava and made sure that there are no other versions of POI in use.

Unfortunately it did not resolve the error. Could you have a look at this issue and let me know if I am doing anything wrong?

Thanks in advance.

Add support for Spark Datasource API v2

Since Spark 2.3.0 (https://spark.apache.org/releases/spark-release-2-3-0.html) a new datasource API v2 is available (experimental).

We should add support in addition to the v1 API.

lowFootprint: numeric column values result in nulls

with spark-hadoopoffice-ds v1.3.5 lowFootprint parser returns numeric columns content as _null_s.

with spark-hadoopoffice-ds v1.3.5 an spreadsheetml 2006 format xlsx files numeric columns leads to a:
XSSFPullParser.getNext: Cannot read celltype line: org.zuinnote.hadoop.office.format.common.parser.msexcel.internal.XSSFPullParser.getNext(XSSFPullParser.java:192 which result in a null instead of the numeric value.

The cause of this is the missing of t-attributes in the c-elements in the contents of /xl/worksheets/sheet1.xml. Take for example a look at the cells H1 through H5 and Q1 through Q5. The values are:
Cell Value
H1 empty
H2 a string
H3 1
H4 25
H5 25
Q1 empty
Q2 a string
Q3 1
Q4 8
Q5 2

The numeric values above correspond to the value of the worksheet-sheetData-row-c-v elements (see
spreadsheetml2006mainpartial.xml.txt)

This issue is reproducible with a http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac spreadsheet so see (MapIssue31.xlsx).

The numeric value of the cell is located as String at the XMLEvent CharactersEventImpl.mContent from xer.nextEvent().

It would be nice when the value of the v-elements is returned instead of the current _null_s.

Updated with an example to reproduce.

Thread Safety When Reading Multiple Excel Files

I'm getting an intermittent issue when trying to read and operate on multiple excel files at once. The exception is below. If I re-run the application another time, it runs fine without issue. I did some googling and it seems related to the fact that SimpleDateFormat is not thread safe (see: https://stackoverflow.com/questions/6840803/why-is-javas-simpledateformat-not-thread-safe)

18/11/14 18:42:20 ERROR Executor: Exception in task 0.0 in stage 45.0 (TID 5582)
java.lang.NumberFormatException: For input string: "818.E818E2"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/11/14 18:42:20 ERROR Executor: Exception in task 1.0 in stage 45.0 (TID 5583)
java.lang.NumberFormatException: For input string: "818.E818E22"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1869)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/11/14 18:42:20 ERROR TaskSetManager: Task 0 in stage 45.0 failed 1 times; aborting job
18/11/14 18:42:21 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 45.0 failed 1 times, most recent failure: Lost task 0.0 in stage 45.0 (TID 5582, localhost, executor driver): java.lang.NumberFormatException: For input string: "818.E818E2"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "818.E818E2"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more

Doesn't generate excel file as expected but no obvious error

Start spark2-shell with the library:
spark2-shell --packages com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.2.4

Run the following code and get the result beneath the code:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val q = "select * from branch_grouped_by_month"
q: String = select * from branch_grouped_by_month

scala> val df = spark.sql(q)
df: org.apache.spark.sql.DataFrame = [branch_id: string, dist_accts: bigint ... 1 more field]

scala> df.show()
+---------+----------+-----------------+
|branch_id|dist_accts|              _c2|
+---------+----------+-----------------+
|   001659|         1|          4536.45|
|   880700|         1|           646.94|
|   008754|         2|             15.5|
|   008820|         2|          4329.82|
|   880565|         1|              0.0|
|   001746|         1|           454.13|
|   800415|         1|              0.0|
|   800202|         1|              0.0|
|   101029|         1|              0.0|
|   880694|         1|           681.63|
|   060022|         1|           909.34|
|   800284|         1|          1862.48|
|   001241|         1|          9171.86|
|   880573|         2|           1729.8|
|   001780|         1|            -0.79|
|   060700|         1|           866.01|
|   800421|         2|8154.120000000001|
|   001137|         1|              0.0|
|   060043|         1|              0.0|
|   060001|         2|          3499.45|
+---------+----------+-----------------+
only showing top 20 rows


scala> df.write.format("org.zuinnote.spark.office.excel").option("spark.write.useHeader",true).option("write.locale.bcp47", "us").save("gaps.xlsx");
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
  at org.zuinnote.spark.office.excel.DefaultSource.<init>(DefaultSource.scala:115)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at java.lang.Class.newInstance(Class.java:442)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 50 elided

scala>
```
What am I missing here? Can anyone please help me to sort it out? Thank you very much in advance.

Drop support for scala 2.10

Scala 2.10 needs to be dropped, because we will drop support for JDK7:
ZuInnoTe/hadoopoffice#20

Furthermore, Spark 2.1.0 supports only Scala 2.11

Provide SPDX metadata in source and packages

All source files and packages shall contain SPDX metadata information, especially about licenses, copyrights, creator, package download information.

See here for more information about SPDX: https://spdx.org/about

Writing to XLSX only outputs String fields but XLS works fine

When writing to XLSX all int fields are output as 0.

This code works:

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
	.option("hadoopoffice.write.header.write", "true")
	.save("hdfs://user/spark/warehouse/output");

And outputs:

+-----+---+--------+------+
| Name|Age|Children|Salary|
+-----+---+--------+------+
| John| 33|       1| 10000|
| Mary| 43|       9|  1000|
|Peter| 71|       3|  5000|
|James| 22|       1|  2000|
|Homer| 55|       3|  2222|
+-----+---+--------+------+

This doesn't works:

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.option("hadoopoffice.write.mimeType", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
	.option("hadoopoffice.write.header.write", "true")
	.save("hdfs://user/spark/warehouse/output");

Neither this works:

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.save("hdfs://user/spark/warehouse/output");

And outputs as:

+-----+---+--------+------+
| Name|Age|Children|Salary|
+-----+---+--------+------+
| John| 0 |       0|     0|
| Mary| 0 |       0|     0|
|Peter| 0 |       0|     0|
|James| 0 |       0|     0|
|Homer| 0 |       0|     0|
+-----+---+--------+------+

Attached a test case in Java.

CVE-2021-44228: Mitigate Log4shell

CVE-2021-44228 describes a highly critical security issue in the library log4j2. While in most installation where hadoopoffice is used it is less likely to be exploitable, we include this the library log4j2 > 2.15 to mitigate this issue. Independent of this - as a defense in the depth - we recommend additionally to upgrade to the latest JDK8 (at least > 191, better the latest) or JDK11/JDK17 and to set the Java System properties (e.g. via -D) "log4j2.formatMsgNoLookups" to "true" and set "com.sun.jndi.rmi.object.trustURLCodebase" and "com.sun.jndi.cosnaming.object.trustURLCodebase" to "false".

The full list of mitigations can be found in the link above.

Cell with Integer value has an = sign

I created a simple xlsx file with two values(1, A).
However when opening the Xlsx file, the value of Integer appears as =1 instead of 1.

`scala> val sRdd = sc.parallelize(Seq((1,"A")))
sRdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at :24

scala> val df = spark.createDataFrame(sRdd)
df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> df.repartition(1).write.format("org.zuinnote.spark.office.excel").option("write.locale.bcp47", "us").option("mapreduce.output.fileoutputformat.compress", "false").option("mapreduce.map.output.compress", "false").save("/tmp/somedir/1")
`

When i downloaded part-m-00000.xlsx and opened the file in Excel, the cell A1 has the value =1 and cell B1 has the value A

I am not sure why a column with integer type is written with value =1

I am unable to find an option in the wiki that explains this behavior.

Custom Schema Example

Can you please provide an example on how to specify a custom schema when reading an Excel file? Thanks

Reading : New header row is added every time while new sheet data is appended

For instance if i am choosing useHeader true and pass a file or multiple files with multiple sheets but all sheets and files has exactly same number of columns with same name and datatype, then final datasets contains additional new row at start of new sheet data.

E.g. 3 files with 2 sheets in each, total 6 sheets, so final dataset will have 5 additional rows, from each sheet's first row as a "data row"

Getting NegativeArraySizeException while reading xls/xlsx format - not in all files

It's not happening with every file for excel but for some files, i am getting NegativeArraySizeException while reading specific worksheet.

java.lang.NegativeArraySizeException
	at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:433)
	at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127)
	at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44)
	at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:183)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
	at scala.Option.orElse(Option.scala:289)
	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

Date column parsing issue

When excel cell has cell value as a date for example 17-05-2018 19:10 it is converting that column into double type. So resultant value will be 17.0000 and spark consider it as a double datatype column

Memory issue

I tried creating dataframe out of a 60Mb excel file using spark-hadoopoffice-ds-2.11.
But it throws java.lang.OutOfMemoryError: GC overhead limit exceeded.
spark.executor.instances was set to 3.
What is the right solution for this?