Code Monkey home page Code Monkey logo

spark-hadoopoffice-ds's Issues

workbookDocument .getWorkbook().getWorkbookPr() can return null

In org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.class v1.3.3 the following lines gives me a NullPointerException:
WorkbookDocument wd = WorkbookDocument.Factory.parse(workbookDataXML); this.isDate1904 = wd.getWorkbook().getWorkbookPr().getDate1904();

cause workbookDocument .getWorkbook().getWorkbookPr() can return null.
For fix see example https://github.com/alibaba/easyexcel/blob/master/src/main/java/com/alibaba/excel/analysis/v07/XlsxSaxAnalyser.java#L40

Unfortunately, I am not allowed to distribute any of the spreadsheets that fail.

Stacktrace:
12:57:44,174 ERROR [rker for task 0] org.apache.spark.executor.Executor.logError: Exception in task 0.0 in stage 0.0 (TID 0) line: org.apache.spark.internal.Logging$class.logError(Logging.scala:91)
java.lang.NullPointerException
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:349)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3.apply(DefaultSource.scala:317)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3.apply(DefaultSource.scala:312)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Maturity / Production Readiness

Hi Jörn,

I am really enthusiastic about your library and we want to use it ☺︎. I can also see that you update your library regularly and recently.

May I ask your opinion about the production readiness or maturity of your library? Would you recommend using it in production?

Danke, Mark

Support for Spark 2.1.0

Using version 1.0.2

Currently when I try to use the hadoopoffice-ds package to write an .xlsx file, for example by using your own example "hadoopoffice\examples\scala-spark2-excel-out-ds" on my Spark 2.1.0 cluster, I get an error stating a method of the abstract OutputWriterFactory class can not be found on the OutputWriterFactory it's trying to use, when it calls the df.write method

Stacktrace:


Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	... 3 more

My own example code that's failing with Spark 2.1.0

import spark.implicits._
val sqlContext=new SQLContext(sc)
import sqlContext.implicits._

val sRdd = sc.parallelize(Seq("1","2","3","4","5")).repartition(1)
val df= sRdd.toDF()
try {
    df.write.format("org.zuinnote.spark.office.excel").mode(SaveMode.Overwrite).save("/user/spark/output/")
} catch {
    case e: Exception => {
         val sw = new StringWriter
       e.printStackTrace(new PrintWriter(sw))
        println(sw.toString)
    } 
}

It works fine when I use Spark 2.0.2 however.

I did some research:

It's correct that Spark 2.1.0 adds a method to the abstract OutputWriterFactory:
org.apache.spark.sql.execution.datasources.OutputWriterFactory#def getFileExtension(context: TaskAttemptContext): String
I saw a similar issue when the databricks-avro dependency was used, which had a similar issue when version 3.1.0 was used, 3.2.0 fixed this.
Link describing the issue and fix: databricks/spark-redshift#315

This is not the dependency you were using however (apache's own version), so I think it's a dependency that's using it?

All that neads to be done is add an override for the method returning a fileextension.

I can see what else I can find in the meantime, but I'm having issues setting up debugging from IntelliJ with the Spark cluster (HDInsight), so have a hard time figuring out where it's going wrong exactly. Maybe you can check it out too?

Appreciate your work by the way!

error on databricks

I have the python code below on databricks and I am getting the rror "java.lang.NoSuchMethodError: org.apache.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V

Please help

from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.zuinnote.spark.office.excel").options("read.locale.bcp47","us").load('/home/user/office/input')

Not all HadoopOffice configuration is applied correctly

All configuration options that contain a capital letter are not applied correctly. This is due to an undocumented Spark behaviour to convert all option keys to lower case. This means for example options such as hadoopoffice.read.lowFootprint have never been applied when using the Spark-HadoopOffice data source (no issues for the other platforms).

Most likely we will implement a workaround in the HadoopOffice library without breaking compatibility.

Skipped imported decimal values

When importing xlsx some decimal values are skipped. Superficially it looks like it depends on number of decimal places.

  • Test values which cause first row to be ignored:
    16884.924
    1725.5523
  • Test values which are imported correctly:
    16884.9245
    1725.5523
  • Test values which are also imported correctly:
    16884.924
    1725.552

CVE-2021-44228: Mitigate Log4shell

CVE-2021-44228 describes a highly critical security issue in the library log4j2. While in most installation where hadoopoffice is used it is less likely to be exploitable, we include this the library log4j2 > 2.15 to mitigate this issue. Independent of this - as a defense in the depth - we recommend additionally to upgrade to the latest JDK8 (at least > 191, better the latest) or JDK11/JDK17 and to set the Java System properties (e.g. via -D) "log4j2.formatMsgNoLookups" to "true" and set "com.sun.jndi.rmi.object.trustURLCodebase" and "com.sun.jndi.cosnaming.object.trustURLCodebase" to "false".

The full list of mitigations can be found in the link above.

Add support for partitioning by folders in Hive/Spark Style

Partitioning is useful for splitting the output in folders and files while writing (and reading).
In excel case can be useful for separating by day, customer or others ways or for creating smaller spreadsheets that will load faster.

The current behavior of this datasource when DataFrameWriter.partitionBy() method is generating a single file and overwriting for each partition.

For example when a dataframe is partitioned by year, month, and day Sparks writes in two different forms while using the 'csv' (or parquet, avro, etc...) and using the Excel output from this datasource.

Using 'csv':

csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv

Using 'org.zuinnote.spark.office.excel':

xls_output_dir/part-m-00000.xls
xls_output_dir/_SUCCESS

Reading Spreadsheets: Automatically determine schema with simple data types

Currently the datasource supports reading spreadsheets, such as MSExcel, into a dataframe consisting of Array of SpreadSheetCellDAO. Although this is a very flexible mechanism, supporting conversion in any of the Spark SQL supported datatypes, handling of erroneous data etc., some users want to have an automated simple conversion into a DataFrame consisting of Spark SQL primitive datatypes, such as String, Decimal, Byte, Short, Integer or Long.

The idea is that with an option read.spark.simpleMode this can be supported.

Add unit/integration tests.

Job abortion on databricks

Hi,
I am working on data bricks platform with runtime version 5.2 (Spark 2.4.0 ,Scala 2.11)
and have the following version of the library installed from maven central:
com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.3.0

the workers and driver nodes are standarDS3 (14GB) memory.

So, I am trying to read an .xlsx file with the following line of code:

sqlContext.read.format("org.zuinnote.spark.office.excel").option('read.sheet.skiplines.num',1).option("read.locale.bcp47", "us").load([path)

I am able to read a smaller file of around 100 records with no issue but each time i try to read a file of around 700000 records i get the following stack trace

Py4JJavaError: An error occurred while calling o622.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 278.0 failed 4 times, most recent failure: Lost task 0.3 in stage 278.0 (TID 10532, 10.139.64.9, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2100)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2088)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2087)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2087)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1076)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1076)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1076)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2319)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2267)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2255)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2252)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:259)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:269)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:69)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:75)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:497)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:48)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectResult(Dataset.scala:2827)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3439)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2556)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2556)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:3423)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:99)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3422)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2556)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2770)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:265)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:302)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Cleaning up unclosed ZipFile for archive /local_disk0/tmp/tmp-6233698570479588979.xlsx

Kindly let me know if this is an issue with the version I have installed and are there any memory constraints for the library to function properly

Any help is appreciated

Thanks

Suggestion: Writing, Reading, combining, saving excel files

Hello,

Currently your library supports reading excel files and writing them, which is excellent!

What I would really like to do is:

Create 1 page with calculations, referring other sheets (250+), with just raw data (480 cells per sheet)

What I have:

250+ dfs of raw data (integers, strings), 1 df for each sheet to be output

According to the documentation data can be written in two ways:

  1. You can have a dataframe with columns of simple datatypes (no map, no list, no struct) that should be written in rows of an Excel sheet. You can define the sheetname by using the option "write.spark.defaultsheetname" (default is "Sheet1"). In this way, you can only write values, but no formulas, comments etc.
  2. You can have a dataframe with arrays where each element corresponds to the schema defined above. In this case you have full control where the data ends, you can use formulas, comments etc.

How I envisioned this could work with your current library, to avoid creating 480 cells each for 250+ dfs of sources:

  • Use method 1 to write each of the 250+ dfs as a separate xlsx, with a unique name for each sheet using write.spark.defaultsheetname
  • Write the front page with formulas using method 2. to a separate xlsx
  • Read these files as dfs, and use union to create 1 df (with multiple sheets)
  • Write them as a single xlsx

This did not work unfortunately, because reading an xlsx will return a df that cannot be directly written as an xlsx with this library.

My question:

Is there an easy way to do this currently? (I'm not that familiar with Spark yet, so there could be something I'm missing!) If not, do you think it should be able to do this?

To me it makes sense to be able to import data, make some adjustments, and export it again.

What do you think?

Keep up the good work!

Kind regards,

Joris

Doesn't generate excel file as expected but no obvious error

Start spark2-shell with the library:
spark2-shell --packages com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.2.4

Run the following code and get the result beneath the code:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val q = "select * from branch_grouped_by_month"
q: String = select * from branch_grouped_by_month

scala> val df = spark.sql(q)
df: org.apache.spark.sql.DataFrame = [branch_id: string, dist_accts: bigint ... 1 more field]

scala> df.show()
+---------+----------+-----------------+
|branch_id|dist_accts|              _c2|
+---------+----------+-----------------+
|   001659|         1|          4536.45|
|   880700|         1|           646.94|
|   008754|         2|             15.5|
|   008820|         2|          4329.82|
|   880565|         1|              0.0|
|   001746|         1|           454.13|
|   800415|         1|              0.0|
|   800202|         1|              0.0|
|   101029|         1|              0.0|
|   880694|         1|           681.63|
|   060022|         1|           909.34|
|   800284|         1|          1862.48|
|   001241|         1|          9171.86|
|   880573|         2|           1729.8|
|   001780|         1|            -0.79|
|   060700|         1|           866.01|
|   800421|         2|8154.120000000001|
|   001137|         1|              0.0|
|   060043|         1|              0.0|
|   060001|         2|          3499.45|
+---------+----------+-----------------+
only showing top 20 rows


scala> df.write.format("org.zuinnote.spark.office.excel").option("spark.write.useHeader",true).option("write.locale.bcp47", "us").save("gaps.xlsx");
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
  at org.zuinnote.spark.office.excel.DefaultSource.<init>(DefaultSource.scala:115)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at java.lang.Class.newInstance(Class.java:442)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 50 elided

scala>
```
What am I missing here? Can anyone please help me to sort it out? Thank you very much in advance.

Writing to XLSX only outputs String fields but XLS works fine

When writing to XLSX all int fields are output as 0.

This code works:

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
	.option("hadoopoffice.write.header.write", "true")
	.save("hdfs://user/spark/warehouse/output");

And outputs:

+-----+---+--------+------+
| Name|Age|Children|Salary|
+-----+---+--------+------+
| John| 33|       1| 10000|
| Mary| 43|       9|  1000|
|Peter| 71|       3|  5000|
|James| 22|       1|  2000|
|Homer| 55|       3|  2222|
+-----+---+--------+------+

This doesn't works:

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.option("hadoopoffice.write.mimeType", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
	.option("hadoopoffice.write.header.write", "true")
	.save("hdfs://user/spark/warehouse/output");

Neither this works:

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.save("hdfs://user/spark/warehouse/output");

And outputs as:

+-----+---+--------+------+
| Name|Age|Children|Salary|
+-----+---+--------+------+
| John| 0 |       0|     0|
| Mary| 0 |       0|     0|
|Peter| 0 |       0|     0|
|James| 0 |       0|     0|
|Homer| 0 |       0|     0|
+-----+---+--------+------+

Attached a test case in Java.

Custom Schema Example

Can you please provide an example on how to specify a custom schema when reading an Excel file? Thanks

Cell with Integer value has an = sign

I created a simple xlsx file with two values(1, A).
However when opening the Xlsx file, the value of Integer appears as =1 instead of 1.

`scala> val sRdd = sc.parallelize(Seq((1,"A")))
sRdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at :24

scala> val df = spark.createDataFrame(sRdd)
df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> df.repartition(1).write.format("org.zuinnote.spark.office.excel").option("write.locale.bcp47", "us").option("mapreduce.output.fileoutputformat.compress", "false").option("mapreduce.map.output.compress", "false").save("/tmp/somedir/1")
`

When i downloaded part-m-00000.xlsx and opened the file in Excel, the cell A1 has the value =1 and cell B1 has the value A

I am not sure why a column with integer type is written with value =1

I am unable to find an option in the wiki that explains this behavior.

Thread Safety When Reading Multiple Excel Files

I'm getting an intermittent issue when trying to read and operate on multiple excel files at once. The exception is below. If I re-run the application another time, it runs fine without issue. I did some googling and it seems related to the fact that SimpleDateFormat is not thread safe (see: https://stackoverflow.com/questions/6840803/why-is-javas-simpledateformat-not-thread-safe)

18/11/14 18:42:20 ERROR Executor: Exception in task 0.0 in stage 45.0 (TID 5582)
java.lang.NumberFormatException: For input string: "818.E818E2"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/11/14 18:42:20 ERROR Executor: Exception in task 1.0 in stage 45.0 (TID 5583)
java.lang.NumberFormatException: For input string: "818.E818E22"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1869)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
18/11/14 18:42:20 ERROR TaskSetManager: Task 0 in stage 45.0 failed 1 times; aborting job
18/11/14 18:42:21 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 45.0 failed 1 times, most recent failure: Lost task 0.0 in stage 45.0 (TID 5582, localhost, executor driver): java.lang.NumberFormatException: For input string: "818.E818E2"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:186)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "818.E818E2"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.text.DigitList.getDouble(DigitList.java:169)
        at java.text.DecimalFormat.parse(DecimalFormat.java:2089)
        at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
        at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
        at org.zuinnote.hadoop.office.format.common.converter.ExcelConverterSimpleSpreadSheetCellDAO.getDataAccordingToSchema(ExcelConverterSimpleSpreadSheetCellDAO.java:434)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:353)
        at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$3$$anonfun$apply$6.apply(DefaultSource.scala:316)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:108)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more

write.spark.useHeader is not working

I am trying to write a DF to excel sheet in spark 2.1.0 with the header row.

val sampleDF: DataFrame = spark.createDataFrame(Seq(
      (2, "abc"),
      (2, "def"),
      (4, "123")

    )).toDF("id", "text")

    sampleDF.repartition(1).write
      .format("org.zuinnote.spark.office.excel")
      .option("write.locale.bcp47", "de")
      .option("write.spark.useHeader","true")
      .mode(SaveMode.Overwrite)
      .save("/home/vghantasala/header.xlsx")

the above code does not insert the header row (a row with "id" and "text" cols) in the excel sheet. It just inserts the contents of the DF into the excel sheet
Am I missing something here ?

Thanks

How do i get a schema that's all String with "read.spark.simpleMode"

I try to get data with "read.spark.simpleMode", but i get a "java.lang.NumberFormatException"
the schema is
root
|-- c0: string (nullable = true)
|-- c1: string (nullable = true)
|-- c2: long (nullable = true)
|-- c3: string (nullable = true)
|-- c4: string (nullable = true)
|-- c5: string (nullable = true)
|-- c6: long (nullable = true)
|-- c7: string (nullable = true)
|-- c8: string (nullable = true)
|-- c9: long (nullable = true)
|-- c10: string (nullable = true)
|-- c11: string (nullable = true)
|-- c12: string (nullable = true)
|-- c13: decimal(6,3) (nullable = true)
|-- c14: string (nullable = true)
|-- c15: string (nullable = true)
|-- c16: string (nullable = true)
|-- c17: string (nullable = true)

i want to get a schema that's all String. thanks

Spark 2.2 support

There was a change in the OutputWriter internal API:
databricks/spark-avro#240

This can be fixed by adding a method write(row: InternalRow) that calls write(row: Row). This means an InternalRow needs to be converted to a Row.

inferSchema for Excelfiles with one row is not working correctly

inferSchema does not result the correct schema if reading an Excel file with just one line or a header + a line.

Note: This does NOT affect reading/processing the full data. It is for scenarios where there is just one excel file with 1 line that it cannot derive the schema

Memory issue

I tried creating dataframe out of a 60Mb excel file using spark-hadoopoffice-ds-2.11.
But it throws java.lang.OutOfMemoryError: GC overhead limit exceeded.
spark.executor.instances was set to 3.
What is the right solution for this?

java.lang.NoSuchMethodError while using spark-hadoopoffice-ds_2.11 (v1.2.4)

I am trying to use the spark-hadoopoffice-ds_2.11 (v1.2.4) library in one of my projects and I am stuck with the following error.

java.lang.NoSuchMethodError: org.apache.poi.xssf.usermodel.XSSFWorkbook.getProperties()Lorg/apache/poi/ooxml/POIXMLProperties;
        at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.prepareXSSFMetaData(MSExcelWriter.java:729)
        at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.prepareMetaData(MSExcelWriter.java:636)
        at org.zuinnote.hadoop.office.format.common.writer.msexcel.MSExcelWriter.close(MSExcelWriter.java:340)
        at org.zuinnote.hadoop.office.format.common.OfficeWriter.close(OfficeWriter.java:131)
        at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordWriter.close(AbstractSpreadSheetDocumentRecordWriter.java:216)
        at org.zuinnote.spark.office.excel.ExcelOutputWriter.close(ExcelOutputWriter.scala:137)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)`

I got this error while running the following scala/spark code in cloudera hadoop cluster.

sparkSession.sql(sqlText)
        .repartition(1)
        .write
        .format("org.zuinnote.spark.office.excel")
        .option("read.locale.bcp47", "us")
        .option("write.spark.defaultsheetname", "Sheet1")
        .save(extractFile)

Here is how the dependencies look like in my maven pom file

<scala.version>2.11.8</scala.version>
 
<dependency>
 <groupId>com.github.zuinnote</groupId>
 <artifactId>spark-hadoopoffice-ds_2.11</artifactId>
 <version>1.2.4</version>
 <exclusions>
     <exclusion>
         <groupId>com.github.zuinnote</groupId>
         <artifactId>hadoopoffice-fileformat</artifactId>
     </exclusion>
     <exclusion>
         <groupId>org.apache.poi</groupId>
         <artifactId>poi</artifactId>
     </exclusion>
     <exclusion>
         <groupId>org.apache.poi</groupId>
         <artifactId>poi-ooxml</artifactId>
     </exclusion>
 </exclusions>
</dependency>

I have already tried going through the suggestions provided here - https://stackoverflow.com/questions/18231134/how-to-avoid-java-lang-nosuchmethoderror-org-apache-poi-util-ioutils-copyljava and made sure that there are no other versions of POI in use.

Unfortunately it did not resolve the error. Could you have a look at this issue and let me know if I am doing anything wrong?

Thanks in advance.

Spill over to next sheet if number of rows exceeding Excel limitations

Excel has limitations for the number of rows in a sheet (1,048,576 rows):
https://support.office.com/en-us/article/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

Currently, in simple mode (note: this does not affect DAO mode), one can only write into a single Excel sheet.

This enhancement will introduce a configuration option, hadoopoffice.write.simple.spillOverSheet (default: true), that allows that any rows beyond the limit of a sheet are spilled over to further sheet(s).
The additional sheets have the default sheet name and an increasing number added to them.

Note: Generally, it depends on the Excel version and the available memory if it can process such sheets.

Related:
ZuInnoTe/hadoopoffice#64

Include support for scala 2.12

Scala 2.12 is source code compatible, but not binary compatible. The idea of this issue is to compile the ds also for scala 2.12

IndexOutOfBoundsException: When reading xlsx file.

Hi,

I'm getting IndexOutOfBoundsException when trying to read an xlsx file. The testing.xlsx which comes as a part of this library works file but not the one which I'm using. Not sure if this is Spark's issue. I'm using pyspark for the coding.

System Config:

  1. Spark : 2.3.3
  2. spark-hadoopoffice-ds: V1.3.0
  3. Scala: 2.11.8

Code Snippet:

df_RPM = spark.read.format('org.zuinnote.spark.office.excel')
.option("read.locale.bcp47", "us")
.option("read.spark.simpleMode", "true")
.option("hadoopoffice.read.header.read", "true")
.option("hadoopoffice.read.lowFootprint", "true")
.load(r'D:\IDD\data\iDD\Sample.xlsx')

Excel File
Sample.xlsx

Exception:

Py4JJavaError Traceback (most recent call last)
in
4 .option("hadoopoffice.read.header.read", "true")
5 .option("hadoopoffice.read.lowFootprint", "true")
----> 6 .load(r'D:\IDD\data\iDD\Sample.xlsx')

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\pyspark\sql\readwriter.py in load(self, path, format, schema, **options)
164 self.options(**options)
165 if isinstance(path, basestring):
--> 166 return self._df(self._jreader.load(path))
167 elif path is not None:
168 if type(path) != list:

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

D:\Spark\spark-2.3.3-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o232.load.
: java.lang.IndexOutOfBoundsException: 15
at scala.collection.mutable.ListBuffer.update(ListBuffer.scala:139)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$4.apply(DefaultSource.scala:211)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$4.apply(DefaultSource.scala:188)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:203)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:203)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:202)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

lowFootprint: numeric column values result in nulls

with spark-hadoopoffice-ds v1.3.5 lowFootprint parser returns numeric columns content as _null_s.

with spark-hadoopoffice-ds v1.3.5 an spreadsheetml 2006 format xlsx files numeric columns leads to a:
XSSFPullParser.getNext: Cannot read celltype line: org.zuinnote.hadoop.office.format.common.parser.msexcel.internal.XSSFPullParser.getNext(XSSFPullParser.java:192 which result in a null instead of the numeric value.

The cause of this is the missing of t-attributes in the c-elements in the contents of /xl/worksheets/sheet1.xml. Take for example a look at the cells H1 through H5 and Q1 through Q5. The values are:
Cell Value
H1 empty
H2 a string
H3 1
H4 25
H5 25
Q1 empty
Q2 a string
Q3 1
Q4 8
Q5 2

The numeric values above correspond to the value of the worksheet-sheetData-row-c-v elements (see
spreadsheetml2006mainpartial.xml.txt)

This issue is reproducible with a http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac spreadsheet so see (MapIssue31.xlsx).

The numeric value of the cell is located as String at the XMLEvent CharactersEventImpl.mContent from xer.nextEvent().

It would be nice when the value of the v-elements is returned instead of the current _null_s.

Updated with an example to reproduce.

Getting NegativeArraySizeException while reading xls/xlsx format - not in all files

It's not happening with every file for excel but for some files, i am getting NegativeArraySizeException while reading specific worksheet.

java.lang.NegativeArraySizeException
	at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:433)
	at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127)
	at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44)
	at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:183)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
	at scala.Option.orElse(Option.scala:289)
	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

Reading : New header row is added every time while new sheet data is appended

screenshot from 2018-03-05 15-40-43
For instance if i am choosing useHeader true and pass a file or multiple files with multiple sheets but all sheets and files has exactly same number of columns with same name and datatype, then final datasets contains additional new row at start of new sheet data.

E.g. 3 files with 2 sheets in each, total 6 sheets, so final dataset will have 5 additional rows, from each sheet's first row as a "data row"

Date column parsing issue

When excel cell has cell value as a date for example 17-05-2018 19:10 it is converting that column into double type. So resultant value will be 17.0000 and spark consider it as a double datatype column

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.