zuinnote,spark-hadoopoffice-ds

jornfranke commented on June 3, 2024

Do you have some code and the name of the worksheet?

On 22. May 2018, at 12:15, Jay Panchal ***@***.***> wrote: It's not happening with every file for excel but for some files, i am getting NegativeArraySizeException while reading specific worksheet. java.lang.NegativeArraySizeException at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:433) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:183) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

@jornfranke I need to email you file and code...
Can you please confirm, [email protected] email id?

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Code if possible via issue. File itself it is sufficient for now if you tell me how many sheets, the sheet names and If there are empty sheets

…

On 22. May 2018, at 12:50, Jay Panchal ***@***.***> wrote: @jornfranke I need to email you file and code... Can you please confirm, ***@***.*** email id? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

I think the issue is that you have sheets which are completely empty - is this correct? It is a strange error because it means the excel document has a row without cells instead of not having that row at all. May I ask how you generated them?

…

On 22. May 2018, at 12:50, Jay Panchal ***@***.***> wrote: @jornfranke I need to email you file and code... Can you please confirm, ***@***.*** email id? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

@jornfranke No no, File contains it self only one sheet named rptOpportunityDetailsWithStageH has total 462 records with 36 columns (A to AJ)

Rows may or may not have data in all columns. i.e. couple of cell may be a blank in particular row.
Couple of (10) colunmns has a timestamp data, two columns has a integer and float values. and other then all are string columns (may contains long descriptions with special characters like, white spaces and punctuation marks.)
Fixed header row is set at 1 index.
Created and tried with microsoft office XLS and XLSX both format. Both of them causing issue.
Files were generated manually.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

dataFrameReader = dataFrameReader.load().sparkSession().read()
						.format("org.zuinnote.spark.office.excel")
						.option("read.spark.simpleMode","true")
						.option("read.spark.useHeader", "true")
						.option("read.locale.bcp47", "US")
						.option("read.lowFootprint", "true")
                                                 // ":" separated
                                                .option("read.sheets", "" + sheetsToBeRead.toString());

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

I see I think the issue is that it has a file which has a row, but no cells. Normally in this case the row should not exist. Ie we do a check if the row is null, but then we need another check if the row is not null, but does not contain any cells. I think it is a simple fix to check for this condition. I will check this week.

…

On 22. May 2018, at 15:10, Jay Panchal ***@***.***> wrote: @jornfranke No no, File contains it self only one sheet named rptOpportunityDetailsWithStageH has total 462 records with 36 columns (A to AJ) Rows may or may not have data in all columns. i.e. couple of cell may be a blank in particular row. Couple of (10) colunmns has a timestamp data, two columns has a integer and float values. and other then all are string columns (may contains long descriptions with special characters like, white spaces and punctuation marks.) Fixed header row is set at 1 index. Created and tried with microsoft office XLS and XLSX both format. Both of them causing issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Opportunity Details_test_xlsx.xlsx
This is the sample file you can try it with.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

As a temporary fix you can activate low footprint mode, which should not have this issue:
hadoopoffice.read.lowFootprint = true

for the current issue I expect a real fix by this week

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

i just saw that you have low footprint mode activated. This is strange because the Exception indicates that you don't use lowfootprint mode

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Can you please try version 1.1.1 - all the issues should be fixed there, thank you.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Sure...Let me check it by today!

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

@jornfranke
Build is getting failed... Can you please assist me to solve this issue...

`$ sbt assembly

[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.spark#spark-core_2.12;2.0.1: not found
[warn] :: org.apache.spark#spark-sql_2.12;2.0.1: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-core_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L36-37)
[warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1
[warn] org.apache.spark:spark-sql_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L38-39)
[warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.12;2.0.1: not found
[error] unresolved dependency: org.apache.spark#spark-sql_2.12;2.0.1: not found
`

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

You seem to build with scala 2.12 - this is not supported by Spark

…

On 23. May 2018, at 09:28, Jay Panchal ***@***.***> wrote: @jornfranke Build is getting failed... Can you please assist me to solve this issue... $ sbt assembly [info] Loading settings from plugins.sbt,assembly.sbt ... [info] Loading project definition from /home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/project [info] Loading settings from build.sbt ... [info] Set current project to spark-hadoopoffice-ds (in build file:/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/) [info] Updating ... [warn] module not found: org.apache.spark#spark-core_2.12;2.0.1 [warn] ==== local: tried [warn] /home/jay/.ivy2/local/org.apache.spark/spark-core_2.12/2.0.1/ivys/ivy.xml [warn] ==== public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.0.1/spark-core_2.12-2.0.1.pom [warn] ==== local-preloaded-ivy: tried [warn] /home/jay/.sbt/preloaded/org.apache.spark/spark-core_2.12/2.0.1/ivys/ivy.xml [warn] ==== local-preloaded: tried [warn] file:////home/jay/.sbt/preloaded/org/apache/spark/spark-core_2.12/2.0.1/spark-core_2.12-2.0.1.pom [warn] ==== Maven2 Local: tried [warn] file:/home/jay/.m2/repository/org/apache/spark/spark-core_2.12/2.0.1/spark-core_2.12-2.0.1.pom [warn] module not found: org.apache.spark#spark-sql_2.12;2.0.1 [warn] ==== local: tried [warn] /home/jay/.ivy2/local/org.apache.spark/spark-sql_2.12/2.0.1/ivys/ivy.xml [warn] ==== public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.0.1/spark-sql_2.12-2.0.1.pom [warn] ==== local-preloaded-ivy: tried [warn] /home/jay/.sbt/preloaded/org.apache.spark/spark-sql_2.12/2.0.1/ivys/ivy.xml [warn] ==== local-preloaded: tried [warn] file:////home/jay/.sbt/preloaded/org/apache/spark/spark-sql_2.12/2.0.1/spark-sql_2.12-2.0.1.pom [warn] ==== Maven2 Local: tried [warn] file:/home/jay/.m2/repository/org/apache/spark/spark-sql_2.12/2.0.1/spark-sql_2.12-2.0.1.pom [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: org.apache.spark#spark-core_2.12;2.0.1: not found [warn] :: org.apache.spark#spark-sql_2.12;2.0.1: not found [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] [warn] Note: Unresolved dependencies path: [warn] org.apache.spark:spark-core_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L36-37) [warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1 [warn] org.apache.spark:spark-sql_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L38-39) [warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1 [error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.12;2.0.1: not found [error] unresolved dependency: org.apache.spark#spark-sql_2.12;2.0.1: not found — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Keep in mind that you also need to increase the hadoopoffice dependency to 1.1.1

…

On 23. May 2018, at 09:28, Jay Panchal ***@***.***> wrote: @jornfranke Build is getting failed... Can you please assist me to solve this issue... $ sbt assembly [info] Loading settings from plugins.sbt,assembly.sbt ... [info] Loading project definition from /home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/project [info] Loading settings from build.sbt ... [info] Set current project to spark-hadoopoffice-ds (in build file:/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/) [info] Updating ... [warn] module not found: org.apache.spark#spark-core_2.12;2.0.1 [warn] ==== local: tried [warn] /home/jay/.ivy2/local/org.apache.spark/spark-core_2.12/2.0.1/ivys/ivy.xml [warn] ==== public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.0.1/spark-core_2.12-2.0.1.pom [warn] ==== local-preloaded-ivy: tried [warn] /home/jay/.sbt/preloaded/org.apache.spark/spark-core_2.12/2.0.1/ivys/ivy.xml [warn] ==== local-preloaded: tried [warn] file:////home/jay/.sbt/preloaded/org/apache/spark/spark-core_2.12/2.0.1/spark-core_2.12-2.0.1.pom [warn] ==== Maven2 Local: tried [warn] file:/home/jay/.m2/repository/org/apache/spark/spark-core_2.12/2.0.1/spark-core_2.12-2.0.1.pom [warn] module not found: org.apache.spark#spark-sql_2.12;2.0.1 [warn] ==== local: tried [warn] /home/jay/.ivy2/local/org.apache.spark/spark-sql_2.12/2.0.1/ivys/ivy.xml [warn] ==== public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.0.1/spark-sql_2.12-2.0.1.pom [warn] ==== local-preloaded-ivy: tried [warn] /home/jay/.sbt/preloaded/org.apache.spark/spark-sql_2.12/2.0.1/ivys/ivy.xml [warn] ==== local-preloaded: tried [warn] file:////home/jay/.sbt/preloaded/org/apache/spark/spark-sql_2.12/2.0.1/spark-sql_2.12-2.0.1.pom [warn] ==== Maven2 Local: tried [warn] file:/home/jay/.m2/repository/org/apache/spark/spark-sql_2.12/2.0.1/spark-sql_2.12-2.0.1.pom [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] :: org.apache.spark#spark-core_2.12;2.0.1: not found [warn] :: org.apache.spark#spark-sql_2.12;2.0.1: not found [warn] :::::::::::::::::::::::::::::::::::::::::::::: [warn] [warn] Note: Unresolved dependencies path: [warn] org.apache.spark:spark-core_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L36-37) [warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1 [warn] org.apache.spark:spark-sql_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L38-39) [warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1 [error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.12;2.0.1: not found [error] unresolved dependency: org.apache.spark#spark-sql_2.12;2.0.1: not found — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Alright, so i need to switch my scala version to 2.10 or 2.11 Right?

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Or 2.11 depending on your cluster

…

On 23. May 2018, at 10:01, Jay Panchal ***@***.***> wrote: Alright, so i need to switch my scala version to 2.10. Right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

@jornfranke I downloaded hadoopOffice-fileformat 1.1.1 and spark-hadoopoffice-ds-2.11
Experimented with below code...

dataFrameReader = dataFrameReader.load().sparkSession().read() .format("org.zuinnote.spark.office.excel") .option("read.spark.simpleMode","true") .option("read.spark.useHeader", String.valueOf(csvInfo.isFirstRowAsColumn())) //.option("read.spark.useHeader.skipHeaderInEachSheet", "true") .option("read.locale.bcp47", "US") .option("read.lowFootprint", "true") .option("read.sheets", "" + getBaseExcelWorksheetVo(csvInfo).getWorksheetName()); dataset = dataFrameReader.load(AppContextUtil.getAppPath() + "excelFiles" + "/" + somePath);

with attached file
116.xlsx

and got below exception :
java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

In fact, these exception occurs in every file now, regardless of previous issue.
Can you please verify the issue at your end...

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Hmm that’s a strange issue, we have a battery of unit tests with files generated by excel and they all don’t show this issue, I guess we have to investigate deeper

…

On 24. May 2018, at 10:41, Jay Panchal ***@***.***> wrote: @jornfranke I downloaded hadoopOffice-fileformat 1.1.1 and spark-hadoopoffice-ds-2.11 Experimented with below code... dataFrameReader = dataFrameReader.load().sparkSession().read() .format("org.zuinnote.spark.office.excel") .option("read.spark.simpleMode","true") .option("read.spark.useHeader", String.valueOf(csvInfo.isFirstRowAsColumn())) //.option("read.spark.useHeader.skipHeaderInEachSheet", "true") .option("read.locale.bcp47", "US") .option("read.lowFootprint", "true") .option("read.sheets", "" + getBaseExcelWorksheetVo(csvInfo).getWorksheetName()); dataset = dataFrameReader.load(AppContextUtil.getAppPath() + "excelFiles" + "/" + somePath); with attached file 116.xlsx and got below exception : java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) In fact, these exception occurs in every file now, regardless of previous issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Can try without specifying the sheets option ? This one is not needed. If you do not specify a sheet it reads all from the workbook. I am not sure if the sheet name you provide exists. Maybe you can also provide the value for the worksheet name?

…

On 24. May 2018, at 10:41, Jay Panchal ***@***.***> wrote: @jornfranke I downloaded hadoopOffice-fileformat 1.1.1 and spark-hadoopoffice-ds-2.11 Experimented with below code... dataFrameReader = dataFrameReader.load().sparkSession().read() .format("org.zuinnote.spark.office.excel") .option("read.spark.simpleMode","true") .option("read.spark.useHeader", String.valueOf(csvInfo.isFirstRowAsColumn())) //.option("read.spark.useHeader.skipHeaderInEachSheet", "true") .option("read.locale.bcp47", "US") .option("read.lowFootprint", "true") .option("read.sheets", "" + getBaseExcelWorksheetVo(csvInfo).getWorksheetName()); dataset = dataFrameReader.load(AppContextUtil.getAppPath() + "excelFiles" + "/" + somePath); with attached file 116.xlsx and got below exception : java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) In fact, these exception occurs in every file now, regardless of previous issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Sure, let me check it that too!
rptOpportunityDetailsWithStageH is the sheet name, though i pass it dynamically, but for the test case, i have also checked it by passing it statically.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

I've gave try 1.1.1 version with two different files, without specifying sheet names.

It has 3 different worksheets - works fine -
Sample - Superstore.xlsx
It has single worksheet - Exception occurs.
Opportunity Details_test_xlsx.xlsx

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

So, one root cause what i found is, the file for which it gives an error contains an empty columns, with only header value. I tried reading that file just by adding a space to only a single cell of each empty column, and it didn't thrown any error.

Another, i passed "read.sheets" parameter with sheet name rptOpportunityDetailsWithStageH
and it thrown below error :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:166)

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Thx for the detailed feedback I will look into giving a more meaningful feedback for this case. Aside from this - do you think there is more to fix besides giving a more meaningful error message. Unfortunately excel data is sometimes very unstructured so it is hard to cover all cases - that is why your feedback is important

…

On 24. May 2018, at 11:54, Jay Panchal ***@***.***> wrote: So, one root cause what i found is, the file for which it gives an error contains an empty columns, with only header value. I tried reading that file just by adding a space to only a single cell of each empty column, and it didn't thrown any error. Another, i passed "read.sheets" parameter with sheet name rptOpportunityDetailsWithStageH and it thrown below error : Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:166) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

You are welcome!

Yes you are absolutely correct! We don't have control over how and what kinda of data user will add. As this is much broad area so, we can't even collect all the areas and cases.

I'll work on this, and will try to figure out, what are those more general error prone scenarios that may be covered up easily.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Another thing i found, while reading this file.... 125.xlsx is

I don't know whether this comes from spark or do we need to correct something from our file side or reader side... It will be good if we can figure out this as well.

org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:46) at org.apache.spark.sql.types.DecimalType$.apply(DecimalType.scala:43) at org.apache.spark.sql.types.DataTypes.createDecimalType(DataTypes.java:123) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:293) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:160) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:160) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:151) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:45) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:151) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Thx. You can always configure a custom schema, for this column you may need to define a float or double. This error seems to be a Spark issue for decimals(cf https://mail-archives.apache.org/mod_mbox/spark-issues/201604.mbox/%3CJIRA.12954203.1459230268000.120167.1459663945944@Atlassian.JIRA%3E). The workaround mentioned there is to set another datatype (as I described above). A newer Spark version (2.2 or 2.3) might help.

…

On 25. May 2018, at 07:57, Jay Panchal ***@***.***> wrote: Another thing i found, while reading this file.... 125.xlsx is org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:46) at org.apache.spark.sql.types.DecimalType$.apply(DecimalType.scala:43) at org.apache.spark.sql.types.DataTypes.createDecimalType(DataTypes.java:123) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:293) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:160) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:160) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:151) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:45) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:151) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) I don't know whether this comes from spark or do we need to correct something from our file side or reader side... It will be good if we can figure out this as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Hey! @jornfranke do we have any hurdles to resolve current issue?

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

The original issue was solved. Can you please verify this? For the other thing we need a new issue to keep track. Then we evaluate importance and plan them.

…

On 28. May 2018, at 08:26, Jay Panchal ***@***.***> wrote: Hey! @jornfranke do we have any hurdles to resolve current issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jaypanchal commented on June 3, 2024

Yes, original issue is resolved, but now we are not able to pass worksheet name, as it throws exception if we pass worksheet name.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Please create a new issue with all the details. It will be easier then the follow up for us. Please Check also the resolution proposals provided so far.

…

On 28. May 2018, at 08:56, Jay Panchal ***@***.***> wrote: Yes, original issue is resolved, but now we are not able to pass worksheet name, as it throws exception if we pass worksheet name. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

check 1.2.0 and please file a new issue if it still persists

from spark-hadoopoffice-ds.

Getting NegativeArraySizeException while reading xls/xlsx format - not in all files about spark-hadoopoffice-ds HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent