Code Monkey home page Code Monkey logo

Comments (32)

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

@jornfranke I need to email you file and code...
Can you please confirm, [email protected] email id?

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

@jornfranke No no, File contains it self only one sheet named rptOpportunityDetailsWithStageH has total 462 records with 36 columns (A to AJ)

  1. Rows may or may not have data in all columns. i.e. couple of cell may be a blank in particular row.
  2. Couple of (10) colunmns has a timestamp data, two columns has a integer and float values. and other then all are string columns (may contains long descriptions with special characters like, white spaces and punctuation marks.)
  3. Fixed header row is set at 1 index.
  4. Created and tried with microsoft office XLS and XLSX both format. Both of them causing issue.
    Files were generated manually.

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024
dataFrameReader = dataFrameReader.load().sparkSession().read()
						.format("org.zuinnote.spark.office.excel")
						.option("read.spark.simpleMode","true")
						.option("read.spark.useHeader", "true")
						.option("read.locale.bcp47", "US")
						.option("read.lowFootprint", "true")
                                                 // ":" separated
                                                .option("read.sheets", "" + sheetsToBeRead.toString());

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Opportunity Details_test_xlsx.xlsx
This is the sample file you can try it with.

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

As a temporary fix you can activate low footprint mode, which should not have this issue:
hadoopoffice.read.lowFootprint = true

for the current issue I expect a real fix by this week

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

i just saw that you have low footprint mode activated. This is strange because the Exception indicates that you don't use lowfootprint mode

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

Can you please try version 1.1.1 - all the issues should be fixed there, thank you.

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Sure...Let me check it by today!

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

@jornfranke
Build is getting failed... Can you please assist me to solve this issue...

`$ sbt assembly

[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.spark#spark-core_2.12;2.0.1: not found
[warn] :: org.apache.spark#spark-sql_2.12;2.0.1: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-core_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L36-37)
[warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1
[warn] org.apache.spark:spark-sql_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L38-39)
[warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.12;2.0.1: not found
[error] unresolved dependency: org.apache.spark#spark-sql_2.12;2.0.1: not found
`

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Alright, so i need to switch my scala version to 2.10 or 2.11 Right?

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

@jornfranke I downloaded hadoopOffice-fileformat 1.1.1 and spark-hadoopoffice-ds-2.11
Experimented with below code...

dataFrameReader = dataFrameReader.load().sparkSession().read() .format("org.zuinnote.spark.office.excel") .option("read.spark.simpleMode","true") .option("read.spark.useHeader", String.valueOf(csvInfo.isFirstRowAsColumn())) //.option("read.spark.useHeader.skipHeaderInEachSheet", "true") .option("read.locale.bcp47", "US") .option("read.lowFootprint", "true") .option("read.sheets", "" + getBaseExcelWorksheetVo(csvInfo).getWorksheetName()); dataset = dataFrameReader.load(AppContextUtil.getAppPath() + "excelFiles" + "/" + somePath);

with attached file
116.xlsx

and got below exception :
java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

In fact, these exception occurs in every file now, regardless of previous issue.
Can you please verify the issue at your end...

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Sure, let me check it that too!
rptOpportunityDetailsWithStageH is the sheet name, though i pass it dynamically, but for the test case, i have also checked it by passing it statically.

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

I've gave try 1.1.1 version with two different files, without specifying sheet names.

  1. It has 3 different worksheets - works fine -
    Sample - Superstore.xlsx
  2. It has single worksheet - Exception occurs.
    Opportunity Details_test_xlsx.xlsx

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

So, one root cause what i found is, the file for which it gives an error contains an empty columns, with only header value. I tried reading that file just by adding a space to only a single cell of each empty column, and it didn't thrown any error.

Another, i passed "read.sheets" parameter with sheet name rptOpportunityDetailsWithStageH
and it thrown below error :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:166)

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

You are welcome!

Yes you are absolutely correct! We don't have control over how and what kinda of data user will add. As this is much broad area so, we can't even collect all the areas and cases.

I'll work on this, and will try to figure out, what are those more general error prone scenarios that may be covered up easily.

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Another thing i found, while reading this file.... 125.xlsx is

I don't know whether this comes from spark or do we need to correct something from our file side or reader side... It will be good if we can figure out this as well.

org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:46) at org.apache.spark.sql.types.DecimalType$.apply(DecimalType.scala:43) at org.apache.spark.sql.types.DataTypes.createDecimalType(DataTypes.java:123) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:293) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:160) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:160) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:151) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:45) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:151) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Hey! @jornfranke do we have any hurdles to resolve current issue?

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jaypanchal avatar jaypanchal commented on June 3, 2024

Yes, original issue is resolved, but now we are not able to pass worksheet name, as it throws exception if we pass worksheet name.

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 3, 2024

check 1.2.0 and please file a new issue if it still persists

from spark-hadoopoffice-ds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.