Comments (32)
from spark-hadoopoffice-ds.
@jornfranke I need to email you file and code...
Can you please confirm, [email protected] email id?
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
@jornfranke No no, File contains it self only one sheet named rptOpportunityDetailsWithStageH has total 462 records with 36 columns (A to AJ)
- Rows may or may not have data in all columns. i.e. couple of cell may be a blank in particular row.
- Couple of (10) colunmns has a timestamp data, two columns has a integer and float values. and other then all are string columns (may contains long descriptions with special characters like, white spaces and punctuation marks.)
- Fixed header row is set at 1 index.
- Created and tried with microsoft office XLS and XLSX both format. Both of them causing issue.
Files were generated manually.
from spark-hadoopoffice-ds.
dataFrameReader = dataFrameReader.load().sparkSession().read()
.format("org.zuinnote.spark.office.excel")
.option("read.spark.simpleMode","true")
.option("read.spark.useHeader", "true")
.option("read.locale.bcp47", "US")
.option("read.lowFootprint", "true")
// ":" separated
.option("read.sheets", "" + sheetsToBeRead.toString());
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Opportunity Details_test_xlsx.xlsx
This is the sample file you can try it with.
from spark-hadoopoffice-ds.
As a temporary fix you can activate low footprint mode, which should not have this issue:
hadoopoffice.read.lowFootprint = true
for the current issue I expect a real fix by this week
from spark-hadoopoffice-ds.
i just saw that you have low footprint mode activated. This is strange because the Exception indicates that you don't use lowfootprint mode
from spark-hadoopoffice-ds.
Can you please try version 1.1.1 - all the issues should be fixed there, thank you.
from spark-hadoopoffice-ds.
Sure...Let me check it by today!
from spark-hadoopoffice-ds.
@jornfranke
Build is getting failed... Can you please assist me to solve this issue...
`$ sbt assembly
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.spark#spark-core_2.12;2.0.1: not found
[warn] :: org.apache.spark#spark-sql_2.12;2.0.1: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-core_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L36-37)
[warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1
[warn] org.apache.spark:spark-sql_2.12:2.0.1 (/home/jay/Desktop/Excel_lib/New/spark-hadoopoffice-ds-master/build.sbt#L38-39)
[warn] +- com.github.zuinnote:spark-hadoopoffice-ds_2.12:1.1.1
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.12;2.0.1: not found
[error] unresolved dependency: org.apache.spark#spark-sql_2.12;2.0.1: not found
`
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Alright, so i need to switch my scala version to 2.10 or 2.11 Right?
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
@jornfranke I downloaded hadoopOffice-fileformat 1.1.1 and spark-hadoopoffice-ds-2.11
Experimented with below code...
dataFrameReader = dataFrameReader.load().sparkSession().read() .format("org.zuinnote.spark.office.excel") .option("read.spark.simpleMode","true") .option("read.spark.useHeader", String.valueOf(csvInfo.isFirstRowAsColumn())) //.option("read.spark.useHeader.skipHeaderInEachSheet", "true") .option("read.locale.bcp47", "US") .option("read.lowFootprint", "true") .option("read.sheets", "" + getBaseExcelWorksheetVo(csvInfo).getWorksheetName()); dataset = dataFrameReader.load(AppContextUtil.getAppPath() + "excelFiles" + "/" + somePath);
with attached file
116.xlsx
and got below exception :
java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
In fact, these exception occurs in every file now, regardless of previous issue.
Can you please verify the issue at your end...
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Sure, let me check it that too!
rptOpportunityDetailsWithStageH is the sheet name, though i pass it dynamically, but for the test case, i have also checked it by passing it statically.
from spark-hadoopoffice-ds.
I've gave try 1.1.1 version with two different files, without specifying sheet names.
- It has 3 different worksheets - works fine -
Sample - Superstore.xlsx - It has single worksheet - Exception occurs.
Opportunity Details_test_xlsx.xlsx
from spark-hadoopoffice-ds.
So, one root cause what i found is, the file for which it gives an error contains an empty columns, with only header value. I tried reading that file just by adding a space to only a single cell of each empty column, and it didn't thrown any error.
Another, i passed "read.sheets" parameter with sheet name rptOpportunityDetailsWithStageH
and it thrown below error :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.nextSpecificSheets(MSExcelParser.java:496) at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.getNext(MSExcelParser.java:422) at org.zuinnote.hadoop.office.format.common.OfficeReader.getNext(OfficeReader.java:127) at org.zuinnote.hadoop.office.format.mapreduce.ExcelRecordReader.nextKeyValue(ExcelRecordReader.java:89) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.hasNext(HadoopFileExcelReader.scala:61) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:44) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:187) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:166)
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
You are welcome!
Yes you are absolutely correct! We don't have control over how and what kinda of data user will add. As this is much broad area so, we can't even collect all the areas and cases.
I'll work on this, and will try to figure out, what are those more general error prone scenarios that may be covered up easily.
from spark-hadoopoffice-ds.
Another thing i found, while reading this file.... 125.xlsx is
I don't know whether this comes from spark or do we need to correct something from our file side or reader side... It will be good if we can figure out this as well.
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:46) at org.apache.spark.sql.types.DecimalType$.apply(DecimalType.scala:43) at org.apache.spark.sql.types.DataTypes.createDecimalType(DataTypes.java:123) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:293) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3$$anonfun$apply$2.apply(DefaultSource.scala:160) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:160) at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$inferSchema$3.apply(DefaultSource.scala:151) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.zuinnote.spark.office.excel.HadoopFileExcelReader.foreach(HadoopFileExcelReader.scala:45) at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:151) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Hey! @jornfranke do we have any hurdles to resolve current issue?
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Yes, original issue is resolved, but now we are not able to pass worksheet name, as it throws exception if we pass worksheet name.
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
check 1.2.0 and please file a new issue if it still persists
from spark-hadoopoffice-ds.
Related Issues (20)
- java.lang.NoSuchMethodError while using spark-hadoopoffice-ds_2.11 (v1.2.4) HOT 7
- Doesn't generate excel file as expected but no obvious error HOT 3
- Cell with Integer value has an = sign HOT 24
- Spark Datasource does not pick up options for writing correctly HOT 2
- Job abortion on databricks HOT 2
- IndexOutOfBoundsException: When reading xlsx file. HOT 8
- How do i get a schema that's all String with "read.spark.simpleMode" HOT 2
- Add support for partitioning by folders in Hive/Spark Style HOT 19
- Writing to XLSX only outputs String fields but XLS works fine HOT 11
- workbookDocument .getWorkbook().getWorkbookPr() can return null HOT 8
- Support for Scala 2.13 and drop support for Scala 2.11 HOT 1
- Support Spark 3.1 HOT 8
- lowFootprint: numeric column values result in nulls HOT 7
- Nullpointer Exception when using Spark with Kyroserializer HOT 2
- "hadoopoffice.write.header.write" is only working for default sheet HOT 6
- Spill over to next sheet if number of rows exceeding Excel limitations
- Skipped imported decimal values HOT 11
- Not all HadoopOffice configuration is applied correctly
- CVE-2021-44228: Mitigate Log4shell HOT 2
- inferSchema for Excelfiles with one row is not working correctly
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-hadoopoffice-ds.