Memory issue,about zuinnote/spark-hadoopoffice-ds

jornfranke commented on June 3, 2024

How much memory do you provide? Keep in mind that an excel file (new excel format ) is compressed so 60 gb are easily several hundred of mb or Gb. Then internally Apache poi represents it as Java objects do you need to add a couple of more mb. You can try to use low footprint mode to reduce the memory amount.

…

On 22. Feb 2018, at 08:32, alki123 ***@***.***> wrote: I tried creating dataframe out of a 60Mb excel file using spark-hadoopoffice-ds-2.11. But it throws java.lang.OutOfMemoryError: GC overhead limit exceeded. spark.executor.instances was set to 3. What is the right solution for this? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

I tried setting hadoopoffice.read.lowFootprint as true. I am only reading .xlsx file. Still the same error occurs. I am reading only a small excel file of 60 mb. Java maximum heap memory size is 1024 mb.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

As I said, it is compressed 60 gb That means uncompressed in memory it is much larger. You will need to provide more memory if you decide to keep the full excel content in-memory. This is even the case for low footprint. How did you set the option for low footprint? How many rows and columns do you have? Do you have a lot of strings or numeric data (formatted as numeric data)?

…

On 22. Feb 2018, at 10:25, alki123 ***@***.***> wrote: I tried setting hadoopoffice.read.lowFootprint as true. I am only reading .xlsx file. Still the same error occurs. I am reading only a small excel file of 60 mb. Java maximum heap memory size is 1024 mb. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Can you please also provide a little bit more source code? It could also come from a Spark operation not related to the data source. Please tell me if the error is from the executor log or driver log?

…

On 22. Feb 2018, at 10:25, alki123 ***@***.***> wrote: I tried setting hadoopoffice.read.lowFootprint as true. I am only reading .xlsx file. Still the same error occurs. I am reading only a small excel file of 60 mb. Java maximum heap memory size is 1024 mb. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

dsAsRow = sparkSQL.read().option("read.spark.useHeader", useHeader)
.option("hadoopoffice.read.lowFootprint","true")
.option("read.spark.simpleMode", "true").option("read.locale.bcp47", "us")
.option("hadoopoffice.read.sheets", SheetName)
.option("hadoopoffice.read.mimeType",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
.format("org.zuinnote.spark.office.excel")

This is my code for creating data frame.. Here header is true. There are totally around million rows and 12 columns. It contains both string and numeric data. I am reading from the container log in azure cluster

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

The option should be read.lowFootprint (sorry for the confusion we will fix the naming in the next version to allow both names) same for the other options - there is no Hadoopoffice prefix for the Spark datasource. However even then 1024 mb could be a little bit tight, but try first if setting lowFootprint as above solves it.

…

On 22. Feb 2018, at 10:58, alki123 ***@***.***> wrote: dsAsRow = sparkSQL.read().option("read.spark.useHeader", useHeader) .option("hadoopoffice.read.lowFootprint","true") .option("read.spark.simpleMode", "true").option("read.locale.bcp47", "us") .option("hadoopoffice.read.sheets", SheetName) .option("hadoopoffice.read.mimeType", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet") .format("org.zuinnote.spark.office.excel") This is my code for creating data frame.. Here header is true. There are totally around million rows and 12 columns. It contains both string and numeric data. I am reading from the container log in azure cluster — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

As you said I changed hadoopoffice.read.lowFootprint to read.lowFootprint. Now there is no error. But dataframe is not yet created. These are the last four lines of the log.
18/02/22 10:16:24 INFO OfficeReader: Using standard API to parse Excel file
18/02/22 10:16:25 INFO ContextCleaner: Cleaned accumulator 76
18/02/22 10:16:26 INFO ContextCleaner: Cleaned accumulator 75
18/02/22 10:16:26 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.0.5:33457 in memory (size: 7.6 KB, free: 1458.6 MB)
18/02/22 10:16:26 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.0.7:34979 in memory (size: 7.6 KB, free: 912.3 MB)
Logger info after creating dataframe is not printed in log.
Does it take more than 10 mins to create dataframe for a 60 mb file?
I had been using crealytics in the past. But there were some issues with it. Thats why trying out hadoop office.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

We run regularly integration tests to make sure it works. However, as I said Excel is a very slow format due to its design. 60 mb is compressed, but uncompressed it is much larger. Due to the format you can also only have one thread processing it... It might be also that the file is corrupt. I would propose that you try first with a small subset of the Excel and see where the issue is. It could be also the memory setting.

…

On 22. Feb 2018, at 11:34, alki123 ***@***.***> wrote: As you said I changed hadoopoffice.read.lowFootprint to read.lowFootprint. Now there is no error. But dataframe is not yet created. These are the last four lines of the log. 18/02/22 10:16:24 INFO OfficeReader: Using standard API to parse Excel file 18/02/22 10:16:25 INFO ContextCleaner: Cleaned accumulator 76 18/02/22 10:16:26 INFO ContextCleaner: Cleaned accumulator 75 18/02/22 10:16:26 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.0.5:33457 in memory (size: 7.6 KB, free: 1458.6 MB) 18/02/22 10:16:26 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.0.7:34979 in memory (size: 7.6 KB, free: 912.3 MB) Logger info after creating dataframe is not printed in log. Does it take more than 10 mins to create dataframe for a 60 mb file? I had been using crealytics in the past. But there were some issues with it. Thats why trying out hadoop office. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

It works well for small files. The problem is only with big files. I have even changed the heap size to 2048mb. Still out of memory error occurs.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.xerces.xni.XMLString.toString(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.WorksheetDocument$Factory.parse(Unknown Source)
at org.apache.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:183)
at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:175)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:438)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:403)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:266)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at org.zuinnote.hadoop.office.format.common.parser.MSExcelParser.parse(MSExcelParser.java:146)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:90)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:130)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.(HadoopFileExcelReader.scala:56)
at org.zuinnote.spark.office.excel.DefaultSource.inferSchema(DefaultSource.scala:147)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

File is also not corrupted.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Well then in low footprint mode. (Does it work now?) Unfortunately this is how the Excel format is structured. You need to provide a lot of memory. Can you open the Excel file in Excel properly?

…

On 22. Feb 2018, at 13:45, alki123 ***@***.***> wrote: File is also not corrupted. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

How much memory is exactly needed?
It does not work in low footprint also.
Yes I can open the excel file properly.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

How much memory you need depends on the Excel file. For instance if you have a lot of unique strings (and sometimes Excel stores number of strings) then you need to load them fully in memory. Java/Scala are very memory hungry for their objects in memory. It highly depends on the data and what you do with it and the previously described before. This only you know. You could start with a high value and then reduce it until you receive an oom error. This will be your lower bound. In general Excel is very memory hungry and alternative formats such as ORC or parquet are more memory optimized. This is due to the format and unfortunately you cannot do much about it except add more memory.

…

On 22. Feb 2018, at 14:22, alki123 ***@***.***> wrote: How much memory is exactly needed? It does not work in low footprint also. Yes I can open the excel file properly. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

alki123 commented on June 3, 2024

Thanks jornfranke. Can I know what api you use to read excel file like apache poi or monitorjbl ?

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Apache POI

…

On 23. Feb 2018, at 07:29, alki123 ***@***.***> wrote: Thanks jornfranke. Can I know what library you use to read excel file like apache poi or monitorjbl ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Feel free to share the Excel file, but my comments above already indicate potential issues.

…

On 23. Feb 2018, at 07:29, alki123 ***@***.***> wrote: Thanks jornfranke. Can I know what library you use to read excel file like apache poi or monitorjbl ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

For your memory calculations keep also in mind that spark needs to keep all the data in-memory. Depending on what you do in Spark maybe even more.

…

On 23. Feb 2018, at 07:29, alki123 ***@***.***> wrote: Thanks jornfranke. Can I know what library you use to read excel file like apache poi or monitorjbl ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

Monitorjbl is also based on poi, low footprint mode uses already the streaming api...

…

On 23. Feb 2018, at 07:29, alki123 ***@***.***> wrote: Thanks jornfranke. Can I know what library you use to read excel file like apache poi or monitorjbl ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

from spark-hadoopoffice-ds.

jornfranke commented on June 3, 2024

I am closing this issue. I recommend trying with the newest HadoopOffice library and/or adding more memory.

from spark-hadoopoffice-ds.

Memory issue about spark-hadoopoffice-ds HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent