Comments (19)
from spark-hadoopoffice-ds.
May I ask in which mode do you use it? Simple mode with simple data types or in CellMode?
from spark-hadoopoffice-ds.
Thanks for the quick response.
Until today we don't export directly to excel because we missed a component for doing the 'last mile'.
The workflow today consists on exporting to CSV files and post-processing with a specific script for converting to excel and formatting. But this becomes very complex and fragile and becomes a burden for creating and maintaining the code.
In cases that we export we simply write using the field names, types and order from the Spark dataframe.
Just exporting partitioned files with the headers and values will cut a lot of effort from the workflows development.
As the excel data is for human consumption, we foresee that the users will request formatting and some data handling and the functionality of templates present in this datasource seems to be very interesting for tackling this need also. But this is the a future problem and first we need the get the basics working well.
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Sorry for the misunderstanding.
Just simple data types: string, int, bigint, datetime (java).
Thanks, again.
from spark-hadoopoffice-ds.
no problem. Just out of curiosity, do you use df.toDF.write.partionBy("year","month","day").format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "us")
.save("/home/user/office/output")
?
I want to adapt the integration tests towards the scenario that you describe. The reason is that we already basically use the internal Spark APIs, so not much or none modifications should be needed.
from spark-hadoopoffice-ds.
Sorry for the delay.
Yes, you are right! I'm using basically this way:
Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
myDF.select("value1", "value2", "year","month","day")
.write().format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
.option("hadoopoffice.write.locale.bcp47", "en")
.option("hadoopoffice.write.header.write", "true")
.partionBy("year","month","day")
.save("hdfs://user/spark/warehouse/output");
Just some additional details that perhaps could be useful:
- When using "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" the values for non string cols were empty or 0. I could'nt figure out what happened.
- Used "hadoopoffice.write." as "write." was not working. But need further investigate again.
- I missing a new option for omitting the cols "year","month","day" from the final output. Currently all spark datasources require the cols in the dataframe for writing and there is not avoid them to be present in the output.
Thanks for the support!
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Great!
from spark-hadoopoffice-ds.
some quick update:
- Even if it works, i will add partitioning to the documentation and as a integration test case
- About "when using application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" => I will try to reproduce it tomorrow. it is odd because the integration test cases and unit test cases do not expose this behaviour.
- About "hadoopoffice.write" i will investigate, but there was a plan to deprecate the options without hadoopoffice prefix anyway to make it consistent across all platforms
- About "missing a new option for omitting cols": This will be difficult, because that logic is handled at Spark level. Spark always require to have the partitions as folder AND in the data itself. One could of course define a template where those columns are simply made invisible or sth. like this
from spark-hadoopoffice-ds.
Thanks for the update.
- Great to see your support on developing this feature.
- I'll check again if it works with xls and not xlsx.
- I see no problem in using options prefixed with
hadoopoffice.*
. Perhaps a little update on the README will help the users of the library. - Yes, that's a pity. And I agree that a option for omitting columns would be better addressed in Spark itself. (Wondering if there is a backlog to register a feature request for Spark)
If you need some help testing, please contact me.
Thanks again for your helpful support.
from spark-hadoopoffice-ds.
I created a improvement issue in Spark Project for adding an opt-in option for omiting columns when saving with Spark:
https://issues.apache.org/jira/browse/SPARK-28505
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
Could you check if you still have issues with xlsx? I cannot reproduce this, it is part of the integration tests and they seem to work fine: https://github.com/ZuInnoTe/spark-hadoopoffice-ds/blob/master/src/it/scala/org/zuinnote/spark/office/excel/SparkScalaExcelDSSparkMasterIntegrationSpec.scala
from spark-hadoopoffice-ds.
I just pushed the support for partitioning including an integration test verifying that it works.
I also noted that the partitioned columns are not written by Spark in the file itself but only exist at the folder level (as you required above). I don't know why you observed this for the CSV file.
I can publish it on Maven Central as version 1.3.2 next week. You can then test or if you do not want to wait for the official Maven central version: just clone this Git repository and publishToMavenLocal. Then include in your application 1.3.2.
For the other issues that you mention, can you please create dedicated issues for them, so we don't mix them all in this issue?
Thanks a lot.
from spark-hadoopoffice-ds.
I just published 1.3.2. Can you please test if it meets your needs wrt to partitioning? If it is successful then I propose to close this issue and create for the remaining points that you mention new issues
from spark-hadoopoffice-ds.
Sorry being late.
I will test tomorrow and I will report any issues.
Good to know that the partitioned columns are not included. I think it happened just in the previous xls datasource library. CSV does not show this behaviour also.
I will try to isolate a test case for the xlsx problem also, if I could reproduce.
Thanks for the all effort.
from spark-hadoopoffice-ds.
from spark-hadoopoffice-ds.
I tested the version 1.3.2 and partitioning worked perfectly in a couple of scenarios: xls, xlsx, compression.
Congratulations for the hard work!
Also I created a test case for reproducing the 'hadoopoffice.write.mimeType' 'XLSX'. I will open a new issue and attach the test case.
from spark-hadoopoffice-ds.
Related Issues (20)
- java.lang.NoSuchMethodError while using spark-hadoopoffice-ds_2.11 (v1.2.4) HOT 7
- Doesn't generate excel file as expected but no obvious error HOT 3
- Cell with Integer value has an = sign HOT 24
- Spark Datasource does not pick up options for writing correctly HOT 2
- Job abortion on databricks HOT 2
- IndexOutOfBoundsException: When reading xlsx file. HOT 8
- How do i get a schema that's all String with "read.spark.simpleMode" HOT 2
- Writing to XLSX only outputs String fields but XLS works fine HOT 11
- workbookDocument .getWorkbook().getWorkbookPr() can return null HOT 8
- Support for Scala 2.13 and drop support for Scala 2.11 HOT 1
- Support Spark 3.1 HOT 8
- lowFootprint: numeric column values result in nulls HOT 7
- Nullpointer Exception when using Spark with Kyroserializer HOT 2
- "hadoopoffice.write.header.write" is only working for default sheet HOT 6
- Spill over to next sheet if number of rows exceeding Excel limitations
- Skipped imported decimal values HOT 11
- Not all HadoopOffice configuration is applied correctly
- CVE-2021-44228: Mitigate Log4shell HOT 2
- inferSchema for Excelfiles with one row is not working correctly
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-hadoopoffice-ds.