Code Monkey home page Code Monkey logo

Comments (19)

jornfranke avatar jornfranke commented on June 13, 2024

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

May I ask in which mode do you use it? Simple mode with simple data types or in CellMode?

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

Thanks for the quick response.
Until today we don't export directly to excel because we missed a component for doing the 'last mile'.

The workflow today consists on exporting to CSV files and post-processing with a specific script for converting to excel and formatting. But this becomes very complex and fragile and becomes a burden for creating and maintaining the code.

In cases that we export we simply write using the field names, types and order from the Spark dataframe.

Just exporting partitioned files with the headers and values will cut a lot of effort from the workflows development.

As the excel data is for human consumption, we foresee that the users will request formatting and some data handling and the functionality of templates present in this datasource seems to be very interesting for tackling this need also. But this is the a future problem and first we need the get the basics working well.

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

Sorry for the misunderstanding.
Just simple data types: string, int, bigint, datetime (java).
Thanks, again.

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

no problem. Just out of curiosity, do you use df.toDF.write.partionBy("year","month","day").format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "us")
.save("/home/user/office/output")
?

I want to adapt the integration tests towards the scenario that you describe. The reason is that we already basically use the internal Spark APIs, so not much or none modifications should be needed.

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

Sorry for the delay.
Yes, you are right! I'm using basically this way:

Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
	.option("hadoopoffice.write.locale.bcp47", "en")
	.option("hadoopoffice.write.header.write", "true")
	.partionBy("year","month","day")
	.save("hdfs://user/spark/warehouse/output");

Just some additional details that perhaps could be useful:

  • When using "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" the values for non string cols were empty or 0. I could'nt figure out what happened.
  • Used "hadoopoffice.write." as "write." was not working. But need further investigate again.
  • I missing a new option for omitting the cols "year","month","day" from the final output. Currently all spark datasources require the cols in the dataframe for writing and there is not avoid them to be present in the output.

Thanks for the support!

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

Great! 🥇

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

some quick update:

  • Even if it works, i will add partitioning to the documentation and as a integration test case
  • About "when using application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" => I will try to reproduce it tomorrow. it is odd because the integration test cases and unit test cases do not expose this behaviour.
  • About "hadoopoffice.write" i will investigate, but there was a plan to deprecate the options without hadoopoffice prefix anyway to make it consistent across all platforms
  • About "missing a new option for omitting cols": This will be difficult, because that logic is handled at Spark level. Spark always require to have the partitions as folder AND in the data itself. One could of course define a template where those columns are simply made invisible or sth. like this

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

Thanks for the update.

  1. Great to see your support on developing this feature.
  2. I'll check again if it works with xls and not xlsx.
  3. I see no problem in using options prefixed with hadoopoffice.*. Perhaps a little update on the README will help the users of the library.
  4. Yes, that's a pity. And I agree that a option for omitting columns would be better addressed in Spark itself. (Wondering if there is a backlog to register a feature request for Spark)

If you need some help testing, please contact me.

Thanks again for your helpful support.

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

I created a improvement issue in Spark Project for adding an opt-in option for omiting columns when saving with Spark:
https://issues.apache.org/jira/browse/SPARK-28505

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

Could you check if you still have issues with xlsx? I cannot reproduce this, it is part of the integration tests and they seem to work fine: https://github.com/ZuInnoTe/spark-hadoopoffice-ds/blob/master/src/it/scala/org/zuinnote/spark/office/excel/SparkScalaExcelDSSparkMasterIntegrationSpec.scala

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

I just pushed the support for partitioning including an integration test verifying that it works.
I also noted that the partitioned columns are not written by Spark in the file itself but only exist at the folder level (as you required above). I don't know why you observed this for the CSV file.

I can publish it on Maven Central as version 1.3.2 next week. You can then test or if you do not want to wait for the official Maven central version: just clone this Git repository and publishToMavenLocal. Then include in your application 1.3.2.

For the other issues that you mention, can you please create dedicated issues for them, so we don't mix them all in this issue?

Thanks a lot.

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

I just published 1.3.2. Can you please test if it meets your needs wrt to partitioning? If it is successful then I propose to close this issue and create for the remaining points that you mention new issues

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

Sorry being late.
I will test tomorrow and I will report any issues.
Good to know that the partitioned columns are not included. I think it happened just in the previous xls datasource library. CSV does not show this behaviour also.
I will try to isolate a test case for the xlsx problem also, if I could reproduce.
Thanks for the all effort.

from spark-hadoopoffice-ds.

jornfranke avatar jornfranke commented on June 13, 2024

from spark-hadoopoffice-ds.

juarezr avatar juarezr commented on June 13, 2024

I tested the version 1.3.2 and partitioning worked perfectly in a couple of scenarios: xls, xlsx, compression.
Congratulations for the hard work!

Also I created a test case for reproducing the 'hadoopoffice.write.mimeType' 'XLSX'. I will open a new issue and attach the test case.

from spark-hadoopoffice-ds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.