best-practice-and-impact / ons-spark Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
The notebook converter needs a Python cell to immediately precede an R cell, whereas Python cells can be stand alone. This should be changed so R cells can be independent of Python cells if needed.
The book should have a section on common SQL commands, views, etc.
You can use this page about views as a starting point.
Pandas outputs generated in some notebooks (e.g. introduction to pyspark) run off end of the page.
In the section 'Write to a CSV: spark_write_csv() and sdf_coalesce()' if it is not clear that the first box is the incorrect answer and that you need to move onto the second. I think this needs to be more obvious as if you are just scanning for answers you can easily miss this distinction.
The Spark Sessions article references CDSW etc. We should make this more general, especially as it is the first article listed.
Sometimes although the principle is the same, the output for R and Python cells will be different, e.g. printSchema
v glimpse
. Currently it will only display the Python output.
There is some information contained within the Introduction to PySpark on reading and writing data but then there are separate sections in the book for Reading and Writing data. The intro to PySpark section is a nice walkthrough type chapter, but it would be good if the Reading and Writing sections contained example code for different data types (parquet, csv, hive, etc). This way is someone wanted to find the info fast they could.
This notebook discusses how you can use isin or between to replace AND and should be worked into the intro to PySpark section
All Python output is captured in this file, but only the direct R outputs, e.g. print
for R cells will not be displayed.
Add in the interpolation notebook in the troubleshooting repo to the book.
This could live in the Analysis in Spark section (mentioned in #85)
This chapter should include methods for working with strings, for example, any of the functions mentioned here
Notebooks from the troubleshooting repos, such as: apply_string_methods, change_string_case.
Note: apply_string_methods may belong elsewhere such as in a section about RDD's - if this is the case please add this as a new issue so we can be sure it will be added in.
The Optimising Joins uses a CSV as source data rather than parquet (this was during migration to the book, to preserve the older screenshots). It should be re-written to use parquet, with new Spark UI screenshots.
The versioning is present in the repo but doesn't seem to show up at the bottom of the menu bar on the left of the published book, The versioning has been taken from the duck book and the duck book displays the version as required.
I have built the book locally and the versioning appears as it should. So something in the deployment might be the source of the issue.
add functionality to get metrics, how many clicks etc.
It would be good to have a section on arrays. This can include the functions mentioned here.
It would also be good to include the information in the gladis_bands notebook in the troubleshooting section
apache spark have moved their docs, e.g.
The old link: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.persist.html
The new link: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.persist.html
Need to change all links in the book to reflect this change.
Currently, the filepaths in the config only work when you're running the code within DevTest. If you try and run a chapter that uses the config locally, it will error as the file cannot be found. Would be good to see if we could come up with some sort of solution for this.
The Pytest for PySpark and testthat for sparklyr repos should be in another directory, along with any other future supporting material.
Hyperlink at the bottom of the page for Checkpoint and staging tables points towards a raw notebook and not the correctly formatted page from the book
Change the authors of the book to Analysis Standards and Pipelines in MQD, ONS.
Use a virtual environment so users can run the code from the book directly in their browsers.
E.g. Google CoLab, Binder
make the contents on the top right of the article easier to follow by grouping operations.
I think it would be beneficial to have a chapter on regular expressions in the book.
This could include things like rlike, regexp_replace, regexp_extract, etc.
Would be good to have a section on the computational complexity of regex vs other methods/how to optimise regex (as this came up a lot in the GLADIS work).
We have been given some code which can be used to return an exact number of samples per strata by modifying the existing "return an exact sample" section.
This might take some time to modify the code and generalise the method for any number of strata
I think it's worthwhile to create a chapter around how null/nan/None values are handled in Spark, the differences between them and any oddities to be aware of.
This notebook, from the troubleshooting repo, could be a good idea to include.
Ted presented some work on using Spark ML for inference, as is commonly done in ONS. We want to add that work and create PySpark example to go with his sparklyr examples
Comments from DF on optimisation ideas.
Consider using ORC for space efficiency/long term storage?
Also consider storing HIVE data as:
- partitioned, if the partition is used for subsequent transformation expressions (e.g. filtering on the partition key)
- bucketed, if the partition is used for subsequent joins
This notebook talks about unioning datasets with many partitions.
It should be added into the book, under the "Managing Partitions" section.
What do we need to put into the requirements.txt
file to make the PySpark and sparklyr code run? e.g. pyspark>=2.4.0, sparklyr >=1.7.4
Put into section on HDFS in the book.
Should we show a local session here and explain you can use them on your laptop or for development? They're useful because they use minimal resource.
Add version number to book similar to Duck Book
Short one. Should the Persisting to disk section go before the checkpointing heading here. Perhaps the last paragraph of that section could be the first paragraph of the Checkpoint section.
Currently it tries to take you to this link https://github.com/robertswh/Spark%20at%20the%20ONS rather than the bpi version.
This notebook currently lives in the troubleshooting repo, but should be added to the joins section in the book.
The logistic regression page makes use of the cleaned rescue dataset rather than the usual rescue dataset specified in the config file. This dataset should be uploaded to the file system and the path added to config.yaml so that the data can be read in using the same syntax as the rest of the book.
We should move the section "Partitions when writing data" into a separate article on writing data.
This notebook talks about when it's appropriate to use loops and how to replace them using groupBy() or window functions.
It's a good opportunity to talk about why loops may be inefficient at times and alternatives to using them.
The age_diff_flag notebook is located in the troubleshooting repo here.
I think this would belong in the Analysis in Spark section (which may or may not exist yet - there's an issue to create this section here but if you can think of anywhere else that's appropriate feel free to place it there.
May need to think about what other content we can include alongside it if necessary.
This notebook talks about dates interval functions in PySpark and should be added into the book.
Mirror front page of duck book more closely
Chris and I spent some time trying to develop some SparklyR code to read/write Avro files, however we weren't able to get it to work.
I believe it's because of the version differences - it was asking us to specify the package and version in the Spark config as in the following code:
spark_connect(..., packages = c("org.apache.spark:spark-avro_2.11:2.4.0")
However, we were getting errors saying it could not load sparkavro dependencies - I think this is because the version of sparkavro we were using were incompatible with the SparklyR version we were using in DevTest.
It MAY work outside of DevTest with a later Spark version, however then we wouldn't be able to build the book since some of the chapter content uses Hive tables, which we'd need to run in DevTest.
The Cramer's V notebook is located in the troubleshooting repo here.
I think this would belong in the Analysis in Spark section (which may or may not exist yet - there's an issue to create this section here) but feel free to place it elsewhere if you can think of somewhere else that's more appropriate.
need to add R code for checkpoints and add staging tables example (plus R code)
The bin_continuous_variables notebook currently lives in the troubleshooting repo, but could be moved into the book in an "Analysis in Spark" section (mentioned in #85)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.