The ons-spark from best-practice-and-impact

R cells need Python cells in notebook converter

The notebook converter needs a Python cell to immediately precede an R cell, whereas Python cells can be stand alone. This should be changed so R cells can be independent of Python cells if needed.

Add in a SQL section to the book

The book should have a section on common SQL commands, views, etc.

You can use this page about views as a starting point.

Tables running off pages

Pandas outputs generated in some notebooks (e.g. introduction to pyspark) run off end of the page.

Issue on page /sparklyr-intro/sparklyr-intro.html

In the section 'Write to a CSV: spark_write_csv() and sdf_coalesce()' if it is not clear that the first box is the incorrect answer and that you need to move onto the second. I think this needs to be more obvious as if you are just scanning for answers you can easily miss this distinction.

Make Spark Sessions general

The Spark Sessions article references CDSW etc. We should make this more general, especially as it is the first article listed.

Different R and Python output in notebook converter

Sometimes although the principle is the same, the output for R and Python cells will be different, e.g. printSchema v glimpse. Currently it will only display the Python output.

Reading and Writing data

There is some information contained within the Introduction to PySpark on reading and writing data but then there are separate sections in the book for Reading and Writing data. The intro to PySpark section is a nice walkthrough type chapter, but it would be good if the Reading and Writing sections contained example code for different data types (parquet, csv, hive, etc). This way is someone wanted to find the info fast they could.

Add in "isin()" and "between" to intro to PySpark section

This notebook discusses how you can use isin or between to replace AND and should be worked into the intro to PySpark section

Show all output in outputs.csv

All Python output is captured in this file, but only the direct R outputs, e.g. print for R cells will not be displayed.

Add in interpolation notebook to the book

Add in the interpolation notebook in the troubleshooting repo to the book.

This could live in the Analysis in Spark section (mentioned in #85)

Create new chapter in the book for String Methods

This chapter should include methods for working with strings, for example, any of the functions mentioned here
Notebooks from the troubleshooting repos, such as: apply_string_methods, change_string_case.

Note: apply_string_methods may belong elsewhere such as in a section about RDD's - if this is the case please add this as a new issue so we can be sure it will be added in.

CSV used rather than parquet in Optimising Joins

The Optimising Joins uses a CSV as source data rather than parquet (this was during migration to the book, to preserve the older screenshots). It should be re-written to use parquet, with new Spark UI screenshots.

Versioning not appearing in menu bar

The versioning is present in the repo but doesn't seem to show up at the bottom of the menu bar on the left of the published book, The versioning has been taken from the duck book and the duck book displays the version as required.

I have built the book locally and the versioning appears as it should. So something in the deployment might be the source of the issue.

Metrics

add functionality to get metrics, how many clicks etc.

Create a chapter on "Arrays in Spark" for the book

It would be good to have a section on arrays. This can include the functions mentioned here.

It would also be good to include the information in the gladis_bands notebook in the troubleshooting section

broken links

apache spark have moved their docs, e.g.

The old link: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.persist.html
The new link: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.persist.html

Need to change all links in the book to reflect this change.

Filepaths in config only work when running code in DevTest

Currently, the filepaths in the config only work when you're running the code within DevTest. If you try and run a chapter that uses the config locally, it will error as the file cannot be found. Would be good to see if we could come up with some sort of solution for this.

Move unit testing repos in a separate folder

The Pytest for PySpark and testthat for sparklyr repos should be in another directory, along with any other future supporting material.

Issue on page /spark-concepts/persistence.html

Hyperlink at the bottom of the page for Checkpoint and staging tables points towards a raw notebook and not the correctly formatted page from the book

Written by ASAP; Issue on page /intro.html

Change the authors of the book to Analysis Standards and Pipelines in MQD, ONS.

Virtual environment for running code from book

Use a virtual environment so users can run the code from the book directly in their browsers.
E.g. Google CoLab, Binder

restructure intro to pyspark headings

make the contents on the top right of the article easier to follow by grouping operations.

Create regex chapter for the book

I think it would be beneficial to have a chapter on regular expressions in the book.

This could include things like rlike, regexp_replace, regexp_extract, etc.

Would be good to have a section on the computational complexity of regex vs other methods/how to optimise regex (as this came up a lot in the GLADIS work).

Add in "Returning an exact sample using stratified sampling"

We have been given some code which can be used to return an exact number of samples per strata by modifying the existing "return an exact sample" section.
This might take some time to modify the code and generalise the method for any number of strata

Add in chapter on null/nan/None in Spark

I think it's worthwhile to create a chapter around how null/nan/None values are handled in Spark, the differences between them and any oddities to be aware of.

This notebook, from the troubleshooting repo, could be a good idea to include.

add Ted Dolby's Spark ML for inference examples

Ted presented some work on using Spark ML for inference, as is commonly done in ONS. We want to add that work and create PySpark example to go with his sparklyr examples

Create a new section of the book called "Analysis in Spark"

This section can include things such as:
Logistic regression
Certain notebooks in the troubleshooting repo such as: #86 #91 #87 #94

Add staging tables to book (and R code)

move garbage collection to book

optimisation of storage

Comments from DF on optimisation ideas.

Consider using ORC for space efficiency/long term storage?
Also consider storing HIVE data as:

partitioned, if the partition is used for subsequent transformation expressions (e.g. filtering on the partition key)

bucketed, if the partition is used for subsequent joins

Add in "tip_56_union_partitions" notebook to the book

This notebook talks about unioning datasets with many partitions.

It should be added into the book, under the "Managing Partitions" section.

spark dependencies

What do we need to put into the requirements.txt file to make the PySpark and sparklyr code run? e.g. pyspark>=2.4.0, sparklyr >=1.7.4

Views

Put into section on HDFS in the book.

Issue on page /spark-overview/example-spark-sessions.html

Should we show a local session here and explain you can use them on your laptop or for development? They're useful because they use minimal resource.

Add versioning to the book

Add version number to book similar to Duck Book

Intro section in checkpoint/staging table

Short one. Should the Persisting to disk section go before the checkpointing heading here. Perhaps the last paragraph of that section could be the first paragraph of the Checkpoint section.

Add content for "set Spark job description"

There is a placeholder for "set Spark job description" here but no content yet.

You can use this notebook from the troubleshooting repo to add in the content to the page.

Link back to repo/open issue in book is broken

Currently it tries to take you to this link https://github.com/robertswh/Spark%20at%20the%20ONS rather than the bpi version.

Add join_same_name_column notebook into book in Joins section

This notebook currently lives in the troubleshooting repo, but should be added to the joins section in the book.

Add cleaned rescue data to config.yaml

The logistic regression page makes use of the cleaned rescue dataset rather than the usual rescue dataset specified in the config file. This dataset should be uploaded to the file system and the path added to config.yaml so that the data can be read in using the same syntax as the rest of the book.

make partitions shorter

We should move the section "Partitions when writing data" into a separate article on writing data.

Add in coalesce_small_files notebook from the troubleshooting repo into the book

This notebook talks about why you shouldn't use coalesce on small files. It should be added to the book, in the reading/writing data section.

This could be combined with #19 as the reading/writing data section doesn't currently talk about partitions.

Add in "tip_29_groups_not_loops" troubleshooting repo notebook to the book

This notebook talks about when it's appropriate to use loops and how to replace them using groupBy() or window functions.

It's a good opportunity to talk about why loops may be inefficient at times and alternatives to using them.

Add in age_diff_flag_notebook to the book

The age_diff_flag notebook is located in the troubleshooting repo here.

I think this would belong in the Analysis in Spark section (which may or may not exist yet - there's an issue to create this section here but if you can think of anywhere else that's appropriate feel free to place it there.

May need to think about what other content we can include alongside it if necessary.

Add in "tip_57_dates_interval" notebook to the book

This notebook talks about dates interval functions in PySpark and should be added into the book.

front page

Mirror front page of duck book more closely

Add in SparklyR code for Avro files in reading and writing data in Spark chapter

Chris and I spent some time trying to develop some SparklyR code to read/write Avro files, however we weren't able to get it to work.

I believe it's because of the version differences - it was asking us to specify the package and version in the Spark config as in the following code:

spark_connect(..., packages = c("org.apache.spark:spark-avro_2.11:2.4.0")

However, we were getting errors saying it could not load sparkavro dependencies - I think this is because the version of sparkavro we were using were incompatible with the SparklyR version we were using in DevTest.

It MAY work outside of DevTest with a later Spark version, however then we wouldn't be able to build the book since some of the chapter content uses Hive tables, which we'd need to run in DevTest.

best-practice-and-impact / ons-spark Goto Github PK

ons-spark's People

Contributors

Stargazers

Watchers

Forkers

ons-spark's Issues

Recommend Projects

Recommend Topics

Recommend Org