Code Monkey home page Code Monkey logo

ons-spark's People

Contributors

ad-prince avatar alexsnowdon avatar antonzogk avatar chrissoderberg-ons avatar emercado4 avatar jday7879 avatar jkhall06 avatar nathankelly-ons avatar robertswh avatar sam1mitchell1-hub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ons-spark's Issues

R cells need Python cells in notebook converter

The notebook converter needs a Python cell to immediately precede an R cell, whereas Python cells can be stand alone. This should be changed so R cells can be independent of Python cells if needed.

Tables running off pages

Pandas outputs generated in some notebooks (e.g. introduction to pyspark) run off end of the page.

Issue on page /sparklyr-intro/sparklyr-intro.html

In the section 'Write to a CSV: spark_write_csv() and sdf_coalesce()' if it is not clear that the first box is the incorrect answer and that you need to move onto the second. I think this needs to be more obvious as if you are just scanning for answers you can easily miss this distinction.

Make Spark Sessions general

The Spark Sessions article references CDSW etc. We should make this more general, especially as it is the first article listed.

Reading and Writing data

There is some information contained within the Introduction to PySpark on reading and writing data but then there are separate sections in the book for Reading and Writing data. The intro to PySpark section is a nice walkthrough type chapter, but it would be good if the Reading and Writing sections contained example code for different data types (parquet, csv, hive, etc). This way is someone wanted to find the info fast they could.

Show all output in outputs.csv

All Python output is captured in this file, but only the direct R outputs, e.g. print for R cells will not be displayed.

CSV used rather than parquet in Optimising Joins

The Optimising Joins uses a CSV as source data rather than parquet (this was during migration to the book, to preserve the older screenshots). It should be re-written to use parquet, with new Spark UI screenshots.

Versioning not appearing in menu bar

The versioning is present in the repo but doesn't seem to show up at the bottom of the menu bar on the left of the published book, The versioning has been taken from the duck book and the duck book displays the version as required.

I have built the book locally and the versioning appears as it should. So something in the deployment might be the source of the issue.

Metrics

add functionality to get metrics, how many clicks etc.

Filepaths in config only work when running code in DevTest

Currently, the filepaths in the config only work when you're running the code within DevTest. If you try and run a chapter that uses the config locally, it will error as the file cannot be found. Would be good to see if we could come up with some sort of solution for this.

Create regex chapter for the book

I think it would be beneficial to have a chapter on regular expressions in the book.

This could include things like rlike, regexp_replace, regexp_extract, etc.

Would be good to have a section on the computational complexity of regex vs other methods/how to optimise regex (as this came up a lot in the GLADIS work).

Add in "Returning an exact sample using stratified sampling"

We have been given some code which can be used to return an exact number of samples per strata by modifying the existing "return an exact sample" section.
This might take some time to modify the code and generalise the method for any number of strata

Add in chapter on null/nan/None in Spark

I think it's worthwhile to create a chapter around how null/nan/None values are handled in Spark, the differences between them and any oddities to be aware of.

This notebook, from the troubleshooting repo, could be a good idea to include.

optimisation of storage

Comments from DF on optimisation ideas.

Consider using ORC for space efficiency/long term storage?
Also consider storing HIVE data as:

  • partitioned, if the partition is used for subsequent transformation expressions (e.g. filtering on the partition key)
  • bucketed, if the partition is used for subsequent joins

spark dependencies

What do we need to put into the requirements.txt file to make the PySpark and sparklyr code run? e.g. pyspark>=2.4.0, sparklyr >=1.7.4

Views

Put into section on HDFS in the book.

Intro section in checkpoint/staging table

Short one. Should the Persisting to disk section go before the checkpointing heading here. Perhaps the last paragraph of that section could be the first paragraph of the Checkpoint section.

Add cleaned rescue data to config.yaml

The logistic regression page makes use of the cleaned rescue dataset rather than the usual rescue dataset specified in the config file. This dataset should be uploaded to the file system and the path added to config.yaml so that the data can be read in using the same syntax as the rest of the book.

make partitions shorter

We should move the section "Partitions when writing data" into a separate article on writing data.

Add in age_diff_flag_notebook to the book

The age_diff_flag notebook is located in the troubleshooting repo here.

I think this would belong in the Analysis in Spark section (which may or may not exist yet - there's an issue to create this section here but if you can think of anywhere else that's appropriate feel free to place it there.

May need to think about what other content we can include alongside it if necessary.

front page

Mirror front page of duck book more closely

Add in SparklyR code for Avro files in reading and writing data in Spark chapter

Chris and I spent some time trying to develop some SparklyR code to read/write Avro files, however we weren't able to get it to work.

I believe it's because of the version differences - it was asking us to specify the package and version in the Spark config as in the following code:

spark_connect(..., packages = c("org.apache.spark:spark-avro_2.11:2.4.0")

However, we were getting errors saying it could not load sparkavro dependencies - I think this is because the version of sparkavro we were using were incompatible with the SparklyR version we were using in DevTest.

It MAY work outside of DevTest with a later Spark version, however then we wouldn't be able to build the book since some of the chapter content uses Hive tables, which we'd need to run in DevTest.

Add in Cramer's V notebook to the book

The Cramer's V notebook is located in the troubleshooting repo here.

I think this would belong in the Analysis in Spark section (which may or may not exist yet - there's an issue to create this section here) but feel free to place it elsewhere if you can think of somewhere else that's more appropriate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.