databricks / learningsparkv2 Goto Github PK

View Code? Open in Web Editor NEW

1.1K 41.0 688.0 76.98 MB

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

Home Page: https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/

License: Apache License 2.0

Python 30.66% Scala 49.72% Java 19.62%

apache-spark spark structured-streaming spark-sql spark-mllib mllib mlflow delta-lake

learningsparkv2's Introduction

Learning Spark 2nd Edition

Welcome to the GitHub repo for Learning Spark 2nd Edition.

Chapters 2, 3, 6, and 7 contain stand-alone Spark applications. You can build all the JAR files for each chapter by running the Python script: python build_jars.py. Or you can cd to the chapter directory and build jars as specified in each README. Also, include $SPARK_HOME/bin in $PATH so that you don't have to prefix SPARK_HOME/bin/spark-submit for these standalone applications.

For all the other chapters, we have provided notebooks in the notebooks folder. We have also included notebook equivalents for a few of the stand-alone Spark applications in the aforementioned chapters.

Have Fun, Cheers!

learningsparkv2's People

Contributors

Stargazers

Watchers

Forkers

brookewenig dmatrix zhuohuwu0603 mattaraghav danielfsilva88 sindhu819 hari328 somanathsankaran omarlatrach melsiddieg ramkumars1985 mudit-singal-git mlaricobar leslieuc chandiwalaaadhar padamshrestha shan-sharma arleighbw pmarkoo chilukanand mike11339 srmchem mathmachado sergei-morozov nakicam peterhaglich amrlotfy77 cubbiepark tiagoooliveira shicongisme marcovasqueze ator97 zaidqureshi92 monikamendiratta skywangyan githcorrado lajosm bharath-adiga zoaopindo kanikasharma29 reddyharsha06 amyhei prasadseemakurthi irfan-x hiteshtulsani vijaypabothu mpharm88 akislay moly-malibu alexfeng2017 gvmbi jiajie999 prakashsudhakar winsc1ence tanimtanim cmftall harshbhanderi dbbabcock greenteausa dsureshmca consultantleonardoferreira dansari12 ishanthedev jimmy1411 pzhao16me bigdataz bonchae shivang98 deercoder mirirshadali ishandxc juandados dbtamisin mehmettbaki abcshravan jlaurente-totp karatugo hetingjian jmapost liangyaopei divya-bhanot smiletm zubair527 naiborhujosua anogues aluizios amalober doubianimehdi nelson-yao rohan0401 pravdaltx feng-tao xuyifeng1217 fabio-gz datastudysquad pennli chancylin csyhuang justin-ngai abhishekpokhrel

learningsparkv2's Issues

missing source file

I am trying to follow chapter 6 - Scala notebook: 6-2 Dataset API and I couldn't find this file path (mnt/training/dataframes/people-with-header-10m.txt) in the github to use in my environment.

Other datasets are available in databricks-datasets/learning-spark-v2 however this one isn't there.

Example 3_7 - Scala Issue

Unable to get Scala code to read blogs.json via the Schema definition provided in the book.

[Code]

`package main.scala.chapter3

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.{col, expr}

object Example3_7 {
def main(args: Array[String]) {

val spark = SparkSession
  .builder
  .appName("Example-3_7")
  .getOrCreate()

if (args.length <= 0) {
  println("usage Example3_7 <file path to blogs.json")
  System.exit(1)
}
//get the path to the JSON file
val jsonFile = args(0)
//define our schema as before
val schema = StructType(Array(
	StructField("Campaigns", ArrayType(StringType), true),
	StructField("First", StringType, true),
	StructField("Hits", LongType, true),
	StructField("Id", LongType, true),
  	StructField("Last", StringType, true),
	StructField("Published", StringType, true),
  	StructField("Url", StringType, true)
))

//Create a DataFrame by reading from the JSON file a predefined Schema
val blogsDF = spark.read.schema(schema).json(jsonFile)
//show the DataFrame schema as output
blogsDF.show(false)
// print the schemas
println(blogsDF.printSchema)
println(blogsDF.schema)

}
}`

[End Code]

LongType and IntegerType were both used for Hits and Id, but both have generated the following error on my machine

scala> kmontano18@DESKTOP-PRKRT1A:~$ spark-shell -i scalaSchema.scala blogs.json
blogs.json:1: error: identifier expected but integer literal found.
{"Id":1, "First": "Jules", "Last":"Damji", "Url":"https://tinyurl.1", "Published":"1/4/2016", "Hits": 4535, "Campaigns": ["twitter", "LinkedIn"]}

However, when reading the json via Spark's implicit read, it generated the expected schema, albeit in a different order

scala> val df = spark.read.json("blogs.json")
df: org.apache.spark.sql.DataFrame = [Campaigns: array, First: string ... 5 more fields]

IoT Devices Notebook (Scala) throws an `AnalysisException` on reading from Json and parsing as `DeviceIoTData` case class

Firstly, I was trying the Dataset examples on my own but without being able to obtain a Dataset[DeviceIoTData] from the Json data. Then, I tried with the notebook imported from this repo and it kept failing again:

Where is my mistake? Thanks! ^^

Fig 10-4 use of shapes?

The cluster example in Fig 10-4 was a bit difficult to follow in black and white print. Could use different shapes to plot the clusters instead of colors?

Table 10-2 error?

Am I right in thinking that 'Deprecated' and 'N/A' should switch places in Table 10-2?

Donation to charity for erratum?

I wonder if I can entice you follow suit with @rasbt and make donations to a charity for users reporting erratum? It's a fun (@rasbt even made it competitive) way to crowd source errors in the book.

See https://github.com/rasbt/python-machine-learning-book/blob/master/docs/errata.md

Could even look to something on the README so users don't report the same error

cloning the repo fails

Hello, I am learning Spark through your book, but when tried to clone the repo , I am getting this error , any help?

Cloning into 'D:\LearningSparkV2'...
remote: Enumerating objects: 1712, done.
remote: Counting objects: 100% (135/135), done.
remote: Compressing objects: 100% (92/92), done.
remote: Total 1712 (delta 38), reused 85 (delta 23), pack-reused 1577
Receiving objects: 100% (1712/1712), 76.76 MiB | 2.72 MiB/s, done.
Resolving deltas: 100% (525/525), done.
error: invalid path 'databricks-datasets/learning-spark-v2/flights/summary-data/avro/*/_SUCCESS'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Error p 309. Chapter 10 - Decision trees code example

In the following code example I couldn't see where indexOutputCols is defined

Chapter 5 - Joins - Incorrect python code

On page 148, python code for explaining joins is incorrect. It reads:

# In Python
# Join departure delays data (foo) with airport info
foo.join(
  airports,
  airports.IATA == foo.origin
).select("City", "State", "date", "delay", "distance", "destination").show()

However, the airports airports dataframe does not exist. It should be changed to airportsna, as follows:

# In Python
# Join departure delays data (foo) with airport info
foo.join(
  airportsna,
  airportsna.IATA == foo.origin
).select("City", "State", "date", "delay", "distance", "destination").show()

Then the code works.

Cloning the repository fails with an error, and all files are marked as DELETED.

Doing a git clone for this repository fails with this error:

Cloning into 'LearningSparkV2'...
remote: Enumerating objects: 126, done.
remote: Counting objects: 100% (126/126), done.
remote: Compressing objects: 100% (88/88), done.
remote: Total 1703 (delta 35), reused 76 (delta 19), pack-reused 1577
Receiving objects: 100% (1703/1703), 76.00 MiB | 16.12 MiB/s, done.
Resolving deltas: 100% (522/522), done.
error: invalid path 'databricks-datasets/learning-spark-v2/flights/summary-data/avro/*/_SUCCESS'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Operating System: Windows 10
Git version : git version 2.26.2.windows.1

Failed to load main class

I am getting an error and didnt find a solution.
I use Intellij, sbt 1.4.7, scala 2.12.10 and spark 3.0.

I couldn't submit any job locally. An example class I've work on:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.avg

object Aggregate extends App {

  val spark = SparkSession
    .builder()
    .appName("AuthorAges")
    .getOrCreate()

  val dataDF = spark.createDataFrame(Seq(("Brooke", 20), ("Brooke", 25),
    ("Denny", 31), ("Jules", 30), ("TD", 35))).toDF("name", "age")

  val avgDF = dataDF.groupBy("name").agg(avg("age"))
  avgDF.show()
}.

I use the command : $SPARK_HOME/bin/spark-submit --class main.scala.Aggregate /home/ubuntu/IdeaProjects/Deneme/target/Deneme-1.0-SNAPSHOT.jar

Neither I create project with sbt and maven. But I get sample error in both case.

My build.sbt file:

//name of the package
name := "main/scala"
//version of our package
version := "1.0"
//version of Scala
scalaVersion := "2.12.10"
// spark library dependencies
// change this to 3.0.0 when released
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "3.0.0",
  "org.apache.spark" %% "spark-sql"  % "3.0.0"
)

The Error
21/02/05 15:48:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Failed to load class main.scala.Aggregate.
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Can anyone help me?

Misleading/Wrong usage of count

Hello,

in the "Counting M&M's example in chapter 2, the count aggregate function is used to "count" the M&M's per State and Color. It is unclear what exactly is being counted here. The input CSV has the following format:

State,Color,Count
TX,Red,20
NV,Blue,66
CO,Blue,79
OR,Blue,71
WA,Yellow,93

When the count function is used, the values in the "Count" columns are completely ignored, and the count function counts each line. Because of the format of the CSV file (which already has a "Count" columns), the intention seemed to be the to get the total of the values in the count column per State and Color. To get the "Total" of the "Count" you would have to use the sum aggregate function.

Error in Chapter 3: "ResponseDelayedinMins" => "ResponseDelayedMins"

Page 96 (Location 1999 in Kindle)

Chapter 5 Spark SQL Shell + accessing person table through spark session

Hi, I'm new to Spark and really appreciate and like this book so far.

I'm currently doing Chapter 5 and I've created the "person" table through the spark-sql shell via command line

I've killed my spark-sql shell and I can see that the "person" table is persisted across spark-sql shell sessions since when I run a new spark-sql shell the person table is still there

I restarted my spark master as well and spark-sql shell session and the person table still exists

One thing I think this chapter (and book|) is lacking - it doesn't mention how I can interact and insert more rows into this table outside of the spark-sql shell in for example a scala spark submit process

I tried the following code below but the "person" table is not found

Would appreciate if someone can advise - I don't see the link / connection b/w the person table in hive and how I can use from other spark sessions


  def main(args: Array[String]) {
    val start = Instant.now

    Logger.getLogger("org").setLevel(Level.INFO)

    val warehouseLocation = new File("spark-warehouse").getAbsolutePath

    // Create a SparkSession without specifying master
    val spark = SparkSession
      .builder
      .master("spark://localhost:7077")
      .appName("PeopleQuery")
      .config("spark.sql.warehouse.dir", warehouseLocation)
      .enableHiveSupport()
      .getOrCreate()

    val sqlDF = spark.sql("select * from people;")
    println(sqlDF.show())
  }