Code Monkey home page Code Monkey logo

learningsparkv2's Introduction

Learning Spark 2nd Edition

Welcome to the GitHub repo for Learning Spark 2nd Edition.

Chapters 2, 3, 6, and 7 contain stand-alone Spark applications. You can build all the JAR files for each chapter by running the Python script: python build_jars.py. Or you can cd to the chapter directory and build jars as specified in each README. Also, include $SPARK_HOME/bin in $PATH so that you don't have to prefix SPARK_HOME/bin/spark-submit for these standalone applications.

For all the other chapters, we have provided notebooks in the notebooks folder. We have also included notebook equivalents for a few of the stand-alone Spark applications in the aforementioned chapters.

Have Fun, Cheers!

learningsparkv2's People

Contributors

brookewenig avatar dennyglee avatar dmatrix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

learningsparkv2's Issues

missing source file

I am trying to follow chapter 6 - Scala notebook: 6-2 Dataset API and I couldn't find this file path (mnt/training/dataframes/people-with-header-10m.txt) in the github to use in my environment.

Other datasets are available in databricks-datasets/learning-spark-v2 however this one isn't there.

Example 3_7 - Scala Issue

Unable to get Scala code to read blogs.json via the Schema definition provided in the book.

[Code]

`package main.scala.chapter3

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.{col, expr}

object Example3_7 {
def main(args: Array[String]) {

val spark = SparkSession
  .builder
  .appName("Example-3_7")
  .getOrCreate()

if (args.length <= 0) {
  println("usage Example3_7 <file path to blogs.json")
  System.exit(1)
}
//get the path to the JSON file
val jsonFile = args(0)
//define our schema as before
val schema = StructType(Array(
	StructField("Campaigns", ArrayType(StringType), true),
	StructField("First", StringType, true),
	StructField("Hits", LongType, true),
	StructField("Id", LongType, true),
  	StructField("Last", StringType, true),
	StructField("Published", StringType, true),
  	StructField("Url", StringType, true)
))

//Create a DataFrame by reading from the JSON file a predefined Schema
val blogsDF = spark.read.schema(schema).json(jsonFile)
//show the DataFrame schema as output
blogsDF.show(false)
// print the schemas
println(blogsDF.printSchema)
println(blogsDF.schema)

}
}`

[End Code]

LongType and IntegerType were both used for Hits and Id, but both have generated the following error on my machine

scala> kmontano18@DESKTOP-PRKRT1A:~$ spark-shell -i scalaSchema.scala blogs.json
blogs.json:1: error: identifier expected but integer literal found.
{"Id":1, "First": "Jules", "Last":"Damji", "Url":"https://tinyurl.1", "Published":"1/4/2016", "Hits": 4535, "Campaigns": ["twitter", "LinkedIn"]}

However, when reading the json via Spark's implicit read, it generated the expected schema, albeit in a different order

scala> val df = spark.read.json("blogs.json")
df: org.apache.spark.sql.DataFrame = [Campaigns: array, First: string ... 5 more fields]

scala> df.printSchema()
root
|-- Campaigns: array (nullable = true)
| |-- element: string (containsNull = true)
|-- First: string (nullable = true)
|-- Hits: long (nullable = true)
|-- Id: long (nullable = true)
|-- Last: string (nullable = true)
|-- Published: string (nullable = true)
|-- Url: string (nullable = true)

image

Fig 10-4 use of shapes?

The cluster example in Fig 10-4 was a bit difficult to follow in black and white print. Could use different shapes to plot the clusters instead of colors?

IMG_5159

Table 10-2 error?

Am I right in thinking that 'Deprecated' and 'N/A' should switch places in Table 10-2?

IMG_5161

cloning the repo fails

Hello, I am learning Spark through your book, but when tried to clone the repo , I am getting this error , any help?

Cloning into 'D:\LearningSparkV2'...
remote: Enumerating objects: 1712, done.
remote: Counting objects: 100% (135/135), done.
remote: Compressing objects: 100% (92/92), done.
remote: Total 1712 (delta 38), reused 85 (delta 23), pack-reused 1577
Receiving objects: 100% (1712/1712), 76.76 MiB | 2.72 MiB/s, done.
Resolving deltas: 100% (525/525), done.
error: invalid path 'databricks-datasets/learning-spark-v2/flights/summary-data/avro/*/_SUCCESS'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Chapter 5 - Joins - Incorrect python code

On page 148, python code for explaining joins is incorrect. It reads:

# In Python
# Join departure delays data (foo) with airport info
foo.join(
  airports,
  airports.IATA == foo.origin
).select("City", "State", "date", "delay", "distance", "destination").show()

However, the airports airports dataframe does not exist. It should be changed to airportsna, as follows:

# In Python
# Join departure delays data (foo) with airport info
foo.join(
  airportsna,
  airportsna.IATA == foo.origin
).select("City", "State", "date", "delay", "distance", "destination").show()

Then the code works.

Cloning the repository fails with an error, and all files are marked as DELETED.

Doing a git clone for this repository fails with this error:

Cloning into 'LearningSparkV2'...
remote: Enumerating objects: 126, done.
remote: Counting objects: 100% (126/126), done.
remote: Compressing objects: 100% (88/88), done.
remote: Total 1703 (delta 35), reused 76 (delta 19), pack-reused 1577
Receiving objects: 100% (1703/1703), 76.00 MiB | 16.12 MiB/s, done.
Resolving deltas: 100% (522/522), done.
error: invalid path 'databricks-datasets/learning-spark-v2/flights/summary-data/avro/*/_SUCCESS'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Operating System: Windows 10
Git version : git version 2.26.2.windows.1

Failed to load main class

I am getting an error and didnt find a solution.
I use Intellij, sbt 1.4.7, scala 2.12.10 and spark 3.0.

I couldn't submit any job locally. An example class I've work on:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.avg

object Aggregate extends App {

  val spark = SparkSession
    .builder()
    .appName("AuthorAges")
    .getOrCreate()

  val dataDF = spark.createDataFrame(Seq(("Brooke", 20), ("Brooke", 25),
    ("Denny", 31), ("Jules", 30), ("TD", 35))).toDF("name", "age")

  val avgDF = dataDF.groupBy("name").agg(avg("age"))
  avgDF.show()
}.

I use the command : $SPARK_HOME/bin/spark-submit --class main.scala.Aggregate /home/ubuntu/IdeaProjects/Deneme/target/Deneme-1.0-SNAPSHOT.jar

Neither I create project with sbt and maven. But I get sample error in both case.

My build.sbt file:

//name of the package
name := "main/scala"
//version of our package
version := "1.0"
//version of Scala
scalaVersion := "2.12.10"
// spark library dependencies
// change this to 3.0.0 when released
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "3.0.0",
  "org.apache.spark" %% "spark-sql"  % "3.0.0"
)

The Error
21/02/05 15:48:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Failed to load class main.scala.Aggregate.
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Can anyone help me?

Misleading/Wrong usage of count

Hello,

in the "Counting M&M's example in chapter 2, the count aggregate function is used to "count" the M&M's per State and Color. It is unclear what exactly is being counted here. The input CSV has the following format:

State,Color,Count
TX,Red,20
NV,Blue,66
CO,Blue,79
OR,Blue,71
WA,Yellow,93

When the count function is used, the values in the "Count" columns are completely ignored, and the count function counts each line. Because of the format of the CSV file (which already has a "Count" columns), the intention seemed to be the to get the total of the values in the count column per State and Color. To get the "Total" of the "Count" you would have to use the sum aggregate function.

Chapter 5 Spark SQL Shell + accessing person table through spark session

Hi, I'm new to Spark and really appreciate and like this book so far.

I'm currently doing Chapter 5 and I've created the "person" table through the spark-sql shell via command line

I've killed my spark-sql shell and I can see that the "person" table is persisted across spark-sql shell sessions since when I run a new spark-sql shell the person table is still there

I restarted my spark master as well and spark-sql shell session and the person table still exists

One thing I think this chapter (and book|) is lacking - it doesn't mention how I can interact and insert more rows into this table outside of the spark-sql shell in for example a scala spark submit process

I tried the following code below but the "person" table is not found

Would appreciate if someone can advise - I don't see the link / connection b/w the person table in hive and how I can use from other spark sessions


  def main(args: Array[String]) {
    val start = Instant.now

    Logger.getLogger("org").setLevel(Level.INFO)

    val warehouseLocation = new File("spark-warehouse").getAbsolutePath

    // Create a SparkSession without specifying master
    val spark = SparkSession
      .builder
      .master("spark://localhost:7077")
      .appName("PeopleQuery")
      .config("spark.sql.warehouse.dir", warehouseLocation)
      .enableHiveSupport()
      .getOrCreate()

    val sqlDF = spark.sql("select * from people;")
    println(sqlDF.show())
  }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.