Code Monkey home page Code Monkey logo

sparkinjava's Introduction

Spark using Java

drawing

Spark Architecture

  • Uses Master Slave Architecture
  • In typical Spark cluster, there is 1 Master node(Driver) and number of worker nodes
  • drawing
  • Each of the executors runs in there own java processes
  • Driver Program is process where main method runs
  • Driver Program converts user program into task and schedule task on work nodes
  • Each driver program execution is called Job
  • So, Driver Program is responsible for receiving job and convert job into task
  • Executors are initialized when application in launched
  • When worker node runs the task, they sends back result to Driver Program
  • Driver Program connects to Spark cluster using Spark Context and apply RDD operations on cluster
  • Driver Program controls all the operations on RDD in Spark job
  • drawing

Spark Components

  • Spark Project contains multiple integrated components, Let's see some of them

  • drawing
  • Spark SQL

  • Spark pkg designed for working with structured data which is built on top of Spark Core

  • Provides an SQL-like interface for working with structured data

  • Internally could trigger several RDD operations on Spark

  • Spark Streaming

  • Running on top of spark, spark streaming provides an API for manipulating data streams that closely match the Spark Core's RDD API

  • Enables powerful interactive and analytical applications across both streaming and historical data while inheriting Spark's ease of use and fault tolerance characteristics

  • Spark can integrate with wide variety of clusters including Twitter, Flume, Kafka, HDFS etc.

  • ML Lib

  • Usable in Java, Scala and Python as a part of Spark Applications

  • Build on top of Spark, MLlib is a scalable ML library that delivers both high-quality algorithms and blazing speed.

  • Consists of common learning algorithms and utilities, including Classifications, Regression, Clustering, Collaborative filtering and dimensionality reduction, etc.

  • Who use Apache Spark and how?

  • Data Scientist

  • Data Processing Applications Engineers

Introduction to Pair RDDs

Pair RDD

  • A lot of datasets we see in real life examples are usually key value pairs.
  • Examples:

A datasets which contains passport IDs and the names of the passport holders A dataset contains course names and list of students who enrolled in the courses

  • Typical pattern is key to one value or multiple values
  • Spark provides a data structure called pair RDD instead of regular RDDs, which makes working with this kind of data more simpler and more efficient
  • A pair RDD is perticular type of RDD that can stores key-value pairs
  • Pair RDDs are useful building blocks in many spark programs

How to create Pair RDDs

  • Two Popular ways to create pair RDDs
    1. Returns Pair RDDs from list of key value data structure called tuple
    1. Turn regular RDD into Pair RDD
  • Java doesn't have a build-in tuple type, so Spark's Java API allows users to create tuples using the scala.Tuple2 class

Transaformation on Pair RDDs

  • Pair RDDs are allowed to use all transformations available to regular RDDs, and thus support the same functions as regular RDDs
  • Since pair RDDs contains tuples, we need to pass function that operates on tuples rather than on individual elements
  • When our dataset is described in the format of key-value pairs, it is quite common that we would like to aggregate statistics accross all elements with the same key
  • We have looked at the reduce actions on regular RDDs, and there is a similar operations for pair RDD, it is called reduceByKey
  • reduceByKey runs several parallels reduce operations, one for each key in the dataset, where each operation combines values that have the same key
  • Considering input datasets could have a huge number of keys, reduceByKey operations is not implemented as an action that returns a value to the driver program. Instead, it returns a new RDD consisting of each key and the reduced value for that key
    • drawing
  • groupByKey A common use case for pair RDD is grouping our data by key. For example, viewing all of an account's transactions together
  • If our data is already keyed in the way we want, groupByKey will group our data using the key in our pair RDD.
  • Let's say, the input pair RDD has keys of type K and values of type V, if we call groupByKey on the RDD, we get back pair of RDD of type K, and iterable V.
  • public JavaPairRDD<K, Iterable> groupByKey()
  • reduceByKey is preferred when there is large data against groupByKey
  • drawing
  • sortByKey Transformation

Data Partitioning

  • reduceByKey all worker nodes process it's partitions and shuffle data for final results whereas groupByKey oeprations all worker nodes first shuffle data and final nodes finds results, So using groupByKey can cause lots of unnecessary data shuffling and can have performance hit on large data
  • Is there any way to reduce amount of shuffle in groupByKey? Yes, we do.
  • We can use HashPartitioner to use data data on same worker, so we can do partition using key of Pair RDD
  • partitionBy method on RDD, return copy of Pair RDD using specified partitioner
  • JavaPairRDD<String, Integer> partitionedWordPairRDD = wordPairRdd.partitionBy(new HashPartitioner(4));
  • No matter how many times groupByKey you called data with same key will appear on same node
  • It is very important to persist partition information to disk, without it method will re-evaluate partitions again
  • _partitionedWordPairRDD.persist(StorageLevel.DISK_ONLY())

Which Operations would benefit from Partitioning

  • Many of spark operations involved shuffling data across network
  • All of these will benefit, this is not complete list of operations. join leftOuterJoin rightOuterJoin groupByKey reduceBykey combineByKey lookup
  • Running reduceByKey on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring onlt the final, locally reduced value to be sent from each worker node back to the master.

Operationa which would be affected by partitioning

  • Operations like map could cause the new RDD to forget the parents partitioning information, as such operations could, in theory, change the key of each element in the RDD.
  • General guidance is to prefer mapValues over map operation

Join Operations

  • Join operation allows us to join two RDDs together which is probably one of the most common operations on a Pair RDD
  • Join Types: leftOuterJoin, rightOuterJoin, crossJoin, innerJoin, etc.
  • If both RDDs have duplicate keys, join operation can dramatically expand size of the data. It's recommended to perform a distinct or combinedByKey operation to reduce the key space if possible.
  • Join Operation may require large network transfers or even create data sets beyond our capabilities to handle
  • Joins in general, are expensive since they require that corresponding keys from each RDD are located at the same partition so that they can be combined locally. If the RDDs do not have known partitioner, they would need to shuffle so that RDDs share a partitioner and data with same key lives in same partitions

Accumulators

  • Accumulators are variables that are used for aggregating information across the executors. For example, we can calculate how many records we can calculate how many records are corrupted or count events that occur during job execution for debugging purposes
  • Tasks on worker node can not access accumulator value
  • Accumulators are write-only variables
  • This allows accumulators to be implemented efficiently, without having to communicate every update

Alternative to Accumulators

  • Using accumulator is not only way to solve this problem
  • It is possible to aggregate values from an entire RDD back to the driver program using actions like reduce or reduceByKey

Different Types of accumulator

Broadcast Variables

  • Broadcast variables allow the programmer to keep a read only variable cached on each machine rather than shipping copy of it with tasks.
  • They can be used, for example, to give every node, a copy of a large input dataset, in an efficient manner
  • All broadcast variables will be kept at all the worker nodes for use in one or more spark operations
    • drawing
    • drawing

Introduction to Running Spark In Cluster

    • drawing
  • The machine where Spark application is running is called Driver Machine
  • Another machine where cluster manager is running is called Master node
  • Worker Node reports available resources to Master Node
  • Cluster Manager is pluggable component
  • Spark is packaged with a build-in cluster manager called the Standalone Cluster Manager
  • There are other types of Spark Manager master such as -
  • Hadoop Yarn : A resource management and scheduling tool for a Hadoop MapReduce Cluster
  • Apache Mesos : Centralized fault tolerant cluster manager and global resource manager for your entire data center
  • Spark provides a single script you can use to submit your application to it called spark-submit
  • drawing
  • drawing
  • drawing

Run Spark Application On Amazon EMR(Elastic Map Reduce)

  • drawing
  • drawing
  • AWS is not FREE
  • AWS charges by how much time and how many machines are running, with given type, any storage space, etc.

sparkinjava's People

Contributors

amitwdh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.