Code Monkey home page Code Monkey logo

sparkbigdatatutorials's Introduction

Spark in Scala

This repo contains basic examples of how to work with Spark RDD and Spark DataFrames in Scala

  • How to create RDD using arrays and some basic api usages

     val arr = Array(1,2,3,4,5)
     val newRDD = sc.parallelize(arr) --> creates RDD    
     newRDD.first() -> first object in rdd
     newRDD.take(n).foreach(println) -> print first n objects in rdd
     newRDD.collect() -> get all elements in rdd
     newRDD.collect().foreach(println) -> print each element in new line
     newRDD.partitions.size -> gives the number of partitions newRDD is split into
    

    each partition gets executed in each core in a machine. So more the number of cores better parallelization can be achieved

  • How to create RDD using files

     val fileRDD = sc.textFile('path')
    
  • RDD transformations

     Notes:
         Spark uses lazy evaluation
         Every transformation creates a new RDD from the exisitng RDD after applying the specified transformation
         
         ## filtering each line in fileRDD and creating a new RDD using filter operation.
         ## Condition is length of each line should be greater than 20. 
         val filterRDD = fileRDD.filter(line => line.length > 20)
         
         # takes each line in fileRDD, splits it by delimiter and returns an array. Final output is array of arrays (first array is number of lines in fileRDD, inner array is created based on split operation)
         # So map takes an array and creates array of arrays
         # Map takes each line and applies a given function to it
         val mapRDD = fileRDD.map(line => line.split(",")) 
         
         # similar to map, but flattens the array of arrays into a single array
         val flatMapRDD = fileRDD.flatMap(line => line.split(","))
         
         # to get distinct elements from an RDD
         val distinctRDD = newRDD.distinct()
         
         # Filter lines
         val filterRDD = fileRDD.filter(line=>line!="some_value")
         val filterRDD = fileRDD.filter(_!="some_value")
    

Happy learning!!

sparkbigdatatutorials's People

Contributors

shreeshan avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.