Code Monkey home page Code Monkey logo

sparkoptics's Introduction

Build Status Maven Central Gitter Binder

Spark-optics

Modify your complex structures in spark-sql dataframes with optics.

Getting Started

Need to set an inner element in a complex structure?

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
val df: DataFrame = ???

import org.hablapps.sparkOptics._
df.select(Lens("field.subfield")(df.schema).set(lit(13)):_*)

Want to try it right now, click on the binder icon to lunch a interactive notebook.

Installing

Compiled for Scala 2.11 with Spark 2.3 and Scala 2.12 with Spark 2.4. Tested with spark 2.3, 2.4 in scala 2.11 and for 3.0.1 in scala 2.12.

libraryDependencies += "org.hablapps" %% "spark-optics" % "1.0.0"

Spark lens doesn't have any dependencies beyond Spark itself.

Implemented optics

Lens

Optic to focus in a column of a provided schema.

Protolens

Optic equivalent to a lens, but without an specific schema, so that it can be applied to any dataframe that contains the specified column.

Motivation and larger example

Working with complex structures in Spark Sql's DataFrames can be hard. A common case if you are working with complex structures is to modify the inner elements of the structure. For instance:

case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)

val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street"))))
val df = List(employee).toDS.toDF
root
 |-- name: string (nullable = true)
 |-- company: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- street: struct (nullable = true)
 |    |    |    |-- number: integer (nullable = false)
 |    |    |    |-- name: string (nullable = true)

In order to modify an inner element, like changing the name of the street, we need to do something like this:

val mDF = df.select(df("name"),struct(
   df("company.name").as("name"),
   struct(
     df("company.address.city").as("city"),
     struct(
       df("company.address.street.number").as("number"),
       upper(df("company.address.street.name")).as("name")
     ).as("street")
   ).as("address")
 ).as("company"))
mDF.printSchema
val longCodeEmployee = mDF.as[Employee].head
root
  |-- name: string (nullable = true)
  |-- company: struct (nullable = false)
  |    |-- name: string (nullable = true)
  |    |-- address: struct (nullable = false)
  |    |    |-- city: string (nullable = true)
  |    |    |-- street: struct (nullable = false)
  |    |    |    |-- number: integer (nullable = true)
  |    |    |    |-- name: string (nullable = true)
 
mDF: DataFrame = [name: string, company: struct<name: string, address: struct<city: string, street: struct<number: int, name: string>>>]
longCodeEmployee: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET"))))

This can be simplified by using spark-optics. It allows you to focus in the element that you want to modify, and the optics will recreate the structure for you.

import org.hablapps.sparkOptics._
import org.apache.spark.sql.functions._

val df: DataFrame = List(employee).toDF

val streetNameLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(streetNameLens.modify(upper):_*)
modifiedDF.printSchema
modifiedDF.as[Employee].head
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
|    |-- name: string (nullable = true)
|    |-- address: struct (nullable = false)
|    |    |-- city: string (nullable = true)
|    |    |-- street: struct (nullable = false)
|    |    |    |-- number: integer (nullable = true)
|    |    |    |-- name: string (nullable = true)

streetNameLens: Lens = Lens(company.address.street.name)
modifiedDF: DataFrame = [name: string, company: struct<name: string, address: struct<city: string, street: struct<number: int, name: string>>>]
res19_3: Employee = Employee(
"john",
Company("awesome inc", Address("london", Street(23, "HIGH STREET")))
)

sparkoptics's People

Contributors

alfonsorr avatar jserranohidalgo avatar nightscape avatar rberenguel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkoptics's Issues

Execution slows down after 3 modifications of complex structure.

I am trying to modify multiple values in a complex DF.
I have copied the relevant parts below.

df.printSchema
root
..|--eventData
........|--stringVals
............|--idNumber
............|--firstName
............|--secondName
............|--email
............|--phone
............|--age
............|--dob
..|--idNumber
..|--firstName
..|--secondName
..|--email
..|--phone

var tempDF = df
val listOfFields = List("idNumber","firstName","secondName,","email","phone")

listOfFields.forEach( eachField => {
val lens = Lens("eventData.stringVals." + eachField)(tempDF.schema)
val tempFunc = lens.setDF(col(eachField)) //df contains column with values to replace at root level
tempDF = tempFunc(tempDF)
}

with 3 fileds in the listOfFields code executes in fast. But when I add 5 fields it slows down. I am trying to replace around 25 values in a complex DF which contians 100s of columns at multilelvels.

Please review and suggest a better option.

Thank you,
Che

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.