Code Monkey home page Code Monkey logo

bebe's People

Contributors

alfonsorr avatar mrpowers avatar nchammas avatar nvander1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bebe's Issues

suggestion to split the project

The idea es to have this project split into three different subprojects

  • Functions
  • Typed columns
  • Typed functions

The reason to isolate the functions is to have a dependency only for people that wants the unimplemented columns in spark following the basic spark API and make it easier to create the python interface of #11 . It will be a great addition if we can split from which spark version functionality is provided. For example RegExpExtractAll it's new in spark 3.1.x, we can exclude this in the previous versions, or try to backport it to allow spark 3.0.x or 2.4 to use it if a copy-paste is the only required thing.

The typed column project will have the core functionality to use the typed columns. This will allow to cross-build this project for spark 2.4 (scala 2.11) and test against 3.0.x.

And last, the Typed functions will merge the two previous projects to present the functions previously provided but typed.

This will require mostly sbt rework, and to we can see in the future if it be better to split it in different repositories ๐Ÿคท .

Create PySpark interface

Will want to expose the "missing API functions" (e.g. regexp_extract_all) for the PySpark folks too.

Think we'll be able to follow this example and expose these relatively easily.

Something like this:

def add_months(col, regexp, group_index):
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.regexp_replace_all(_to_java_column(start), _to_java_column(regexp), _to_java_column(group_index)))

Not sure how to build the wheel file or publish to PyPi from a SBT repo. My other PySpark projects, like quinn, are built with Poetry which has built-in packaging / publishing commands.

Functions that are in Spark SQL and not in Scala API to be implemented

We can use this issue to create a list of all the functions that are in Spark SQL, but not in the Scala API for whatever reason.

Here's the list that @nvander1 sent me so we can get started out. He already implemented approx_percentile, so we're on our way!

  • approx_percentile
  • cardinality #24
  • character_length #25 - in scala - python API is called length
  • char_length - not sure if we need this if we have character_length - in scala - python API is called length
  • chr #26
  • cot #28
  • count_if #29
  • count_min_sketch
  • cube - in scala API is an alternative to the method group_by of Dataset
  • current_database - can be obtained from sparkSession.catalog.currentDatabase
  • date - called to_date
  • date_part - dayofweek, dayofyear, second, timestamp_seconds, etc
  • day - dayofmonth
  • decimal - cast(DecimalType)
  • div - /
  • double - cast(DoubleType)
  • e
  • elt - not needed, can use regular Array indexing to fetch items
  • every - not needed, can use forall
  • extract - its a router to dayofweek, dayofyear, second... and the element to extract can't be a expression.
  • find_in_set - not needed, can use Scala functions
  • first_value - alias of first
  • float - cast(FloatType)
  • if - not needed, use when
  • ifnull
  • in - can use array_contains
  • inline
  • inline_outer
  • input_file_block_length
  • input_file_block_start
  • int - cast(IntegerType)
  • isnotnull - method of column in scala API
  • java_method
  • last_value - alias of last
  • lcase - lower
  • left
  • like - method of column
  • ln - column method isin
  • make_date -
  • make_interval - this one was added to Spark ๐Ÿ˜Ž
  • make_timestamp
  • max_by
  • min_by
  • mod - alias of %
  • named_struct
  • negative - - col
  • now - current_timestamp
  • nullif
  • nvl - coalesce
  • nvl2
  • octet_length
  • or - column method | and or
  • parse_url
  • percentile
  • pi
  • position - locate but maybe create an alternate locate that accepts all parameters as columns
  • positive
  • power - alias of pow
  • printf - format_string
  • random - alias of rand
  • reflect - not implemented in scala API but is an alias of SQL function java_method
  • replace -
  • right -
  • like - method of column
  • rollup - in scala API is an alternative to the method group_by of Dataset
  • sentences -
  • sha - alias of sha1
  • shiftleft - shiftLeft
  • shiftright - shiftRight
  • shiftrightunsigned - shiftRightUnsigned
  • sign - doesn't seem like a useful function
  • smallint - can use cast()
  • some - array_exists
  • space - returns a string of n spaces, can be done with scala / python
  • stack - PR: #21
  • std - Is the same of stddev_pop function?
  • string - cast(StringType)
  • str_to_map - in scala / python is much better to do a literal from a Map object
  • substr - substring
  • timestamp - cast(TimestampType)
  • tinyint - can use cast
  • to_unix_timestamp - to_timestamp
  • typeof - scala / python can check the type from the schema
  • ucase - upper
  • uuid
  • version - returns the spark version, can be obtained from the spark session
  • weekday

Additional functions to add

These are functions that are not implemented in Spark, but commonly requested. They should be implemented as Catalyst Expressions so they're performant for the community:

Datetime

  • beginningOfDay
  • beginningOfMonth
  • beginningOfQuarter
  • beginningOfWeek
  • beginningOfYear
  • endOfDay
  • endOfMonth
  • endOfQuarter
  • endOfWeek
  • endOfYear
  • today
  • tomorrow

Possibly make a Spark 2 release & talk about maintenance

I think we should bump to Spark 2.4.5 & see what can features can get added to Spark 2. This way they'll be some JAR files for the Spark 2 users.

Then think we should bump to Spark 3.0.1 and see what additional features can be added. This is the version for the latest Databricks runtime and would serve current users.

Think the version bumps should roughly keep pace with the Databricks version bumps.

Don't think we should cross compile. We'll be relying on new features that are added every release and don't want to make this a complicated maintenance thing. We can just make it clear what versions are supported for each release so users can easily pick the JAR file that'll work for them.

Thoughts?

Create df.typedCol method

DataFrames have schemas that contain the name and type for each column.

The built-in column constructors don't use the type information and build generic Column objects. df["some_string"] returns an untyped Column object.

We can add a typedCol method that'll return IntegerColumn, StringColumn, DateColumn, etc. objects based on the schema of the underlying DataFrame.

Suppose the some_string column in the DataFrame is a StringType column. df.typedCol("some_string") should return a StringColumn. Under the hood, it can infer the column type with df.dtypes.

@alfonsorr - can you try to add this if you have a sec? I'm guessing this'll just take you a few mins!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.