The bebe from mrpowers

Publishing command now broken

@alfonsorr - I can't publish this lib anymore, see the issue report here.

Any idea how to fix this. Are you able to publish libs? Thanks!

suggestion to split the project

The idea es to have this project split into three different subprojects

Functions
Typed columns
Typed functions

The reason to isolate the functions is to have a dependency only for people that wants the unimplemented columns in spark following the basic spark API and make it easier to create the python interface of #11 . It will be a great addition if we can split from which spark version functionality is provided. For example RegExpExtractAll it's new in spark 3.1.x, we can exclude this in the previous versions, or try to backport it to allow spark 3.0.x or 2.4 to use it if a copy-paste is the only required thing.

The typed column project will have the core functionality to use the typed columns. This will allow to cross-build this project for spark 2.4 (scala 2.11) and test against 3.0.x.

And last, the Typed functions will merge the two previous projects to present the functions previously provided but typed.

This will require mostly sbt rework, and to we can see in the future if it be better to split it in different repositories 🤷 .

scalafmt workflow

@alfonsorr - can you please send me your recommended scalafmt workflow? Should I run a command before committing?

Create PySpark interface

Will want to expose the "missing API functions" (e.g. regexp_extract_all) for the PySpark folks too.

Think we'll be able to follow this example and expose these relatively easily.

Something like this:

def add_months(col, regexp, group_index):
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.regexp_replace_all(_to_java_column(start), _to_java_column(regexp), _to_java_column(group_index)))

Not sure how to build the wheel file or publish to PyPi from a SBT repo. My other PySpark projects, like quinn, are built with Poetry which has built-in packaging / publishing commands.

Setup GitHub actions CI

As suggested here.

Publish documentation

Should be similar to what's published for spark-daria.

Functions that are in Spark SQL and not in Scala API to be implemented

We can use this issue to create a list of all the functions that are in Spark SQL, but not in the Scala API for whatever reason.

Here's the list that @nvander1 sent me so we can get started out. He already implemented approx_percentile, so we're on our way!

Additional functions to add

These are functions that are not implemented in Spark, but commonly requested. They should be implemented as Catalyst Expressions so they're performant for the community:

Datetime

Possibly make a Spark 2 release & talk about maintenance

I think we should bump to Spark 2.4.5 & see what can features can get added to Spark 2. This way they'll be some JAR files for the Spark 2 users.

Then think we should bump to Spark 3.0.1 and see what additional features can be added. This is the version for the latest Databricks runtime and would serve current users.

Think the version bumps should roughly keep pace with the Databricks version bumps.

Don't think we should cross compile. We'll be relying on new features that are added every release and don't want to make this a complicated maintenance thing. We can just make it clear what versions are supported for each release so users can easily pick the JAR file that'll work for them.

Thoughts?

Create df.typedCol method

DataFrames have schemas that contain the name and type for each column.

The built-in column constructors don't use the type information and build generic Column objects. df["some_string"] returns an untyped Column object.

We can add a typedCol method that'll return IntegerColumn, StringColumn, DateColumn, etc. objects based on the schema of the underlying DataFrame.

Suppose the some_string column in the DataFrame is a StringType column. df.typedCol("some_string") should return a StringColumn. Under the hood, it can infer the column type with df.dtypes.

@alfonsorr - can you try to add this if you have a sec? I'm guessing this'll just take you a few mins!

mrpowers / bebe Goto Github PK

bebe's People

Contributors

Stargazers

Watchers

Forkers

bebe's Issues

Publishing command now broken

suggestion to split the project

scalafmt workflow

Create PySpark interface

Setup GitHub actions CI

Publish documentation

Functions that are in Spark SQL and not in Scala API to be implemented

Additional functions to add

Possibly make a Spark 2 release & talk about maintenance

Create df.typedCol method

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent