mrpowers / bebe Goto Github PK
View Code? Open in Web Editor NEWFilling in the Spark function gaps across APIs
Filling in the Spark function gaps across APIs
@alfonsorr - I can't publish this lib anymore, see the issue report here.
Any idea how to fix this. Are you able to publish libs? Thanks!
The idea es to have this project split into three different subprojects
The reason to isolate the functions is to have a dependency only for people that wants the unimplemented columns in spark following the basic spark API and make it easier to create the python interface of #11 . It will be a great addition if we can split from which spark version functionality is provided. For example RegExpExtractAll
it's new in spark 3.1.x, we can exclude this in the previous versions, or try to backport it to allow spark 3.0.x or 2.4 to use it if a copy-paste is the only required thing.
The typed column project will have the core functionality to use the typed columns. This will allow to cross-build this project for spark 2.4 (scala 2.11) and test against 3.0.x.
And last, the Typed functions will merge the two previous projects to present the functions previously provided but typed.
This will require mostly sbt rework, and to we can see in the future if it be better to split it in different repositories ๐คท .
@alfonsorr - can you please send me your recommended scalafmt workflow? Should I run a command before committing?
Will want to expose the "missing API functions" (e.g. regexp_extract_all) for the PySpark folks too.
Think we'll be able to follow this example and expose these relatively easily.
Something like this:
def add_months(col, regexp, group_index):
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.regexp_replace_all(_to_java_column(start), _to_java_column(regexp), _to_java_column(group_index)))
Not sure how to build the wheel file or publish to PyPi from a SBT repo. My other PySpark projects, like quinn, are built with Poetry which has built-in packaging / publishing commands.
As suggested here.
Should be similar to what's published for spark-daria.
We can use this issue to create a list of all the functions that are in Spark SQL, but not in the Scala API for whatever reason.
Here's the list that @nvander1 sent me so we can get started out. He already implemented approx_percentile
, so we're on our way!
sparkSession.catalog.currentDatabase
to_date
dayofweek
, dayofyear
, second
, timestamp_seconds
, etccast(DecimalType)
/
cast(DoubleType)
dayofweek
, dayofyear
, second
... and the element to extract can't be a expression.first
cast(FloatType)
array_contains
cast(IntegerType)
last
lower
isin
- col
current_timestamp
coalesce
|
and or
locate
but maybe create an alternate locate that accepts all parameters as columnspow
format_string
rand
sha1
shiftLeft
shiftRight
shiftRightUnsigned
stddev_pop
function?cast(StringType)
substring
to_timestamp
upper
These are functions that are not implemented in Spark, but commonly requested. They should be implemented as Catalyst Expressions so they're performant for the community:
Datetime
I think we should bump to Spark 2.4.5 & see what can features can get added to Spark 2. This way they'll be some JAR files for the Spark 2 users.
Then think we should bump to Spark 3.0.1 and see what additional features can be added. This is the version for the latest Databricks runtime and would serve current users.
Think the version bumps should roughly keep pace with the Databricks version bumps.
Don't think we should cross compile. We'll be relying on new features that are added every release and don't want to make this a complicated maintenance thing. We can just make it clear what versions are supported for each release so users can easily pick the JAR file that'll work for them.
Thoughts?
DataFrames have schemas that contain the name and type for each column.
The built-in column constructors don't use the type information and build generic Column objects. df["some_string"]
returns an untyped Column
object.
We can add a typedCol
method that'll return IntegerColumn
, StringColumn
, DateColumn
, etc. objects based on the schema of the underlying DataFrame.
Suppose the some_string
column in the DataFrame is a StringType
column. df.typedCol("some_string")
should return a StringColumn
. Under the hood, it can infer the column type with df.dtypes
.
@alfonsorr - can you try to add this if you have a sec? I'm guessing this'll just take you a few mins!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.