Code Monkey home page Code Monkey logo

koalas's People

Contributors

90jam avatar abdealiloko avatar awdavidson avatar beobest2 avatar charlesdong1991 avatar deepyaman avatar dumbmachine avatar dvgodoy avatar floscha avatar fwani avatar garawalid avatar gatorsmile avatar gioa avatar gliptak avatar guyao avatar harupy avatar hjoo avatar hyukjinkwon avatar icexelloss avatar itholic avatar lucasg0 avatar nitlev avatar rainfung avatar rxin avatar shril avatar thoo avatar thunterdb avatar tomspur avatar ueshin avatar xinrong-meng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

koalas's Issues

Adding a `strict` setting to force pandas compatibility at the cost of efficiency

I think that there are a bunch of design considerations that will come up.

For example - when doing pd.DataFrame.sample() - pandas guarantees exact records for n and an exact fraction when using frac.
When implementing them, will Koala decide to:

  1. Be exactly the same as pandas and inefficient
  2. Be inexact (not exactly same behavior as pandas) and be efficient

Option 1 could be implemented as:

  • Need to find num-rows
  • Do an inexact sampling in spark
  • If numRows is lower than the required, do another inexact sampling on what is left over
    OR:
  • Sort data randomly and get first N records
    Both are inefficient and is doing more computation than what a pure-spark solution would be. But gives the exact same number of records as pandas would.

Option 2 could be implemented as:

  • For frac use spark's behavior of fraction
  • For n do a count and use spark's sample() with fraction=n/df.count()
    Neither would not give the same number of rows as pandas consistently, but requires 0 passes or max 1 pass on the data.

Broken example in README

spark.from_pandas in the following example should be koalas.from_pandas and the import from databricks import koalas is also missing

import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

df = spark.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x

move read_csv and read_parquet to SparkSession.

Move read_csv and read_parquet to monkey patch to SparkSession instead of pyspark package.

Currently we alway use default_session in namespace.py, but the session might be different from the one some users expect.
And make pyspark.read_csv redirect to default_session().read_csv() for the most users.

pip s3 location needs to change

It still says "pandorable_sparky":

pip install https://s3-us-west-2.amazonaws.com/databricks-tjhunter/pandorable_sparky/pandorable_sparky-0.0.5-py3-none-any.whl

Script to package and release to pypi

We have enough content to make a release onto PyPI. The goal of this ticket is to:

  • document the process for future releases, in docs/RELEASE.md for example
  • make a first release with version 0.1.0, because this is already pretty usable.

UDF using DataFrame.apply

These are the patterns that I have seen with the apply() function:

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], 'b': [7, 6, 5, 4, 3, 2, 1]})

# Option 1: Use a series and apply on it to run a UDF (Technically a Series.apply)
df['c'] = df['a'].apply(lambda x: x*2)

# Option 2: Use the entire row from a dataframe
df['d'] = df.apply(lambda x: x.a*x.b, axis=1)

# Option 3: Use some columns in a row
df['e'] = df[['a', 'b']].apply(lambda x: x.a*x.b, axis=1)

# Option 4: Apply aggregates on every column to create a new dataframe
new_df = df.apply(lambda x: x.sum())

# Option 5: Apply aggregates but return multiple values - here it adjusts the shorted columns with NaNs
new_df = df.apply(lambda x: x[:x[0]])

Column select with [[...]] broken

This functionality for selecting columns works in both Spark and Pandas.
But is broken in Koala.

In [1]: from datetime import datetime
   ...:
   ...: import pandas as pd
   ...: import numpy as np
   ...:
   ...: import findspark; findspark.init(); import pyspark
   ...: from pyspark.sql import types as T, functions as F
   ...:

In [2]: spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [3]: # Get data
   ...: data = {'a{}'.format(i): [1,2,3,4] for i in range(2)}
   ...: data.update({'b{}'.format(i): ['a', 'b ', ' c', ' d '] for i in range(2)})
   ...: data.update({'c{}'.format(i): [[1,2], None, [3,4], [4,5,6]] for i in range(2)})
   ...: df = pd.DataFrame(data)
   ...: sdf = spark.createDataFrame(df)
   ...:

In [4]: sdf[['a0', 'a1']].show()
+---+---+
| a0| a1|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+


In [5]: import databricks.koala

In [6]: sdf[['a0', 'a1']].show()
---------------------------------------------------------------------------
SparkPandasNotImplementedError            Traceback (most recent call last)
<ipython-input-6-c2947ceaf814> in <module>()
----> 1 sdf[['a0', 'a1']].show()

~/Documents/oss/databricks/spark-pandas/databricks/koala/structures.py in __getitem__(self, key)
    619
    620     def __getitem__(self, key):
--> 621         return anchor_wrap(self, self._pd_getitem(key))
    622
    623     def __setitem__(self, key, value):

~/Documents/oss/databricks/spark-pandas/databricks/koala/structures.py in _pd_getitem(self, key)
    605             raise NotImplementedError(key)
    606         if isinstance(key, list):
--> 607             return self.loc[:, key]
    608         if isinstance(key, DataFrame):
    609             # TODO Should not implement alignment, too dangerous?

~/Documents/oss/databricks/spark-pandas/databricks/koala/selection.py in __getitem__(self, key)
    132                     df = df._spark_where(reduce(lambda x, y: x & y, cond))
    133             else:
--> 134                 raiseNotImplemented("Cannot use slice for MultiIndex with Spark.")
    135         elif isinstance(rows_sel, string_types):
    136             raiseNotImplemented("Cannot use a scalar value for row selection with Spark.")

~/Documents/oss/databricks/spark-pandas/databricks/koala/selection.py in raiseNotImplemented(description)
    102                 description=description,
    103                 pandas_function=".loc[..., ...]",
--> 104                 spark_target_function="select, where")
    105
    106         rows_sel, cols_sel = _unfold(key, self.col)

SparkPandasNotImplementedError: Cannot use slice for MultiIndex with Spark. You are trying to use pandas function .loc[..., ...], use spark function select, where

In [7]: df[['a0', 'a1']]
Out[7]:
   a0  a1
0   1   1
1   2   2
2   3   3
3   4   4

Port dask's tests for basic dataframe operations

We should adapt the test suites of Dask to make sure that we are not breaking anything.

The license allows us to copy the code, if we put the proper attribution in the README

dask uses a .compute() method, which we should add as an alias to .toPandas() to simplify the porting of tests

A good test suite to begin with is probably
https://github.com/dask/dask/blob/master/dask/dataframe/tests/test_dataframe.py
It is very long, so we can start with the first few tests to see how much change we need to do. In general, given the size of the testing code base, we should aim at limiting the changes we bring to it.

Initial port of 10 minutes to Pandas

We should have a short tutorial that quickly explains:

  • to spark users, how they can leverage pandas features thanks to koala
  • to pandas users, what changes they should be of when they want to use spark+koala

The aim is to produce a simple tutorial based on the sections of the 10 minutes to pandas tutorial that are already implemented, and mark the remainder as not implemented yet.

In particular, the tutorial should answer the following questions:

  • will the index be tracked
  • are the objects (context, dataframe, column) separate or are they the regular spark objects
  • how does filtering nulls and NaNs is affected (dropna, value_counts)

Work has been started in this pull request, and more contributions are welcome:
#34

Easier Pandas UDFs

TODO: move this to a google doc for proper design.

All these are mostly implemented in https://github.com/databricks/spark-pandas/blob/master/pandorable_sparky/typing.py#L86

A couple of changes can make the current Pandas UDFs more friendly for non-spark users:

support native python types Users should not have to know the spark datatype objects.

Use python3 typing for expressing types Type hints are required by spark and now they can be expressed naturally in python 3 too. You can quickly declare a python UDF with the following pythonic syntax:

@spark_function
def my_udf(x: Col[int]) -> Col[double]:
  pass

The input type is optional but then can be use to do either casting or type checking, just like in scala.
An alternative syntax for expressing the same thing in python 2 could be:

@spark_udf(col_x=int, col_return=double)
def my_udf(x): pass

Similarly for reducers. The following function can be automatically infered to be a UDAF because it returns an integer.

@spark_function
def my_reducer(x: Col[int]) -> int:
  pass

Broadcasting For example, for UDFs that take multiple columns as arguments, a scalar should be accepted as treated as a SQL lit().

Useful and relevant code:
https://github.com/databricks/spark-pandas/blob/master/pandorable_sparky/typing.py#L86
https://github.com/databricks/Koala/blob/master/potassium/utils/spark_utils.py
https://github.com/dvgodoy/handyspark/blob/master/handyspark/sql/schema.py

Use encapsulation instead of monkey patching / inheritance

After spending some time thinking about the issue, I think it would be better for the project to create its own entry point and its own DataFrame / Series class, rather than doing it through monkey patching or inherence to change the existing Spark DataFrame API.

The main reason I am leaning this way now is because Spark and Pandas DataFrames already have quite a few functions with the same name but different semantics, e.g. sample, head. Keeping existing Spark behavior does not accomplish the goal of the project, while changing existing Spark behavior based on some import statement is an anti-pattern of Python code. Doing it through encapsulation also avoids the issue of Koalas polluting the internal state of Spark's DataFrame through monkey patching.

So here's an alternative design:

koalas: parallel to pandas namespace. Includes I/O functions (read_csv, read_parquet) as well as general functions (e.g. concat, get_dummies). Functions here would only work if there is an active SparkSession.

koalas.DataFrame: A completely new interface based on Pandas' DataFrame API. KoalaDataFrame wraps around a Spark DataFrame along with additional internal states.

koalas.Series: Based on Pandas' DataFrame API.

The only thing I'd monkey patch into Spark's DataFrame is a toKoalas method, which turns an existing Spark DataFrame into a Koalas DataFrame.

While this change will bring more code, we will get the following benefits:

  • Very clear when users will get the Pandas behavior vs Spark behavior.
  • Documentation also becomes clear, as we just provide documentation on Koalas.

The main tradeoff is:

If we go with encapsulation, there is a separate class hierarchy in addition to the existing Spark ones, and the new hierarchy follows strictly Pandas semantics. The downside is that users cannot directly use a Spark DataFrame with Pandas code, and they need to call "spark_ds.toKoalas".

If we go with the current monkey patching approach, it's the opposite of the above. Users can directly use a Spark DataFrame with Pandas code, but existing Spark code's behavior might change as soon as they import koalas package. For example, "head" now returns 5 rows, rather than 1 row. "columns' return a Pandas Index, rather than a Python list. "sample" in the future might also change.

Getting an indexed column as a Series - doesnt work in pandas


In [5]: # Get data
   ...: data = {'a{}'.format(i): [1,2,3,4] for i in range(2)}
   ...: data.update({'b{}'.format(i): ['a', 'b ', ' c', ' d '] for i in range(2)})
   ...: df = pd.DataFrame(data)
   ...: sdf = spark.createDataFrame(df)
   ...:

In [6]: import databricks.koala

In [7]: df = df.set_index('a0')
   ...: sdf = sdf.set_index('a0')
   ...:

In [8]: sdf['a0']

a0
1    1
2    2
3    3
4    4
Name: a0, dtype: int64

In [9]: df['a0']

Getting the index as a column seems to work in spark, but fails in pandas.

I guess I'm again coming from the question - What should be expected behavior in koala ?
Is an index considered a column ? (I know internally it is a column ... but i meant from a usability perspective.) Pandas does not think so

Documentation setup

I think it would be awesome if some sort of readthedocs like documentation could be started.

I have begun exploring some stuff and rather than storing it all in my head could write it out so that it becomes a reference for other+me.

Reverse get_dummies

In Pandas, there is no good way to reverse get_dummies. This has been a long time issue (since 2014) with Pandas.

Sklearn and SparkML both have ways to reverse OHE. Would be amazing if Koalas supported this!

Make patch_spark() revertable - also add Context manager for it

I notice that patch_spark() documentation says it cannot be reverted.
That can be a bit scary especially cause I have a bunch of codes where I'd love to use koala for parts which I am writing. But i do need to also call third party python+spark libraries I cannot control and they may break

Ideally - I'd love to be able to do something like:

# Some sparky stuff - Create DF1, DF2
with init_koala():
    # Koala stuff with DF1
# More sparky stuff with DF2

Implementation Ideas:
I think a very simple implementation of this could be a dict() or all the items it is patching with the old versions of those and revert it using that dict. and reverting back. Just need a bit of book-keeping.

Script to deploy new version onto databricks

Let's have a small script that builds the project and pushes a new release onto databricks using the Databricks CLI. Pushing a new release onto DBFS is enough for users to attach to a cluster.

Design of lazy evaluation for some properties

The idea behind this issue is to discuss the design of lazy evaluation for properties.

I think we need this behavior especially with a property like values. We will load all the data in values when we create a new DataFrame and this becomes heavy when the data is huge.

I found a good trick here and I think it can be useful. We just need to override __get__().

Any idea related to the topic and we can use it in Koalas?

For instance, we can solve values by replacing it by to_numpy().
Pandas documentation suggests.

Warning: We recommend using DataFrame.to_numpy() instead.

Additions on "what is supported"

Potentially a few things to add to the what is supported excel (giving from my list of common operations in pandas)
https://docs.google.com/spreadsheets/d/1GwBvGsqZAFFAD5PPc_lffDEXith353E1Y7UV6ZAAHWA/edit

Pandas:

  • pd.concat()
  • pd.cut / Series.cut
  • pd.get_dummies()

Series:

  • value_counts()
  • Common functions: min / max / argmin / abs / corr / cov / cummin / cumprod / cumsum
  • all / any
  • clip
  • fillna
  • Operations: eq / isin / lt / le / gt / ge / isna / isnull

pd.options:

Options should be considered at some point.
For example, I use with pd.option_context('mode.use_inf_as_null', True): with data = data[[colname]].dropna() to drop both INFs and NAs from a dataframe.
These should be documented as not implemented right now as they do change results

Indexes:

And I had a question as I see set_index and reset_index in the excel right now.
Is the plan that this library will implement the indexing logic on top of Spark ? (As Spark DataFrames dont have indexing)
i.e. will the index be tracked in koala and it will be a normal column in sparkDF but all index functionality will work ?

Add most of the missing pandas methods as stubs

Since the documentation step in an external spreadsheet is proving to be too tedious to be respected, let's simply add most of the missing methods as stubs. These methods should:

  • return a NotImplementedError (with a message for getting contributions and that includes the name of the missing function)
  • be documented as not implemented yet.

It should be pretty easy to implement a method that always throws for these functions.

Publish the package documentation

We should publish documentation to explain the changes that this package brings:

  • list of status of all the methods: it can be a markdown file, for example, that exports the content of the spreadsheet
  • full python doc of the modified pyspark.DataFrame and pyspark.Column to provide a reference to what is implemented
  • (if possible) embed a quick tutorial once it is written

DataFrame.dtypes

Raising this to discuss what the handling of the .dtypes should be.
Current implementations:

  • Pandas: Returns a pandas Series with index of fieldnames and values as dtypes
  • Spark: Returns a list of tuples with (fieldname, dtype)
  • Koala: Same as Spark

Other Differences:

  • Whether to show indexes in .dtypes:
    • Pandas: Shows only columns and not indexes
    • Spark: Shows all columns, no index concept
    • Koala: Shows columns and indexes
  • Type values:
    • Pandas: Shows strings, arrays, map, etc as object
    • Spark: Shows strings, arrays, map, etc in appropriate simpleString notation
    • Koala: Same as spark
  • Dtype notation:
    • Pandas: Shows numpy type for the dtype
    • Spark: Shows simpleString notation
    • Koala: Same as spark

Should we simplify stub code?

Right now they look like:

    def add(self, other, axis='columns', level=None, fill_value=None):
        """A stub for the equivalent method to `pd.DataFrame.add()`.
        The method `pd.DataFrame.add()` is not implemented yet.
        """
        raise PandasNotImplementedError(class_name='pd.DataFrame', method_name='add')

Should we simplify it to something like

def add(**kwargs):
  unsupported_function()

or

add = unsupported_function()

or even

unsupported_function('add')

Reason I think we should simplify:

  1. I don't see a strong need to show all the arguments, in addition to the function name. (Correct me if I'm wrong).
  2. Perception wise, it'd be better to have a shorter missing file than a very long file.

Extract Koala cell value results in error

Scenario:

# Create Small Sample Dataset
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
kdf = spark.from_pandas(pdf)

When running this via Pandas, this works as expected:

pdf['D'][0]

Out[107]: 0.3420981969020352

But with Koalas, get the following error:

kdf['D'][0]

Out[106]: <repr(<pyspark.sql.column.Column at 0x7fcdeb84e710>) failed: pyspark.sql.utils.AnalysisException: "Can't extract value from D#6026: need struct type but got double;">

DataFrame.sample

This function is going to require more thoughts than most others because Spark and Pandas have the same function name (sample) to provide slightly different semantics:

Documentation:

Before starting on designing what the expectations should be, here are some constraints:

  • the existing spark code must still behave similarly
  • the pandas code may have to call arguments by names to make it compatible. This is usually the standard practice anyway

Some questions which the design doc should explore:

  • when calling for a number of items to return, should it return a pandas or a spark dataframe. I expect a spark dataframe
  • should the number of elements returned be exact? I would expect it to be the case since this the full idea of specifying the number of elements
  • should the elements always be the same? This is very hard to do with the current implementation of sample() in Spark, so this would have to be changed a bit

First release

First release (private alpha)

  • more complete readme:

    • how to contribute
    • what is implemented
    • license
  • clean the print statements

  • make a script to pull the list of implemented functions in markdown

  • rename the package

Merging Schema Spark -> Pandas DF

I wrote out a single CSV file using Spark, and am trying to just read in that file directly (not the metadata) using Pandas.

However, I'm getting this "TypeError: field Transport DtTm: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>".

Please see screenshot.
Screen Shot 2019-04-19 at 10 46 34 PM

More agressive bound checks on packages

Some environments like Databricks ship with outdated versions of python and pyarrow, which causes some deserialization errors. Let's put the exact supported bounds in the requirements. Here is what works from experimentation:

  • pyspark >= 2.4: pandas UDF
  • pandas >= 0.23: support for parquet reading
  • pyarrow >= 0.11.1: support for dates

Dataframe.loc

Dask test suite:
https://github.com/dask/dask/blob/master/dask/dataframe/tests/test_indexing.py

Dask implementation:
https://github.com/dask/dask/blob/master/dask/dataframe/indexing.py#L69

HandySpark:
https://github.com/databricks/Koala/blob/master/potassium/selection.py#L38

The dask implementation is very comprehensive, I believe that we do not need to implement every single way unless we can easily copy/paste dask.
The basic use cases are described here:
https://pandas.pydata.org/pandas-docs/stable/10min.html#getting
Let's not focus on the cases that require an index for now, this requires more thought about how we want to handle indexing.

Create CONTRIBUTING.md

We should create a CONTRIBUTING.md file that documents how to get started with the project, including build instructions.

The idea is that developers that are familiar with Python and Apache Spark should be able go from not having anything setup to be able to build and modify and run tests by just reading the CONTRIBUTING.md file. We should update "How to contribute" section in README.md to simply link to CONTRIBUTING.md.

Create a design principles doc

As we start to accept contributions, it's important to set the principles in place, so the project will carry forward in a coherent way.

How to call dev/_make_missing_functions.py?

rxin @ C02XT0W6JGH5 : ~/workspace/spark-pandas (master) 
> dev/_make_missing_functions.py 
Traceback (most recent call last):
  File "dev/_make_missing_functions.py", line 22, in <module>
    from databricks.koala.frame import PandasLikeDataFrame
ImportError: No module named databricks.koala.frame

Do I need to install koala first? We should add documentation to CONTRIBUTING.md. It'd also be best if this runs against the existing code base, rather than a system-wide installed Koala.

Document the functions that are different between Pandas and Spark

We should carefully document the functions that are different between Pandas and Spark/Koala, including with a special decorator. Suggestions are welcome here. Here are a couple:

  • count
  • maybe dtypes? #52 has a longer discussion

For these functions, the documentation should clearly state:

  • that the Spark behaviour is preserved
  • which alternative function provides the desirable function

In addition, we should mark these functions with a python decorator to be able so that we can find easily the conflicts

Regarding providing an alternative, I would be in favour of providing a function like pandas_count or pd_count so that users have an easy way to switch.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.