hi-primus / optimus Goto Github PK

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

License: Apache License 2.0

Python 75.36% Shell 0.01% HTML 1.37% Jupyter Notebook 23.10% CSS 0.05% JavaScript 0.05% Dockerfile 0.05%

spark pyspark data-wrangling bigdata big-data-cleaning data-science data-cleansing data-cleaner data-transformation machine-learning

optimus's Introduction

Optimus

Overview

Optimus is an opinionated python library to easily load, process, plot and create ML models that run over pandas, Dask, cuDF, dask-cuDF, Vaex or Spark.

Some amazing things Optimus can do for you:

Process using a simple API, making it easy to use for newcomers.
More than 100 functions to handle strings, process dates, urls and emails.
Easily plot data from any size.
Out of box functions to explore and fix data quality.
Use the same code to process your data in your laptop or in a remote cluster of GPUs.

See Documentation

Try Optimus

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:

Installation (pip):

In your terminal just type:

pip install pyoptimus

By default Optimus install Pandas as the default engine, to install other engines you can use the following commands:

Engine	Command
Dask	`pip install pyoptimus[dask]`
cuDF	`pip install pyoptimus[cudf]`
Dask-cuDF	`pip install pyoptimus[dask-cudf]`
Vaex	`pip install pyoptimus[vaex]`
Spark	`pip install pyoptimus[spark]`

To install from the repo:

pip install git+https://github.com/hi-primus/[email protected]

To install other engines:

pip install git+https://github.com/hi-primus/[email protected]#egg=pyoptimus[dask]

Requirements

Python 3.7 or 3.8

Examples

You can go to 10 minutes to Optimus where you can find the basics to start working in a notebook.

Also you can go to the Examples section and find specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.

Here's a handy Cheat Sheet with the most common Optimus' operations.

Start Optimus

Start Optimus using "pandas", "dask", "cudf","dask_cudf","vaex" or "spark".

from optimus import Optimus
op = Optimus("pandas")

Loading data

Now Optimus can load data in csv, json, parquet, avro and excel formats from a local file or from a URL.

#csv
df = op.load.csv("../examples/data/foo.csv")

#json
df = op.load.json("../examples/data/foo.json")

# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-23.5/examples/data/foo.json")

# parquet
df = op.load.parquet("../examples/data/foo.parquet")

# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")

Also, you can load data from Oracle, Redshift, MySQL and Postgres databases.

Saving Data

#csv
df.save.csv("data/foo.csv")

# json
df.save.json("data/foo.json")

# parquet
df.save.parquet("data/foo.parquet")

You can also save data to oracle, redshift, mysql and postgres.

Create dataframes

Also, you can create a dataframe from scratch

df = op.create.dataframe({
    'A': ['a', 'b', 'c', 'd'],
    'B': [1, 3, 5, 7],
    'C': [2, 4, 6, None],
    'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})

Using display you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.

display(df)

Cleaning and Processing

Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the standard DataFrame functionality adding .rows and .cols accessors.

For example you can load data from a url, transform and apply some predefined cleaning functions:

new_df = df\
    .rows.sort("rank", "desc")\
    .cols.lower(["names", "function"])\
    .cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
    .cols.normalize_chars("names")\
    .cols.remove_special_chars("names")\
    .rows.drop(df["rank"]>8)\
    .cols.rename("*", str.lower)\
    .cols.trim("*")\
    .cols.unnest("japanese name", output_cols="other names")\
    .cols.unnest("last position seen", separator=",", output_cols="pos")\
    .cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])

Need help? 🛠️

Feedback

Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey

Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues

Troubleshooting

If you have issues, see our Troubleshooting Guide

Contributing to Optimus 💡

Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:

Documentation updates, enhancements, designs, or bugfixes.
Spelling or grammar fixes.
README.md corrections or redesigns.
Adding unit, or functional tests
Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
Blogging, speaking about, or creating tutorials about Optimus and its many features.
Helping others on our official chats

Backers and Sponsors

Become a backer or a sponsor and get your image on our README on Github with a link to your site.

optimus's People

Contributors

Stargazers

Watchers

Forkers

himansh1306 josephkevin fabioquintero a3digit gilmararaujo afaq404alam thejanagala guptam todorovc tantrantp7 ashishkej gitter-badger codeaudit iht indeevari mrpowers mahsa-kiani charygao sbarman-mi9 kirkhadley srinivest shaweifeng shubhampachori12110095 anoru ganeshharugeri yuseferi pursh2002 just4jc joserfjuniorllms guillermogsjc nkhuyu tplink32 eswarketl stefmt2970 actions-im aiexperts diegobodas farman1855 jatin7 eduardopacheco4 ufukhurriyetoglu sam9905 matbilml msellamitn mittidesai vivshri plliao gachet o9812 trendingtechnology rbramwell vikasyadav15 dohoainam111 deepakjangid123 hussainasghar easonleeee hordaway axel-bernal nkamsteve afcarl redbitshift bharath5673 xuliangleon techwithshadab gridl eysdevteam pilgrim2go meenasambamurthy darrenhaken manoelalmeida raashutosh lijielife jingwuchen anilsener benzei shafaypro amoussoubaruch tudoufuluobo ramasravani arvindsam96 shiaronhuapaya bizancio3 top1select hedgehog-zowie shinroo vivek2319 obarros forestlzj veeranji0425 paolominguzzi yajiebao semanttica jiapei100 yuanjie-ai danielzhang111cn 1superman shadowkun jason-lee-lxx bhanditz kirosg

optimus's Issues

Let users impute values to 0

We could create a little helper function to let the user easily impute values to 0.

Document missing functions from Read the Docs

Use command line tool for publishing Spark Package

Start branch to bump Optimus to Spark 2.2.0

Create Spark Dataframes easily

Explore options for a different DataFrameTransformer interface

I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.

The Spark Scala API has a nifty transform method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.

I like the DataFrameTransformer class, but it doesn't let users easily access the native PySpark DataFrame methods.

We might want to take these methods out of the DataFrameTransfrormer class, so the user can mix and match the Optimus API and the PySpark API.

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

The transform method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

Let me know what you think!

Add feature engineering to transformer

Improve plots from DAnalyzer

Moving the read and write functions

I think that should not be necessary to instance the Utilities function to use the read and write operations.

Foe example:

# Import optimus
import optimus as op
# Import module for system tools 
import os

# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook 
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_dataset_csv(path=filePath,
 delimiter_mark=',')

I think the way pandas handle it is easiest and elegant.

import pandas as pd

# Load the dataset
df = pd.read_csv('mock_bank_data_original.csv')
df.to_csv('mock_bank_data_original_PART1.csv')

Try integration with Spark DF profiler

https://github.com/julioasotodv/spark-df-profiling/

Improve docuentation on ReadTheDocs

Add every method description and more examples

Simplify plot_hist

Right now to print a hist the user must:

priceDf = analyzer.get_data_frame.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
analyzer.plot_hist(df_one_col=priceDf, hist_dict= hist_dictPri, type_hist='categorical')

I think that we must simplify the function to:
analyzer.plot_hist(df, column, bins, type_hist='categorical')

Create wrapper in Optimus init for all use classes and functions

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Fix the explode function

Change the explode function to count items on Read the docs http://optimus-ironmussa.readthedocs.io/en/latest/#dataframetransformer-explode-table-coldid-col-new-col-feature

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Let the OutlierDetector class get the threshold param

The threshold is actually hard coded in the OutlierDetector function. The user should be able to pass this param to test different thresholds

Migrate Code from DFTransformer to Spark 2.2.0

Improve usability

We should make life easy from people coming from pandas, spark, R, etc. the usage of the framework.

I'm not saying lets copy the name, but I think some or them are too large o maybe confusing.

Create Cheatsheet

It will be amazing to create an Optimus Cheatsheet. I create a www.cheatography.com. Please @FavioVazquez let me know via Slack your cheatography credentials to add you as a collaborator.

Timestamp error

Hi,
I get this error on the latest version while doing:

analyzer = op.DataFrameAnalyzer(df=df)
analyzer.column_analyze("*", plots=False, values_bar=False)

Error:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-282aba391ae8> in <module>()
      1 analyzer = op.DataFrameAnalyzer(df=df)
----> 2 analyzer.column_analyze("*", plots=False, values_bar=False)

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in column_analyze(self, column_list, plots, values_bar, print_type, num_bars, print_all)
    596                 values_bar=values_bar,
    597                 num_bars=num_bars,
--> 598                 types_dict=types)
    599 
    600             # Save the invalid col if exists

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in _analyze(self, df_col_analyzer, column, row_number, plots, print_type, values_bar, num_bars, types_dict)
    388         summary_dict = self._create_dict(
    389             ["name", "type", "total", "valid_values", "missing_values"],
--> 390             [column, types_dict[type_col], row_number, valid_values, missing_values
    391              ])
    392 

KeyError: 'timestamp'

Add Profiler example

Do you think it should be in the same example or create a new one @argenisleon?

Fix "Method could be a function" errors

Data Enrichment

It would be helpful for a user to have a function to enrich data using a REST API. For example, connect to the Google Maps API to geocode an address or the Fullcontact API to add additional info using the user email.

The function must let the user config the request rate limit and the URL.

We must explore how to add additional params like the API keys or any other param the API needs.

Create Benchmarks for Optimus (local and cluster)

Try out some cleaning operations in local mode and cluster mode. See performance and create basic graphs.

Principal libraries and frameworks to try:

Trifacta
Pandas
Dora
....

Create Docker Box

It would be amazing if we create a Docker Box with all ready to run Optimus. Spark, Optimus, Jupiter Notebook, etc. This seems docker file to start https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Refactor tests

Want to add DataFrame comparisons to the tests suite to make it more robust. Want to also make the tests suite handle the SparkSession efficiently. Proposed improvements are here: #150

DataFrameTransformer.explode_table example is outdated

Following the explode_table example on doc will return missing 1 required positional argument: 'list_to_assign' error

# Instanciation of DataTransformer class:
transformer = op.DataFrameTransformer(df)

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show()

# Transformation:
transformer.explode_table('bill id', 'foods', 'Beer')

# Printing new dataFrame:
print('New dataFrame:')
transformer.show()

Load pyspark DF from URL

Based on https://github.com/ibm-watson-data-lab/pixiedust/blob/master/pixiedust/utils/sampleData.py

Using PEP-8 style guide for naming conventions

I was going through the code and found that various methods and variables have been named using camel casing. Also few methods starting with "__" double underscores. We should just use one "_" underscore to differentiate internal methods from public methods. I think it would be a good idea to start by using PEP-8 naming conventions for modules, function, methods and variables. Please let me know your thoughts on this.

Find out how to publish a release in Spark-Packages

Let users impute values to the minimal column value

We could create a little helper function to let the user easily impute values to minimal column value.

Start testing and CI with travis

Fix display buttons for plotting

Wrapping o modifying the transformer.get_data_frame().show() function

@FavioVazquez
If we want to abstract the use of the spark dataframe structure, I think we should consider use something like:

transformer.show()
instead of
transformer.get_data_frame().show()

Is shorter, easier to use and you do not need to deal with df

Find differences between Spark 1.6.x and Spark 2.2.0

And the ones really important to Optimus

Code not complete

On this documentation we have the following code:

# Import optimus
import optimus as op
#Import os module for system tools
import os

# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_csv(path=filePath,
                            sep=',')

# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

There is a tools = op.Utilities() line missing.

Fix "Using type() instead of isinstance() for a typecheck." errors

Improve I/O functionality

Add automatic publish to PyPi from travis

Improve and add more tests

Sampling and processing the whole data set

@FavioVazquez
I think should be an easy way to take a data sample, make all the operations and then apply them to the whole dataset. In this way, if we have a data set of 5TB we do not need to process the all the data that could be time-consuming.

For example:

Sample Dataset

transformer.sample().trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*")

Whole dataset

transformer.trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*")

This approach should be the easiest to implement but there are have some problems here. The user has to copy paste the whole chaining and apply it to the whole dataset. Another approach could be something like that(not python):

operations = [trim_col("*"), remove_special_chars("*"),  clear_accents("*"), lower_case("*") ]
#Transformation in the sample dataset
transfomer.sample().apply(operations)
#Transformation in the whole dataset
transfomer.apply(operations)

The problem with a random sample

We talk about the possibility to use a random sample function from Spark. Although this could be the first approach I think there are 2 problems we have to tackle.

Empty data

The trainers do not accept empty data. If we can not detect empty data on the sample and we do not remove it in the whole data set the user could have problems at training time.

Outliers

If we can not detect and an outlier in the sample and we do not remove it the whole data set it could generate a flawed model.

We must be sure that our sample data set to meet this requirement.

What can we do

I think we must find the fastest way to detect empty data and outliers in the sample function. and be sure that the user has the more accurate representation of the whole data.

Lets do it @argenisleon

hi-primus / optimus Goto Github PK

optimus's Introduction

Optimus

Overview

Try Optimus

Installation (pip):

Requirements

Examples

Start Optimus

Loading data

Saving Data

Create dataframes

Cleaning and Processing

Need help? 🛠️

Feedback

Troubleshooting

Contributing to Optimus 💡

Backers and Sponsors

optimus's People

Contributors

Stargazers

Watchers

Forkers

optimus's Issues

Installation

Using pip

Setup

Installation

Using pip

Setup

Installation

Using pip

Setup

The problem with a random sample

What can we do

Recommend Projects

Recommend Topics

Recommend Org