Code Monkey home page Code Monkey logo

hi-primus / optimus Goto Github PK

View Code? Open in Web Editor NEW
1.4K 38.0 234.0 112.56 MB

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Home Page: https://hi-optimus.com

License: Apache License 2.0

Python 75.36% Shell 0.01% HTML 1.37% Jupyter Notebook 23.10% CSS 0.05% JavaScript 0.05% Dockerfile 0.05%
spark pyspark data-wrangling bigdata big-data-cleaning data-science data-cleansing data-cleaner data-transformation machine-learning

optimus's Introduction

Optimus

Logo Optimus

Tests Docker image updated PyPI Latest Release GitHub release CalVer

Downloads Downloads Downloads Mentioned in Awesome Data Science Slack

Overview

Optimus is an opinionated python library to easily load, process, plot and create ML models that run over pandas, Dask, cuDF, dask-cuDF, Vaex or Spark.

Some amazing things Optimus can do for you:

  • Process using a simple API, making it easy to use for newcomers.
  • More than 100 functions to handle strings, process dates, urls and emails.
  • Easily plot data from any size.
  • Out of box functions to explore and fix data quality.
  • Use the same code to process your data in your laptop or in a remote cluster of GPUs.

See Documentation

Try Optimus

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:

Binder Colab

Installation (pip):

In your terminal just type:

pip install pyoptimus

By default Optimus install Pandas as the default engine, to install other engines you can use the following commands:

Engine Command
Dask pip install pyoptimus[dask]
cuDF pip install pyoptimus[cudf]
Dask-cuDF pip install pyoptimus[dask-cudf]
Vaex pip install pyoptimus[vaex]
Spark pip install pyoptimus[spark]

To install from the repo:

pip install git+https://github.com/hi-primus/[email protected]

To install other engines:

pip install git+https://github.com/hi-primus/[email protected]#egg=pyoptimus[dask]

Requirements

  • Python 3.7 or 3.8

Examples

You can go to 10 minutes to Optimus where you can find the basics to start working in a notebook.

Also you can go to the Examples section and find specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.

Here's a handy Cheat Sheet with the most common Optimus' operations.

Start Optimus

Start Optimus using "pandas", "dask", "cudf","dask_cudf","vaex" or "spark".

from optimus import Optimus
op = Optimus("pandas")

Loading data

Now Optimus can load data in csv, json, parquet, avro and excel formats from a local file or from a URL.

#csv
df = op.load.csv("../examples/data/foo.csv")

#json
df = op.load.json("../examples/data/foo.json")

# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-23.5/examples/data/foo.json")

# parquet
df = op.load.parquet("../examples/data/foo.parquet")

# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")

Also, you can load data from Oracle, Redshift, MySQL and Postgres databases.

Saving Data

#csv
df.save.csv("data/foo.csv")

# json
df.save.json("data/foo.json")

# parquet
df.save.parquet("data/foo.parquet")

You can also save data to oracle, redshift, mysql and postgres.

Create dataframes

Also, you can create a dataframe from scratch

df = op.create.dataframe({
    'A': ['a', 'b', 'c', 'd'],
    'B': [1, 3, 5, 7],
    'C': [2, 4, 6, None],
    'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})

Using display you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.

display(df)

Cleaning and Processing

Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the standard DataFrame functionality adding .rows and .cols accessors.

For example you can load data from a url, transform and apply some predefined cleaning functions:

new_df = df\
    .rows.sort("rank", "desc")\
    .cols.lower(["names", "function"])\
    .cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
    .cols.normalize_chars("names")\
    .cols.remove_special_chars("names")\
    .rows.drop(df["rank"]>8)\
    .cols.rename("*", str.lower)\
    .cols.trim("*")\
    .cols.unnest("japanese name", output_cols="other names")\
    .cols.unnest("last position seen", separator=",", output_cols="pos")\
    .cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])

Need help? 🛠️

Feedback

Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey

Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues

Troubleshooting

If you have issues, see our Troubleshooting Guide

Contributing to Optimus 💡

Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:

  • Documentation updates, enhancements, designs, or bugfixes.
  • Spelling or grammar fixes.
  • README.md corrections or redesigns.
  • Adding unit, or functional tests
  • Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
  • Blogging, speaking about, or creating tutorials about Optimus and its many features.
  • Helping others on our official chats

Backers and Sponsors

Become a backer or a sponsor and get your image on our README on Github with a link to your site.

OpenCollective OpenCollective

optimus's People

Contributors

argenisleon avatar arpit1997 avatar atwoodjw avatar aviolante avatar codacy-badger avatar cool-pot avatar deepakjangid123 avatar dependabot[bot] avatar eschizoid avatar faviovazquez avatar jameslamb avatar jarrioja avatar joseangelhernao avatar lakhotiaharshit avatar lironco11 avatar luis11011 avatar luisboitas avatar mrpowers avatar niteshnicholas avatar pyup-bot avatar sergey48k avatar timgates42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

optimus's Issues

Explore options for a different DataFrameTransformer interface

I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.

The Spark Scala API has a nifty transform method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.

I like the DataFrameTransformer class, but it doesn't let users easily access the native PySpark DataFrame methods.

We might want to take these methods out of the DataFrameTransfrormer class, so the user can mix and match the Optimus API and the PySpark API.

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

The transform method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

Let me know what you think!

Moving the read and write functions

I think that should not be necessary to instance the Utilities function to use the read and write operations.

Foe example:

# Import optimus
import optimus as op
# Import module for system tools 
import os

# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe. os.getcwd() returns de current directory of the notebook 
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_dataset_csv(path=filePath,
 delimiter_mark=',')

I think the way pandas handle it is easiest and elegant.

import pandas as pd

# Load the dataset
df = pd.read_csv('mock_bank_data_original.csv')
df.to_csv('mock_bank_data_original_PART1.csv')

Simplify plot_hist

Right now to print a hist the user must:

priceDf = analyzer.get_data_frame.select("price") #or df.select("price")
hist_dictPri = analyzer.get_numerical_hist(df_one_col=priceDf, num_bars=10)
analyzer.plot_hist(df_one_col=priceDf, hist_dict= hist_dictPri, type_hist='categorical')

I think that we must simplify the function to:
analyzer.plot_hist(df, column, bins, type_hist='categorical')

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Improve usability

We should make life easy from people coming from pandas, spark, R, etc. the usage of the framework.

I'm not saying lets copy the name, but I think some or them are too large o maybe confusing.

Timestamp error

Hi,
I get this error on the latest version while doing:

analyzer = op.DataFrameAnalyzer(df=df)
analyzer.column_analyze("*", plots=False, values_bar=False)

Error:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-282aba391ae8> in <module>()
      1 analyzer = op.DataFrameAnalyzer(df=df)
----> 2 analyzer.column_analyze("*", plots=False, values_bar=False)

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in column_analyze(self, column_list, plots, values_bar, print_type, num_bars, print_all)
    596                 values_bar=values_bar,
    597                 num_bars=num_bars,
--> 598                 types_dict=types)
    599 
    600             # Save the invalid col if exists

~/Programs/anaconda/envs/dl/lib/python3.6/site-packages/optimus/df_analyzer.py in _analyze(self, df_col_analyzer, column, row_number, plots, print_type, values_bar, num_bars, types_dict)
    388         summary_dict = self._create_dict(
    389             ["name", "type", "total", "valid_values", "missing_values"],
--> 390             [column, types_dict[type_col], row_number, valid_values, missing_values
    391              ])
    392 

KeyError: 'timestamp'

Data Enrichment

It would be helpful for a user to have a function to enrich data using a REST API. For example, connect to the Google Maps API to geocode an address or the Fullcontact API to add additional info using the user email.

The function must let the user config the request rate limit and the URL.

We must explore how to add additional params like the API keys or any other param the API needs.

Setup Airbrake for your Python application

Installation

Using pip

pip install -U airbrake

Setup

The easiest way to get set up is with a few environment variables (You can find your project ID and API KEY with your project's settings):

export AIRBRAKE_API_KEY=<Your project API KEY>
export AIRBRAKE_PROJECT_ID=<Your project ID>
export AIRBRAKE_ENVIRONMENT=production

and you're done!

Otherwise, you can instantiate your AirbrakeHandler by passing these values as arguments to the getLogger() helper:

import airbrake


logger = airbrake.getLogger(api_key="<Your project API KEY>", project_id=<Your project ID>)


try:
    1/0
except Exception:
    logger.exception("Bad math.")

For more information please visit our official GitHub repo.

Refactor tests

Want to add DataFrame comparisons to the tests suite to make it more robust. Want to also make the tests suite handle the SparkSession efficiently. Proposed improvements are here: #150

DataFrameTransformer.explode_table example is outdated

Following the explode_table example on doc will return missing 1 required positional argument: 'list_to_assign' error

# Instanciation of DataTransformer class:
transformer = op.DataFrameTransformer(df)

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show()

# Transformation:
transformer.explode_table('bill id', 'foods', 'Beer')

# Printing new dataFrame:
print('New dataFrame:')
transformer.show()

Using PEP-8 style guide for naming conventions

I was going through the code and found that various methods and variables have been named using camel casing. Also few methods starting with "__" double underscores. We should just use one "_" underscore to differentiate internal methods from public methods. I think it would be a good idea to start by using PEP-8 naming conventions for modules, function, methods and variables. Please let me know your thoughts on this.

Code not complete

On this documentation we have the following code:

# Import optimus
import optimus as op
#Import os module for system tools
import os

# Reading dataframe. os.getcwd() returns de current directory of the notebook
# 'file:///' is a prefix that specifies the type of file system used, in this
# case, local file system (hard drive of the pc) is used.
filePath = "file:///" + os.getcwd() + "/foo.csv"

df = tools.read_csv(path=filePath,
                            sep=',')

# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

There is a tools = op.Utilities() line missing.

Sampling and processing the whole data set

@FavioVazquez
I think should be an easy way to take a data sample, make all the operations and then apply them to the whole dataset. In this way, if we have a data set of 5TB we do not need to process the all the data that could be time-consuming.

For example:

Sample Dataset

transformer.sample().trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*") 

Whole dataset

transformer.trim_col("*") 
           .remove_special_chars("*") 
           .clear_accents("*") 
           .lower_case("*") 

This approach should be the easiest to implement but there are have some problems here. The user has to copy paste the whole chaining and apply it to the whole dataset. Another approach could be something like that(not python):

operations = [trim_col("*"), remove_special_chars("*"),  clear_accents("*"), lower_case("*") ]
#Transformation in the sample dataset
transfomer.sample().apply(operations)
#Transformation in the whole dataset
transfomer.apply(operations)

The problem with a random sample

We talk about the possibility to use a random sample function from Spark. Although this could be the first approach I think there are 2 problems we have to tackle.

Empty data

The trainers do not accept empty data. If we can not detect empty data on the sample and we do not remove it in the whole data set the user could have problems at training time.

Outliers

If we can not detect and an outlier in the sample and we do not remove it the whole data set it could generate a flawed model.

We must be sure that our sample data set to meet this requirement.

What can we do

I think we must find the fastest way to detect empty data and outliers in the sample function. and be sure that the user has the more accurate representation of the whole data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.