Light

data-describe / awesome-data-science-models Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 17.0 109.78 MB

A few end to end examples that use data-describe

License: Apache License 2.0

Jupyter Notebook 93.59% Python 5.40% HTML 0.89% Dockerfile 0.11%

awesome-data-science-models's Introduction

data ⎰ describe

data-describe is a Python toolkit for Exploratory Data Analysis (EDA). It aims to accelerate data exploration and analysis by providing automated and polished analysis widgets.

For more examples of data-describe in action, see the Quick Start Tutorial.

Main Features

data-describe implements the following basic features:

Feature	Description
Data Summary	Curated data summary
Data Heatmap	Data variation and missingness heatmap
Correlation Matrix	Correlation heatmaps with categorical support
Distribution Plots	Generate histograms, violin plots, bar charts
Scatterplots	Generate scatterplots and evaluate with scatterplot diagnostics
Cluster Analysis	Automated clustering and plotting
Feature Ranking	Evaluate feature importance using tree models

Extended Features

data-describe is always looking to elevate the standard for Exploratory Data Analysis. Here are just a few that are implemented:

Dimensionality Reduction Methods
Sensitive Data (PII) Redaction
Text Pre-processing / Topic Modeling
Big Data Support

Installation

data-describe can be installed using pip:

pip install data-describe

Getting Started

import data_describe as dd
help(dd)

See the User Guide for more information.

Project Status

data-describe is currently in beta status.

Contributing

data-describe welcomes contributions from the community.

awesome-data-science-models's People

Contributors

Stargazers

Watchers

Forkers

truongc2 jonaqp priya170807 sachinsaxena021988 nodedevar vedantyadav praveen05ch statsgary doraa7 tam0201 arivperumal19 zhaozhufeng1 arsalsyed24 the-data-guy ahycourse ahymv shubhampachori12110095

awesome-data-science-models's Issues

Compile command in pitch predictor notebook has the wrong path

!python baseball-pipeline-single.py >> !python pipelines/baseball-pipeline-single.py

AI Platform prediction hosting fails

(project name has been redacted)

Using endpoint [
https://us-central1-ml.googleapis.com/
]
Listed 0 items.
Using endpoint [
https://ml.googleapis.com/
]
ERROR: (gcloud.beta.ai-platform.models.create) Resource in projects [...] is the subject of a conflict: Field: model.name Error: A model with the same name already exists.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: A model with the same name already exists.
    field: model.name
Traceback (most recent call last):
  File "host_xgboost.py", line 46, in <module>
    run()
  File "host_xgboost.py", line 35, in run
    subprocess.check_call([shutil.which('gcloud'),'beta', 'ai-platform','models','create',MODEL_NAME,'--regions','us-central1', "--enable-logging", "--enable-console-logging"])
  File "/usr/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcloud', 'beta', 'ai-platform', 'models', 'create', 'xgboost_FT', '--regions', 'us-central1', '--enable-logging', '--enable-console-logging']' returned non-zero exit status 1

move all data to the Maven Wave GCS Buckets

All the data should be pulling from the MW Public GCP Buckets:

But the example don't seem to pull from these.

Note for MW folks: this is https://console.cloud.google.com/storage/browser/amazing-public-data;tab=objects?forceOnBucketsSortingFiltering=false&project=data-describe&prefix=&forceOnObjectsSortingFiltering=false

For example:

Census-income ttps://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data and *.text
for black-friday https://github.com/data-describe/awesome-data-science-models/tree/master/black-friday/EDA/Data
on cellular image bipin_workshop_bucket is used should ber the public bucket
Chicago taxi looks fine as it uses a pubic GCP data set but we need to give instructions on how to authenticate their own GCP project for access
lending club should be moved from gs://amazing-public-data/lending_club/lending_club_data.tsv
pitch predictor uses a Collect-Games method https://github.com/data-describe/awesome-data-science-models/blob/master/pitch-predictor/answers/components/collectStats/ParDoFns/collectGames.py#L13, should probably freeze that data and put it in our public bucket

remove any client references

saw some paths with client names, need to remove those before we make public.

pitch predictor missing requirements file

Add mention on original source for examples

Please add in README.md where each example originated.

create IOT example with NASA Bearing Sensor dataset

create IOT example with NASA Bearing Sensor Dataset

EDA could/should use the distribution() feature of data describe

Are there any usability shortcomings that prevent using dd.distribution() in the EDA notebooks? These plots are always created manually.

Update/extend incomplete model variants and deployments for pitch predictor

Pitch predictor contains partially implemented components for random forest model (training and deploy) as well as a Seldon deployed model.

Prep for workshop

Prep the following.

Give overview for each, create readme.md with an overview of the problem, description of the data, and the problem statement. Where possible make references to the GCP tool set. Possibly make a sub directory of each with GCP as the name to create GCP specific training.

Census
Beatles
Lending Club

Make sure each one has filled out the following:

Document AI Demo should use real-world training data

The Document AI demo should incorporate real-world training data. Right now, it uses a very small set of dummy data in order to train the classification algorithm.

NASA IoT demo should parameterize gcp project names in dockerfiles

the GCP project name is currently hardcoded to mwpmltr. While this is a straightforward find-and-replace, it would be nicer if it was templated out or dynamically filled based on appropriate environment variables.

DocAI

include the demo

Use versioned pipelines in pitch predictor

create unit test and github actions to run on GCP/AWS/Azure

Create unit tests for each and a corresponding github action to run examples.

Check/verify "enhanced" feature pipeline in pitch predictor

Pipeline has not been validated after update

IoT Condition Monitoring - Paderborn Bearing Dataset

The aim of this issue is to add more IoT related use cases to awesome data science models repository. The use case we have selected for this activity is experimental bearing data sets for condition monitoring based on vibration and motor current signals.

The goal is to move from condition monitoring to predictive maintenance. Modeling approaches to consider include Predict Remaining Useful LifeCycle of bearing so that corrective and timely maintenance measures can be taken. This will help avoid unplanned machine downtime.

Many plants faces unplanned downtime and it assets fail over the period of time and plants should have the ability to predict the life remaining. With the help of data science, the first step is to build an ML tool to analyze the condition or health of machine assets.

Our dataset of Interest(Paderborn Bearing Dataset) consists the following details:

The Dataset contains the motor currents and vibration signals with additional measurements of torque, speed, load and temperature
There are 26 Damaged bearing and 6 healthy bearing state in the data
All the data is collected in 4 different operating conditions.

Dataset - https://mb.uni-paderborn.de/en/kat/main-research/datacenter/bearing-datacenter/data-sets-and-download

With help of Data Science techniques the aim is to analyze the condition of assets, identify features which contribute to failure and build effective ML models for RUL prediction

Requirement for Taxi Cab

The requirements.txt state TF1 but we are using TF2

https://github.com/data-describe/awesome-data-science-models/tree/master/chicago-taxi

Google AI Platform Divergence

Ending with commit 7d748cf, additional work on these demos will begin to be targeted towards Google Vertex AI vs the traditional Google AI Platform. AI Platform is still available but expected to be deprecated sometime in the future.

Document AI Demo should have EDA using data-describe

All of the examples in this repository use the data-describe tool to perform exploratory data analysis on the training data. This step needs to be added to the Document AI example.

chicago-taxi

Rework the current chicago-taxi solution. There are dependency conflicts with the current implementation on AI Platform. May need to upgrade the runtime, keras version, and deployment strategy.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.