Code Monkey home page Code Monkey logo

datascience's Introduction

Awesome Data Science with Python

A curated list of awesome resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks.

Core

pandas - Data structures built on top of numpy.
scikit-learn - Core ML library.
matplotlib - Plotting library.
animatplot - Animate plots build on matplotlib.
seaborn - Python data visualization library based on matplotlib.
pandas_summary - Basic statistics using DataFrameSummary(df).summary().
pandas_profiling - Descriptive statistics using ProfileReport.
sklearn_pandas - Helpful DataFrameMapper class.
janitor - Clean messy column names.
missingno - Missing data visualization.

Pandas and Jupyter

General ticks: link
cookiecutter-data-science - Project template for data science projects.
nteract - Open Jupyter Notebooks with doubleclick.
modin - Parallelization library for faster pandas DataFrame.
swifter - Apply any function to a pandas dataframe faster.
xarray - Extends pandas to n-dimensional arrays.
blackcellmagic - Code formatting for jupyter notebooks.
pivottablejs - Drag n drop Pivot Tables and Charts for jupyter notebooks.
qgrid - Pandas DataFrame sorting.
nbdime - Diff two notebook files, Alternative GitHub App: ReviewNB.

Extraction

textract - Extract text from any document.
camelot - Extract text from PDF.

Big Data

spark - DataFrame for big data, cheatsheet, tutorial.
sparkit-learn - PySpark + Scikit-learn.
dask, dask-ml - Pandas DataFrame for big data and machine learning library, resources, talk1, talk2, notebooks, videos.
turicreate - Helpful SFrame class for out-of-memory dataframes.
h2o - Helpful H2OFrame class for out-of-memory dataframes.
datatable - Data Table for big data support.
cuDF - GPU DataFrame Library.
ray - Flexible, high-performance distributed execution framework.
mars - Tensor-based unified framework for large-scale data computation.
bottleneck - Fast NumPy array functions written in C.
bolz - A columnar data container that can be compressed.
cupy - NumPy-like API accelerated with CUDA.

Command line tools

ni - Command line tool for big data.
xsv - Command line tool for indexing, slicing, analyzing, splitting and joining CSV files.
csvkit - Another command line tool for CSV files.
csvsort - Sort large csv files.

Statistics

Visualizations - Null Hypothesis Significance Testing (NHST), Correlation, Cohen's d, Confidence Interval, Equivalence, non-inferiority and superiority testing, Bayesian two-sample t test, Distribution of p-values when comparing two groups, Understanding the t-distribution and its normal approximation
Common statistical tests explained Bland-Altman Plot - Plot for agreement between two methods of measurement.
scikit-posthocs - Statistical post-hoc tests for pairwise multiple comparisons.

Exploration and Cleaning

impyute - Imputations.
fancyimpute - Matrix completion and imputation algorithms.
imbalanced-learn - Resampling for imbalanced datasets.
tspreprocess - Time series preprocessing: Denoising, Compression, Resampling.
Kaggler - Utility functions (OneHotEncoder(min_obs=100))
pyupset - Visualizing intersecting sets.
pyemd - Earth Mover's Distance, similarity between histograms.

Feature Engineering

sklearn - Pipeline, examples.
pdpipe - Pipelines for DataFrames.
few - Feature engineering wrapper for sklearn.
skoot - Pipeline helper functions.
categorical-encoding - Categorical encoding of variables, vtreat (R package).
dirty_cat - Encoding dirty categorical variables.
patsy - R-like syntax for statistical models.
mlxtend - LDA.
featuretools - Automated feature engineering, example.
tsfresh - Time series feature engineering.
pypeln - Concurrent data pipelines.

Feature Selection

Tutorial, Talk
scikit-feature - Feature selection algorithms.
stability-selection - Stability selection.
scikit-rebate - Relief-based feature selection algorithms.
scikit-genetic - Genetic feature selection.
boruta_py - Feature selection, explaination, example.
linselect - Feature selection package.

Dimensionality Reduction

prince - Dimensionality reduction, factor analysis (PCA, MCA, CA, FAMD).
sklearn - Multidimensional scaling (MDS).
sklearn - t-distributed Stochastic Neighbor Embedding (t-SNE), intro. Faster implementations: lvdmaaten, MulticoreTSNE.
sklearn - Truncated SVD (aka LSA).
mdr - Dimensionality reduction, multifactor dimensionality reduction (MDR).
umap - Uniform Manifold Approximation and Projection.
FIt-SNE - Fast Fourier Transform-accelerated Interpolation-based t-SNE.

Visualization

All charts, Austrian monuments.
cufflinks - Dynamic visualization library, wrapper for plotly, medium, example.
physt - Better histograms, talk.
matplotlib_venn - Venn diagrams.
joypy - Draw stacked density plots.
mosaic plots - Categorical variable visualization, example.
yellowbrick - Wrapper for matplotlib for diagnosic ML plots.
bokeh - Interactive visualization library, Examples, Examples.
plotnine - ggplot for Python.
altair - Declarative statistical visualization library.
bqplot - Plotting library for IPython/Jupyter Notebooks.
holoviews - Visualization library.
dtreeviz - Decision tree visualization and model interpretation.
chartify - Generate charts.
VivaGraphJS - Graph visualization (JS package).
pm - Navigatable 3D graph visualization (JS package), example.
python-ternary - Triangle plots.
falcon - Interactive visualizations for big data.

Dashboards

dash - Dashboarding solution by plot.ly. Tutorial: 1, 2, 3, 4, 5
bokeh - Dashboarding solution.
visdom - Dashboarding library by facebook.
bowtie - Dashboarding solution.
panel - Dashboarding solution.
altair example - Video

Geopraphical Tools

folium - Plot geographical maps using the Leaflet.js library.
stadiamaps - Plot geographical maps.
datashader - Draw millions of points on a map.
sklearn - BallTree, Example.
pynndescent - Nearest neighbor descent for approximate nearest neighbors.
geocoder - Geocoding of addresses, IP addresses.
Conversion of different geo formats: talk, repo
geopandas - Tools for geographic data
Low Level Geospatial Tools (GEOS, GDAL/OGR, PROJ.4)
Vector Data (Shapely, Fiona, Pyproj)
Raster Data (Rasterio)
Plotting (Descartes, Catropy)
Predict economic indicators from Open Street Map ipynb.

Recommender Systems

Examples: 1, 2, 2-ipynb, 3.
surprise - Recommender, talk.
turicreate - Recommender.
implicit - Fast Python Collaborative Filtering for Implicit Feedback Datasets.
spotlight - Deep recommender models using PyTorch.
lightfm - Recommendation algorithms for both implicit and explicit feedback.
funk-svd - Fast SVD.
pywFM - Factorization.

Decision Tree Models

lightgbm - Gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, doc.
xgboost - Gradient boosting (GBDT, GBRT or GBM) library, doc, Methods for CIs: link1, link2.
catboost - Gradient boosting.
thundergbm - GBDTs and Random Forest.
h2o - Gradient boosting.
forestci - Confidence intervals for random forests.
scikit-garden - Quantile Regression.
grf - Generalized random forest.
dtreeviz - Decision tree visualization and model interpretation.
rfpimp - Feature Importance for RandomForests using Permuation Importance.
Why the default feature importance for random forests is wrong: link
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
bartpy - Bayesian Additive Regression Trees.
infiniteboost - Combination of RFs and GBDTs.
merf - Mixed Effects Random Forest for Clustering, video

Natural Language Processing (NLP) / Text Processing

talk-nb, nb2, talk.
Text classification Intro, Preprocessing blog post.
gensim - NLP, doc2vec, word2vec, text processing, topic modelling (LSA, LDA), Example, Coherence Model for evaluation.
Embeddings - GloVe ([1], [2]), StarSpace, wikipedia2vec.
pyldavis - Visualization for topic modelling.
spaCy - NLP.
NTLK - NLP, helpful KMeansClusterer with cosine_distance.
pytext - NLP from Facebook.
fastText - Efficient text classification and representation learning.
annoy - Approximate nearest neighbor search.
faiss - Approximate nearest neighbor search.
pysparnn - Approximate nearest neighbor search.
infomap - Cluster (word-)vectors to find topics, example.
datasketch - Probabilistic data structures for large data (MinHash, HyperLogLog).
flair - NLP Framework by Zalando.
stanfordnlp - NLP Library.

Papers

Search Engine Correlation

Image Processing

cv2 - OpenCV, classical algorithms: Gaussian Filter, Morphological Transformations.
scikit-image - Image processing.
mahotas - Image processing (Bioinformatics), example.

Neural Networks

Reading

Convolutional Neural Networks for Visual Recognition
Cell Segmentation Talk
Cell Segmentation Blog Post 2
Deep Learning Book
Tutorial
Feature Visualization: Blog, PPT
Talk: Extracting knowledge of big NNs to smaller NNs
Visualization of optimization algorithms

Image Related

keras preprocessing - Preprocess images.
imgaug - More sophisticated image preprocessing.
imgaug_extension - Extension for imgaug.
albumentations - Wrapper around imgaug and other libraries.
Augmentor - Image augmentation library.
tcav - Interpretability method.
cutouts-explorer - Image Viewer.

Text Related

ktext - Utilities for pre-processing text for deep learning in Keras.
textgenrnn - Ready-to-use LSTM for text generation.

Libs

keras - Neural Networks on top of tensorflow.
keras-contrib - Keras community contributions.
hyperas - Keras + Hyperopt: Convenient hyperparameter optimization wrapper.
elephas - Distributed Deep learning with Keras & Spark.
tflearn - Neural Networks on top of tensorflow.
tensorlayer - Neural Networks on top of tensorflow, tricks.
tensorforce - Tensorflow for applied reinforcement learning.
fastai - Neural Networks in pytorch.
ignite - Highlevel library for pytorch.
skorch - Scikit-learn compatible neural network library that wraps pytorch.
Detectron - Object Detection by Facebook.
autokeras - AutoML for deep learning.
simpledet - Object Detection and Instance Recognition.
PlotNeuralNet - Plot neural networks.
lucid - Neural network interpretability, Activation Maps.
AdaBound - Optimizer that trains as fast as Adam and as good as SGD.
caffe - Deep learning framework, pretrained models.
foolbox - Adversarial examples that fool neural networks.
hiddenlayer - Training metrics.
imgclsmob - Pretrained models.

Snippets

Simple Keras models
Entity Embeddings of Categorical Variables, code, kaggle

GPU

cuML - Run traditional tabular ML tasks on GPUs.
thundergbm - GBDTs and Random Forest.
thundersvm - Support Vector Machines.

Regression

Understanding SVM Regression: slides, forum, paper

pyearth - Multivariate Adaptive Regression Splines (MARS), tutorial.
pygam - Generalized Additive Models (GAMs), Explanation.
GLRM - Generalized Low Rank Models.

Classification

All classification metrics
DESlib - Dynamic classifier and ensemble selection

Clustering

pyclustering - All sorts of clustering algorithms.
somoclu - Self-organizing map.
hdbscan - Clustering algorithm.
nmslib - Similarity search library and toolkit for evaluation of k-NN methods.
buckshotpp - Outlier-resistant and scalable clustering algorithm.
merf - Mixed Effects Random Forest for Clustering, video

Interpretable Classifiers and Regressors

skope-rules - Interpretable classifier, IF-THEN rules.
sklearn-expertsys - Interpretable classifiers, Bayesian Rule List classifier.

Multi-label classification

scikit-multilearn - Multi-label classification, talk.

Time Series

Signal Processing Book
Filter Design: Article, Interactive Tool, Filter examples
Talk
statsmodels - Time series analysis, seasonal decompose example, SARIMA, granger causality.
pyramid, pmdarima - Wrapper for (Auto-) ARIMA.
pyflux - Time series prediction algorithms (ARIMA, GARCH, GAS, Bayesian).
prophet - Time series prediction library.
htsprophet - Hierarchical Time Series Forecasting using Prophet.
tensorflow - LSTM and others, examples: link, link, link, Explain LSTM, seq2seq: 1, 2, 3, 4
tspreprocess - Preprocessing: Denoising, Compression, Resampling.
tsfresh - Time series feature engineering.
thunder - Data structures and algorithms for loading, processing, and analyzing time series data.
gatspy - General tools for Astronomical Time Series, talk.
gendis - shapelets, example.
tslearn - Time series clustering and classification, TimeSeriesKMeans, TimeSeriesKMeans.
pastas - Simulation of time series.
fastdtw - Dynamic Time Warp Distance.
fable - Time Series Forecasting (R package).
CausalImpact - Causal Impact Analysis (R package).
pydlm - Bayesian time series modeling (R package, Blog post)
PyAF - Automatic Time Series Forecasting.
luminol - Anomaly Detection and Correlation library from Linkedin.
matrixprofile-ts - Detecting patterns and anomalies, website, ppt.
obspy - Seismology package. Useful classic_sta_lta function.
RobustSTL - Robust Seasonal-Trend Decomposition.
seglearn - Time Series library.

Financial Data

pyfolio - Portfolio and risk analytics.
zipline - Algorithmic trading.
alphalens - Performance analysis of predictive stock factors.

Survival Analysis

Time-dependent Cox Model in R.
lifelines - Survival analysis, Cox PH Regression, talk, talk2.
scikit-survival - Survival analysis.
xgboost - "objective": "survival:cox" NHANES example
survivalstan - Survival analysis, intro.
convoys - Analyze time lagged conversions.
RandomSurvivalForests (R packages: randomForestSRC, ggRandomForests).

Outlier Detection & Anomaly Detection

sklearn - Isolation Forest and others.
pyod - Outlier Detection / Anomaly Detection.
eif - Extended Isolation Forest.
AnomalyDetection - Anomaly detection (R package).
luminol - Anomaly Detection and Correlation library from Linkedin.

Ranking

lightning - Large-scale linear classification, regression and ranking.

Scoring

SLIM - Scoring systems for classification, Supersparse linear integer models.

Probabilistic Modeling and Bayes

Intro, Guide
PyMC3 - Baysian modelling, intro
pomegranate - Probabilistic modelling, talk.
pmlearn - Probabilistic machine learning.
arviz - Exploratory analysis of Bayesian models.
zhusuan - Bayesian deep learning, generative models.
dowhy - Estimate causal effects.
edward - Probabilistic modeling, inference, and criticism, Mixture Density Networks (MNDs), MDN Explanation.
Pyro - Deep Universal Probabilistic Programming

Stacking Models and Ensembles

Model Stacking Blog Post
mlxtend - EnsembleVoteClassifier, StackingRegressor, StackingCVRegressor for model stacking.
vecstack - Stacking ML models.
StackNet - Stacking ML models.
mlens - Ensemble learning.

Model Evaluation

pycm - Multi-class confusion matrix.
pandas_ml - Confusion matrix.
Plotting learning curve: link.
yellowbrick - Learning curve.

Model Explanation, Interpretability, Feature Importance

Book, Examples
shap - Explain predictions of machine learning models, talk.
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
lime - Explaining the predictions of any machine learning classifier, talk, Warning (Myth 7).
lime_xgboost - Create LIMEs for XGBoost.
eli5 - Inspecting machine learning classifiers and explaining their predictions.
lofo-importance - Leave One Feature Out Importance, talk.
pybreakdown - Generate feature contribution plots.
FairML - Model explanation, feature importance.
pycebox - Individual Conditional Expectation Plot Toolbox.
pdpbox - Partial dependence plot toolbox, example.
partial_dependence - Visualize and cluster partial dependence.
skater - Unified framework to enable model interpretation.
anchor - High-Precision Model-Agnostic Explanations for classifiers.
l2x - Instancewise feature selection as methodology for model interpretation.
contrastive_explanation - Contrastive explanations.
DrWhy - Collection of tools for explainable AI.
lucid - Neural network interpretability.
xai - An eXplainability toolbox for machine learning.

Automated Machine Learning

AdaNet - Automated machine learning based on tensorflow.
tpot - Automated machine learning tool, optimizes machine learning pipelines.
auto_ml - Automated machine learning for analytics & production.
autokeras - AutoML for deep learning.
nni - Toolkit for neural architecture search and hyper-parameter tuning by Microsoft.
automl-gs - Automated machine learning.

Evolutionary Algorithms & Optimization

deap - Evolutionary computation framework (Genetic Algorithm, Evolution strategies).
evol - DSL for composable evolutionary algorithms, talk.
platypus - Multiobjective optimization.
nevergrad - Derivation-free optimization.
gplearn - Sklearn-like interface for genetic programming.
blackbox - Optimization of expensive black-box functions.
Optometrist algorithm - paper.

Hyperparameter Tuning

sklearn - GridSearchCV, RandomizedSearchCV.
hyperopt - Hyperparameter optimization.
hyperopt-sklearn - Hyperopt + sklearn.
skopt - BayesSearchCV for Hyperparameter search.
tune - Hyperparameter search with a focus on deep learning and deep reinforcement learning.
optuna - Hyperparamter optimization.
hypergraph - Global optimization methods and hyperparameter optimization.
bbopt - Black box hyperparameter optimization.
dragonfly - Scalable Bayesian optimisation.

Incremental Learning, Online Learning

sklearn - PassiveAggressiveClassifier, PassiveAggressiveRegressor.
creme-ml - Incremental learning framework.
Kaggler - Online Learning algorithms.

Active Learning

Talk
modAL - Active learning framework.

Reinforcement Learning

YouTube, YouTube
Intro to Monte Carlo Tree Search (MCTS) - 1, 2, 3
AlphaZero methodology - 1, 2, 3, Cheat Sheet
RLLib - Library for reinforcement learning.
Horizon - Facebook RL framework.

Frameworks

h2o - Scalable machine learning.
turicreate - Apple Machine Learning Toolkit.
astroml - ML for astronomical data.

Deployment and Lifecycle Management

m2cgen - Transpile trained ML models into other languages.
sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
mlflow - Manage the machine learning lifecycle, including experimentation, reproducibility and deployment.
modelchimp - Experiment Tracking.
skll - Command-line utilities to make it easier to run machine learning experiments.

Other

dvc - Versioning for ML projects.
daft - Render probabilistic graphical models using matplotlib.
unyt - Working with units.
scrapy - Web scraping library.
VowpalWabbit - ML Toolkit from Microsoft.
metric-learn - Metric learning.

General Python Programming

funcy - Fancy and practical functional tools.
more_itertools - Extension of itertools.
dill - Serialization, alternative to pickle.
attrs - Python classes without boilerplate.
dateparser - A better date parser.
jellyfish - Approximate string matching.

Blogs

PocketCluster - Blog.
Distill.pub - Blog.

Awesome Lists

Awesome Adversarial Machine Learning
Awesome AI Booksmarks
Awesome AI on Kubernetes
Awesome Business Machine Learning
Awesome Data Science with Ruby
Awesome Deep Learning
Awesome Financial Machine Learning
Awesome Machine Learning
Awesome Machine Learning Interpretability
Awesome Machine Learning Operations
Awesome Network Embedding
Awesome Python
Awesome Python Data Science
Awesome Python Data Science
Awesome Recommender Systems
Awesome Semantic Segmentation
Awesome Sentence Embedding
Awesome Time Series
Awesome Time Series Anomaly Detection
Recommender Systems (Microsoft)

Things I google a lot

Frequency codes for time series
Date parsing codes
Feature Calculators tsfresh

Contributing

Do you know a package that should be on this list? Did you spot a package that is no longer maintained and should be removed from this list? Then feel free to read the contribution guidelines and submit your pull request or create a new issue.

License

CC0

datascience's People

Contributors

r0f1 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.