Code Monkey home page Code Monkey logo

data-science-cheatsheet's Introduction

Data Science Cheatsheet 2.0

A helpful 5-page data science cheatsheet to assist with exam reviews, interview prep, and anything in-between. It covers over a semester of introductory machine learning, and is based on MIT's Machine Learning courses 6.867 and 15.072. The reader should have at least a basic understanding of statistics and linear algebra, though beginners may find this resource helpful as well.

Inspired by Maverick's Data Science Cheatsheet (hence the 2.0 in the name), located here.

Topics covered:

  • Linear and Logistic Regression
  • Decision Trees and Random Forest
  • SVM
  • K-Nearest Neighbors
  • Clustering
  • Boosting
  • Dimension Reduction (PCA, LDA, Factor Analysis)
  • Natural Language Processing
  • Neural Networks
  • Recommender Systems
  • Reinforcement Learning
  • Anomaly Detection
  • Time Series
  • A/B Testing

This cheatsheet will be occasionally updated with new/improved info, so consider a follow or star to stay up to date.

Future additions (ideas welcome):

  • Time Series Added!
  • Statistics and Probability Added!
  • Data Imputation
  • Generative Adversarial Networks
  • Graph Neural Networks

Links

Screenshots

Here are screenshots of a couple pages - the link to the full cheatsheet is above!

Why is Python/SQL not covered in this cheatsheet?

I planned for this resource to cover mainly algorithms, models, and concepts, as these rarely change and are common throughout industries. Technical languages and data structures often vary by job function, and refreshing these skills may make more sense on keyboard than on paper.

License

Feel free to share this resource in classes, review sessions, or to anyone who might find it helpful :)

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Creative Commons License

Images are used for educational purposes, created by me, or borrowed from my colleagues here

Contact

Feel free to suggest comments, updates, and potential improvements!

Author - Aaron Wang

If you'd like to support this cheatsheet, you can buy me a coffee here. I also do resume, application, and tech consulting - send me a message if interested.

data-science-cheatsheet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-science-cheatsheet's Issues

Additional topics

Here is a list of topics that would be great to add:

  1. Graph Neural Networks.
  2. Bayesian Statistics, Probabilistic programming and application for A/B testing.
  3. DBSCAN in the Clustering section.
  4. SVD, Factorization Machines, and Neural Network approaches in RecSys section
  5. Deep Reinforcement learning

Please consider improvement for page - 1

  1. In the CLI, you can add a requirement for a variance limit and math. the expectation is bounded. The Cauchy distribution is sometimes encountered in engineering in the laser industry.

  2. Bias - an error with the choice of a set of functions from which you choose, the objective function is far from the projection on this set of functions that you are considering.

  3. Variance - associated with the instability of the solutions obtained based on sampling examples from the population.
    For (2) and (3) better read the Statistical Learning book, the definitions from CheatSheet are too vague at least for me.

  4. In non-parametric models, these are models where there are no training parameters. For example, the nearest neighbor method. (Or more sophisticated models from MetaLearning that use in one of it's flavor non-parametric blocks)

  5. Cross-Validation - does not do any validation, rather it is an assessment. If you want validation (in the sense of guarantees) - this principle is not suitable.

  6. In Subset selection you put 0-norm, that's very bad jargon. Zero is not the norm. It is not homogeneous (multiplying a vector by 10 will not multiply a value by 10). Use the word cardinality(.)

Page-1: CP is not the cost of split in fact

In the CART model, splitting occurs from the beginning of the entire tree, so CP has a more complex meaning.

Specifically, in the CART model, this is a penalty in the objective function of the form "cp|T|", where T is the number of tree nodes. Technically, CP is still included as a split score, but there is a trick - splitting the tree (what for classification, what for regression is done in general over the entire depth).

Further, in the classification mode, CART uses Gini Index, C4.5 uses Entropy.

I was a student of J. Friedman and he told me about it.
If you want to learn more about CART, I have a couple of notes
https://sites.google.com/site/burlachenkok/articles/decision-trees-parti-decision-trees-for-regression,
https://sites.google .com/site/burlachenkok/articles/some-problems-with-decision-tree

and two talks at MIPT (but they are in Russian)
https://www.youtube.com/watch?v=r4ZTy90233w
https://www.youtube.com/ watch?v=evkzN6AZTnc&t=53s

"Variable Importance - ranks variables by their ability to minimize error when split upon, averaged across all trees"

First page. "Variable Importance - ranks variables by their ability to
minimize error when split upon, averaged across all trees".

Estimating the importance of a variable is quite an abstract and broad thing.

You can do this in a large number of ways if you require theory (and generally infinite - if you don’t require theory).

Things I know about evaluation variables are here https://sites.google.com/site/burlachenkok/some-ways-to-intepretate-black-box-models

The text from this cheat sheet in the context of decision trees was suggested by J.Friedman's friend - L. Breiman from 1983, J. Friedman has a theorem from 1999 that this is in some sense a good thing.

I suggest to add extra text, because one more time think that Variable importance can be done only in one way - is deeply wrong.

And thanks for your work

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.