This course is designed for graduate and advanced undergraduate students who wish to learn the fundamentals of data science and machine learning in the context of real world applications. An em- phasis will be placed on problems that companies such as Amazon, Booking.com, Netflix and others use with a slight emphasis on problems arising at The New York Times, where I was a data scientist. Despite a focus on applications, the course will be mathematically rigorous, but the goal is to motivate each theorem and problem by a concrete problem arising in industry. The course will follow an online iPython notebook where students can try out various algorithms in real time as we go through the course.
There will be no midterms or exams, but rather assignments which will be handed in periodically throughout the term. The final project will be yours to choose, but will ideally be a productionalized tool developed via a web app that uses some of the methods (or others) taught in this class to solve a concrete problem.
###Prerequisites: Exposure to undergraduate-level probability, statistics, graph theory, algorithms, and linear algebra is strongly encouraged, but these topics will be covered as we encounter them.
- Problems that arise in industry involving data.
- Introduction to regression, classification, clustering. Model training and evaluation.
- Predicting Virality of Content (Regression. Linear Regression, Random Forest )
- User Churn, Acquisition and Conversion. (Classification. Exponential Family.)
- Model selection and feature selection. Regularization. Real world performance evaluation.
- Clustering users (Clustering and Support Vector Machines)
- Correlation of features. Principle Component Analysis.
- A/B experiments. Causal inference introduction.
- Uplift Modeling. How do we target who should have received treatment?
- Map Reduce. SQL. Bash.
- Diffusion on Graph and NYT Article Recommendations.
- Topic Modeling.
- Introduction to Bayesian statistics. Bayesian vs. Frequentist approach.
- Multi-armed Bandits. Thompson Sampling. LinUCB.
- Cold Starts. Continous Cold starts. Warm Starts. uTime Series Analysis and
- The paper distribution problem at The New York Times.
- Reivew of Random Variables and Distributions.
- Time Series Models. Auto Regressive. Poisson Regression. Negative Binomial Regression.
- The Newsvendor Problem and profit optimization.
These are references to deepen your understanding of material presented in lecture. The list is by no means exhaustive.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning, Springer 2013
Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning, Springer 2013
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
Cameron Davidson-Pilon, Bayesian Methods for Hackers, https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers