- Description
- Learning Outcomes
- Logistics
- Marking Scheme
- Policies
- Folder Structure
- Acknowledgements and Contributions
We wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and most recently, the Mississaugas of the Credit River. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.
The course was created by the University of Toronto's Data Science Institute. The course provides an overview of the Design of Machine Learning Systems which are embedded within data-intensive products and applications. It covers the fundamental components of the infrastructure, systems, and methods necessary to implement and maintain Machine Learning (ML) models in production. In short, we will learn methods to build a factory of ML models.
The course has two components:
- A discussion of the main issues and challenges faced in production, together with some approaches to address them.
- A live lab with demonstrations of implementation techniques.
The course covers the following areas:
- Data engineering.
- Feature engineering.
- Hyperparameter tuning.
- Model deployment.
- Model explainability.
- Logging, experiment tracking, and monitoring.
We will discuss the tools and techniques required to do the above in good order and at scale. However, we will not discuss the inner working of models, advantages, and so on. As well, we will not discuss the theoretical aspects of feature engineering or hyperparameter tuning. We will focus on tools and reproducibility.
By the end of this course, a student will be able to:
- Describe the main components of a machine learning system.
- Explain the infrastructure required to train and test models in production.
- Implement an experiment tracking system and logging.
- Contrast and evaluate different approaches of storing and manipulating data.
- Design data flows and processes to automate the construction of ML models.
-
Instructor: Jesús Calderón (he/him)
-
dsi.production.course [at] gmail.com
- This email is exclusively for the course.
- I will monitor this email and respond within 24 hours.
-
-
TA: TBD
- The workshop will be held over three weeks on the dates outlined below.
- Most days, we will review slides for about one hour, take a short break and continue with the technical discussion.
- There are Jupyter notebooks in the repo to follow along in the coding sessions.
- We encourage you to participate and ask questions.
- A standard PC with Python installed. Ideally, an account with admin rights to this PC.
- The examples are not computationally intensive and can be further reduced if performance is an issue.
- The course is implemented with a Docker backend that will run a PosgreSQL server. This is intended to mimic a production-like environment. Use SQLite if Docker is not an option.
- Camera is optional although highly encouraged. We understand that not everyone may have the space at home to have the camera on.
- When to Use ML
- ML in Production
- ML vs Traditional Software
- Business and ML Objectives
- Requirements of Data-Driven Products
- Iterative Process
- Framing ML Problems
- Git, authorization, and production pipelines.
- VS Code and Git.
- Python virtual environments.
- Repo File Structure.
- Branching Strategies.
- Commit Messages.
- Data Sources
- Data Formats
- Data Models
- Data Storage and Processing
- Modes of Data Flow
- Jupyter notebooks and source code.
- Logging and using a standard logger.
- Environment variables.
- Getting the data.
- Schemas and index in dask.
- Reading and writing parquet files.
- Dask vs pandas: a small example of big vs small data.
- Sampling
- Labeling
- Class Imbalance
- Data Augmentation
- Sampling in Python.
- An initial training pipeline.
- Modularizing the training pipeline.
- Decoupling settings, parameters, data, code, and results.
- Common Operations
- Data Leakage
- Feature Importance
- Feature Generalization
- Transformation Pipelines
- Encapsulation and parametrization
- Model Development and Training
- Ensembles
- Experiment Tracking and Versioning
- Model Offline Evaluation
- Experiment tracking
- Hyperparameter Tuning
- ML Deployment Myths
- Batch Prediction vs Online Prediction
- Partial Dependence Plots
- Permutation Importance
- SHAP Values
- ML System Failures
- Data Distribution Shifts
- Monitoring and Observability
- Python implementation
- Infrastructure
- ML Ops
- Roles and skills
- Organization
Week 15
- Monday, Mar 11, 6 pm - 8:30 pm
- Tuesday, Mar 12, 6 pm - 8:30 pm
- Wednesday, Mar 13, 6 pm - 8:30 pm
- Thursday, Mar 14, 6 pm - 8:30 pm
Week 16
- Tuesday, Mar 19, 6 pm - 8:30 pm
- Wednesday, Mar 20, 6 pm - 8:30 pm
- Thursday, Mar 21, 6 pm - 8:30 pm
- Saturday, Mar 23, 9 am - 11:30 am
- Assignment 1 due on March 15.
- Assignment 2 due on March 19.
- Assignment 3 due on March 23.
-
Evaluation.
- Quizzes will follow every session. They include multiple choice, multiple selection, and true/false questions related to the day's quesitons. The quizzes are not only assessment, but an integral part of learning. I recommend that you do not leave them all to the very last minute.
- There will be three coding assignments.
-
Reading.
-
Attendance.
-
The course has a live-coding component.
-
Students are expected to follow along with the coding, creating files and folders to navigate and manipulate. Students should be active participants while coding and are encouraged to ask questions throughout.
Below are the folders contained in this repo with a description of what they contain and information on how to use them.
assignments
: assignment files.config
: configuration files for experiments.docs
: all notes and quizzes.notebooks
: Jupyter notebookssrc
: code.
- We welcomes issues, enhancement requests, and other contributions. To submit an issue, use GitHub issues.