Code Monkey home page Code Monkey logo

data_science_standards's Introduction

Data Science Standards
What are the Data Science Standards? The Data Science Standards are a proven process at both the university and the bootcamp level for students to create production grade machine learning for their portfolio, to excel in the job interview. This process has been stress-tested with over 2,000 students and offers you the following:

  • A Framework that leads to confidence with client success and career interviews
  • A portfolio to share as Proof of Concepts to clients and for career opportunities
  • A Standard for mental model and business framework to solving production grade machine learning
  • An organized, and centralized repository for state of the art resources for production grade data science
  • Available for any technology stack

Foundational Literature:

  1. How to ask Data Science questions
  2. Asking insightful questions

Data Science Project Deliverables:

  1. Part 1: Project Proposal Criteria - Prepare an Abstract as both a Document and a PowerPoint (Start with 3 to 6 project ideas)
  2. Part 2: Perform Exploratory Data Analysis, Visualizations, and Feature Engineering
  3. Part 3: Perform Machine Learning, Performance Metrics, and Deployment for your project
  4. Part 4: Present your project as a Presentation to your business stakeholders

Part 1: Project Proposal Criteria:

Please prepare your project proposal as a sharable document, and a PowerPoint/Google Slides presentation

  1. Project Title
  • What is your Project Theme?
  • What is an Abstract - 1 paragraph Executive Summary of your Solution?
  1. Problem Statement & Business Case
  • What is the technical problem you are solving?
  • What is the applied business case for this problem?
    • Business perspective (I.e., Likelihood, sentiment, demand, price, market strategy, groups, automation)
  1. Data Science Workflow
  • What Null/Alternative Hypothesis are you testing against?
    • Does X Predict Y? (I.e., Distinct groups, key components, outliers)
  • What is the response that is important for you to measure?
  • What assumptions are important for you to assess and to benchmark?
  • What solutions would you like to deliver against?
  • What benchmarks are you looking to automate?
  • What alternative questions would you like to explore and provide solutions?
  • What analytics and insights would you like to discover from your data? - What types of graphics or machine learnings would you like to discover?
  • What is the business case for your project?
  • How will your solution help generate revenue, reduce costs, or impact another Key Performance Indicator or Objective Key Result?
  • Who will be impacted (Executive Stakeholders/Sponsors) by your solution? Who is your ideal client/customer?
  1. Data Collection
  • What raw datasets will you extract for machine learning?
  • Is the data from open-source, paid crowdsourcing, internal?
  • What is the structures, file types, and quality of the data?
  • How will you collect the data?
  • Of your known data, what is the current data dictionaries that exist, or that you can further describe? (You can create these data dictionaries in a spreadsheet, markdown table, or listed)
  1. Data Processing, Preparation, & Feature Engineering
  • What techniques will you use to improve your data quality?
  • How will you handle missing data and outliers?
  • What calculations/formulas would you like to create, that may not yet exist?
  1. Machine Learning: Model Selection
  • Which model architecture(s) will you use to solve your problem?
  • How will you validate the model performance?
  1. Model Persistence: Deployment, Training, & Data Pipelines
  • How would your results operate LIVE in a production environment?
  • What technology stack, what integrations, and which Engineers would you cooperate?
  • Where will you share your results internally or externally to stakeholders through Marketing, Implementation and Deployments?
  • How will you validate your machine learnings with a timeline from development to production? How will you generate more data to train?

Part Two: Exploratory Data Analysis Guidelines

The Exploratory Data Analysis is a significant progression from Defining a Data Science Problem to determine the specific characteristics needed to solve the problem. From Data Wrangling, Data Munging, Pre-processing, Pipelines, Data Visualization, and Data Analytics, all these areas are essential for effective Exploratory Data Analysis.

0. Compute and Storage Considerations: Projects that scale require more compute, faster computer, and more storage. In the market, many solutions from many providers exist. If you need Cloud Compute and Storage consider the following options:

Paperspace - For under $10 per month, basic cloud compute and storage is available, with automation, Docker containers, and pre-installed Python packages in a Jupyter notebook. Google Colab - Cloud Notebooks with the potential to accelerate with GPUs and TPUs. Data can be accessed and stored from Google Drive. Microsoft Notebooks - Cloud Notebooks and data on Azure. Custom environments: Amazon Web Services with EMR, Microsoft Azure, Google Cloud Platform, and IBM Watson Data Studio. Note: Today there are dozens of other platforms that can help in the cloud, including Domino Data Lab, Anaconda Cloud, Crestle, Spell.ai, Comet.ml, among others.

1. Developer Environment

Pick a consistent Framework (Python or R) that can be used for your end-to-end project workflow. Consider a consistent environment for your project development (Jupyter, PyCharm, or Visual Studio Code which support code, Markdown Text, and LaTeX.

2. Data Loading

Import your Data in-memory from SQL Databases, APIs, or Files with Pandas IO and Camelot PDFs

3. Data Exploration

Examine your data, columns and rows and rename and adjust indexing and encoding as appropriate. This Pandas Cheatsheet could be resourcesful for you. Did you also know that Python has excellent built-in functions.

  1. Clean null and blank values, and consider to drop rows, as well as to manipulate data and adjust data types as appropriate, including dates and time, or setting appropriate indices. Adjusting specific values and replacing strings and characters for the data wrangling process.
  2. Explore analysis with graphing and visualizations. Overall you can view many types of charts here. Here are all the known packages. Further, with matplotlib and seaborn and alternative visualization packages (Plot.ly and Dash, Bokeh, Altair, Vincent, Mlpd3, Folium, and pygal). It is important to create reproducible graphs. Sci-kit plot may help. Additional Seaborn resources may be helpful: (Cat graphs, Seaborn Color Palettes, Matplotlib Color Maps and more Seaborn examples). You can also explore advanced Matplotlib capabilities, legends with Matplotlib and Matplotlib styles. Adobe color also offers fantastic color selections and Lyft Colorbox provides accessible color options. Numerous magic methods exist to allow graphs to display and to offer customized magical functions.
  3. Perform additional analysis by creating new columns for calculations, including aggregator functions, counts and groupbys. Scipy could be helpful for statistical calculations as well. Consider what distributions you might be working with and all the possibilities. Consider GIS in Python for geospatial data.
  4. Encode categorical variables with a variety of techniques through logical conditions, mapping, applying, where clauses, dummy variables, and one hot encoding. Here is one method to encodage categorical variables in Pandas. When displaying results, consider to format them as well including as floats.
  5. Re-run calculations, including crosstabs or pivots, and new graphs to see results
  6. Create correlation matrices, pairplots, scatterplot matrices, and heatmaps to determine which attributes should be features for your models and which attributes should not. Design your visualizations with themes such as pallettes.
  7. Identify the response variables(s) that you would want to predict/classify/interpret with data science
  8. Perform additional feature engineering as necessary, including Min/Max, Normalizaton, Scaling, and additional Pipeline changes that may be beneficial or helpful when you run machine learning. If you have trouble installing packages, this environmental variable resource may be helpful.
  9. Merge or concatenate datasets with Pandas merging, or SQL methods (I.e., Learning SQL, SQL Joins, Joins #2, Joins #3, SQL Tutorial, and Saving Queries if you have not already, based on common keys or unique items for more in-depth analysis. Additional SQL resources include the SQL Cookbook and Seven Databases.
  10. Add commenting and markdown throughout the jupyter notebook to explain the interpretation of your results or to comment on code that may not be human readable, and help you recall for you what you are referencing. (Markdown references: Latex Cheatsheet, Markdown for Jupyter Notebooks, LaTeX in Notebooks, Markdown Intro, CommonMark,
  11. To create a markdown .md milestone report that shows and explains the results of what you have accomplished to date in this part of your course project. Consider also creating a .pdf or .pptx to display initial results, aha moments, or findings that would be novel or fascinating for your final presentations.

Part 3: Machine Learning Guidelines

  1. Create a brand new Jupyter notebook, where you run the latest DataFrame or .csv files(s) that you have previously saved from your exploratory data analysis notebook.
  2. After you have completed the exploratory data analysis section of your project, start revisiting your hypothesis(es) on ideas that you would like to either predict (regression) or classify (classifier). > 2. Have you identified a specific column or multiple columns that could be treated as response or target variables to predict/classify?
  3. If not, consider performing additional exploratory analysis that helps you pinpoint a potential working hypothesis to test results against. You could consider clustering techniques as an addition to exploratory data analysis as a preparation for machine learning, including TSNE Clustering
  4. Consider for your machine learning what parts of your feature engineering have been completed, or need to additionally be completed through Pre-processing and its use cases or Pipeline operations such as Normalize, Scaler, Min/Max, etc.
  5. As a result of correlation matrices, heatmaps, and visualizations, consider which features may be relevant to support the model that you are building.
  6. Consider what machine learning models through SkLearn and their Github Repo or StatsModels could be effective for your newly discovered hypothesis testing (linear regressions (I.e., Lowess Regression, Logistic regression and multi-class models, KNearest Neighbors, Clustering, Decision Trees and how to export graphviz, including Bagging Regressor or the Bagging Classifier, and feature selection for Ensembles, Random Forest including Tuning RF, Naive Bayes, Natural Language Processing (Word2Vec and understanding Word2Vec, Spacy and Spacy Models, and Topic Modeling) Time Series Analysis (I.e., Time Series 1 and Time Series 2, Neural Networks, Support Vector Machines and Model Resistance, Stochastic Gradient Descent, dimensionality reduction with PCA (demo here) as well as Ensembles such as GB Classifier and GB Regressor). Once you have determined models to consider, be sure to import their packages into Python.
  7. Consider what tuning parameters you may want to optimize for your model, including regularization (Lasso, ridge, ElasticNet), and additional parameters relevant to each model. Github Code Search could help you as you are adjusting your models.
  8. Be sure to include a train_test_split, and then consider a KFolds or Cross Validations to offer stratified results that limit the interpretation of outliers for your dataset. If you have imbalanced classes consider techniques to adjust them.
  9. If you still have many outliers, consider how to remove them or optimize for them with categories. How could you adjust your categories, or thresholds to improve performance for what you are testing for your hypothesis? Depending on how your model error performs, you may want to consider to change or adjust other features in your model. You may want to consider to add or remove features, and measure the feature importance when running models.
  10. Consider a Grid Search, Grid Search with Cross Validation Continued, or Random Search to better optimize your models.
  11. Share metrics on each model that is run, such as error and accuracy, confusion matrices which are based off truth tables, and logical conditions. They can be displayed through ROC/AUC curves as well as visually. Scoring your models is important for both regression and classification techniques. Other models have additional metrics, that you can consider to share. You can set up metrics and running models in defined functions for further automation of your project.
  12. Compare your metrics against the base case or null case for accuracy, which ideally is compared to your majority class, or a median/mean representation for your target/response variable. How well does your model perform?
  13. Provide markdown explaining the interpretation relevant to your business case after running models. Also, share comments to explain what you are doing, for your interpretation and then reproducibility of your code.
  14. If you are running Time Series Analysis, you will want to consider additional model capabilities such as rolling and moving averages with the dateTime package and pandas.
  15. If you are working on Natural Language processing, you will want to consider python packages such as Spacy, topic modeling, NLTK, TextBlob, and word2vec.
  16. If you are scraping additional data, consider python packages such as Selenium and BeautifulSoup4.
  17. For your project, your presentation will showcase the best 3-5 models. However, it is fine if you have inefficient models that do not perform well, for practice, so keep these in your main modeling Jupyter notebook as a reference.
  18. If you chose to work with .py scripts, here is a method to rename these files.

Part 4: Presentation Design

Please prepare your Presentation with this Presentation Skeleton for for Data Science, Solution Engineering & Customer Experience

  • Title Page:
    • Project Title
    • Name
    • Job Title, Organizational Title
  • Agenda Page:
    • Sections to be covered, and time for each section
  • Introductions:
    • Introduction to your Stakeholders
    • Introduction to you
  • Problem Statement:
    • Describe in Depth the Problem
    • Solution(s) technical/non-technical to the problem
  • Data Analysis Slide(s):
    • Techniques, Software stack, platforms used
    • Data Dictionary, Feature Engineering
    • Benchmarked or baseline metrics to discover
    • Visualizations of analytics with business context (Maximum 2 visualizations per slide)
  • Machine Learning Slide(s):
    • Metrics and Scoring with analytics with best scoring models and business context
    • Describe how metrics are scored to baseline (Model Persistence)
  • Deployment:
    • How Machine Learning solution will be Deployed in Production
  • Conclusion Slide:
    • Recommendations and Results with business context
    • Future Research and Analysis
  • Next Steps slide:
    • Contact, Github/Gitlab URL, Presentation Link, and Call to Action Appendix:
    • Works Cited and Media Resources

Design and Product Requirements

  1. Github Organization: Create one parent directory for your project, with separate Jupyter Notebooks for each section, a data folder, and an assets folder for images.
  2. Final presentation to be shared as a Google Slides presentation or Microsoft PowerPoint or React Native Slides
  3. Presentation to focus on business analysis, insights, and business impact with graphs, and machine learning output. Minimal, if any, code should be shown in presentation.
  4. Presentation should use maximum of 3 fonts.
  5. Maximum of 20 slides.
  6. Can be interpreted if sent as a cold e-mail without you presenting your report.
  7. Appendix Slide for Works Cited, Bibliography, and Links must be included.
  8. Presentation delivery to not exceed 7 minutes
  9. Presentation delivery to be for non-technical stakeholder (Also known as "Teach me like I am 5")
  10. Presentation Delivery In-Person or Zoom or Skype
  11. Be prepared for a Q&A sesssion for 3 minutes

Additional Notes

  1. Consider that you could save all your plots to an overall PDF.
  2. You could consider a pass-through on your Jupyter notebooks to customize them with docstrings and Markdown to polish your presentation for code review by stakeholders.

Licenses

License

CC-4.0-by-nc-nd

To the extent possible under law, David Yakobovitch has licensed this work under Creative Commons, 4.0-NC-ND. This license is the most restrictive of Creative Commons six main licenses, only allowing others to download your works and share them with others as long as they credit the author, but they can’t change them in any way or use them commercially.

data_science_standards's People

Contributors

davidyakobovitch avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.