In the previous sections you examined how files were organized in a repository, now you will shift to what is inside the files themselves. The focus of this section is not on the statistical validity of your analysis, but on the content organization and presentation.
You will be able to:
- Review the minimum expectations of written data science project
- List what enhances a project from the minimum to the exceptional
- Compare the impact of rewritten sections to their original content
In The Elements of Data Analytic Style, by Jeff Leek, he writes:
A written analysis should always include:
- A title
- An introduction or motivation
- A description of the statistics or machine learning models you used
- Results including measures of uncertainty
- Conclusions including potential problems
- References
Data science projects at Flatiron School include two forms of documentation: the Jupyter Notebook and the README. The Jupyter Notebook contains the long-form documentation of the analysis, while the README contains a short summary and provides guidance to navigate the repository structure.
In the Jupyter notebook, you capture the requirements Jeff Leek describes in these sections:
- Title
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
Each section has a set of requirements and questions it should be able to address.
As with the name of your files and repository, the title of the notebook should also be descriptive. "Project Notebook" is not as informative as "EDA, Modeling, and Evaluation". An even more descriptive title might be "House Price Prediction: Data Exploration, Modeling and Results". There should be no question in a viewer's mind as to what is contained in your notebook.
Beyond the title, you should also give an overview of your notebook's contents right at the top of the notebook. This allows a reader to easily skim for the content they're looking for, as well as to know exactly what is contained within the notebook.
This section clearly explains the real-world value the project has for a specific stakeholder, and how a problem will be addressed by this analysis.
Example questions to be answered:
- How much time will this solution save?
- Who will this solution help?
- What need does this analysis address?
- How well does the metric or target variable directly relate to the real world problem?
This section relates your data source and the properties of variables to the real-world problem of interest. Jumping straight into the modeling without demonstrating a thorough understanding of the data is amateur hour. A robust data understanding section will describe the source and properties of all the variables used in the data preparation and modeling sections.
Example questions to be answered:
- Where does the data come from?
- What do the variables mean in actual language?
- What is the target variable?
- What is the range, scale, or distribution of each variable?
- Who is in the sample or how was the data collected?
- What elements of the data will or will not address the business question?
- Are there any issues related to data permissions, copyright, ethical issues, confidential information, etc.?
- Are there any interesting aspects or anomalies in the data such as outliers or missing data?
- What additional data would be really helpful to your analysis?
An employer should be able to replicate your data cleaning and preparation, from the raw data to what is used in the analysis, using your data preparation code. A quality data preparation section fully documents and justifies decisions to merge, drop, or transform variables.
Example questions to be answered:
- Can someone else replicate your entire data preparation process?
- If you created the data through scraping or an API, can someone repeat that process?
- In what form is the data stored?
- Can someone else easily run the code to take the raw data and get it ready for analysis?
- Is the code in pipeline form?
- Is all the preprocessing code in the notebook, or is it in separate
py
files?
While model development is an iterative process, not every analysis explored should be in your final project notebook. Models should be correct, iterative, and fully documented, including valid justification for decisions. Models are developed iteratively and justifiably, proceeding from a simple baseline model to more complex models.
Example questions to be answered:
- Is the information you are including absolutely relevant?
- Is your final model specified in an equation or pseudocode, and not just specified in code?
- When you describe the parameter or coefficients, do you describe them in real terms?
- Have you examined any problems with the data that might be impacting the quality of your analysis or model?
Evaluation is not just about accuracy or r-squared score. While those metrics are important, the evaluation section also needs to address how well (or not) the model solves the original business problem. The limitations are just as important as the successes.
Example questions about the model:
- What evaluation metrics did you use?
- Were there special considerations you made when choosing that evaluation metric?
- How does your model's metric compare to industry standards or what is already out there?
- Was cross validation included in your process and what concerns did that address?
Example questions about the application:
- What are the limitations of interpreting your analysis?
- What next steps would you take in this analysis? What new data would you want to incorporate?
- How well does your analysis answer the actual business question?
- What sort of impact would your results actually have?
The README is at once an abstract, a road map, and a how-to manual. While perhaps not labeled explicitly, a quality README includes:
- Detailed description of your business question
- A summary of your data science process, findings, and ideas for future improvement
- At least one interesting visualization from your analysis
- Repository navigation
- Links to the presentation slides, notebook, and other relevant documentation
- Links to sources, such as the data, papers referenced, or other important materials
- Reproduction instructions
- Contact information
In Randal Olsen's sample analysis on the Iris dataset, he uses the data analysis checklist from The Elements of Data Analytic Style to ensure his analysis is not mediocre. You don't want a mediocre analysis either!
Here is another good example analysis.
And here is yet another good example.
Great! Now that you know more details about how to structure Jupyter Notebook and README content, let's move on to some exercises.