Fix security vulnerability for this repo in requirements.txt file

Fix security warning

Polish story card request template to guide data processing / analyses

get feedback
incorporate feedback from team housing
socialize the template
update the Github version of the story card request template
to match google doc
incorporate feedback from other team leads

understand how to make it easier to work with storybook APIs and Python / R code

come up with example in R

Come up with code example of how to use Metacat with a postgres Docker image

work on code example
put together relevant tutorial material

Check if RStudio Cloud free edition is a sustainable cloud computing solution for R users

https://rstudio.cloud/projects

Figure out how to track data lineage metadata in Postgres DB / RDS

candidate software solution

read up on Metacat

Decisions and follow-up

come up with recommendation for a software solution

Polish Python docker image

add AWS cli to docker image & add instructions for setting up proper AWS credentials
add sqlalchemy to read from postgres SQL db
make sure there is a commented out step to incorporate dependencies from an arbitrary requirements.txt file
add guidelines for customizing the docker file
add guidelines for re-building the docker image and pushing it to Dockerhub

First check-up on team transportation

https://github.com/hackoregon/transportation-congestion-analysis/

Story Card Request

Project:
Card Title:
Card Document: (link here)

Milestones

Setup

Create Card Document from this template and link above

Type of data processing / analysis this story card uses

(a card can belong to multiple categories)

descriptive - simple data (re)-representation, doing summary statistics belongs to this category, we do this for all data sets
explanatory - testing hypotheses and / or comparing data points
predictive - any regression, model fitting, classification or clustering tasks
prescriptive - when you want to recommend any action to be taken (we do this rarely, if at all)

Data documentation and proposed analysis

Document metadata
Request platform resources like database instances by copying the request template https://docs.google.com/document/d/1SlnEmRneRIP5Aco1vK2KBIA2l2lH__m9OTp5a9ymtgk, filling it out and sending it to the infra team
Decide whether to load to database or S3 with proper metadata documentation
Use platform-request form to request resources for database and / or AWS account to write data files to S3
Review metadata and proposed data analysis

Set up data processing development environment

Clone repo from template
Set up access to GitHub repo for all team members
Set up a container from a suitable version of the Dockerfile template
Prototyping and testing analysis proposals
Review additional proposed data analysis identified through prototyping
Write code for reproducible data processing steps with proper version control & data lineage
Data science results produced and documented
Data science peer reviewed

Build APIs

Data visualization:

Concept clearly articulated through card title, visualization title/subtitle, card question(s)/action(s), and card context
Titles and context use consistent language (e.g., census tract v. neighborhood) and match grain of data used in the visualization
Visualization and component choices inline with data visualization best practices
All components needed available in Storybook
Components available in Storybook demonstrate all needed features
Follows data visualization and interface guidelines available in Storybook

Design

TBD Wireframes?
TBD Design review?

Written content / additional links

Write content
Review content

Host tutorial section explaining how to sync up Google Colab and local machine for HackO data science work

complete #1
create slide deck
pick a good date and set up calendar event

help team transportation to put together a platform request for monitoring their db storage quota

See conversation with Ed

review story card data science codes

refine data & API flowchart

[Completed] with https://docs.google.com/document/d/1RSHnpI5ICgO2WTHljFExEmuirRecx91ZMfEjKpELDVw/edit#heading=h.chp21trjbrqd

Figure out a recommended tracking system for data science workflow

read up on DVC
read up on ML flow
read up on Pachyderm
recommend the best solution

Sync up with backend team to come up with recommendations for storing & retrieving data from RDS

see what else we need to add to postgres tutorial from a data science perspective

Add guidelines to list S3 public bucket folder / file structure

There has been a request to find ways to list the directory structure of S3 files.

Right now we are unable to list the content of our public hacko-data-archive S3 bucket with the AWS CLI, most possibly due to permission issues
Reference from stack overflow here
@DingoEatingFuzz Is it possible for us to change the permission of the files & folders in that public S3 bucket? If permission is not the problem, how can we enable this?

[resolution] Karen will write a platform request to ask for a REST API to check what files are availabe in the S3 bucket

explore housing tax lot data processing options with Stephen

discussion to be had

@stephenosserman has been exploring and wrangling our taxlot data in our housing-staging postgis db and been making some progress. So far I've I appended census-block to each lot for each year (1997-2017); standardized the taxlot-id format; and added an index here and there to make querying faster. I've also written the queries to identify most all changes from year to year -- including lot divisions, merger of multiple lots, changes in lot-id without lot geography changes, and (most) other boundary changes. I'm hoping all of this will make longitudinal analysis of taxlot data a lot easier and more accurate.
The brainstorm prompt: What analyses do you think we and/or future folks might want to run using this dataset? Context is that there are a bunch of possible table-structures I could imagine using for storing the complex taxlot changes dataset I'm pulling together. I'd like actual analyses that people might run in the future to inform which table structure(s) I go with. Specifically hoping to work backwards from potential analyses to pseudo-sql to see which approaches might be best combination of comprehensive, streamlined, and flexible. We have a few concrete uses for existing housing-team work this year which I'm thinking about, and I'll seed a thread with a few other ideas, but would love more ideas. Thx in advance!

some possible types of analyses to kick off brainstorming:

amount of division over time in different parts of town;
changes in property values taxlots that haven't changed;
descriptive or predictive analyses regarding changes in sale prices, land values, and divisions;
impacts of changes in zoning and other factors on all of the above;
kinds of activity in gentrifying neighborhoods vs other neighborhoods; (or relatedly, evaluation of different measures of gentrification from census variables and elsewhere to see which most successfully predicts different sorts of gentrification as witnessed though changes in taxlot data)

Build Docker image and dockerfile for R users

finalize the R libraries to put in docker image

hackoregon / 2019hackordatasciencetemplate Goto Github PK

2019hackordatasciencetemplate's People

Contributors

Stargazers

Watchers

2019hackordatasciencetemplate's Issues