Code Monkey home page Code Monkey logo

2019hackordatasciencetemplate's People

Contributors

karenng-civicsoftware avatar karenyyng avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

2019hackordatasciencetemplate's Issues

Polish Python docker image

  • add AWS cli to docker image & add instructions for setting up proper AWS credentials
  • add sqlalchemy to read from postgres SQL db
  • make sure there is a commented out step to incorporate dependencies from an arbitrary requirements.txt file
  • add guidelines for customizing the docker file
  • add guidelines for re-building the docker image and pushing it to Dockerhub

Example story card 1

Story Card Request

Project:
Card Title:
Card Document: (link here)

Milestones

Setup

Type of data processing / analysis this story card uses

(a card can belong to multiple categories)

  • descriptive - simple data (re)-representation, doing summary statistics belongs to this category, we do this for all data sets
  • explanatory - testing hypotheses and / or comparing data points
  • predictive - any regression, model fitting, classification or clustering tasks
  • prescriptive - when you want to recommend any action to be taken (we do this rarely, if at all)

Data documentation and proposed analysis

  • Document metadata
  • Request platform resources like database instances by copying the request template https://docs.google.com/document/d/1SlnEmRneRIP5Aco1vK2KBIA2l2lH__m9OTp5a9ymtgk, filling it out and sending it to the infra team
  • Decide whether to load to database or S3 with proper metadata documentation
  • Use platform-request form to request resources for database and / or AWS account to write data files to S3
  • Review metadata and proposed data analysis

Set up data processing development environment

  • Clone repo from template
  • Set up access to GitHub repo for all team members
  • Set up a container from a suitable version of the Dockerfile template
  • Prototyping and testing analysis proposals
  • Review additional proposed data analysis identified through prototyping
  • Write code for reproducible data processing steps with proper version control & data lineage
  • Data science results produced and documented
  • Data science peer reviewed

Build APIs

  • Database deployed to cloud
  • Initial API repo created via cookiecutter, using templatized names
  • API developer confers with Data Visualization/Frontend teams regarding story card MVP
  • API developer confers with Data Scientists regarding all needed calculations, filters and queries, validation
  • perhaps using OpenAPI as a contract/organization first, can help understand the needs/requirements - https://swagger.io/docs/specification/about/
  • https://apievangelist.com/2018/04/03/openapi-is-the-contract-for-your-microservice/
  • Basic API in container
  • Tests are created to validate API and prevent regressions
  • API developer provides documentation on all endpoints, calculations, filters and queries
  • API developer creates metadata endpoints as further defined
  • Basic API deployed to cloud
  • API endpoint with all needed calculations, filters and queries is available
  • Final validation with Data Scientists regarding endpoint?

Data visualization:

  • Concept clearly articulated through card title, visualization title/subtitle, card question(s)/action(s), and card context
  • Titles and context use consistent language (e.g., census tract v. neighborhood) and match grain of data used in the visualization
  • Visualization and component choices inline with data visualization best practices
  • All components needed available in Storybook
  • Components available in Storybook demonstrate all needed features
  • Follows data visualization and interface guidelines available in Storybook

Design

  • TBD Wireframes?
  • TBD Design review?

Written content / additional links

  • Write content
  • Review content

Add guidelines to list S3 public bucket folder / file structure

There has been a request to find ways to list the directory structure of S3 files.

Right now we are unable to list the content of our public hacko-data-archive S3 bucket with the AWS CLI, most possibly due to permission issues
Reference from stack overflow here
@DingoEatingFuzz Is it possible for us to change the permission of the files & folders in that public S3 bucket? If permission is not the problem, how can we enable this?

[resolution] Karen will write a platform request to ask for a REST API to check what files are availabe in the S3 bucket

explore housing tax lot data processing options with Stephen

discussion to be had

@stephenosserman has been exploring and wrangling our taxlot data in our housing-staging postgis db and been making some progress. So far I've I appended census-block to each lot for each year (1997-2017); standardized the taxlot-id format; and added an index here and there to make querying faster. I've also written the queries to identify most all changes from year to year -- including lot divisions, merger of multiple lots, changes in lot-id without lot geography changes, and (most) other boundary changes. I'm hoping all of this will make longitudinal analysis of taxlot data a lot easier and more accurate.
The brainstorm prompt: What analyses do you think we and/or future folks might want to run using this dataset? Context is that there are a bunch of possible table-structures I could imagine using for storing the complex taxlot changes dataset I'm pulling together. I'd like actual analyses that people might run in the future to inform which table structure(s) I go with. Specifically hoping to work backwards from potential analyses to pseudo-sql to see which approaches might be best combination of comprehensive, streamlined, and flexible. We have a few concrete uses for existing housing-team work this year which I'm thinking about, and I'll seed a thread with a few other ideas, but would love more ideas. Thx in advance!

some possible types of analyses to kick off brainstorming:

  • amount of division over time in different parts of town;
  • changes in property values taxlots that haven't changed;
  • descriptive or predictive analyses regarding changes in sale prices, land values, and divisions;
  • impacts of changes in zoning and other factors on all of the above;
  • kinds of activity in gentrifying neighborhoods vs other neighborhoods; (or relatedly, evaluation of different measures of gentrification from census variables and elsewhere to see which most successfully predicts different sorts of gentrification as witnessed though changes in taxlot data)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.