The onereview_data_collection from porterehunley

Configure Application for Deployment via WSGI and Gunicorn

Prepare the application for deployment with WSGI and Gunicorn

Toggle Years for Collection

Add in a years toggle for the collection at the front end.

Sequel Check

Sometimes when a movie has multiple parts, the collector will collect a movie review for the second movie when it should be in the first.

Add in Authentication to Start Collection

Right now the startservercontroller is open, and anyone can just start the data collection by hitting that API endpoint. The application has a token system. I need to first validate the user is logged in, then get his token, then call the protected endpoint.

Go Through Cloud Storage Documentation

We need to store our data somewhere. Google cloud is a pretty good option since we are using YouTube's API

Documentation below
https://cloud.google.com/storage/docs/

Add API Documentation

Add a page to the front end that documents the API usage and Auth system

What video data do we need to Query?

We do not have infinite queries and some are bigger than others. We should not go through the CC of the video unless we have too.

Should we look are comments, CC's, other things? What data do we need to get from the videos to train our algorithms. This is going to require training some crude ML models with clean data.

Go through ML book chapter 2

Do a in-depth reading of chapter 2. This chapter has an example of an end-to-end ML project. All pretty useful and helps us frame our project.

https://github.com/devakar/deep-learning-books/blob/master/Hands%20on%20Machine%20Learning%20with%20Scikit%20Learn%20and%20TensorFlow.pdf

Collection Concurrency Issue

Right now, multiple authenticated users can start their own collection process. Add in a check that checks if there is a controller working at that moment

Add Easy Score Videos

This is a large issue with a couple of parts. First we need to be able to click on a movie on the frontend and have it display the youtube videos associated with that movie. Then we need to be able to enter a score for those videos if there is not one already.

Remove Internal API Calls

Remove the internal API calls inside of the application to increase speed and reduce complexity. Also makes it more configurable.

Add Channel into Video

Add a channel both into the SQL database and into firestore.

Set Up CircleCI for Packaging/Deployment

^^^^

Add Data Migration

Write a script integrated with the application that transfers data from the database and puts it in Firestore.

Stabilize Data Collector

The data collector will sometimes skip over parts of the data pipeline. I do not know why.

Clear Current Entry

Have the user be able to select an entry then clear it.

Add Captions Monitoring

Have the frontend be able to color the entries that don't have captions

Entry Monitors

Lets the user know (front end) whether or not a media item contains all the data entries needed.

Recollect Damaged Data

Add a button that recollects incomplete data. Complete the issue before this that allows for detection of media items with an incomplete data pipeline

Create IMDB Webscraper

Create an algorithm that gets the movie titles from IMDB

Make Max_Videos Configurable

Make the number of videos that the controller collects configurable from front-end.

Better Logging System

Create Logs for specified directories

Email Approve Users

Have an approval system for email where if a applicant is approved from the email, then it registers them into the database with a new token.

Make app_name configurable

Add in a configuration file that allows for app name to become configurable for better routing

Provision Server for Deployment

Provision and setup server to hold an NginX deployment.

Clean YouTube Data

Gather some example data from a couple YouTube videos relating to different products and see what it all looks like and how we should clean it.

Mark it up using python and go ahead and commit it

Hitting Quota Limits

Have the application not totally crash when it hits a quota limit.

Setup data transformation pipeline

Set up the data transformation pipeline for youtube data. Dirty data in, clean data out.

Add Current Entry

Have the user be able to select a current entry, then add it.

Add User Registration to front End

Add a user registration page that submits to an email registered with TrueReview.

Zap Test

Setup YouTube API Token

Setup an authorized account that has access to YouTube's API.

Prevent crashing when data collector hits existing media title

Make sure the data collector does not crash when it tries to collect a media item that is already in the database. Let the user know, then go to the next entry.

Stop Collection Button

Add a button that will stop the data collection

Write Server Provisioning Scripts

Write a script, in Ansible or others, that installs/provisions the server as well as sets up the database. Should ssh into the ubuntu host: [email protected] and set everything up.

Add Movie Year to server status API

Add the current year of the movie titles to the server status API so the frontend can update correctly.

Setup development environment

Install Anaconda and set up an environment in sublime text

porterehunley / onereview_data_collection Goto Github PK

onereview_data_collection's People

Contributors

Watchers

onereview_data_collection's Issues

Recommend Projects

Recommend Topics

Recommend Org