Code Monkey home page Code Monkey logo

deeplearningproject's Introduction

harvard-logo

An end to end tutorial of a machine learning pipeline

This tutorial tries to do what most Most Machine Learning tutorials available online do not. It is not a 30 minute tutorial which teaches you how to "Train your own neural network" or "Learn deep learning in under 30 minutes". It's a full pipeline which you would need to do if you actually work with machine learning - introducing you to all the parts, and all the implementation decisions and details that need to be made. The dataset is not one of the standard sets like MNIST or CIFAR, you will make you very own dataset. Then you will go through a couple conventional machine learning algorithms, before finally getting to deep learning!

In the fall of 2016, I was a Teaching Fellow (Harvard's version of TA) for the graduate class on "Advanced Topics in Data Science (CS209/109)" at Harvard University. I was in-charge of designing the class project given to the students, and this tutorial has been built on top of the project I designed for the class.

UPDATE 24th October 2018

The tutorial has now been re-written in PyTorch thanks to Anshul Basia (https://github.com/AnshulBasia)

You can access the HTML here: https://spandan-madan.github.io/DeepLearningProject/PyTorch_version/Deep_Learning_Project-Pytorch.html and the IPython Notebook with the code in PyTorch here:https://github.com/Spandan-Madan/DeepLearningProject/blob/master/PyTorch_version/Deep_Learning_Project-Pytorch.ipynb

Citing if you use the work here

If you would like to use this work, please cite the work using the doi - DOI

Reading/Viewing the Tutorial

To view the project as an HTML file, visit - https://spandan-madan.github.io/DeepLearningProject/

The Code

If you would like to access to Code, please go through the ipython notebook Deep_Learning_Project.ipynb

SETUP

Python

  • We will be using Python 2.7. Primary reason is that Tensorflow is not compatible with python > 3.5, and some other libraries are not compatible with python 3.

To make setup easy, we are going to use conda.

  • Please install conda 3 from https://www.continuum.io/downloads
  • The repository has a conda config file which will make setting up super easy. It's the file deeplearningproject_environment.yml
  • Then create a new conda environment using the command with conda env create -f deeplearningproject_environment.yml
  • Now, you can activate the environment with: source activate deeplearningproject
  • jupyter notebook If all the isntallations go through, you are good to go! If not, here is a list of packages that need to be installed: requests imDbPy wget tmdbsimple seaborn sklearn Pillow keras tensorflow h5py gensim nltk stop_words

Please install imdbpy using 'pip install imdbpy==6.6' since earlier versions are broken

Setting up conda environment in jupyter notebook

To be able to run the environment you just created on a juputer notebook, first check that you have the python package ipykernel installed. If you don't simply install it using

pip install ipykernel

Now, add this to your jupyter notebook using the command:

python -m ipykernel install --user --name deeplearningproject --display-name "deeplearningproject"

Needless to say, remove all single quotes before running commands.

Go to the directory and run jupyter notbeook by "jupyter notebook" and open the respective notebook on browser. TO install TMDB: pip install tmdbsimple Use "import tmdbsimple as tmdb"

Setting up a docker container with docker-compose

Prerequisites

Run docker-compose

To work with an isolate environment and be able to run it on many systems without troubles, you can run this docker-compose command:

docker-compose up

It will build deeplearningproject image according to Dockerfile. And then run dokcer container via docker-compose. See Docker and docker-compose docs for more informations :

Then access notebooks through your web browser at http://localhost:8888

You should notice that notebooks have been copied from root to notebooks folder to mount them into container via bind volume. Any changes you make, will be saved on host (notebooks dir).

Add packages

You can add conda or pip packages to image (and thus, container) by updating deeplearningproject_environment.yml file and then run

docker-compose build

It will build a new deeplearningproject image with new conda/pip packages installed. Stop your running container (CTRL-C) and then docker-compose up to rerun a fresh new container.

Known common bugs

I will keep updating this as issues pop up on this repository.

  • One known bug is because Keras 2.0 is not compatible with some Keras 1.2 functionalities. You may run into errors with importing VGG16. If so, just update keras using the following command:
sudo pip install git+git://github.com/fchollet/keras.git --upgrade

-OS Error: Too Many Open Files Refer to: https://stackoverflow.com/questions/16526783/python-subprocess-too-many-open-files or, shut down notebook and execute following the the same terminal ``bash ulimit -Sn 10000


And restart the jupyter notebook.

Hope this repo helps introduce you to a full machine learning pipeline! If you spot an error, please create an issue to help out others using this resource!

To prevent problems with installation and setting up, this repository comes with a conda environment profile. The only thing you will need is to install the newest version of conda, and use this profile to create a new environment and it will come set up with all the libraries you will need for the tutorial.

deeplearningproject's People

Contributors

anshulbasia avatar biogeek avatar bobbleoxs avatar brandly avatar mel-jecker avatar mkilavuz avatar spandan-madan avatar tomraulet avatar vargas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeplearningproject's Issues

Example binarized vector representation is syntactically incorrect

Now let's store the genre's for these movies in a list that we will later transform into a binarized vector.

Binarized vector representation is a very common and important way data is stored/represented in ML. Essentially, it's a way to reduce a categorical variable with n possible values to n binary indicator variables. What does that mean? For example, let [(1,3),(4)] be the list saying that sample A has two labels 1 and 3, and sample B has one label 4. For every sample, for every possible label, the representation is simply 1 if it has that label, and 0 if it doesn't have that label. So the binarized version of the above list will be -

> [(1,0,1,0]),
> (0,0,0,1])]

This section has output that contains a brackets mismatch syntax error. Not a huge problem, but probably a bit confusing for a beginner. Otherwise, great tutorial!

Dependency Issue on Windows

Tried to setup with the .yml file which was aborted.
Manual installation of the requested packages led to an error: tensorflow on Windows is only supported in 64-bit Python 3.5. Updating python raises depency errors for functools32 and subprocess32, which only run with Python 2.7.
So based on my limited knowledge: there is no way of setting up the environment on Windows. Or am I missing something?

Got IOError half way through learning

When I got to the last session on model textual, the model went through 5 epochs then threw this error:

IOError: [Errno 24] Too many open files

I went ahead trying to change $ulimit -n but realized the easiest way is to just change numb_workers to 4. It's arbitrary but someone suggested 4*num of GPU is a good approximation for num_workers.

It's not a specific issue per se but I think it may be beneficial for people to know this is one of the nuances in building ML pipeline which is not necessarily apparent.

Finding words that are most predictive of a genre

Hi, This was an extremely useful document, and I learnt a lot from the tutorial. An interesting extension to the problem would be to identify the words in the synopses that most distinguish a genre from other genres in the model - I have an analogous task in my project.

Is there a way to find the words that are most predictive of a genre? For example, is there a way to identify that the words ‘battle’, ‘challenge’ and ‘fight’ (for example) are the most predictive of a movie falling into the ‘Action’ category, based on the model we trained? i.e. which are the words (in the synopsis) that most prominently indicate that the synopsis would fall under a particular genre? (Using the model we have fit).
This basically translates to decoding the algorithm to find out how it works “under the hood.” - what features (words?) it uses "under the hood" to classify a synopsis into a genre.

A solution I found online is in the code snippet below - Using the classifier coefficients from clf.coef_ (clf is the name of the model I fit) and picking the top 10 words (which the model uses to distinguish/identify a genre based on a given text).

def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))

Please let me know if this is appropriate and if there is a better way of doing this.

Would be nice to support Python 3.

Just to be clear - consider this as just a minor suggestion rather than a complaint.

Thank you for this tutorial! It's rare that people actually spend a lot of time to make a great free learning resource.

Possible wrong syntax

In [26]: # Create a tmdb genre object!
genres=tmdb.Genres()
the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.
list_of_genres=genres.list()['genres']

The above segment throws this error:
Create a tmdb genre object!
genres=tmdb.Genres()
the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.
list_of_genres=genres.list()['genres']

I apologize if this is a trivial issue. I'm new to Python. It'll be great if someone can help me resolve this. T

Cut out warnings from imports due to numpy ufunc and dtype sizes

Nice jobs with the notebooks- On block 2, if you'd like to get rid of the RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 warnings, the you can just add at the bottom of the block:

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

"That" vs "Which" grammatical error.

Kudos on a very well done writeup. I have a simple grammatical correction ... in many cases, you have used 'which' in place of 'that'.

See http://www.writersdigest.com/online-editor/which-vs-that

If 'which' is used to describe something, and is not preceded by a comma, it is a likely candidate for the confusion.

For example,
'use the available data to learn a function which can' ==> 'use the available data to learn a function that can'

Few things to look in *Deep Learning to extract visual features from posters* on Section 7

  1. You declared VGG model function and stored in variable 'model' and used variable 'model_viz' for training, which means you did not use VGG at all. You can check your model by typing 'print(model_viz.layers)'. If you struggle to fix this issue, I can help you with this section if you add me as an author.
  2. It is important to show how well your model is trained. I would recommend plotting curves of loss and accuracy with history instance returned from 'model.fit()' function or a confusion matrix from predictions to show false positives and vice versa.

help

in [41] cell, when I am executing I am getting the following error:

HTTPError Traceback (most recent call last)
in ()
17 url += '&with_genres=' + str(g_id) + '&page=' + str(page)
18
---> 19 data = urllib2.urlopen(url).read()
20
21 dataDict = json.loads(data)

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
433 for processor in self.process_response.get(protocol, []):
434 meth = getattr(processor, meth_name)
--> 435 response = meth(req, response)
436
437 return response

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in http_response(self, request, response)
546 if not (200 <= code < 300):
547 response = self.parent.error(
--> 548 'http', request, response, code, msg, hdrs)
549
550 return response

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in error(self, proto, *args)
471 if http_err:
472 args = (dict, 'default', 'http_error_default') + orig_args
--> 473 return self._call_chain(*args)
474
475 # XXX probably also want an abstract factory that knows when it makes

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
554 class HTTPDefaultErrorHandler(BaseHandler):
555 def http_error_default(self, req, fp, code, msg, hdrs):
--> 556 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
557
558 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 401: Unauthorized


help appreciated.
Thanks

Possible points of confusion and typos

Points of confusion

  • This section uses f as the generalized function and g as the exact function, whereas before f was exact and g was generalized. This has the potential to confuse readers.
  • On In [51] and In [52], id is assigned a value but does not seem to be used
  • On the section after Out [62] it says that the shape of Y is 1666,20 but the output of print Y.shape is (1595, 20). Where does the 1666 come from?

Typos

  • In the last sentence of the first paragraph of the same section, "listen to" should be changed to "watch"
  • in the last paragraph before In [68] (this section) "vocabular" should be "vocabulary"
  • In the first paragraph of this section, "can only integer values" should be "can only be integer values"
  • In the second item of the first list in this section, "difference models" should be "different models"

small correction cell no.18

It should be

the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.

list_of_genres=genres.movie_list()['genres']

Thanks for the amazing tutorial .

Help !

Hi am new to ML, can i start with this tutorial ?
or where i have to start ? and how to start?
thanks in advance

Varibles undefined when run the scripts

In section 7, when extract VGG features for scraped images.
In the for loop where try and except block located, the varible 'imname' was not declared, may be change like the following:

for mov in poster_movies:
    i+=1
    mov_name=mov['original_title']
    mov_name1=mov_name.replace(':','/')
    poster_name=mov_name.replace(' ','_')+'.jpg'
    if poster_name in imnames:
        img_path=poster_folder+poster_name
        try:
            img = image.load_img(img_path, target_size=(224, 224))
            succesful_files.append(imname) # **imname undefined , change to poster_name ?**
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = preprocess_input(x)
            features = model.predict(x)
            file_order.append(img_path)
            feature_list.append(features)
            genre_list.append(mov['genre_ids'])
            if np.max(np.asarray(feature_list))==0.0:
                print('problematic',i)
            if i%250==0 or i==1:
               print "Working on Image : ",i
        except Exception,e:
            print Exception,":",e   # **for debuging**
            failed_files.append(imname) # **imname undefined , change to poster_name ?**
            continue
    else:
        continue

TMDB API key

Great walkthrough!

One recommendation is to replace your actual TMDB API key with a placeholder. That way no one can abuse your account via your API key.

P.S. Super nitpicky, but in that same block, I think the Jupyter step should read In [5]:

TMDB Genre list() changed to movie_list()

In Section 3 when looking at returning Genres from TMDB the instructions state to use the .list() method of the Genre object returned by tmdb.Genres().

There has been an update to the API and list() no longer exists. There are separate lists for movies, tv, etc. Currently the function we're looking for is movie_list(), which returns the list of movie genres.

Tutorial is broken? Low recall/precision in TF results

Hey Spandan,

Looks like something has changed in the data or model, the TF precision and recall in the final runs are very low (.2 or so) might need.to future proof this a bit more against changes to the TMDB or IMDB apis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.