spandan-madan / deeplearningproject Goto Github PK

View Code? Open in Web Editor NEW

4.7K 4.7K 633.0 2.46 MB

An in-depth machine learning tutorial introducing readers to a whole machine learning pipeline from scratch.

Home Page: https://spandan-madan.github.io/DeepLearningProject/

License: MIT License

Jupyter Notebook 43.69% HTML 56.28% Dockerfile 0.03%

deep-learning machine-learning neural-networks tutorial

deeplearningproject's Introduction

An end to end tutorial of a machine learning pipeline

This tutorial tries to do what most Most Machine Learning tutorials available online do not. It is not a 30 minute tutorial which teaches you how to "Train your own neural network" or "Learn deep learning in under 30 minutes". It's a full pipeline which you would need to do if you actually work with machine learning - introducing you to all the parts, and all the implementation decisions and details that need to be made. The dataset is not one of the standard sets like MNIST or CIFAR, you will make you very own dataset. Then you will go through a couple conventional machine learning algorithms, before finally getting to deep learning!

In the fall of 2016, I was a Teaching Fellow (Harvard's version of TA) for the graduate class on "Advanced Topics in Data Science (CS209/109)" at Harvard University. I was in-charge of designing the class project given to the students, and this tutorial has been built on top of the project I designed for the class.

UPDATE 24th October 2018

The tutorial has now been re-written in PyTorch thanks to Anshul Basia (https://github.com/AnshulBasia)

You can access the HTML here: https://spandan-madan.github.io/DeepLearningProject/PyTorch_version/Deep_Learning_Project-Pytorch.html and the IPython Notebook with the code in PyTorch here:https://github.com/Spandan-Madan/DeepLearningProject/blob/master/PyTorch_version/Deep_Learning_Project-Pytorch.ipynb

Citing if you use the work here

If you would like to use this work, please cite the work using the doi -

Reading/Viewing the Tutorial

To view the project as an HTML file, visit - https://spandan-madan.github.io/DeepLearningProject/

The Code

If you would like to access to Code, please go through the ipython notebook Deep_Learning_Project.ipynb

SETUP

Python

We will be using Python 2.7. Primary reason is that Tensorflow is not compatible with python > 3.5, and some other libraries are not compatible with python 3.

To make setup easy, we are going to use conda.

Please install conda 3 from https://www.continuum.io/downloads
The repository has a conda config file which will make setting up super easy. It's the file deeplearningproject_environment.yml
Then create a new conda environment using the command with conda env create -f deeplearningproject_environment.yml
Now, you can activate the environment with: source activate deeplearningproject
jupyter notebook If all the isntallations go through, you are good to go! If not, here is a list of packages that need to be installed: requests imDbPy wget tmdbsimple seaborn sklearn Pillow keras tensorflow h5py gensim nltk stop_words

Please install imdbpy using 'pip install imdbpy==6.6' since earlier versions are broken

Setting up conda environment in jupyter notebook

To be able to run the environment you just created on a juputer notebook, first check that you have the python package ipykernel installed. If you don't simply install it using

pip install ipykernel

Now, add this to your jupyter notebook using the command:

python -m ipykernel install --user --name deeplearningproject --display-name "deeplearningproject"

Needless to say, remove all single quotes before running commands.

Go to the directory and run jupyter notbeook by "jupyter notebook" and open the respective notebook on browser. TO install TMDB: pip install tmdbsimple Use "import tmdbsimple as tmdb"

Setting up a docker container with docker-compose

Prerequisites

Docker https://docs.docker.com/install/
docker compose https://docs.docker.com/compose/install/

Run docker-compose

To work with an isolate environment and be able to run it on many systems without troubles, you can run this docker-compose command:

docker-compose up

It will build deeplearningproject image according to Dockerfile. And then run dokcer container via docker-compose. See Docker and docker-compose docs for more informations :

Then access notebooks through your web browser at http://localhost:8888

You should notice that notebooks have been copied from root to notebooks folder to mount them into container via bind volume. Any changes you make, will be saved on host (notebooks dir).

Add packages

You can add conda or pip packages to image (and thus, container) by updating deeplearningproject_environment.yml file and then run

docker-compose build

It will build a new deeplearningproject image with new conda/pip packages installed. Stop your running container (CTRL-C) and then docker-compose up to rerun a fresh new container.

Known common bugs

I will keep updating this as issues pop up on this repository.

One known bug is because Keras 2.0 is not compatible with some Keras 1.2 functionalities. You may run into errors with importing VGG16. If so, just update keras using the following command:

sudo pip install git+git://github.com/fchollet/keras.git --upgrade

-OS Error: Too Many Open Files Refer to: https://stackoverflow.com/questions/16526783/python-subprocess-too-many-open-files or, shut down notebook and execute following the the same terminal ``bash ulimit -Sn 10000


And restart the jupyter notebook.

Hope this repo helps introduce you to a full machine learning pipeline! If you spot an error, please create an issue to help out others using this resource!

To prevent problems with installation and setting up, this repository comes with a conda environment profile. The only thing you will need is to install the newest version of conda, and use this profile to create a new environment and it will come set up with all the libraries you will need for the tutorial.

deeplearningproject's People

Contributors

Stargazers

Watchers

Forkers

pruthvishetty rahulrana95 deeplearningsky mbraihan nitingupta180 kitisak bapi-reddy avsolatorio akshayudhane prakritidev joyjeni karthiklml sach2211 vivanraaj pulkitpagare jrafaelamaral gihanali sagarmalla arihantjain15 biranchi2018 shyamsukhamit nsairakesh nvhoang iamvikas10 mnrmja007 krishvishal sahooamarjeet aashish-ak raj-maurya sad143007 pritom14 vkunal1996 chirasmita16 onwardmahachi kunal-lalwani marwaatia tanngo codeaudit brando90 poivrenoir chirayukong belalmohsen peratham jvlegend akashyssboddeda kuanliang charankesav leolorenzoluis alokranjan1234 ominux mutjas athiwatp vishallakha yanghaha11514 phammanhhiep aepuripraveenkumar rushib1 annamalainagappan prafful13 hivaids2512 ranababu1 adilkaleem shivajid iamihgam abhisheksachan mohanarunachalam www-go nguyenbaduy1995 arnabbir rohithyeravothula pkgodara shikhardb tpemartin bush333 sandeepk17 amanalip weburnit kbeankim ivanistheone cys4 ambhar scofieldyoo merajat learningmaster mahmoudelhamshary rafayet13 sayantanmukherjee6 rajivbits h2016102 turtlelabs hbcbh1999 akshanshchahal vsamidurai 0wnrepo jacknova idianale w4zir shandude vdt fbarrientos

deeplearningproject's Issues

Example binarized vector representation is syntactically incorrect

Now let's store the genre's for these movies in a list that we will later transform into a binarized vector.

Binarized vector representation is a very common and important way data is stored/represented in ML. Essentially, it's a way to reduce a categorical variable with n possible values to n binary indicator variables. What does that mean? For example, let [(1,3),(4)] be the list saying that sample A has two labels 1 and 3, and sample B has one label 4. For every sample, for every possible label, the representation is simply 1 if it has that label, and 0 if it doesn't have that label. So the binarized version of the above list will be -

> [(1,0,1,0]),
> (0,0,0,1])]

This section has output that contains a brackets mismatch syntax error. Not a huge problem, but probably a bit confusing for a beginner. Otherwise, great tutorial!

Dependency Issue on Windows

Tried to setup with the .yml file which was aborted.
Manual installation of the requested packages led to an error: tensorflow on Windows is only supported in 64-bit Python 3.5. Updating python raises depency errors for functools32 and subprocess32, which only run with Python 2.7.
So based on my limited knowledge: there is no way of setting up the environment on Windows. Or am I missing something?

Got IOError half way through learning

When I got to the last session on model textual, the model went through 5 epochs then threw this error:

IOError: [Errno 24] Too many open files

I went ahead trying to change $ulimit -n but realized the easiest way is to just change numb_workers to 4. It's arbitrary but someone suggested 4*num of GPU is a good approximation for num_workers.

It's not a specific issue per se but I think it may be beneficial for people to know this is one of the nuances in building ML pipeline which is not necessarily apparent.

Finding words that are most predictive of a genre

Hi, This was an extremely useful document, and I learnt a lot from the tutorial. An interesting extension to the problem would be to identify the words in the synopses that most distinguish a genre from other genres in the model - I have an analogous task in my project.

Is there a way to find the words that are most predictive of a genre? For example, is there a way to identify that the words ‘battle’, ‘challenge’ and ‘fight’ (for example) are the most predictive of a movie falling into the ‘Action’ category, based on the model we trained? i.e. which are the words (in the synopsis) that most prominently indicate that the synopsis would fall under a particular genre? (Using the model we have fit).
This basically translates to decoding the algorithm to find out how it works “under the hood.” - what features (words?) it uses "under the hood" to classify a synopsis into a genre.

A solution I found online is in the code snippet below - Using the classifier coefficients from clf.coef_ (clf is the name of the model I fit) and picking the top 10 words (which the model uses to distinguish/identify a genre based on a given text).

def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))

Please let me know if this is appropriate and if there is a better way of doing this.

Would be nice to support Python 3.

Just to be clear - consider this as just a minor suggestion rather than a complaint.

Thank you for this tutorial! It's rare that people actually spend a lot of time to make a great free learning resource.

Possible wrong syntax

In [26]: # Create a tmdb genre object!
genres=tmdb.Genres()
the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.
list_of_genres=genres.list()['genres']

The above segment throws this error:
Create a tmdb genre object!
genres=tmdb.Genres()
the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.
list_of_genres=genres.list()['genres']

I apologize if this is a trivial issue. I'm new to Python. It'll be great if someone can help me resolve this. T

Site not responsive on mobile.

'Genres' object has no attribute 'list'

I get this error while compiling the code cell 15 of the notebook.

I want to translate your README.md !

Hello I'm university student from Korea. I'm interested in your project!

So I'd like to translate your README.md in korean.

Please contact me with email or comment below :)

My email address is [email protected]

Thank you

Cut out warnings from imports due to numpy ufunc and dtype sizes

Nice jobs with the notebooks- On block 2, if you'd like to get rid of the RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 warnings, the you can just add at the bottom of the block:

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

"That" vs "Which" grammatical error.

Kudos on a very well done writeup. I have a simple grammatical correction ... in many cases, you have used 'which' in place of 'that'.

See http://www.writersdigest.com/online-editor/which-vs-that

If 'which' is used to describe something, and is not preceded by a comma, it is a likely candidate for the confusion.

For example,
'use the available data to learn a function which can' ==> 'use the available data to learn a function that can'

Few things to look in Deep Learning to extract visual features from posters on Section 7

You declared VGG model function and stored in variable 'model' and used variable 'model_viz' for training, which means you did not use VGG at all. You can check your model by typing 'print(model_viz.layers)'. If you struggle to fix this issue, I can help you with this section if you add me as an author.
It is important to show how well your model is trained. I would recommend plotting curves of loss and accuracy with history instance returned from 'model.fit()' function or a confusion matrix from predictions to show false positives and vice versa.

help

in [41] cell, when I am executing I am getting the following error:

HTTPError Traceback (most recent call last)
in ()
17 url += '&with_genres=' + str(g_id) + '&page=' + str(page)
18
---> 19 data = urllib2.urlopen(url).read()
20
21 dataDict = json.loads(data)

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
433 for processor in self.process_response.get(protocol, []):
434 meth = getattr(processor, meth_name)
--> 435 response = meth(req, response)
436
437 return response

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in http_response(self, request, response)
546 if not (200 <= code < 300):
547 response = self.parent.error(
--> 548 'http', request, response, code, msg, hdrs)
549
550 return response

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in error(self, proto, *args)
471 if http_err:
472 args = (dict, 'default', 'http_error_default') + orig_args
--> 473 return self._call_chain(*args)
474
475 # XXX probably also want an abstract factory that knows when it makes

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result

/home/shouvik/anaconda3/envs/deeplearningproject/lib/python2.7/urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
554 class HTTPDefaultErrorHandler(BaseHandler):
555 def http_error_default(self, req, fp, code, msg, hdrs):
--> 556 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
557
558 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 401: Unauthorized

help appreciated.
Thanks

Possible points of confusion and typos

Points of confusion

This section uses f as the generalized function and g as the exact function, whereas before f was exact and g was generalized. This has the potential to confuse readers.
On In [51] and In [52], id is assigned a value but does not seem to be used
On the section after Out [62] it says that the shape of Y is 1666,20 but the output of print Y.shape is (1595, 20). Where does the 1666 come from?

Typos

In the last sentence of the first paragraph of the same section, "listen to" should be changed to "watch"
in the last paragraph before In [68] (this section) "vocabular" should be "vocabulary"
In the first paragraph of this section, "can only integer values" should be "can only be integer values"
In the second item of the first list in this section, "difference models" should be "different models"

TensorFlow version of this tutorial on Python 3.

Feel Free to reach out if interested in implementing tf version of this tutorial.

small correction cell no.18

It should be

the list() method of the Genres() class returns a listing of all genres in the form of a dictionary.

list_of_genres=genres.movie_list()['genres']

Thanks for the amazing tutorial .

Help !

Hi am new to ML, can i start with this tutorial ?
or where i have to start ? and how to start?
thanks in advance

404 for figure 1

The link to the figure 1: https://spandan-madan.github.io/DeepLearningProject/docs/contour.png seems to be incorrect

Varibles undefined when run the scripts

In section 7, when extract VGG features for scraped images.
In the for loop where try and except block located, the varible 'imname' was not declared, may be change like the following:

for mov in poster_movies:
    i+=1
    mov_name=mov['original_title']
    mov_name1=mov_name.replace(':','/')
    poster_name=mov_name.replace(' ','_')+'.jpg'
    if poster_name in imnames:
        img_path=poster_folder+poster_name
        try:
            img = image.load_img(img_path, target_size=(224, 224))
            succesful_files.append(imname) # **imname undefined , change to poster_name ?**
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = preprocess_input(x)
            features = model.predict(x)
            file_order.append(img_path)
            feature_list.append(features)
            genre_list.append(mov['genre_ids'])
            if np.max(np.asarray(feature_list))==0.0:
                print('problematic',i)
            if i%250==0 or i==1:
               print "Working on Image : ",i
        except Exception,e:
            print Exception,":",e   # **for debuging**
            failed_files.append(imname) # **imname undefined , change to poster_name ?**
            continue
    else:
        continue

Broken images

Check the images, half of them are not loading

TMDB API key

Great walkthrough!

One recommendation is to replace your actual TMDB API key with a placeholder. That way no one can abuse your account via your API key.

P.S. Super nitpicky, but in that same block, I think the Jupyter step should read In [5]:

Can you please mentions the pre-requisites for the tutorial in README?

appnope 0.1.0 not available on win-64 and linux channels.

The appnope package is made to disable App Nap on OS X. If you are on a different platform you must remove or comment out the line.

Spandan/DeepLearningProject

Anyone interested in making a PyTorch version of this on Python 3?

Between research and another tutorial I'm working on (NLP), I have little time left. If someone would like to build a PyTorch version of this, we can add it to this repo.

Would be very helpful. If you're interested, mail me at [email protected].

Best,
Spandan