Code Monkey home page Code Monkey logo

hightechu / hub-project-scifi-or-fantasy Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 19.94 MB

Using the scikit-learn library, this project explores various machine learning classifiers to best classify movie descriptions as 'Sci-Fi' or 'Fantasy'. After finding the best classifier a user interface is created in which a user may input text which the machine will then classify using said classifier. This project provides experience with Python, data-mining concepts, and GitHub and version control.

Python 100.00%
project-hub template project-template

hub-project-scifi-or-fantasy's People

Contributors

cairosanders avatar

Watchers

 avatar  avatar  avatar

hub-project-scifi-or-fantasy's Issues

What is Machine Learning?

Have a question? Tag @cairosanders in a comment.

Checkout this fun video about machine learning:

Screen Shot 2020-07-16 at 5 39 36 PM

Is this Movie Sc-fi or Fantasy?

sci-fi or fantasy
For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

What is Machine Learning?

Checkout this fun video about machine learning:

Screen Shot 2020-07-16 at 5 39 36 PM

Is this Movie Sc-fi or Fantasy?

sci-fi or fantasy
For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

Setting Up Your Development Environment

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.
cloning-repo

Open your terminal
In command line:

>cd desktop
>git clone <copied link here>

Installing Requirements

If you don't already, make sure you install python3.

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip install -r requirements.txt

You should be good to move onto the next issue and start coding! If you have any questions, comment on this issue and your mentor will be notified.

Choosing Your Best Classifier

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

  • Start by reading in the data from the csv files just as in test_classifiers
movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

  • Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)
clf = MultinomialNB(alpha = 0.5, fit_prior = True)
  • Train your machine using the fit() method from sklearn
clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:
Screen Shot 2020-07-20 at 4 21 25 PM

  • in the click() function in app.py set description = to the DTM version of the input
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Run app.py and try out your awesome new AI app!
> python3 app.py
  • Once you're satisfied with your work, create a pull request

Finding the Best Classification Strategy part 1

Have a question? Tag @cairosanders in a comment.

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!
naivebayes

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:
Screen Shot 2020-07-20 at 4 21 25 PM

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)
  • Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

  • Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

  • Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

  • Try changing the fit_prior and see which gives the highest accuracy

  • When you think you have the highest accuracy you can get, create a pull request

What is Machine Learning?

Checkout this fun video about machine learning:

Screen Shot 2020-07-16 at 5 39 36 PM

Is this Movie Sc-fi or Fantasy?

sci-fi or fantasy
For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

Finding the Best Classification Strategy part 1

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!
naivebayes

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:
Screen Shot 2020-07-20 at 4 21 25 PM

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)
  • Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

  • Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

  • Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

  • Try changing the fit_prior and see which gives the highest accuracy

  • When you think you have the highest accuracy you can get, create a pull request

Finding the Best Classification Strategy: Part 2

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.
SorFSVM

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

  • Change the C value - it can be any positive number

  • Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'

  • Experiment with degree values - they can be any integer

  • Change the gamma parameter - it can be 'scale', or 'auto'

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy part 1

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!
naivebayes

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:
Screen Shot 2020-07-20 at 4 21 25 PM

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)
  • Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

  • Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

  • Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

  • Try changing the fit_prior and see which gives the highest accuracy

  • When you think you have the highest accuracy you can get, create a pull request

Choosing Your Best Classifier

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

  • Start by reading in the data from the csv files just as in test_classifiers
movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

  • Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)
clf = MultinomialNB(alpha = 0.5, fit_prior = True)
  • Train your machine using the fit() method from sklearn
clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:
Screen Shot 2020-07-20 at 4 21 25 PM

  • in the click() function in app.py set description = to the DTM version of the input
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Run app.py and try out your awesome new AI app!
> python3 app.py
  • Once you're satisfied with your work, create a pull request

Set Up

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Download git and connect to your GitHub account

  • If you don't already have git on your computer you can download it here
  • Connect to GitHub on

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.
cloning-repo

Open your Command Line (probably called Terminal or Command Prompt)

In command line:

>cd desktop
>git clone <copied link here>

example:

cd desktop
git clone https://github.com/hightechu/project-template.git

Installing Requirements

If you don't already, make sure you install python3

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip3 install -r requirements.txt

Download Large Data Set

click on movie_descriptions.csv

  • Click download and it will take you to page with a lot of text

click downlaod

  • Right click in white space and select "save as"

right click and save as

  • Save as movie_descriptions.csv into your project folder

save as movie_descriptions.csv

Choosing Your Best Classifier

Have a question? Tag @cairosanders in a comment.

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

  • Start by reading in the data from the csv files just as in test_classifiers
movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

  • Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)
clf = MultinomialNB(alpha = 0.5, fit_prior = True)
  • Train your machine using the fit() method from sklearn
clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:
Screen Shot 2020-07-20 at 4 21 25 PM

  • in the click() function in app.py set description = to the DTM version of the input
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Run app.py and try out your awesome new AI app!
> python3 app.py
  • Once you're satisfied with your work, create a pull request

Finding the Best Classification Strategy part 3 (Final Part)

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.
nn

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

  • Add a neural network after the else: in code/test_classifiers.py
...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run your code to see the accuracy and test for errors:
> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

  • hidden_layer_sizes - can be any positive integer in the format (100,)

  • activation - can be 'identity', 'logistic', 'tanh', 'relu'

  • alpha - positive number (keep it small)

  • learning_rate_init - positive number (keep it small)

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy part 3 (Final Part)

Have a question? Tag @cairosanders in a comment.

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.
nn

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

  • Add a neural network after the else: in code/test_classifiers.py
...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run your code to see the accuracy and test for errors:
> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

  • hidden_layer_sizes - can be any positive integer in the format (100,)

  • activation - can be 'identity', 'logistic', 'tanh', 'relu'

  • alpha - positive number (keep it small)

  • learning_rate_init - positive number (keep it small)

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy: part 1

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!
naivebayes

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:
Screen Shot 2020-07-20 at 4 21 25 PM

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)
  • Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

  • Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

  • Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

  • Try changing the fit_prior and see which gives the highest accuracy

  • When you think you have the highest accuracy you can get, create a pull request

Finding the Best Classification Strategy part 3 (Final Part)

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.
nn

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

  • Add a neural network after the else: in code/test_classifiers.py
...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run your code to see the accuracy and test for errors:
> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

  • hidden_layer_sizes - can be any positive integer in the format (100,)

  • activation - can be 'identity', 'logistic', 'tanh', 'relu'

  • alpha - positive number (keep it small)

  • learning_rate_init - positive number (keep it small)

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy part 2

Have a question? Tag @cairosanders in a comment.

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.
SorFSVM

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

  • Change the C value - it can be any positive number

  • Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'

  • Experiment with degree values - they can be any integer

  • Change the gamma parameter - it can be 'scale', or 'auto'

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Choosing Your Best Classifier

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

  • Start by reading in the data from the csv files just as in test_classifiers
movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

  • Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)
clf = MultinomialNB(alpha = 0.5, fit_prior = True)
  • Train your machine using the fit() method from sklearn
clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:
Screen Shot 2020-07-20 at 4 21 25 PM

  • in the click() function in app.py set description = to the DTM version of the input
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function
...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...
  • Run app.py and try out your awesome new AI app!
> python3 app.py
  • Once you're satisfied with your work, create a pull request

Setting Up Your Development Environment

Have a question? Tag @cairosanders in a comment.

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Download git and connect to your GitHub account

  • If you don't already have git on your computer you can download it here
  • Connect to GitHub on

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.
cloning-repo

Open your Command Line (probably called Terminal or Command Prompt)

In command line:

>cd desktop
>git clone <copied link here>

example:

cd desktop
git clone https://github.com/hightechu/project-template.git

Installing Requirements

If you don't already, make sure you install python3

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip3 install -r requirements.txt

Download Large Data Set

click on movie_descriptions.csv

  • Click download and it will take you to page with a lot of text

click downlaod

  • Right click in white space and select "save as"

right click and save as

  • Save as movie_descriptions.csv into your project folder

save as movie_descriptions.csv

Finding the Best Classification Strategy part 2

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.
SorFSVM

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 test_classifiers.py

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

  • Change the C value - it can be any positive number

  • Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'

  • Experiment with degree values - they can be any integer

  • Change the gamma parameter - it can be 'scale', or 'auto'

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

What is Machine Learning?

Checkout this fun video about machine learning:

Screen Shot 2020-07-16 at 5 39 36 PM

Is this Movie Sc-fi or Fantasy?

sci-fi or fantasy
For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

Setting Up Your Development Environment

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.
cloning-repo

Open your terminal
In command line:

>cd desktop
>git clone <copied link here>

Installing Requirements

If you don't already, make sure you install python3

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip3 install -r requirements.txt

Download Large Data Set

click on movie_descriptions.csv

  • Click download and it will take you to page with a lot of text

click downlaod

  • Right click in white space and select "save as"

right click and save as

  • Save as movie_descriptions.csv into your project folder

save as movie_descriptions.csv

Sharing your App with the World ๐ŸŒŽ

Have a question? Tag @cairosanders in a comment.

A README is a file that helps explain the usage and purpose of software and it shows on the main page of a GitHub repository. Right now this README exists to help you understand the process of a HighTechU Project Hub Project, but now that your App is looking good and working well, you can change the README.md to reflect the purpose of YOUR app.

Edit your README.md

  • Explain what the purpose of the app is
  • Explain how the app works
  • Take a screenshot (or 3 or 4) or a .gif and include that in the README
  • Include any other information you want to share with others

Add an App Preview

Your App Preview is what people will see when you share your app on social media or when they look at your profile on the Project Hub

  • Navigate to settings
  • Click edit under Social Preview and upload one of the screenshots you took for the README

social preview setting

Verify Your Completion

Your done your App! Congratulations. Now we just need to verify your completion with your mentor. To do so

  • Navigate to th hightechu.yml file in this repository (you can do this on GitHub, you do not need to do it locally)
  • Add completed: True to the top of the file
    hightechu.yml screen shot
  • Create a pull request

When your mentor reviews and approves your pull request, your project will display a verified completion badge on your portfolio on the project hub.

And that's it! Awesome work :)
congrats

Finding the Best Classification Strategy part 2

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.
SorFSVM

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:
> python3 code/test_classifiers.py

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

  • Change the C value - it can be any positive number

  • Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'

  • Experiment with degree values - they can be any integer

  • Change the gamma parameter - it can be 'scale', or 'auto'

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy: Part 3 (Final Part)

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.
nn

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

  • Add a neural network after the else: in code/test_classifiers.py
...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...
  • Run your code to see the accuracy and test for errors:
> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

  • hidden_layer_sizes - can be any positive integer in the format (100,)

  • activation - can be 'identity', 'logistic', 'tanh', 'relu'

  • alpha - positive number (keep it small)

  • learning_rate_init - positive number (keep it small)

  • Once you think you have found the highest accuracy you can for this classifier, create a pull request

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.