The hub-project-scifi-or-fantasy from hightechu

(Bonus) Spicing Up the User Interface

Have a question? Tag @cairosanders in a comment.
If you'd like you can add more to the user interface (the way the app looks). The current simple UI is made with tkinker and you can check out more tkinter options here: https://coderslegacy.com/python/python-gui/

When you're done, create a pull request

What is Machine Learning?

Have a question? Tag @cairosanders in a comment.

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

(Bonus) Spicing Up the User Interface

If you'd like you can add more to the user interface (the way the app looks). The current simple UI is made with tkinker and you can check out more tkinter options here: https://coderslegacy.com/python/python-gui/

When you're done, create a pull request

What is Machine Learning?

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

Setting Up Your Development Environment

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.

Open your terminal
In command line:

>cd desktop
>git clone <copied link here>

Installing Requirements

If you don't already, make sure you install python3.

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip install -r requirements.txt

You should be good to move onto the next issue and start coding! If you have any questions, comment on this issue and your mentor will be notified.

Choosing Your Best Classifier

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

Start by reading in the data from the csv files just as in test_classifiers

movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)

clf = MultinomialNB(alpha = 0.5, fit_prior = True)

Train your machine using the fit() method from sklearn

clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:

in the click() function in app.py set description = to the DTM version of the input

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Run app.py and try out your awesome new AI app!

> python3 app.py

Once you're satisfied with your work, create a pull request

Finding the Best Classification Strategy part 1

Have a question? Tag @cairosanders in a comment.

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)

Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

Add a Multinomial Naive Bayes classifier from sklearn under the first if statement:

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

Try changing the fit_prior and see which gives the highest accuracy
When you think you have the highest accuracy you can get, create a pull request

What is Machine Learning?

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

Finding the Best Classification Strategy part 1

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)

Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

Add a Multinomial Naive Bayes classifier from sklearn under the first if statement:

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

Try changing the fit_prior and see which gives the highest accuracy
When you think you have the highest accuracy you can get, create a pull request

Finding the Best Classification Strategy: Part 2

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

Use the SVC (Support Vector Classification) from sklearn below the elif in your code/test_classifiers.py file:

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

Change the C value - it can be any positive number
Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'
Experiment with degree values - they can be any integer
Change the gamma parameter - it can be 'scale', or 'auto'
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy part 1

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)

Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

Add a Multinomial Naive Bayes classifier from sklearn under the first if statement:

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

Try changing the fit_prior and see which gives the highest accuracy
When you think you have the highest accuracy you can get, create a pull request

Choosing Your Best Classifier

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

Start by reading in the data from the csv files just as in test_classifiers

movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)

clf = MultinomialNB(alpha = 0.5, fit_prior = True)

Train your machine using the fit() method from sklearn

clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:

in the click() function in app.py set description = to the DTM version of the input

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Run app.py and try out your awesome new AI app!

> python3 app.py

Once you're satisfied with your work, create a pull request

Set Up

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Download git and connect to your GitHub account

If you don't already have git on your computer you can download it here
Connect to GitHub on
- Windows
- Mac
- Linux

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.

Open your Command Line (probably called Terminal or Command Prompt)

In command line:

>cd desktop
>git clone <copied link here>

example:

cd desktop
git clone https://github.com/hightechu/project-template.git

Installing Requirements

If you don't already, make sure you install python3

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip3 install -r requirements.txt

Download Large Data Set

Go to https://github.com/hightechu/sci-fi-or-fantasy-data
Click on movie_description.csv

Click download and it will take you to page with a lot of text

Right click in white space and select "save as"

Save as movie_descriptions.csv into your project folder

Create a pull request

(Bonus) Spicing Up the User Interface

If you'd like you can add more to the user interface (the way the app looks). The current simple UI is made with tkinker and you can check out more tkinter options here: https://coderslegacy.com/python/python-gui/

When you're done, create a pull request

Choosing Your Best Classifier

Have a question? Tag @cairosanders in a comment.

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

Start by reading in the data from the csv files just as in test_classifiers

movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)

clf = MultinomialNB(alpha = 0.5, fit_prior = True)

Train your machine using the fit() method from sklearn

clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:

in the click() function in app.py set description = to the DTM version of the input

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Run app.py and try out your awesome new AI app!

> python3 app.py

Once you're satisfied with your work, create a pull request

Finding the Best Classification Strategy part 3 (Final Part)

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

Add a neural network after the else: in code/test_classifiers.py

...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run your code to see the accuracy and test for errors:

> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

hidden_layer_sizes - can be any positive integer in the format (100,)
activation - can be 'identity', 'logistic', 'tanh', 'relu'
alpha - positive number (keep it small)
learning_rate_init - positive number (keep it small)
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy part 3 (Final Part)

Have a question? Tag @cairosanders in a comment.

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

Add a neural network after the else: in code/test_classifiers.py

...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run your code to see the accuracy and test for errors:

> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

hidden_layer_sizes - can be any positive integer in the format (100,)
activation - can be 'identity', 'logistic', 'tanh', 'relu'
alpha - positive number (keep it small)
learning_rate_init - positive number (keep it small)
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy: part 1

Classifiers

Classifiers are different formulas that you can use to classify data. In the case of this project the data is the movie description and the classification is either sci-fi or fantasy. Luckily Sklearn has some existing function that do the calculations for you; However, you are going to be providing the values that the function will use to perform the calculation. By changing the values, the classifier will have a different accuracy (how often it guesses the correct genre for a given description), and your goal is to find the highest accuracy possible!

Naive Bayes

Naive Bayes is a mathematical formula that calculates probability. Check out this quick review to get a basic understanding. In the video the classication is either "normal" mail or "spam" mail - you can think of this as the equivalent of the sci-fi or fantasy labels. Feel free to watch the whole video, but this section gives a surface level explanation to how this formula works and it's only 3 minutes long!

Time to start coding!

Open up your project in a text-editor and navigate to the "code folder". In there you'll find a Python file called "test_classifiers.py". That's where the action happens. You can see the first section already has some code.

...
def main():
    movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
    movie_label = pd.read_csv('../movie_genres.csv', header=0)
    movie_label=movie_label.values.ravel()
...

This reads a csv file (the files that contain the data) and turns it into a structure that Python will be able to interpret. If you want, you can open the csv files in excel or google sheets and take a look at how they looked originally. You'll notice that movie_descriptions.csv is a very large file with words at the top as the headers and numbers filling in the rows and columns. This is because the file is a record of how many times each word occurs in a given description. This format is called a document term matrix and looks like this:

Training and Testing

To start, you'll split your data into a training and test set. You do this because you need to "train" your machine to understand patterns, just like you would study key terms to memorize them for a test. The test set is used to see how accurate your trained machine is on new data that it hasn't seen before. This would be like taking a test at school and getting your grade back.

You're lucky again because Sklearn has a method that splits the data for you: train_test_split

# X_train, X_test, y_train, y_test = train_test_split(<data>, <label>, <percentage of data used for testing>, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_label, test_size=0.2, random_state=1)

Add a line of code that splits your data into a training and a testing set

you can change the <percentage of data used for testing> but in general, the training set should be larger than the test set.

Run the program in command line to check for errors. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

Your First Classifier!

This is the exciting part, you're going to start seeing how accurate your classifier is! There is an if/elif/else statement in the main() for your benefit. When you run your program you can choose one of the 3 classifiers you're going to test. If you let all run at could take quite some time because the dataset has around 2000 examples and they have to calculate with all that information. If you run your program and nothing happens for a bit, be patient, it's probably just calculating!

Add a Multinomial Naive Bayes classifier from sklearn under the first if statement:

    if clf == '1':
        print(Naive Bayes")
        clf = MultinomialNB(alpha = 0.5, fit_prior = True) # your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

The alpha value can be set from anything between 0 and 1

Try different alpha values and see which gives the highest accuracy

The fit_prior can be True or False

Try changing the fit_prior and see which gives the highest accuracy
When you think you have the highest accuracy you can get, create a pull request

Introduction to GitHub

Have a question? Tag @cairosanders in a comment.

Since you're going to be using GitHub for your project, it's a good idea to become comfortable with it before diving in.
Check out the Introduction to GitHub Course by the GitHub Training Team.

Finding the Best Classification Strategy part 3 (Final Part)

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

Add a neural network after the else: in code/test_classifiers.py

...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run your code to see the accuracy and test for errors:

> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

hidden_layer_sizes - can be any positive integer in the format (100,)
activation - can be 'identity', 'logistic', 'tanh', 'relu'
alpha - positive number (keep it small)
learning_rate_init - positive number (keep it small)
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Finding the Best Classification Strategy part 2

Have a question? Tag @cairosanders in a comment.

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

Use the SVC (Support Vector Classification) from sklearn below the elif in your code/test_classifiers.py file:

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

Change the C value - it can be any positive number
Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'
Experiment with degree values - they can be any integer
Change the gamma parameter - it can be 'scale', or 'auto'
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Choosing Your Best Classifier

Your Best Classifier

At this point, you've tested out 3 (or more) machine learning algorithms and achieved the best accuracy you could find with each one! Now, choose the one that had the highest overall accuracy and you will use that one to make your app in app.py.

Start by reading in the data from the csv files just as in test_classifiers

movie_data = pd.read_csv('../movie_descriptions.csv', header=0)
movie_label = pd.read_csv('../movie_genres.csv', header=0)
movie_label=movie_label.values.ravel()

This time, you don't have to split your data into testing and training set because you already know the performance of your classifier.

Copy and paste your best classifier from test_classifier.py (note that my example below is not necessarily the best)

clf = MultinomialNB(alpha = 0.5, fit_prior = True)

Train your machine using the fit() method from sklearn

clf.fit(movie_data, movie_label)

User Interface

Right now there is a super simple user interface interface with just a prompt, an input box and a button. You need to take the input from the box and let the classifier decide if it is sci-fi or fantasy. Since the data that you used to train your classifier was in the form of a Document Term Matrix (DTM), it expects another sample of the same shape to classify. For this reason, you will first use the DTM() function from the DTMdescription class in the same folder with your sentence. You don't need to work in this file, but you're welcome to take a look at it and read the comments to see kind of how it works. The DTM () function will return a DTM version of your sentence with the same words as the original training set as the headings. Reminder of what a DTM looks like:

in the click() function in app.py set description = to the DTM version of the input

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM() # your code
    # this line will be next
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Next, give your new description to the trained classifier to predict whether it is sci-fi or fantasy using (no surprise here) sklearn's predict() function

...
def click():
    description = DTMdescriptions(input.get(), movie_data.columns.values).DTM()
    genre_pred = clf.predict(description) # your code
    genre = Label(root, text = "I think \"" + input.get() + "\" is " + genre_pred[0])
    genre.pack()
...

Run app.py and try out your awesome new AI app!

> python3 app.py

Once you're satisfied with your work, create a pull request

Setting Up Your Development Environment

Have a question? Tag @cairosanders in a comment.

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Download git and connect to your GitHub account

If you don't already have git on your computer you can download it here
Connect to GitHub on
- Windows
- Mac
- Linux

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.

Open your Command Line (probably called Terminal or Command Prompt)

In command line:

>cd desktop
>git clone <copied link here>

example:

cd desktop
git clone https://github.com/hightechu/project-template.git

Installing Requirements

If you don't already, make sure you install python3

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip3 install -r requirements.txt

Download Large Data Set

Go to https://github.com/hightechu/sci-fi-or-fantasy-data
Click on movie_description.csv

Click download and it will take you to page with a lot of text

Right click in white space and select "save as"

Save as movie_descriptions.csv into your project folder

Create a pull request

Finding the Best Classification Strategy part 2

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

Use the SVC (Support Vector Classification) from sklearn below the elif in your code/test_classifiers.py file:

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 test_classifiers.py

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

Change the C value - it can be any positive number
Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'
Experiment with degree values - they can be any integer
Change the gamma parameter - it can be 'scale', or 'auto'
Once you think you have found the highest accuracy you can for this classifier, create a pull request

What is Machine Learning?

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

For this Project we are going to be using data from IMDB on movie genres. The set you're using was specifically prepared for this activity and contains movies of only 2 genres - Sci-Fi and Fantasy. The original data set contained 81,000+ movies in multiple languages, and genres, but has been narrowed down to approx. 2,600 movies. More data can help create a more accurate model, but it is at the cost of slower computation. There is a ton of free data sets. If you want to explore available data sets to use in future projects, check out Kaggle datasets

Setting Up Your Development Environment

Setting up your development environment

You're almost ready to start coding, but first you have to set up your computer so you can test your program while you code.

Adding the Project to your Local Machine

You need to add this repository to your computer to make changes and edit it in the development environment. Go ahead and clone this repository on your computer. Make sure you're in a directory that you'll remember/ is easy to access. If you're not sure, your Desktop is a good go-to.

Open your terminal
In command line:

>cd desktop
>git clone <copied link here>

Installing Requirements

If you don't already, make sure you install python3

Now navigate into your project folder on command line and install the required packages:

>cd <project-repository-name>
>pip3 install -r requirements.txt

Download Large Data Set

Go to https://github.com/hightechu/sci-fi-or-fantasy-data
Click on movie_description.csv

Click download and it will take you to page with a lot of text

Right click in white space and select "save as"

Save as movie_descriptions.csv into your project folder

Create a pull request

Introduction to GitHub

Since you're going to be using GitHub for your project, it's a good idea to become comfortable with it before diving in.
Check out the Introduction to GitHub Course by the GitHub Training Team.

Sharing your App with the World 🌎

Have a question? Tag @cairosanders in a comment.

A README is a file that helps explain the usage and purpose of software and it shows on the main page of a GitHub repository. Right now this README exists to help you understand the process of a HighTechU Project Hub Project, but now that your App is looking good and working well, you can change the README.md to reflect the purpose of YOUR app.

Edit your README.md

Explain what the purpose of the app is
Explain how the app works
Take a screenshot (or 3 or 4) or a .gif and include that in the README
Include any other information you want to share with others

Add an App Preview

Your App Preview is what people will see when you share your app on social media or when they look at your profile on the Project Hub

Navigate to settings
Click edit under Social Preview and upload one of the screenshots you took for the README

Verify Your Completion

Your done your App! Congratulations. Now we just need to verify your completion with your mentor. To do so

Navigate to th hightechu.yml file in this repository (you can do this on GitHub, you do not need to do it locally)
Add completed: True to the top of the file
Create a pull request

When your mentor reviews and approves your pull request, your project will display a verified completion badge on your portfolio on the project hub.

And that's it! Awesome work :)

Finding the Best Classification Strategy part 2

Classifier Number 2: Support Vector Machines

The purpose of support vector machines is to find a line that best separates the data based on their classifcation, in this case Sci-Fi and Fantasy. Checkout another fun stat quest clearly explained video for Support Vector Machines to learn a little bit more.

Building Your Support Vector Machine

It's that time again - time to create another awesome classifier!

Use the SVC (Support Vector Classification) from sklearn below the elif in your code/test_classifiers.py file:

...
    elif clf == '2':
        print("SVM")
        clf = SVC(C = 1.5, kernel = 'rbf', degree = 4, gamma = 'scale')# your code
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run the program in command line to check for errors and see your first accuracy!. Make sure you're in the code folder of your project directory and run:

> python3 code/test_classifiers.py

As you can see, this classifier has many possible parameters and if you got to sklearn documentation you can add even more.
But to start try different values for these parameters and try to achieve the highest accuracy you can

Change the C value - it can be any positive number
Try different kernels - it can be one of 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'
Experiment with degree values - they can be any integer
Change the gamma parameter - it can be 'scale', or 'auto'
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Bonus: Spicing Up the User Interface

If you'd like you can add more to the user interface (the way the app looks). The current simple UI is made with tkinker and you can check out more tkinter options here: https://coderslegacy.com/python/python-gui/

When you're done, create a pull request

Finding the Best Classification Strategy: Part 3 (Final Part)

Classifier Number 3: Neural Networks

Of all machine learning techniques, you may have heard of this one. When many people think of Artificial intelligence, they often think of computers trying to emulate human intelligence. Human behaviour is a result of neurons transmitting information throughout the nervous system. Neural networks are named after them because their structure is inspired by the human nervous system. Check out this video to learn more.

Creating Your Neural Network

It may be no surprise to you that sklearn has a neural network class.

Add a neural network after the else: in code/test_classifiers.py

...
    elif clf == '3':
        print("Neural Network")
        clf = MLPClassifier(hidden_layer_sizes=(80,), activation = 'tanh',
            alpha = 0.0005, learning_rate_init = 0.005)
        classifcation_acc(clf,X_train, X_test, y_train, y_test)
...

Run your code to see the accuracy and test for errors:

> python3 test_classifiers.py

This classifier also has a lot of parameters. Start with the one's in the code above and try out others from the sklearn docs if you want!

hidden_layer_sizes - can be any positive integer in the format (100,)
activation - can be 'identity', 'logistic', 'tanh', 'relu'
alpha - positive number (keep it small)
learning_rate_init - positive number (keep it small)
Once you think you have found the highest accuracy you can for this classifier, create a pull request

Introduction to GitHub

Since you're going to be using GitHub for your project, it's a good idea to become comfortable with it before diving in.
Check out the Introduction to GitHub Course by the GitHub Training Team.

hightechu / hub-project-scifi-or-fantasy Goto Github PK

hub-project-scifi-or-fantasy's People

Contributors

Watchers

hub-project-scifi-or-fantasy's Issues

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

Setting up your development environment

Adding the Project to your Local Machine

Installing Requirements

Your Best Classifier

User Interface

Classifiers

Naive Bayes

Time to start coding!

Training and Testing

Your First Classifier!

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

Classifiers

Naive Bayes

Time to start coding!

Training and Testing

Your First Classifier!

Classifier Number 2: Support Vector Machines

Building Your Support Vector Machine

Classifiers

Naive Bayes

Time to start coding!

Training and Testing

Your First Classifier!

Your Best Classifier

User Interface

Setting up your development environment

Download git and connect to your GitHub account

Adding the Project to your Local Machine

Installing Requirements

Download Large Data Set

Your Best Classifier

User Interface

Classifier Number 3: Neural Networks

Creating Your Neural Network

Classifier Number 3: Neural Networks

Creating Your Neural Network

Classifiers

Naive Bayes

Time to start coding!

Training and Testing

Your First Classifier!

Classifier Number 3: Neural Networks

Creating Your Neural Network

Classifier Number 2: Support Vector Machines

Building Your Support Vector Machine

Your Best Classifier

User Interface

Setting up your development environment

Download git and connect to your GitHub account

Adding the Project to your Local Machine

Installing Requirements

Download Large Data Set

Classifier Number 2: Support Vector Machines

Building Your Support Vector Machine

Checkout this fun video about machine learning:

Is this Movie Sc-fi or Fantasy?

Setting up your development environment

Adding the Project to your Local Machine

Installing Requirements

Download Large Data Set

Edit your README.md

Add an App Preview

Verify Your Completion

Classifier Number 2: Support Vector Machines

Building Your Support Vector Machine

Classifier Number 3: Neural Networks

Creating Your Neural Network

Recommend Projects

Recommend Topics

Recommend Org