Code Monkey home page Code Monkey logo

similarcoursevisual's Introduction

SimilarCourseVisual

Description

For WING's task one. Crawl some courses codes and descriptions from NUSmodules. Take the descriptions and encode them into thought or doc vectors via word embeddings). Give a visualization on how the courses relate to each other.

File Structure

SimilarCourseVisual/
|-- bin/  (Executable files for the project)
|  |-- _init_
|  |-- data_prepare.py  (Crawl and store data to local file)
|  |-- start.py  (Starting program)
|-- core/  (Store all source code for the project (core code))
|  |-- tests/  (Store unit test code)
|  |  |-- init.py
|  |  |-- test.main.py
|  |-- init.py
|  |-- test_main.py  (Store core logic)
|-- conf/  (Configuration file)
|  |-- init.py
|  |-- setting.py  (Write the relevant configuration)
|---db/ #Database files
|  |--db.json  (Store database files)
|-- docs/  (Store some documents)
|-- lib/  (Library files, put custom modules and packages)
|  |-- init.py
|  |-- common.py  (Write commonly used functions)
|-- log/  (Log files)
|  |-- access.log  (logs)
|-- README  (Project description document)

Data Structure

table Modsinfo (
ModsID int,
ModsCode text,
ModsDetails text)
Mainly covers: Modsid: the Unique identifier in the database; ModsCode: Nus course code; ModsDetails: the descriptions of each module

How to use

Prepare data

Run data_prepare.py, which will call the function in the core to crawl information and store them into db.sqlite3.

python bin/data_prepare.py

Run

  • Then you can run start.py to excute the main function to drawl the top frequency words cloud into docs/cloudWord.html & show their similarity by tsne.
python bin/start.py drawWords
  • And you can also run start.py to excute the main function to drawl the similarity of sample modules by add another agrv.
python bin/start.py drawWords
  • And you can also run start.py to excute the main function to recommend the most similar course for you.
python bin/start.py -r CS5424

# return:
# The most similar course:
# CS4224

Others

Cna modify global variables directly in the lib
STOPWORD_LOCATION = "../docs/Foxstoplist.txt"
DB_LOCATION = "../db/db.sqlite3"
MODEL_LOCATION = "../db/word_embedding_corpus"
MODEL2_LOCATION = "../db/keyword_embedding_corpus"
CONSTANT_DB_PATH = "../db/db.sqlite3"
MAIN_FUNCTION_PATH = "../core/main.py"

CONSTANT_DOMAIN = "http://api.nusmods.com/"

Random sampling number
RANDOM_SAMPLING_NUM = 100
The number of points shown in the figure
SHOW_WORD_NUM = 100

interfaces to DB: class ModsDB
def store_to_sqlite(self, id, code, details)
def read_all_from_sqlite(self)
def read_details_bycode(self, code)
def if_module_exist(self, code)
def create_database(self)

Unit test store at /core/test/
Details by reading test.ipynb

Process

Grab information from the server

  • Usingsave_code_description(year1, year2, semester)to crawl course information online
def open_url(url):
    '''
    Get the pakage from given url. Camouflage man-made behavior
    :param url: http://...
    :return: response
    '''
    logging.info('Holden: Try to get the pakage from' + url)
    req = urllib2.Request(url)

    req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36 QQBrowser/4.1.4132.400')
    response = urllib2.urlopen(req)

    logging.info('Holden: Successully get the pakage from url')
    return response.read()  

Camouflage man-made behavior by using add_header Put the data in memory in the form of Json

  • Using ModsDB.store_to_sqlite(self, id, code, details) to store ID, modules code and modules description one by one.

Create a handle to the database

mods_db = ModsDB(DB_LOCATION)
Which defined by the class in /core/mods_db.py

Take all the data from the database for training

all_classcode_list, all_keywords_list = keywords_list_from_db(mods_db)
Extract the keywords for each course introduction and return with the course code. Using Rapid Automatic Keyword Extraction algorithm (RAKE)

r = Rake()
r.extract_keywords_from_text(modules_db[row][2])

Take random modules to show

For the convenience of display, randomly select 100 samples to draw on the icon. In order to reflect the effect of clustering, random sampling from the database
sample_classcode_list, sample_keywords_list = random_keywords_list_from_db(mods_db) Using random shuffle

random_arr = np.arange(len(classcode_list))
np.random.shuffle(random_arr)

Word embedding or Load model

Convert all word lists into word vectors using word2vec. And save the model to local path. If there is trained model in the workspace, then load it directly instead.

    if os.path.exists(MODEL2_LOCATION):
        model = Word2Vec.load(MODEL2_LOCATION)  # Load trained model
    else:
        model = build_model(all_keywords_list)  # Training data from scratch, embedding words

Draw frequency of words in corpus

draw_top_frequency(all_keywords_list)
Use all keywords from corpus to count their frequencies. Show the most frequently appearing 1000 words by using WordCloud and pyecharts. And try to use plotting bitmaps to show the connection between words and words.

corpus = build_corpus(keywords_list)  # A word counter including all keywords
draw_word_count(corpus)

Build corpus containing the frequency of occurrence of each word. Store the 100 words with the highest frequency and their Vec

words_list, words_vectors = word_and_vec_list(corpus)
plot_tsne2d(words_vectors, words_list)

We can see that modules such as [ 'modules' , 'courses' , 'topics' , 'major' ] and [ 'work' , 'study' , 'skills' , 'project' ] and [ 'applications' , 'practice' , 'problems' ] etc. are grouped together, meaning that they have very similar positions in the description.

Draw modules'relationship

The idea is to average the vectors of the keywords that introduce the text in each course. Get the approximate position of the course in the vector space, and then map it to the two-dimensional space and draw it out.

draw_modules_relation(sample_keywords_list, sample_classcode_list)
Because the number of courses is too large, it is not well observed in the chart, so it is randomly sampled from the sample.

def draw_modules_relation(keywords_list,classcode_list):
    '''
    Draw the modules'relationships based on the entire corpus, and draw plot of them
    :param keywords_list: [["word",],]
    :param classcode_list: ["CS5242",]
    :return: Draw plot
    '''
    classvec_by_class_list = word2vec_by_class(keywords_list)
    classvec_list = mean_class_vec_list(classvec_by_class_list)

    plot_tsne2d(classvec_list, classcode_list)
    logging.info('Holden: Finish drawing class relationship')
    return

Function word2vec_by_class is to convert the descriptions of each module into corresponding vectors and save them into the classvec_by_class_list

And it can be found that:

  • EE(Electrical and Computer Engineering), CS(Computer science), IS(Information systems and Analytics), which are closely related to computers are gathered together.
  • CN(Chinese study), SW(Social work), PH(Philosophy). These liberal arts majors are gathered together.
  • ST(Statistics and Applied Probability), EC(Economics), AC(Accounting), that are closely related to statistics are gathered together.

Reference

[1] Fox, Christopher. A stop list for general text[J]. ACM SIGIR Forum, 1989, 24(1-2):19-21.
[2] Stuart J , Engel, David W , et al. Automatic Keyword Extraction from Individual Documents[M]// Text Mining: Applications and Theory. John Wiley & Sons, Ltd, 2010.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.