For WING's task one. Crawl some courses codes and descriptions from NUSmodules. Take the descriptions and encode them into thought or doc vectors via word embeddings). Give a visualization on how the courses relate to each other.
SimilarCourseVisual/
|-- bin/ (Executable files for the project)
| |-- _init_
| |-- data_prepare.py (Crawl and store data to local file)
| |-- start.py (Starting program)
|-- core/ (Store all source code for the project (core code))
| |-- tests/ (Store unit test code)
| | |-- init.py
| | |-- test.main.py
| |-- init.py
| |-- test_main.py (Store core logic)
|-- conf/ (Configuration file)
| |-- init.py
| |-- setting.py (Write the relevant configuration)
|---db/ #Database files
| |--db.json (Store database files)
|-- docs/ (Store some documents)
|-- lib/ (Library files, put custom modules and packages)
| |-- init.py
| |-- common.py (Write commonly used functions)
|-- log/ (Log files)
| |-- access.log (logs)
|-- README (Project description document)
table Modsinfo (
ModsID int,
ModsCode text,
ModsDetails text)
Mainly covers: Modsid: the Unique identifier in the database; ModsCode: Nus course code; ModsDetails: the descriptions of each module
Run data_prepare.py, which will call the function in the core
to crawl information and store them into db.sqlite3.
python bin/data_prepare.py
- Then you can run start.py to excute the main function to drawl the top frequency words cloud into docs/cloudWord.html & show their similarity by tsne.
python bin/start.py drawWords
- And you can also run start.py to excute the main function to drawl the similarity of sample modules by add another agrv.
python bin/start.py drawWords
- And you can also run start.py to excute the main function to recommend the most similar course for you.
python bin/start.py -r CS5424
# return:
# The most similar course:
# CS4224
Cna modify global variables directly in the lib
STOPWORD_LOCATION
= "../docs/Foxstoplist.txt"
DB_LOCATION
= "../db/db.sqlite3"
MODEL_LOCATION
= "../db/word_embedding_corpus"
MODEL2_LOCATION
= "../db/keyword_embedding_corpus"
CONSTANT_DB_PATH
= "../db/db.sqlite3"
MAIN_FUNCTION_PATH
= "../core/main.py"
CONSTANT_DOMAIN
= "http://api.nusmods.com/"
Random sampling number
RANDOM_SAMPLING_NUM
= 100
The number of points shown in the figure
SHOW_WORD_NUM
= 100
interfaces to DB: class ModsDB
def store_to_sqlite(self, id, code, details)
def read_all_from_sqlite(self)
def read_details_bycode(self, code)
def if_module_exist(self, code)
def create_database(self)
Unit test store at /core/test/
Details by reading test.ipynb
- Using
save_code_description(year1, year2, semester)
to crawl course information online
def open_url(url):
'''
Get the pakage from given url. Camouflage man-made behavior
:param url: http://...
:return: response
'''
logging.info('Holden: Try to get the pakage from' + url)
req = urllib2.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36 QQBrowser/4.1.4132.400')
response = urllib2.urlopen(req)
logging.info('Holden: Successully get the pakage from url')
return response.read()
Camouflage man-made behavior by using add_header Put the data in memory in the form of Json
- Using
ModsDB.store_to_sqlite(self, id, code, details)
to store ID, modules code and modules description one by one.
mods_db = ModsDB(DB_LOCATION)
Which defined by the class in /core/mods_db.py
all_classcode_list, all_keywords_list = keywords_list_from_db(mods_db)
Extract the keywords for each course introduction and return with the course code. Using Rapid Automatic Keyword Extraction algorithm (RAKE)
r = Rake()
r.extract_keywords_from_text(modules_db[row][2])
For the convenience of display, randomly select 100 samples to draw on the icon. In order to reflect the effect of clustering, random sampling from the database
sample_classcode_list, sample_keywords_list = random_keywords_list_from_db(mods_db)
Using random shuffle
random_arr = np.arange(len(classcode_list))
np.random.shuffle(random_arr)
Convert all word lists into word vectors using word2vec. And save the model to local path. If there is trained model in the workspace, then load it directly instead.
if os.path.exists(MODEL2_LOCATION):
model = Word2Vec.load(MODEL2_LOCATION) # Load trained model
else:
model = build_model(all_keywords_list) # Training data from scratch, embedding words
draw_top_frequency(all_keywords_list)
Use all keywords from corpus to count their frequencies. Show the most frequently appearing 1000 words by using WordCloud and pyecharts. And try to use plotting bitmaps to show the connection between words and words.
corpus = build_corpus(keywords_list) # A word counter including all keywords
draw_word_count(corpus)
Build corpus containing the frequency of occurrence of each word. Store the 100 words with the highest frequency and their Vec
words_list, words_vectors = word_and_vec_list(corpus)
plot_tsne2d(words_vectors, words_list)
We can see that modules such as [ 'modules' , 'courses' , 'topics' , 'major' ] and [ 'work' , 'study' , 'skills' , 'project' ] and [ 'applications' , 'practice' , 'problems' ] etc. are grouped together, meaning that they have very similar positions in the description.
The idea is to average the vectors of the keywords that introduce the text in each course. Get the approximate position of the course in the vector space, and then map it to the two-dimensional space and draw it out.
draw_modules_relation(sample_keywords_list, sample_classcode_list)
Because the number of courses is too large, it is not well observed in the chart, so it is randomly sampled from the sample.
def draw_modules_relation(keywords_list,classcode_list):
'''
Draw the modules'relationships based on the entire corpus, and draw plot of them
:param keywords_list: [["word",],]
:param classcode_list: ["CS5242",]
:return: Draw plot
'''
classvec_by_class_list = word2vec_by_class(keywords_list)
classvec_list = mean_class_vec_list(classvec_by_class_list)
plot_tsne2d(classvec_list, classcode_list)
logging.info('Holden: Finish drawing class relationship')
return
Function word2vec_by_class
is to convert the descriptions of each module into corresponding vectors and save them into the classvec_by_class_list
And it can be found that:
- EE(Electrical and Computer Engineering), CS(Computer science), IS(Information systems and Analytics), which are closely related to computers are gathered together.
- CN(Chinese study), SW(Social work), PH(Philosophy). These liberal arts majors are gathered together.
- ST(Statistics and Applied Probability), EC(Economics), AC(Accounting), that are closely related to statistics are gathered together.
[1] Fox, Christopher. A stop list for general text[J]. ACM SIGIR Forum, 1989, 24(1-2):19-21.
[2] Stuart J , Engel, David W , et al. Automatic Keyword Extraction from Individual Documents[M]// Text Mining: Applications and Theory. John Wiley & Sons, Ltd, 2010.