-
Team Name | Lunar-Tsai
- Manuel Suarez Lunar | [email protected]
- Wei-Lun (Will) Tsai | [email protected] --> team captain
-
University of Illinois Urbana-Champaign | Fall 2023
-
CS 410: Text Information Systems | ChengXiang Zhai
-
Code Repo | github.com/willtsai/quote-finder
-
Demo Video | drive.google.com/file/d/1bw_fsboU2rEsjHIbS7dBuFE5-w_5QDTT/view?usp=sharing
-
PPT slides | github.com/willtsai/quote-finder/blob/main/course-deliverables/project-final-presentation-slides.pptx
Quote Finder is a unique search engine / text retrieval tool specifically tailored for literature and quote enthusiasts. Users input a particular sentiment or emotion and then Quote Finder searches the catalog of famous quotes available on goodreads.com/quotes to return a ranked list of quotes and respective authors resonating with the user's specified sentiment.
We have completed four of our initially proposed goals with the implementation of a working web crawler, preprocessor, searcher, and web application. These tasks made up a bit more than the 20 hours of coding per team member that we had originally estimated due to cycles being spent on integrating the searcher with the web app and then debugging. Thus, we weren't able to complete our two stretch goals of implementing a feedback component as well as hosting the application on Azure. However, we were able to complete the core functionality of our application with a satisfactory set of initial results. In addition to applying the methods learned in this course, we also got to learn how to build a web crawler, leverage a sentiment library, and build a web application. Aside from our two stretch goals, the only other enhancements we would make given more time and resources are to extend sentiment analysis to the quotes themselves as well as tweaking the ranker to get further improvement in the quality of the results.
The architecture consists of three main backend components: (1) web crawler, (2) preprocessor, (3) searcher, (4) web application, and (5) user interface. The web crawler collects raw quotes data from the Goodreads quotes website, which is then processed by the preprocessor to extract, normalize, and tokenize the quotes along with their corresponding metadata. The searcher parses the user sentiment input, builds an inverted index, then searches against the index to return a set of ranked quotes based on relevance to a given query. A web application (API) is built on top of these components, which is then fronted by a user interface for users to interact with the application.
Figure 1: Overall architecture diagram for Quote Finder
Between the two team members, the coding and documentation work was distributed as follows:
-
Manuel Suarez Lunar
- Web crawler
- Web application
- User interface
- Powerpoint presentation
-
Wei-Lun (Will) Tsai
- Preprocessor
- Searcher
- Code documentation
The web crawler is implemented using the Scrapy Python library in web_crawler.py
. It crawls the Goodreads quotes website and extracts quotes, authors, and tags for each quote, and then follows the "next" link to crawl subsequent pages. The quotes data is output to a JSON file that is picked up by the preprocessor.
The preprocessor is implemented in Python, and the code can be found in the preprocessor.py
file. It takes a raw quotes data file (in JSON format) as input, and outputs two data files: (1) a quotes.dat
containing the processed quotes data, and (2) a quotes_map.json
metadata file containing the metadata for each quote. The quotes.dat
file is used by the searcher to build the inverted index, while the quotes_map.json
file is used to display the metadata for each quote in the user interface.
The searcher is implemented in Python, and the code can be found in the searcher.py
file. This is the core component of the Quote Finder application with functions to perform the key processes of the application:
sentiment()
function to parse the user input into a broader set of opinions leveraging thenltk.corpus.sentiwordnet
APIs and dataset from the NLTK Python library.build_index()
to build the inverted index,load_ranker()
to load a ranker,load_query()
to load a query, andrun_query()
to search the index with a given query which returns a ranked list of quotes as the result.results_to_string()
function to convert the results into a string format that can be displayed in the user interface.
The web application is implemented using the Flask Python library in webapp.py
. It exposes a set of API endpoints that are called by the user interface to invoke the searcher, and returns the results to the user interface.
The user interface is implemented using HTML templates in the templates
directory. It consists of an index.html
home page with a search bar for the user to input a sentiment, and an output.html
page that displays the ranked list of quotes returned by the searcher. These UI pages are rendered by the webapp.py
Flask web application.
-
Clone the code repository and navigate to the project directory:
git clone https://github.com/willtsai/quote-finder.git cd quote-finder
-
Install Conda and Git if you don't already have them installed.
Note: if you are running this on an Apple Silicon Mac (e.g. M1, M2 chips), adjust your Conda config after installation:
conda config --env --set subdir osx-64
Note: if you are running this on a Linux machine, adjust your Conda environment variables after installation:
export CONDA_DIR=conda info | grep -i 'base environment' source $CONDA_DIR/etc/profile.d/conda.sh
-
Create and activate a Python 3.5 Conda environment:
conda create -n py35 python=3.5 conda activate py35
-
Create and activate a Flask environment for the web application:
python -m venv .venv . .venv/bin/activate
-
Install the project dependencies:
pip install -r requirements.txt
-
Run the web crawler and preprocessor to prepare the quotes data:
python web_crawler.py python preprocessor.py
-
Run the web app:
python webapp.py
-
Navigate to http://127.0.0.1:5000/ in a web browser to interact with the Quote Finder web application.
Here are some example queries to try out in the Quote Finder web application:
Input: "inspirational and grateful"
Expected Output:
Input: "frustrated and disappointed"
Expected Output: