Web Scraping, Sentiment Analysis, Latent Dirichlet Allocation (LDA) topic modelling
- Installation
- Project Overview
- Problem Statement
- Methodologies
- Metrics
- File Description
- Results
- Future Work
There should be no necessary libraries to run the code here beyond the Anaconda distribution of Python.
In this project, I scraped hotel reviews of “Hotel Beresford” located in San Francisco, CA from the website bookings.com. Then, I did some data exploration, generated WordClouds, performed sentiment analysis and created an LDA topic model
The project goal is to use text analytics and Natural Language Processing (NLP) to extract actionable insights from the reviews and help the hotel improve their guest satisfactions.
-
Web Scraping: The hotel reviews were scraped from bookings.com by using requests with BeautifulSoup.
-
Exploratory Data Analysis (EDA): This part contais a pie chart, a histogram, and a seaborn violin plot to get a better understanding of the overall reviews and ratings.
-
WordClouds: In order to generate more meaningful WordClouds, I customized extra stop words and used lemmatization to remove closely redundant words.
-
Sentiment Analysis: The sentiment analysis helps to classify the polarity and subjectivity of the overall reviews and determine whether the expressed opinion in the reviews is mostly positive, negative, or neutral.
-
LDA Topic Model: In natural language processing, the latent Dirichlet allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. I used GridSearch to find the best topic model. The two tuning parameters are: (1) n_components: number of topics and (2) learning_decay (which controls the learning rate)
I used the log-likelihood score to evaluate the model performance. A model with higher log-likelihood and lower perplexity is considered to be a good model. However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words.
- "Scraped Data" Folder contains the data scraped from the hotel review page: https://www.booking.com/reviews/us/hotel/beresford.html?
- The scraped data are stored under 3 dataframes:
- reviewer_info: Basic information of the reviewer and reviews:
- Rating Score
- Reviewer Name
- Reviewer's Nationality
- Overall Review (contains both positive & negative reviews)
- Reviewer Reviewed Times
- Review Date
- Review Tags (Trip type, such as business trip, leisure trip, etc.)
- pos_reviews: Positive reviews
- neg_reviews: Negative reviews
- reviewer_info: Basic information of the reviewer and reviews:
- The scraped data are stored under 3 dataframes:
- "Capstone_Project.ipynb" contains the code, visualizations and analyses on the Hotel Reviews.
The main findings of the code can be found at the post available here .
A lot of the analyses are limited due to the size of the scraped data. Non-English reviews were not scraped. Maybe trying to scrape reviews in other languages and translate the scraped reviews or scrape after translation would help to increase the data volume.
To provide more useful suggestions to Hotel Beresford, we may also conduct analysis of its competitors to gain insights of guest preferences as well as valuable information that Hotel Beresford may not get from its own reviews.
Must give credit to my college friend, Lanyu Yu, who contributed partial script for the web scraping section.