Code Monkey home page Code Monkey logo

sentiment-analysis-topic-modeling-for-hotel-reviews's Introduction

Sentiment-Analysis-Topic-Modeling-for-Hotel-Reviews

Web Scraping, Sentiment Analysis, Latent Dirichlet Allocation (LDA) topic modelling

Table of Contents

  1. Installation
  2. Project Overview
  3. Problem Statement
  4. Methodologies
  5. Metrics
  6. File Description
  7. Results
  8. Future Work

Installation

There should be no necessary libraries to run the code here beyond the Anaconda distribution of Python.

Project Overview

In this project, I scraped hotel reviews of “Hotel Beresford” located in San Francisco, CA from the website bookings.com. Then, I did some data exploration, generated WordClouds, performed sentiment analysis and created an LDA topic model

Problem Statement

The project goal is to use text analytics and Natural Language Processing (NLP) to extract actionable insights from the reviews and help the hotel improve their guest satisfactions.

Methodologies

  1. Web Scraping: The hotel reviews were scraped from bookings.com by using requests with BeautifulSoup.

  2. Exploratory Data Analysis (EDA): This part contais a pie chart, a histogram, and a seaborn violin plot to get a better understanding of the overall reviews and ratings.

  3. WordClouds: In order to generate more meaningful WordClouds, I customized extra stop words and used lemmatization to remove closely redundant words.

  4. Sentiment Analysis: The sentiment analysis helps to classify the polarity and subjectivity of the overall reviews and determine whether the expressed opinion in the reviews is mostly positive, negative, or neutral.

  5. LDA Topic Model: In natural language processing, the latent Dirichlet allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. I used GridSearch to find the best topic model. The two tuning parameters are: (1) n_components: number of topics and (2) learning_decay (which controls the learning rate)

Metrics

I used the log-likelihood score to evaluate the model performance. A model with higher log-likelihood and lower perplexity is considered to be a good model. However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words.

File Description

  • "Scraped Data" Folder contains the data scraped from the hotel review page: https://www.booking.com/reviews/us/hotel/beresford.html?
    • The scraped data are stored under 3 dataframes:
      1. reviewer_info: Basic information of the reviewer and reviews:
        • Rating Score
        • Reviewer Name
        • Reviewer's Nationality
        • Overall Review (contains both positive & negative reviews)
        • Reviewer Reviewed Times
        • Review Date
        • Review Tags (Trip type, such as business trip, leisure trip, etc.)
      2. pos_reviews: Positive reviews
      3. neg_reviews: Negative reviews
  • "Capstone_Project.ipynb" contains the code, visualizations and analyses on the Hotel Reviews.

Results

The main findings of the code can be found at the post available here .

Future Work

A lot of the analyses are limited due to the size of the scraped data. Non-English reviews were not scraped. Maybe trying to scrape reviews in other languages and translate the scraped reviews or scrape after translation would help to increase the data volume.

To provide more useful suggestions to Hotel Beresford, we may also conduct analysis of its competitors to gain insights of guest preferences as well as valuable information that Hotel Beresford may not get from its own reviews.

Acknolowledgements

Must give credit to my college friend, Lanyu Yu, who contributed partial script for the web scraping section.

sentiment-analysis-topic-modeling-for-hotel-reviews's People

Contributors

jiamei-wang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.