This repository contains a Python script designed for sentiment analysis on hotel reviews from TripAdvisor. The project focuses on natural language processing (NLP) techniques, utilizing the nltk
library, to preprocess and analyze text data. It employs both traditional machine learning models and a deep learning model to predict sentiment labels.
- Overview
- Key Features
- Usage
- Dependencies
- Installation
- Dataset
- Code Structure
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- Word Cloud Visualization
- Machine Learning Models
- Deep Learning Model
- Model Deployment
- Functionality
- Results and Visualizations
- Contributing
- License
This project aims to analyze sentiment in hotel reviews using a combination of traditional machine learning and deep learning models. The script processes the dataset, explores the data through visualizations, and trains multiple models to predict sentiment labels. The models include Decision Trees, Random Forest, Support Vector Machines, Logistic Regression, K-Nearest Neighbors, and a Bidirectional LSTM deep learning model.
-
Data Preprocessing: Extensive preprocessing steps, including text cleaning, lemmatization, and stopword removal, are performed to prepare the text data for model training.
-
Exploratory Data Analysis (EDA): Utilizes Seaborn for visualizing the distribution of ratings and review lengths, providing insights into the dataset.
-
Word Cloud Visualization: Generates a Word Cloud to visualize the most commonly used words in cleaned reviews, offering a quick glimpse into prevalent sentiments.
-
Machine Learning Models: Trains traditional machine learning models using cross-validation, including Decision Trees, Random Forest, Support Vector Machines, Logistic Regression, K-Nearest Neighbors, and Naive Bayes.
-
Deep Learning Model: Implements a Bidirectional LSTM model using TensorFlow and Keras, trained on tokenized and padded text data for sentiment predictions.
-
Model Deployment: Saves the Logistic Regression model for future use and provides a function to make predictions using both traditional machine learning and deep learning models.
-
Clone the Repository:
git clone https://github.com/your-username/tripadvisor-sentiment-analysis.git cd tripadvisor-sentiment-analysis
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Script:
python sentiment_analysis.py
-
Explore the Results: The script will output visualizations and performance metrics for each model. Additionally, use the provided functions for making predictions on new text data.
- Python 3.x
- Required Python packages (install using
pip install -r requirements.txt
)
To set up the project environment, follow these steps:
-
Install Python: Download Python
-
Clone the repository:
git clone https://github.com/your-username/tripadvisor-sentiment-analysis.git cd tripadvisor-sentiment-analysis
-
Install dependencies:
pip install -r requirements.txt
The dataset (tripadvisor_hotel_reviews.csv
) used in this project is not included in this repository. You can obtain a similar dataset from TripAdvisor or any other source of your choice.
The main script, sentiment_analysis.py
, contains the entire workflow, from data loading to model training and evaluation. The code is organized into sections for clarity.
EDA involves visualizing the dataset's characteristics, including rating distributions and review lengths, using Seaborn.
Text data is preprocessed through cleaning, lemmatization, and stopword removal to enhance the quality of input for model training.
A Word Cloud is generated to visually represent frequently occurring words in the cleaned reviews.
Traditional machine learning models are trained using cross-validation, and their accuracy is evaluated.
A Bidirectional LSTM model is implemented using TensorFlow and Keras for deep learning-based sentiment analysis.
The Logistic Regression model is saved for future use, and functions are provided for making predictions using both traditional machine learning and deep learning models.
- Sentiment Prediction:
- The script provides functions (
ml_predict
anddl_predict
) for making predictions on new text data using both Logistic Regression and Bidirectional LSTM models.
- The script provides functions (
The script outputs visualizations and performance metrics for each model, providing insights into their effectiveness.
Feel free to contribute to this project! Open issues or submit pull requests to enhance the functionality or fix any bugs.
Feel free to explore, modify, and adapt the code for your own sentiment analysis tasks. If you find this project helpful, don't forget to give it a star! ๐