Code Monkey home page Code Monkey logo

yenlp's Introduction

##Comparison of open source NLP tools for sentiment analysis on Yelp Dataset Challenge

For the results check the blogpost here.

Scope

This is a project for the Data Mining course taught in the undergraduate programme of the Computer Science & Engineering department of the Univerisity of Ioannina during the fall semester 2014/2015. For details checkout the handout.

###Goal

The objective of this project is to apply various sentiment analysis techniques(NLP) on the restaurant reviews and assess whether they can correctly identify the reviews as positive or negative.

Yelp Dataset Challenge

Yelp has released an anonymized part of their stored data to the public. This was accompanied by [a challenge] (http://www.yelp.com/dataset_challenge) with various awards in order to incentivize research and generate insights for the use of the data.

Here follows a brief explanation of the dataset, from their website.

The Challenge Dataset includes data from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh:

* 42,153 businesses
* 320,002 business attributes
* 31,617 check-in sets
* 252,898 users
* 955,999 edge social graph
* 403,210 tips
* 1,125,458 reviews

This project is not a participation in the challenge.

Open source tools

Three different open source tools have been used and assessed:

  • Training and testing using the dataset on a Naive Bayes classifier
  • Training with generic lexicons such as WordNet and SentiWordNet
  • Stanford's CoreNLP

Usage

The code can be found in github. A brief explanation on how to run the code.

Dependencies

Make sure you have the following libraries installed before running the code.

Also you must have installed the stopword corpora of NLTK. To bring up the NLTK downloader, run the following in a python console.

import nltk
nltk.download()

Extracting reviews

This must be done before runing any of the classifiers below.

You need to provide the category of the businesses and the quantity of samples for each review class (pos/neg). The script creates two json files one for each class.

python extract_reviews.py 'Restaurants' 1000

Naive Bayes

You need to provide the category, the number of samples for each class and the number of folds for the k-fold cross validation. It trains one classifier for each feature extraction filter (single words, stopwords removal, bigrams, bigrams & stopwords removal) and prints the overall accuracy.

python run_bayes 'Restaurants' 1000 5

SentiwordNet

You only need to provided the category and the number of samples.

python run_sentiwordnet 'Restaurants' 1000

CoreNLP

You need to add to java's classpath the path to the simple-json and corenlp jar. First compile the corenlp.java file to create the class.

javac -cp ".:PATH_TO_SIMPLE_JSON/*:PATH_TO_CORENLP/*" -d . corenlp.java 

Then run the compiled class providing the category and number of samples.

java -cp ".:PATH_TO_SIMPLE_JSON/*:PATH_TO_CORENLP/*" corenlp restaurants 1000

yenlp's People

Contributors

stathius avatar

Watchers

James Cloos avatar Vikas Raunak avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.