This repository contains a collection of scripts that parse four different NLP datasets (3 aspect-based sentiment analysis datasets and 1 sentiment analysis dataset). Each script puts the parsed data into a SQLite database with minimal changes to the individual texts in the datasets.
Each of the four scripts correspond to one dataset. For citations and more information on each dataset, please look at the comments at the top of each script file. These scripts do not currently extract every part of the supported datasets since these scripts were specifically written for my Masters thesis.
Supported Datasets:
- The restaurant corpus from SemEval 2016's Aspect-Based Sentiment Analysis Task
- Restaurant Reviews dataset
- SOCC (SFU Opinion and Comments) corpus, which contains opinion articles from the Globe and Mail
- Stanford Sentiment Treebank
The datasets are not included in this repository. Please look at parameters.json
for where the scripts expect the data to placed by default.