Project source can be downloaded from https://github.com/khuan013/CS172-Crawler.git
- Kenneth Huang
- Tien Tran
This application uses the Twitter Streaming API to collect geolocated tweets and stores them in text files of 10MB each.
In order to run the program, you must have:
- Python 2.7
- Tweepy Twitter API library installed.
- lxml
Download the repository from https://github.com/khuan013/CS172-TwitterSearch.git
If on Unix/Linux, run the crawler.sh shellscript, and pass the number of tweets you want to search (if the number is 0, the crawler will go on untill it reaches 5 GB in data) and output directory name, which will execute the Python program.
By default, the files are placed in /data and number of tweets are not limited.
Examples:
- ./crawler.sh [num-tweets] [output-dir]
- ./crawler.sh [num-tweets]
- ./crawler.sh
In order to run the program you must have the following installed:
- Eclipse for Java EE
- Apache Tomcat version 7.0
- Lucene version 3.7.2
- Download the repository from https://github.com/khuan013/CS172-TwitterSearch.git
- Put MyLucene.java and MySearch.jsp into your Eclipse project directory.
- If you already have twitter data, run MyLucene.java to create an index. Otherwise run the python program twitterGeo.py, refer to Part A documentation on how to use it.
- Once MyLucene.java finishes it will create a folder called testIndex. Put this folder at your Desktop directory.
- Run MySearch.jsp on the tomcat servers using Eclipse. This should bring up a webpage with a search bar.