Collects news articles for trending topics
Dependencies:
- Twitter4j 4.0.3
- Selenium Java Driver 2.44.0
- Selenium Html Unit Driver 2.44.0
- Selenium Server Standalone Driver 2.44.0
- Selenium Remote Driver 2.44.0
- Mongo Java Driver 3.0.0
To run the project the following files should be executed in order:
- GetLatestTweets.java
- Scraper.java
- NaiveNLPComparator.java
- TestArticles.java This will create a file "output.txt" which has Trend Names and the article links and descriptions.
On Linux, the following command can be executed from the project folder, to run the project javac -cp ".:dependencies/lib/mongo-java-driver-3.0.0.jar:dependencies/lib/twitter4j-core-4.0.3.jar:dependencies/lib/selenium-java-2.44.0.jar:dependencies/lib/selenium-remote-driver-2.44.0.jar:dependencies/lib/selenium-api-2.44.0.jar:dependencies/lib/selenium-server-standalone-2.44.0.jar:dependencies/lib/selenium-firefox-driver-2.44.0.jar:dependencies/lib/selenium-htmlunit-driver-2.44.0.jar" src/NameHelper.java src/GetLatestTweets.java src/Scraper.java src/NaiveNLPComparator.java src/TestArticles.java java -cp ".:dependencies/lib/mongo-java-driver-3.0.0.jar:dependencies/lib/twitter4j-core-4.0.3.jar:dependencies/lib/selenium-java-2.44.0.jar:dependencies/lib/selenium-remote-driver-2.44.0.jar:dependencies/lib/selenium-api-2.44.0.jar:dependencies/lib/selenium-server-standalone-2.44.0.jar:dependencies/lib/selenium-firefox-driver-2.44.0.jar:dependencies/lib/selenium-htmlunit-driver-2.44.0.jar:src/.class:./src" GetLatestTweets java -cp ".:dependencies/lib/mongo-java-driver-3.0.0.jar:dependencies/lib/twitter4j-core-4.0.3.jar:dependencies/lib/selenium-java-2.44.0.jar:dependencies/lib/selenium-remote-driver-2.44.0.jar:dependencies/lib/selenium-api-2.44.0.jar:dependencies/lib/selenium-server-standalone-2.44.0.jar:dependencies/lib/selenium-firefox-driver-2.44.0.jar:dependencies/lib/selenium-htmlunit-driver-2.44.0.jar:src/.class:./src" Scraper java -cp ".:dependencies/lib/mongo-java-driver-3.0.0.jar:dependencies/lib/twitter4j-core-4.0.3.jar:dependencies/lib/selenium-java-2.44.0.jar:dependencies/lib/selenium-remote-driver-2.44.0.jar:dependencies/lib/selenium-api-2.44.0.jar:dependencies/lib/selenium-server-standalone-2.44.0.jar:dependencies/lib/selenium-firefox-driver-2.44.0.jar:dependencies/lib/selenium-htmlunit-driver-2.44.0.jar:src/.class:./src" NaiveNLPComparator java -cp ".:dependencies/lib/mongo-java-driver-3.0.0.jar:dependencies/lib/twitter4j-core-4.0.3.jar:dependencies/lib/selenium-java-2.44.0.jar:dependencies/lib/selenium-remote-driver-2.44.0.jar:dependencies/lib/selenium-api-2.44.0.jar:dependencies/lib/selenium-server-standalone-2.44.0.jar:dependencies/lib/selenium-firefox-driver-2.44.0.jar:dependencies/lib/selenium-htmlunit-driver-2.44.0.jar:src/.class:./src" TestArticles
Algorithm:
- Get the 10 latest trends at NYC from twitter.
- For each trend get 4 most popular tweets and store in database.
- Scrape the news section of huffingtonpost.com, to get all the news article links along with the short description. Only the news sub-section is scrapped.
- Store these in a database.
- NaiveNLPComparator retrieves the trends and the tweets from the database.
- Every tweet is processed by removing stop words and a frequency table is created. Here the words which have a high frequency ultimately induces a high weight to an article.
- For each article in the database, the description is matched with the frequency table to get a weight.
- If this weight is 50% of the cumulative frequencies for a trend it is added to a Success table.
- Else it is dumped in a failure table along with the trend_id, news_article, news_link and the weight it received.
- TestArticles retrieves from the success table in descending order of weights (if there are no results) then the failure table is queried.
- All news articles and links whose weight is not 0 are retrived in descending order of weights.
- These are written into output.txt
Possible Improvements:
- Could query for more than 4 tweets, so that there are more keywords.
- Could get the entire article instead of just the breif descriptions, however this would increase the scraping time considerably.
- Get articles from all the sections including Entertainment, Life&Style, Tech&Science etc.