Newspaper Scrapers

A quick precursor:

We used these scripts to collect data for two projects on The DataFace's website: 34 Percent of Articles about Trump Now Mention His Twitter Activity and Trump and the Media: A Text Analysis.

NewspaperScraper.py provides support for scraping the websites of 14 major media outlets. They are listed below:

New York Times
Washington Post
Wall Street Journal
USA Today
CNN
Fox News
Politico
Slate
CNBC
Bloomberg
TIME
The Weekly Standard
LA Times
Chicago Tribune

You can extend the library to support other websites by creating new classes in NewspaperScraper.py. Just make sure your class inherits from NewspaperScraper, then write your own version of get_pages() specific to each new site!

Dependencies:

This project is indebted to the great work of Lucas Ou-Yang and his Newspaper library.

Here are the rest of the project's dependencies. Be sure to install these before proceeding:

Here's how to use this thing...

A NewspaperScraper object expects four inputs (at a minimum). The scraper's name, a search term, a start date, and an end date. After initializing a scraper, the intended workflow is as follows:

First, run get_pages() to find the URLs of all articles matching the search term within the relevant time period.
Then, run newspaper_parser() to grab metadata about each article returned by get_pages()
Finally, store the data using either write_to_mongo() or write_to_csv()

If you have mongoDB installed, you can get started quickly by referencing RunScrapers.py. You'll simply write the four inputs on the command line.

Note 1: NYT and WSJ require the credentials of a subscribed user to work. Those can be input as command line arguments as well (see RunScrapers.py).

Note 2: Some scrapers work better than others. We had some glitches gathering data from NYT and CNN in particular (oops), so feel free to fork + improve!

What you'll end up with:

A database (or file) that contains the following pieces of metadata about each article:

title
date_published
news_outlet
authors
feature_img
article_link
keywords
movies
summary
text
html

wanageeska / newspaper-scrapers Goto Github PK