Code Monkey home page Code Monkey logo

newspaper-scrapers's Introduction

Newspaper Scrapers

A quick precursor:

We used these scripts to collect data for two projects on The DataFace's website: 34 Percent of Articles about Trump Now Mention His Twitter Activity and Trump and the Media: A Text Analysis.

NewspaperScraper.py provides support for scraping the websites of 14 major media outlets. They are listed below:

  • New York Times
  • Washington Post
  • Wall Street Journal
  • USA Today
  • CNN
  • Fox News
  • Politico
  • Slate
  • CNBC
  • Bloomberg
  • TIME
  • The Weekly Standard
  • LA Times
  • Chicago Tribune

You can extend the library to support other websites by creating new classes in NewspaperScraper.py. Just make sure your class inherits from NewspaperScraper, then write your own version of get_pages() specific to each new site!

Dependencies:

This project is indebted to the great work of Lucas Ou-Yang and his Newspaper library.

Here are the rest of the project's dependencies. Be sure to install these before proceeding:

Here's how to use this thing...

A NewspaperScraper object expects four inputs (at a minimum). The scraper's name, a search term, a start date, and an end date. After initializing a scraper, the intended workflow is as follows:

  • First, run get_pages() to find the URLs of all articles matching the search term within the relevant time period.
  • Then, run newspaper_parser() to grab metadata about each article returned by get_pages()
  • Finally, store the data using either write_to_mongo() or write_to_csv()

If you have mongoDB installed, you can get started quickly by referencing RunScrapers.py. You'll simply write the four inputs on the command line.

Note 1: NYT and WSJ require the credentials of a subscribed user to work. Those can be input as command line arguments as well (see RunScrapers.py).

Note 2: Some scrapers work better than others. We had some glitches gathering data from NYT and CNN in particular (oops), so feel free to fork + improve!

What you'll end up with:

A database (or file) that contains the following pieces of metadata about each article:

  • title
  • date_published
  • news_outlet
  • authors
  • feature_img
  • article_link
  • keywords
  • movies
  • summary
  • text
  • html

newspaper-scrapers's People

Contributors

jackbeckwith avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.