data4democracy / media-crawler Goto Github PK

31.0 9.0 9.0 17 KB

Web scraper for generating a graph of media connections via articles, twitter, reddit, and more

Python 100.00%

media-crawler's Introduction

Media Crawler and Reference Tree Generator

Project Purpose

The goal of this project is, given any article (or item in general) on a media site such as Washington Post, The Hill, CNN, Forbes, AP, and even Breitbart, we can take the references made in that site (links as <a> tags) and connect that article to the articles it references--then connect those articles to the articles, tweets, and videos they reference. This will result in a tree of references with the original article as a root, and will allow us to combine these trees into a media reference graph at a large scale.

Example small tree with Washington Post article as root:

Once enough of these smaller trees are amassed, connections can be made to draw out a much larger media ecosystem as a single graph (perhaps with isolated subgraphs, denoting closed networks).

This will, ideally, give us a foundation for performing analysis on media. What sort of connections exist between various media-entities? What entities commonly use themselves as a reference? Can we use the articles, their references, and some natural-language tools to understand a more specific problem?

Hopefully, this can be a tool that serves all of those purposes.

Installing Dependencies

At a minimum, you will need Scrapy and BeautifulSoup for the baseline functionality described below. Install those with yoru virtual environment of choice, ensuring that you are using Python 3.6.x in said environment.

Starting the MediaSpider Crawl

To start the crawling process, you will need to pass a url to the MediaSpider via the Scrapy CLI.

First, cd into media-crawler/crawler, (where the scrapy.cfg file is). Then, run this command, passing the spider the URL you want to start with.

scrapy crawl media_spider -o media.json -a media_url="https://www.washingtonpost.com/news/post-politics/wp/2017/09/07/did-facebook-ads-traced-to-a-russian-company-violate-u-s-election-law/?tid=a_inl&utm_term=.e24142917aa8"

Here, we use a Washington Post article as the starting point. Feel free to use this as a baseline test.

Note: This defaults to crawling 3 nodes deep in any reference tree (ie: three references down from the starting media item). You can change this via DEPTH_LIMIT in crawler/crawler/settings.py

Contributing

For contributing, the typical open-source git-flow applies.

Before branching off to make any changes, please claim the issue with a comment so you can be assigned to it and others know you are already working on it! :)

Fork the project, then make a branch off of master with a clear and concise name describing your intentions with the branch--something like handle-old-wapo-format.

Then, make your changes to the code in that branch in your Fork on your machine (following instructions above to ensure it works and doesn't throw any unexpected errors).

When documenting your code, please refer to the Google Python Docstring Standard.

After that, commit the changes to your branch with a clear heading and specific details as to what you have changed. Then, push the branch up to your forked copy of the repository.

At this point, go to Pull Requests at the top of the GitHub page and select New pull request. On this next screen, you will see an option to Compare Across Forks. Click this, and you will be able to compare the branch on your fork with the master branch of the main repository in a pull request.

If you have any further questions about using git or if some of this doesn't make sense, please check out the #github-help channel on the D4D Slack.

For any other issues, please ping @josephpd3 in the #p-media-crawler channel.

media-crawler's People

Contributors

Stargazers

Watchers

Forkers

brycecf frankiezeager rnkaufman sigino josephpd3 sefabey bhuvanesh15 chainofdemand lbyiuou0329

media-crawler's Issues

Parse Associated Press Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Control for Depth in Media Crawl

As of now, the spider will crawl for a very long time as it is a depth first search across all linked media and sites like The Washington Post love to link to their own articles. That means that as long as we are parsing Washington Post articles and they keep linking to them within other articles--we will keep crawling.

Ideally we will be able to control for depth (in the context of a tree search) on a crawl by passing an argument to the spider, much like media_url, which will have the spider either yield no further requests or determine that each reference in that parsed response will just become a leaf node regardless of whether we can parse it or not.

Here is the code for said spider.

Baseline Command Error? Scrapy version?

scrapy crawl media_spider -a media_url="https://www.washingtonpost.com/news/post-politics/wp/2017/09/07/did-facebook-ads-traced-to-a-russian-company-violate-u-s-election-law/?tid=a_inl&utm_term=.e24142917aa8" max_depth=5 -o media.json

Is this command still working? Which version of Scrapy is this project using?

I have the following error:
crawl: error: running 'scrapy crawl' with more than one spider is no longer supported

Parse Reuters Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse Forbes Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

When parsing Washington Post articles, handle old format as well as new format

While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given xpath. (See Scrapy docs here for info on xpath selectors)

This article about Herman Cain, for instance, is still present in an older, wordpress-based format.

The way to go about handling this will be to extend the existing parser function to try various xpath or css selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a <div ...> item rather than an <article itemprop="articleBody">:

<div class="wp-column ten margin-right main-content">

Write Item Class for MediaItems

The item pipeline for this project will be handling MediaItems. These can be articles, tweets, videos, or even posts on reddit. The item which will encompass these can be very generic and not have all fields be required.

Metadata like tags is often common between articles and videos, though the concept has some overlap with #tags on twitter.

The one large commonality between all of these are the references they make. Articles link to other media items in their body, as do tweets and posts on reddit and other message boards. Videos often have links in their descriptions, and videos embedded from YouTube can be resolved to an ID, which can then be used to grab further data through YouTube--such as these descriptions.

Parse Business Insider Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Handle Pay-to-View Articles

This is, unfortunately, going to have to be a case-by-case issue across each site. As this is an open-source project, we aren't currently in the business of paying for a ton of various subscriptions.

These will have to be detected differently upon response handling for each media entity we scrape.

The handling, however, can be pretty simple. I figure we can just export these as leaf-nodes in the trees we scrape, labeling their media type as Pay-gated Article.

Create Generic Spider for Scraping References from Media Items

This spider will parse with a function that, ideally, takes in the response and URL which it is drawn from. Then, the URL will be parsed via regex to determine a parsing function for the respective site (not written in this issue, just handled). That function will then parse out all the reference items to pass back into Scrapy.

Example:

import re

def get_wapo_refs()
    # Do stuff
    return refs

parser_map = {
    '.+\.washingtonpost.com.+': get_wapo_refs,
    '.+\.forbes.com.+': get_forbes_refs,
    # ...
}

precompiled_map = [re.compile(pattern), parser for pattern, parser in parser_map.items()]

def get_parser(url):
    # The caveat here is that we must be sure the first match is what we want to match
    for pattern, parser in precompiled_map:
        if pattern.match(url):
            return parser

Handle Inline Documents as Media Nodes

I encountered this intelligence community report while working through the Washington Post example data.

This is a very specific type of media which will likely always end up as a leaf node in the trees generated (or a node with no outward path in the full, Directed Acyclic Graph built later).

The two facets of this issue are:

How to treat these nodes with respect to classification
What information can/should be extracted

Parse Fox News Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse Breitbart Articles

Note: If you prefer to not work with this source, please leave it to other contributors. As far as we are concerned, all media is relevant from a research perspective.

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse The Hill Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse NPR Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

You will have to define the parser in its own submodule under crawler/crawler/parsers
This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
These parser objects must have the following:
- 'href': the link within the anchor tag itself
- 'text': the text or item which the anchor tag wraps
- 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Handle Tweets in Articles

Hackathon Note: This is a bit more challenging than the other parsers, and can be considered a stretch goal

So tweets come in a few shapes and sizes when referenced. Sometimes articles link to tweets, other times they embed them. This issue is for tracking both kinds of tweet reference:

Links to Tweets
Embedded Tweets

No matter which kind you are tackling, I recommend utilizing the Twitter API in this.

Links to Tweets

This can be handled very similarly to all other link references. A parser for twitter URLs will just have to defined and imported into the parser_map under a suitable regex pattern to match against. From there, the Twitter API can likely be your best friend.

Embedded Tweets

This will likely be the first kind of embedded reference we will handle aside from videos, so we will actually discuss that in another issue.

data4democracy / media-crawler Goto Github PK

media-crawler's Introduction

Media Crawler and Reference Tree Generator

Project Purpose

Installing Dependencies

Starting the MediaSpider Crawl

Contributing

media-crawler's People

Contributors

Stargazers

Watchers

Forkers

media-crawler's Issues

Links to Tweets

Embedded Tweets

Recommend Projects

Recommend Topics

Recommend Org