Code Monkey home page Code Monkey logo

media-crawler's People

Contributors

josephpd3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

media-crawler's Issues

Control for Depth in Media Crawl

As of now, the spider will crawl for a very long time as it is a depth first search across all linked media and sites like The Washington Post love to link to their own articles. That means that as long as we are parsing Washington Post articles and they keep linking to them within other articles--we will keep crawling.

Ideally we will be able to control for depth (in the context of a tree search) on a crawl by passing an argument to the spider, much like media_url, which will have the spider either yield no further requests or determine that each reference in that parsed response will just become a leaf node regardless of whether we can parse it or not.

Here is the code for said spider.

Parse Fox News Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse Business Insider Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Baseline Command Error? Scrapy version?

scrapy crawl media_spider -a media_url="https://www.washingtonpost.com/news/post-politics/wp/2017/09/07/did-facebook-ads-traced-to-a-russian-company-violate-u-s-election-law/?tid=a_inl&utm_term=.e24142917aa8" max_depth=5 -o media.json

Is this command still working? Which version of Scrapy is this project using?

I have the following error:
crawl: error: running 'scrapy crawl' with more than one spider is no longer supported

Write Item Class for MediaItems

The item pipeline for this project will be handling MediaItems. These can be articles, tweets, videos, or even posts on reddit. The item which will encompass these can be very generic and not have all fields be required.

Metadata like tags is often common between articles and videos, though the concept has some overlap with #tags on twitter.

The one large commonality between all of these are the references they make. Articles link to other media items in their body, as do tweets and posts on reddit and other message boards. Videos often have links in their descriptions, and videos embedded from YouTube can be resolved to an ID, which can then be used to grab further data through YouTube--such as these descriptions.

Parse The Hill Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse Breitbart Articles

Note: If you prefer to not work with this source, please leave it to other contributors. As far as we are concerned, all media is relevant from a research perspective.

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Parse NPR Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Create Generic Spider for Scraping References from Media Items

This spider will parse with a function that, ideally, takes in the response and URL which it is drawn from. Then, the URL will be parsed via regex to determine a parsing function for the respective site (not written in this issue, just handled). That function will then parse out all the reference items to pass back into Scrapy.

Example:

import re

def get_wapo_refs()
    # Do stuff
    return refs

parser_map = {
    '.+\.washingtonpost.com.+': get_wapo_refs,
    '.+\.forbes.com.+': get_forbes_refs,
    # ...
}

precompiled_map = [re.compile(pattern), parser for pattern, parser in parser_map.items()]

def get_parser(url):
    # The caveat here is that we must be sure the first match is what we want to match
    for pattern, parser in precompiled_map:
        if pattern.match(url):
            return parser

When parsing Washington Post articles, handle old format as well as new format

While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given xpath. (See Scrapy docs here for info on xpath selectors)

This article about Herman Cain, for instance, is still present in an older, wordpress-based format.

The way to go about handling this will be to extend the existing parser function to try various xpath or css selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a <div ...> item rather than an <article itemprop="articleBody">:

<div class="wp-column ten margin-right main-content">

Parse Forbes Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Handle Pay-to-View Articles

This is, unfortunately, going to have to be a case-by-case issue across each site. As this is an open-source project, we aren't currently in the business of paying for a ton of various subscriptions.

These will have to be detected differently upon response handling for each media entity we scrape.

The handling, however, can be pretty simple. I figure we can just export these as leaf-nodes in the trees we scrape, labeling their media type as Pay-gated Article.

Parse Associated Press Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Handle Tweets in Articles

Hackathon Note: This is a bit more challenging than the other parsers, and can be considered a stretch goal

So tweets come in a few shapes and sizes when referenced. Sometimes articles link to tweets, other times they embed them. This issue is for tracking both kinds of tweet reference:

  • Links to Tweets
  • Embedded Tweets

No matter which kind you are tackling, I recommend utilizing the Twitter API in this.

Links to Tweets

This can be handled very similarly to all other link references. A parser for twitter URLs will just have to defined and imported into the parser_map under a suitable regex pattern to match against. From there, the Twitter API can likely be your best friend.

Embedded Tweets

This will likely be the first kind of embedded reference we will handle aside from videos, so we will actually discuss that in another issue.

Handle Inline Documents as Media Nodes

I encountered this intelligence community report while working through the Washington Post example data.

This is a very specific type of media which will likely always end up as a leaf node in the trees generated (or a node with no outward path in the full, Directed Acyclic Graph built later).

The two facets of this issue are:

  • How to treat these nodes with respect to classification
  • What information can/should be extracted

Parse Reuters Articles

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.