data4democracy / media-crawler Goto Github PK
View Code? Open in Web Editor NEWWeb scraper for generating a graph of media connections via articles, twitter, reddit, and more
Web scraper for generating a graph of media connections via articles, twitter, reddit, and more
As of now, the spider will crawl for a very long time as it is a depth first search across all linked media and sites like The Washington Post
love to link to their own articles. That means that as long as we are parsing Washington Post articles and they keep linking to them within other articles--we will keep crawling.
Ideally we will be able to control for depth (in the context of a tree search) on a crawl by passing an argument to the spider, much like media_url
, which will have the spider either yield no further requests or determine that each reference in that parsed response will just become a leaf node regardless of whether we can parse it or not.
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
scrapy crawl media_spider -a media_url="https://www.washingtonpost.com/news/post-politics/wp/2017/09/07/did-facebook-ads-traced-to-a-russian-company-violate-u-s-election-law/?tid=a_inl&utm_term=.e24142917aa8" max_depth=5 -o media.json
Is this command still working? Which version of Scrapy is this project using?
I have the following error:
crawl: error: running 'scrapy crawl' with more than one spider is no longer supported
The item pipeline for this project will be handling MediaItems
. These can be articles, tweets, videos, or even posts on reddit. The item which will encompass these can be very generic and not have all fields be required.
Metadata like tags is often common between articles and videos, though the concept has some overlap with #tags
on twitter.
The one large commonality between all of these are the references
they make. Articles link to other media items in their body, as do tweets and posts on reddit and other message boards. Videos often have links in their descriptions, and videos embedded from YouTube can be resolved to an ID, which can then be used to grab further data through YouTube--such as these descriptions.
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
Note: If you prefer to not work with this source, please leave it to other contributors. As far as we are concerned, all media is relevant from a research perspective.
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
This spider will parse with a function that, ideally, takes in the response and URL which it is drawn from. Then, the URL will be parsed via regex to determine a parsing function for the respective site (not written in this issue, just handled). That function will then parse out all the reference
items to pass back into Scrapy.
Example:
import re
def get_wapo_refs()
# Do stuff
return refs
parser_map = {
'.+\.washingtonpost.com.+': get_wapo_refs,
'.+\.forbes.com.+': get_forbes_refs,
# ...
}
precompiled_map = [re.compile(pattern), parser for pattern, parser in parser_map.items()]
def get_parser(url):
# The caveat here is that we must be sure the first match is what we want to match
for pattern, parser in precompiled_map:
if pattern.match(url):
return parser
While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given xpath
. (See Scrapy docs here for info on xpath selectors)
This article about Herman Cain, for instance, is still present in an older, wordpress-based format.
The way to go about handling this will be to extend the existing parser function to try various xpath
or css
selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a <div ...>
item rather than an <article itemprop="articleBody">
:
<div class="wp-column ten margin-right main-content">
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
This is, unfortunately, going to have to be a case-by-case issue across each site. As this is an open-source project, we aren't currently in the business of paying for a ton of various subscriptions.
These will have to be detected differently upon response handling for each media entity we scrape.
The handling, however, can be pretty simple. I figure we can just export these as leaf-nodes in the trees we scrape, labeling their media type as Pay-gated Article
.
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
Hackathon Note: This is a bit more challenging than the other parsers, and can be considered a stretch goal
So tweets come in a few shapes and sizes when referenced. Sometimes articles link to tweets, other times they embed them. This issue is for tracking both kinds of tweet reference:
No matter which kind you are tackling, I recommend utilizing the Twitter API
in this.
This can be handled very similarly to all other link references. A parser for twitter URLs will just have to defined and imported into the parser_map
under a suitable regex pattern to match against. From there, the Twitter API can likely be your best friend.
This will likely be the first kind of embedded reference we will handle aside from videos, so we will actually discuss that in another issue.
I encountered this intelligence community report while working through the Washington Post example data.
This is a very specific type of media which will likely always end up as a leaf node in the trees generated (or a node with no outward path in the full, Directed Acyclic Graph built later).
The two facets of this issue are:
Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a>
references.
This will involve a few things:
crawler/crawler/parsers
<p>
tag enclosing the given anchor tag's cleaned text.When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.