Code Monkey home page Code Monkey logo

media-crawler's Introduction

Media Crawler and Reference Tree Generator

Project Purpose

The goal of this project is, given any article (or item in general) on a media site such as Washington Post, The Hill, CNN, Forbes, AP, and even Breitbart, we can take the references made in that site (links as <a> tags) and connect that article to the articles it references--then connect those articles to the articles, tweets, and videos they reference. This will result in a tree of references with the original article as a root, and will allow us to combine these trees into a media reference graph at a large scale.

Example small tree with Washington Post article as root: small-dag

Once enough of these smaller trees are amassed, connections can be made to draw out a much larger media ecosystem as a single graph (perhaps with isolated subgraphs, denoting closed networks).

This will, ideally, give us a foundation for performing analysis on media. What sort of connections exist between various media-entities? What entities commonly use themselves as a reference? Can we use the articles, their references, and some natural-language tools to understand a more specific problem?

Hopefully, this can be a tool that serves all of those purposes.

Installing Dependencies

At a minimum, you will need Scrapy and BeautifulSoup for the baseline functionality described below. Install those with yoru virtual environment of choice, ensuring that you are using Python 3.6.x in said environment.

Starting the MediaSpider Crawl

To start the crawling process, you will need to pass a url to the MediaSpider via the Scrapy CLI.

First, cd into media-crawler/crawler, (where the scrapy.cfg file is). Then, run this command, passing the spider the URL you want to start with.

scrapy crawl media_spider -o media.json -a media_url="https://www.washingtonpost.com/news/post-politics/wp/2017/09/07/did-facebook-ads-traced-to-a-russian-company-violate-u-s-election-law/?tid=a_inl&utm_term=.e24142917aa8"

Here, we use a Washington Post article as the starting point. Feel free to use this as a baseline test.

Note: This defaults to crawling 3 nodes deep in any reference tree (ie: three references down from the starting media item). You can change this via DEPTH_LIMIT in crawler/crawler/settings.py

Contributing

For contributing, the typical open-source git-flow applies.

Fork the project, then make a branch off of master with a clear and concise name describing your intentions with the branch--something like handle-old-wapo-format.

Then, make your changes to the code in that branch in your Fork on your machine (following instructions above to ensure it works and doesn't throw any unexpected errors).

When documenting your code, please refer to the Google Python Docstring Standard.

After that, commit the changes to your branch with a clear heading and specific details as to what you have changed. Then, push the branch up to your forked copy of the repository.

At this point, go to Pull Requests at the top of the GitHub page and select New pull request. On this next screen, you will see an option to Compare Across Forks. Click this, and you will be able to compare the branch on your fork with the master branch of the main repository in a pull request.

If you have any further questions about using git or if some of this doesn't make sense, please check out the #github-help channel on the D4D Slack.

For any other issues, please ping @josephpd3 in the #p-media-crawler channel.

media-crawler's People

Contributors

josephpd3 avatar

Watchers

James Cloos avatar Frankie Zeager avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.