Code Monkey home page Code Monkey logo

xpath2rss's Introduction

XPath2RSS

A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.

It's in PHP because it's a good glue for anything web-related, and it uses XPaths because they're awesome to work with.

What's it for

It's for keeping up with the updates to those annoying sites that don't provide an RSS feed themselves. There's some example cases below.

Installing

On a Debian-like system, get the dependencies with:

$ apt-get install php5-cli php5-curl

Then get yourself a copy of xpath2rss.php (might be handy to drop it in your PATH somewhere, like under /usr/bin). Feel free to rename it to xpath2rss while you're at it if you don't like the extension (the interpreter is specified in the file).

To see that it's a-OK, try running:

$ xpath2rss

You should see a usage message. PHP 5.3+ is recommended, but the script should run with anything 5.1+.

Usage

The command expects a path to a configuration file as its only argument. The configuration file is a traditional ini-file that specifies what to fetch, the XPath expressions to use etc. You can test out a configuration file by running:

$ xpath2rss --test myconfig.ini

You'll see some useful info.

The script is likely most useful when ran from a cron-like facility periodically.

Configuration

A configuration file must contain the following properties:

  • feed - Name of the feed. This will appear as the <title> of the RSS feed.
  • url - URL from which to load the HTML that will be scraped.
  • file - Path to an XML file that will host the RSS feed (likely under your webroot somewhere so an RSS reader can access it).
  • title - Template for the contents of the <title> for a single item in the RSS feed. If this template contains any %variables%, they are replaced with the corresponding XPath matches from [vars].
  • description - Same as above, but for the <description> tag.
  • context - An (optional) XPath expression to select a context node for any following expressions under [vars] below. Use this to avoid repetition of the same search prefix in multiple variables. See Examples.
  • [vars] - Any number of XPath expressions that will be used to scrape content from the page at url. If the name of the var is foo, then it will be usable in the title and description fields as %foo%. The only mandatory var is guid.

Notes

Each RSS item has a GUID. Once an item has been added to the feed, an item with the same GUID won't be added again.

The GUID, along with other optional variables, are specified under the [vars] heading of the configuration file. The content of each variable is determined by its XPath. Any %var%s found in the title and description templates of an RSS item are expanded to their value.

Examples

A webcomic

To get a feed from one popular webcomic (yes, they already have one), set up an xkcd.ini along these lines:

feed = "xkcd"
url = "http://xkcd.com/"
file = "/path/to/webroot/xkcd.xml"
title = "%guid%"
description = "<img src='%image%' /> <p>%text%</p>"

[vars]

guid = "//div[@id='middleContent']//img/@alt"
image = "//div[@id='middleContent']//img/@src"
text = "//div[@id='middleContent']//img/@title"

And run:

$ xpath2rss --test xkcd.ini

You should see the name of the latest comic as the guid and the other vars populated as well. The <p>%text%</p> has the added benefit of being able to read the image title text with devices without a cursor (say, a phone).

Episodic YouTube-content

Some good stuff on YouTube don't have their own channel (from which you could get a feed directly). To scrape a feed from the search page, you could do something like:

feed = "When Cheese Fails"
url = "http://www.youtube.com/results?search_type=videos&search_query=when+cheese+fails&search_sort=video_date_uploaded"
file = "/path/to/webroot/whencheesefails.xml"
title = "%guid%"
description = "<a href='http://www.youtube.com%link%'>View on YouTube</a>"
context = "//div[@id='search-results']//a[ contains(@title, 'Season') and contains(@title, 'Episode') ]"

[vars]

guid = "@title"
link = "@href"

This works because the search results are ordered newest first, and the XPath expressions will always use the first match if multiple are found. Also, since the search query is a bit long-winded, we use the optional context option to first select the matching context node. After that, any [vars] we declare will use that node as their context. Note that the same could have been done with the webcomic example.

See also

  1. http://www.w3.org/TR/xpath/ - XPath syntax

xpath2rss's People

Contributors

jareware avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.