archiveteam / archivebot Goto Github PK

ArchiveBot, an IRC bot for archiving websites

Home Page: http://www.archiveteam.org/index.php?title=ArchiveBot

License: MIT License

Ruby 27.98% Haxe 4.69% Makefile 0.02% JavaScript 13.09% HTML 13.05% CSS 0.26% Python 40.26% Shell 0.50% Dockerfile 0.14%

archiving ruby python javascript haxe irc

archivebot's Introduction

1. ArchiveBot

    <SketchCow> Coders, I have a question.
    <SketchCow> Or, a request, etc.
    <SketchCow> I spent some time with xmc discussing something we could
                do to make things easier around here.
    <SketchCow> What we came up with is a trigger for a bot, which can
                be triggered by people with ops.
    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
                archive.org. Boom.
    <SketchCow> I can supply machine as needed.
    <SketchCow> Obviously there's some sanitation issues, and it is root
                all the way down or nothing.
    <SketchCow> I think that would help a lot for smaller sites
    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
                simple.
    <SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling.  ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide: http://archivebot.readthedocs.org/en/latest/
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. Local use

ArchiveBot was originally written as a set of separate programs for
deployment on a server.  This means it has a poor distribution story.
However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline,
dashboard, ignores, and control system and created a package intended for
personal use.  You can find it at https://github.com/ArchiveTeam/grab-site.

4. License

Copyright 2013 David Yip; made available under the MIT license.  See
LICENSE for details.

5. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <http://celluloid.io/>
* Cinch <https://github.com/cinchrb/cinch/>
* CouchDB <http://couchdb.apache.org/>
* Ember.js <http://emberjs.com/>
* Redis <http://redis.io/>
* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>

6. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

 vim:ts=2:sw=2:tw=72:et

archivebot's People

Contributors

Stargazers

Watchers

Forkers

pombredanne chfoo aeakett mback2k cigraphics pressstartandselect sanqui riking hannahwhy emijrp jesseweinstein garyrh cloudxtreme onceinlifetime falconkirtaran frogging101 asparagirl alexxnica bytearchive doomtay semtle justanotherarchivist peterbortas ghostofapacket martindale bhyvex c4k3 luckcolors hsab hasimir vlmarcovecchio farmers-tan beatsaver anarcat suicidesin hook321 fusl promyloph rjshaver kelvinhammond flashfire42 reviforks justforkin collinalexbell sinsixx fakegit maciejmacko remilia807 sdshlanta actorexpose aktheknight pokechu22 atlashackert 5l1v3r1 wessel1512 arkiver2 systwi-again occlub73 gabldotink thetechrobo pabs3 tomodachi94 elsiehupp mvandermeulen iridescentic

archivebot's Issues

Dashboard should show RAM and disk usage for each pipeline instance

Each pipeline instance (i.e. each pipeline/pipeline.py process) should periodically report on its RAM and disk usage.

For now, we just want to display that on ArchiveBot's dashboard. We could in future use it for more sophisticated job routing.

Log all archive jobs to prevent duplicates

ArchiveBot currently logs which jobs have been started, but it doesn't log which jobs have finished. (The idea was to remove jobs once they finished.)

We should log jobs that have finished, as well as how they finished: failure, partial success, or success.

This has a couple of applications:

We can disallow jobs from being run once they've successfully completed.
An alternative of (1): we can require confirmation for successfully completed jobs. (An op may want to archive a site at multiple points in time.)

!status [IDENT/URL] should distinguish between "in progress" and "pending"

While it's not wrong (on a very literal level) to call a pending job "in progress", it's misleading. The status display should distinguish between these categories.

Perhaps "queued" and "downloading" would be better.

If the job queue is full, don't tell users that we're archiving things just yet

[03:55:38] <joepie91> yipdw: suggestion, perhaps make it say "Queued x" instead of "Archiving x", and post a message into the channel (-without- a nickname prefix) as soon as the job actually starst?

This issue is a refinement of that request.

If a job can start immediately (where "immediately" means "within 5 seconds"), then we should still print Archiving SITE. However, if the job queue is full at the time of job submission, we should print something like Queued SITE for archival or Queued SITE for archival without recursion, and perhaps also the queue length.

Once the job starts, we should print a message into the control channel. Something like

<ArchiveBot> Job IDENT for SITE has started.

Set expiration for archive jobs

[11:24:16] <SketchCow> Archivebot will also not regrab a site if it's been grabbed recently, recently being 2 days.

Easy solution: Set archive records to expire 48 hours after successful completion.

Require requests come from ops

Cinch supposedly offers oper detection as follows:

on :message, "bleh" do |m|
  if m.user.oper?
    # stuff
  end
end

This didn't work for me on SynIRC, though, which is why I haven't implemented it yet. I haven't tried it on EFNet or other networks.

See what's going on and make it so that ArchiveBot will only start jobs requested by channel ops.

ArchiveBot should label aborted WARCs in filenames

Now that ArchiveBot generates a WARC in the complete and abort cases, it should label the generated WARC as aborted.

Something like this, maybe: example_com-inf-aborted-1234567890.warc.gz

!archiveonly many URLs

[22:02:11] <ivan`> feature request: submit a few hundred !archiveonly URLs via a URL to a text file listing said URLs

Maybe something like this?

!ao < http://www.example.com/urls.txt

ArchiveBot gets tangled up in faux loops

Here is an example from a job done on coilhouse.net:

https://s3.amazonaws.com/nw-depot/coilhouse_fetch.log.bz2 (expands to ~58 MB)

There's two conditions that seem to be necessary for this to happen:

The website has undetectable cycles. (In this case, http://coilhouse.net/2011/09/turntable.fm/Coilhouse actually brings us back to the same page as http://coilhouse.net/, but there is currently no way for ArchiveBot's wget to detect this, because the former returns a 2xx, not a 3xx. And, really, they technically are different resources, since they have different URLs.)
Some sort of malformed link on the target site.

Figure out how we can detect this condition.

The dashboard should not assume URLs are UTF-8 character sequences

The dashboard uses JSON for exchanging log updates. The json gem (in ArchiveBot's environment) represents JSON in UTF-8.

However, URL data in each log update is not guaranteed to be valid UTF-8. There's a few reasons for this, but

I haven't investigated the causes in depth
The causes are really irrelevant

Said causes are irrelevant because the dashboard is being presented with URLs that are just a string of bits and we're interpreting them as UTF-8. That works in a lot of cases, but not all. Explicit transcoding is required.

ArchiveBot should know where its archives end up

At present, ArchiveBot's pipeline hardcodes a Web-accessible archive URL into a work item.

This isn't going to work now that we're uploading to a staging area with eventual injection into the Internet Archive. However, the data model is flexible enough to support updates to job data, which means that we can add an archive URL when we know what that URL is going to be. (And we can update said URL as needed.)

Write a companion program that does this.

Detect and upload failed jobs

On ArchiveBotDrone, I find sometime the leftover of failed jobs (.warc.gz & .json).
I upload them to Fos manually (i set aborted to true in the json and i rename them to include -failed before uploading) but the bot should do this automatically.

Consider context for "you must be an op to do that" message

[02:18:57] <yipdw> !abort cuyp1tr5lig4d6ox4v7kyry9f
[02:18:57] <ATGoKart> yipdw: Sorry, only channel operators may start archive jobs.

ATBot's response should be something like "Sorry, only channel operators may use !abort".

Wire up a Twitter, RSS, etc. feed for "here's what I'm downloading"

[23:26:51] <SketchCow> Speaking of feature creep
[23:27:04] <SketchCow> I am not against Archivebot tweeting what it's doing.

Success/abort notifications

Currently, the !status IDENT mechanism is the only way to get status information from ArchiveBot.

ArchiveBot should be able to notify people when jobs complete or finish aborting.

Simultaneous upload to multiple locations

ArchiveBot should be able to upload generated WARCs to multiple simultaneous locations.

The use case:

There are (at least) two needs that this project is trying to address. The first is generating WARCs for injection into the Internet Archive. The second is a quick WARC generation tool.

Multiple simultaneous upload targets makes it possible for us to satisfy both needs.

Automatically abort jobs when available disk space passes some threshold

If a job's available disk space exceeds some threshold, it should abort itself.

"Abort", in this context, means the same thing it does in the rest of ArchiveBot: immediately terminate grab, upload the partial result, and inform the bot's task tracker.

Abort command

It's possible that archive jobs may get out of control; the site may be bigger than anticipated.

We need a way to abort currently running jobs. !abort IDENT could work.

Any trusted ArchiveBot user can issue this command.

Download embedded videos, audio, etc. in page

Many blogs embed audio, video, and other such media. In some cases, wget can detect this (say, if it's a direct link to a file), but custom Flash players can cause problems.

Suggested solution: use an external tool to either

generate a downloadable URL to feed back into wget's fetch queue
download the video file and store it to a separate WARC

and integrate that into ArchiveBot's fetch process.

Show upload status in the dashboard

When a WARC is uploading, there's no indication of this in the dashboard. However, rsync generates useful diagnostic output.

We should show that in the job's log, so that it doesn't look like the job has inexplicably stalled.

Add a manually maintained ban list

Ban on nick!hostmask, just like IRC servers.

Increase chance of WARC filenames being unique

Previously, ArchiveBot's generated filenames looked like this:

f1ae3mhya6r5ujb8kp4v224e6-[date]-[time].warc.gz

That wasn't very useful for sorting or review purposes.

They now look like this:

www.example.com-inf-20130331-120000.warc.gz

which is much better for manual review, but has a collision problem.

Specifically, if two jobs are started from different parts in www.example.com's hierarchy in the same second, you're going to end up with an overwrite whilst rsyncing the data.

This happens a lot on !ao runs, which is also where you don't want this sort of thing to happen. (It's particularly common to encounter this while saving a set of tweets.)

A compromise like this would be useful:

www.example.com-224e6-inf-20130331-120000.warc.gz

Split ArchiveBot's seesaw pipeline up

ArchiveBot's Seesaw pipeline is quite large and is becoming painful to navigate and change.

Fortunately, it is composed of a few major components:

Custom tasks
Redis scripts used by said tasks
System monitoring
Some Seesaw extensions
The pipeline itself

These would be a good place to start drawing package boundaries.

However the pipeline is packaged, it should be done with an eye towards clarifying inter-module dependencies and making it easier to change one part of the pipeline whilst thinking less about how it'll affect the rest of the pipeline. This is how you know you have gone too far.

Monitor Wayback Machine / Internet Archive forums for URLs

Caution: super feature creep.

There's quite a few requests for websites to be added into the Wayback Machine.
(e.g.: https://archive.org/iathreads/forum-display.php?forum=web&limit=1000) Perhaps we can fulfill their requests automatically?

This might work better as a separate client that feeds URLs in the IRC channel.

wget cannot handle large site downloads gracefully

wget, while versatile, has two major limitations when it comes to downloading large sites:

The URL queue is kept in memory and cannot be stored on disk. This becomes a problem for large sites (i.e. hundreds of thousands of URLs), as it can make wget's image size hundreds of megabytes large.
If a wget process fails (segfault, power failure, whatever), there is no way to resume the grab where it failed. You have to start from square one.

Luckily, wget-lua provides us with the get_urls and httploop_result hooks, which provide us with the necessary extension points to implement custom URL scraping and queuing logic. Investigate this possibility, keeping the following goals in mind:

Efficient use of memory. We should be able to handle URL queues whose size vastly exceeds available RAM.
Resume capability: we should be able to pick off where a job failed. It is understood that this can lead to consistency problems, especially for sites that are still active. There's nothing that can be done about that.

Track bandwidth usage across all jobs

Just because I think it'd be funny.

Easiest way to do this (though not strictly accurate) would be to sum up bytes_downloaded for all jobs. That misses upload, but we can relabel this "total bytes downloaded" or something to make it more accurate.

Ignore URLs of the form http://www.example.com/%22http://www.example.com...

I've seen ArchiveBot wget instances try to grab URLs of the form

http://www.example.com/%22http://www.example.com...

on some sites. If you apply URL decoding, this becomes

http://www.example.com/"http://www.example.com...

Many sites do not have content at such URLs, so even if wget generates such a URL, fetch will terminate with a 4xx (or 5xx, I guess, if the site has problems with "s in URLs). However, some sites will send back an HTTP 200 with links going deeper in the link graph. Ugh.

Assuming that URLs of the above form are very rare and are almost never legitimate, ArchiveBot should detect such patterns and reject them. (The other possibility: figure out why these URLs show up. If it's a wget bug, fix it.)

An odd case: http://folk.uib.no/hnohf/

http://folk.uib.no/hnohf/ contains links to resources in http://www.uib.no/people/hnohf/. However, GETs on these resources result in 301s back to resources on folk.uib.no/hnohf/.

ArchiveBot currently cannot deal with this: it will not follow links from www.uib.no to folk.uib.no.

Find out a way to handle cases like this.

The dashboard's log view slows down when hit with large numbers of updates

Floods of log entries (i.e. many ignores in a row) can overwhelm a browser displaying the dashboard. Ember.js does some work to batch DOM updates together, but sometimes too much is just too much.

I have observed that the dashboard's performance significantly improves when the offending log's output is paused.

Determine whether implementing a "you're updating too fast" timer will offer significant performance improvements in the above extreme case whilst not imposing unacceptable overhead in the common case. The timer will be used to implement an update backoff: instead of updating every entry, we'll back off to every n entries.

Profile image in RSS/Atom feed broken

The profile image in the feed is hotlinked to Twitter but that link is now expired (404). There used to be a URL in the API designed for hotlinking, but it no longer works. (They are slowly getting rid of the API to prevent third-party apps.) Maybe it would be best to hotlink a nice square one on the wiki instead.

Write a utility to deploy ignore patterns

Ignore pattern sets are (or can be) updated frequently and need to be deployed independent of the rest of the ArchiveBot program set.

At present, ignore patterns are updated by me performing the updates manually in ArchiveBot's database. This has obvious scalability and reproducibility problems.

This sort of thing should be wrapped into a utility that can be run by anyone with sufficient access, where "sufficient" is something that I still need to define.

Serve the Twitter/RSS feed from the dashboard

From #29:

For an RSS/Atom feed, I'm thinking that the dashboard app could serve it. Sounds good?

Something like http://dashboard.example.com/index.rss should work.

It'd be really slick if we had TwitterTweeter or a related object generate that file on-disk.

ArchiveBot's Lua script should be able to handle connection failures

At present, the Lua script establishes a Redis connection once, and has no way to recover from failure.

There should be a way for the script to refresh its connection and retry commands. (All commands issued by the script can be executed multiple times without harm, so I don't think we need complicated "did I already do this" logic.)

wget's domain restriction seems to be too strict re: page requisites

A lot of websites use external asset servers for CSS, Javascript, and other media needed to properly display a page.

wget's --domains filter seems to apply to both retrieval and page requisites, which is understandable but unfortunate. We can't possibly predict all acceptable asset servers in advance.

We might, however, be able to implement custom filtering using a Lua script.

History should show the # of URLs retrieved

I can watch the number of 2xx, 3xx, etc. responses on the dashboard and get an idea of how much has been retrieved relative to what a Wayback query returns for a given domain, but if I'm away from keyboard when the job finishes and I go to the http://archivebot.at.ninjawedding.org:4567/#/histories/ page all I get is the total file size and a histogram, no final counts. This would be handy to see.

Add command to add ignore-set to existing job

or if it already exists, document it in https://github.com/ArchiveTeam/ArchiveBot/blob/master/COMMANDS

Show pipeline statistics in the dashboard

The dashboard should show which pipeline a job is on, and should also show free RAM/disk space for each pipeline. Nothing else for now.

Always start !ao jobs before !a jobs

!ao jobs should always be higher-priority than !a jobs, which can hold up time-sensitive things for days.

Write the .json file earlier to allow for manual upload if the pipeline stages break

e.g. due to redis being down

Add standard ignore patterns to every job

The same kind of junk appears to be grabbed for many blogs, and ArchiveBot should ignore all of it by default.

?replytocom=
?share=email
?share=linkedin
?share=twitter
?share=stumbleupon
?share=google-plus-1
?share=digg
https://plus.google.com/share?url=
http://www.facebook.com/login.php
https://ssl.reddit.com/login?dest=
http://www.reddit.com/login?dest=
http://www.reddit.com/submit?url=
http://digg.com/submit?url=
http://www.facebook.com/sharer.php?
http://www.facebook.com/sharer/sharer.php?
http://pixel.quantserve.com/pixel/

as well as these that appear for reasons unknown to me:

'%20+%20liker.profile_URL%20+%20'
'%20+%20liker.avatar_URL%20+%20'
%22%20+%20$wrapper.data(

(single quotes are actually in the first two URLs)

Don't sleep after ignored URLs

At present, ArchiveBot uses wget's --random-wait option as a simple form of rate limiting.

This (unexpectedly) also seems to apply to ignored URLs. It shouldn't. Make it so.

Implement a special !ao queue

From #27:

!ao jobs now always take precedence over !a jobs. However, all jobs still live in one queue. We can go further and build a separate !ao queue that (say) a separate pipeline could process. I'll delegate that to a separate issue.

There are (at least) two important considerations:

Job queuing currently uses Redis' RPOPLPUSH primitive, and therefore benefits from that primitive's atomicity guarantees. Any replacement should maintain those guarantees or conclusively demonstrate that they are not necessary.
Pipelines should be able to work on !a, !ao, or both. The pipeline administrator should be able to choose the most appropriate workload for their machines.

wpull: errors echoed to dashboard console contain \u2018 and \u2019

wpull errors echoed to the dashboard look like this:

ERROR Fetching \u2018http://www.dogster.com/local/CO/Englewood/Pet_Stores_General/Precious_cat_catlitter-24478\u2019 encountered an error: Read timed out.

The escaped Unicode characters appear to be left and right quotes.

This confuses some ArchiveBot users by making them think that the \u2018 and \u2019 are part of the fetched URL.

Write pipeline ID into the job's info file and job data

<ivan`> yipdw: did [SITE] land on your server? will it be up for a week or so?
<yipdw> ivan`: [SITE] is not on my pipeline
<ivan`> guess it's on joepie91
<yipdw> or nico_32
<yipdw> ArchiveBot is now truly Cloud
<yipdw> because we can't tell where the fuck a job is

Fix this.

Upload identifying information with WARCs

[15:00:22] <SketchCow> yipdw: I wish the archivebot would include a .txt file next to the warc files so we knew what they were.
[15:00:51] <SketchCow> Otherwise we're going to have these massive crawl items that sit there being obscure

A couple of possibilities (not mutually exclusive):

the .NFO route
provide a better filename (maybe a slugged hostname and UNIX timestamp, e.g. www-example-com-1234567890.warc.gz)

!ignore, !unignore: add or remove ignore patterns to a running job

This is related to #15.

The Web is full of crap that ignores the semantics of HTTP. Figuring out whether something has gone wrong in a fetch loop often requires human analysis and intuition.

Why not make that tunable over time?

ArchiveBot's dashboard lets humans see what the spider's doing. We should also have a way to act on that.

Proposed interface:

<me> !ignore 957u6gyj7c536x4fsaam6mqsm turntable.fm/Coilhouse
<ArchiveBot> Pattern turntable.fm/Coilhouse added to job 957u6gyj7c536x4fsaam6mqsm.

This means "for all future retrievals, check if any of the proposed URLs match turntable.fm/Coilhouse. Do not fetch those that do."

!unignore would remove patterns:

<me> !unignore 957u6gyj7c536x4fsaam6mqsm turntable.fm/Coilhouse
<ArchiveBot> Pattern turntable.fm/Coilhouse removed from job 957u6gyj7c536x4fsaam6mqsm.

Because the fetch determination is controlled by a Lua script, any Lua pattern accepted by string.match will be accepted.

"Stop downloading and upload" command

Currently, the only way to stop a download is !abort. This halts downloading and deletes the WARC-in-progress.

It might make sense to have a command which halts downloading, but keeps the WARC (partial WARCs are okay, and can be fixed up pretty easily) and uploads what's there. Sort of like a !goodenough.

This wouldn't see wide use. The motivating case for this is our current grab of the Silk Road forums, which is currently 500-ing out on queued URLs. We did, however, grab about 10,000 URLs from said forums. It would be nice to be able to say "meh, good enough" and just upload what we have.

Record maximum wget memory usage in job

Wget's memory usage tends to grow with the number of URLs examined, but we don't really have any data indicating what the growth pattern is like.

This would be useful to have primarily for satisfying curiosity, but it would also be good to have as a before-and-after dataset if/when we work on #9.

The dashboard's log view suffers from memory leaks

The dashboard's log view is designed to keep only a small number of log entries around in scrollback.

However, heap analysis in Chrome and Firefox indicate that there's a lot of stuff hanging around even after the views have been removed from the DOM. Chrome's heap analysis implicates Ember's views.

This might not actually be a memory leak (maybe it just needs that much memory and eventually stabilizes), but we've seen memory usage go above a gigabyte with no sign of slowing down. Additionally, disabling logs gives us much lower and more stable memory usage.

Look for and implement more efficient ways to render the log. For example, it might be the case that bypassing Ember's view system and dropping down to DOM manipulation -- while more painful and fraught with corner-cases in an Ember application -- might be enough to get by.

Automatically abort jobs when available RAM passes some threshold

Like #32, but monitoring available RAM. (Includes swap.)