Code Monkey home page Code Monkey logo

warrick's Introduction

Warrick

The website reconstructor

Dependencies

  • Perl5 or later
  • cURL
  • Python
  • Perl libraries: HTML::TagParser, LinkExtractor, Cookies, Status, and Date, and the URI library

Installation

Install Warrick's dependencies on the command line by running:

./INSTALL

Test the installation by running:

./TEST

This will recover a web page and compare it to a master copy.

For further options and information on using warrick, run:

perl warrick.pl --help

This version of Warrick has been redesigned to reconstruct lost websites from the Web Infrastructure using Memento.

Recovery Process Details

This program creates several files that provide information or log data about the recovery.

For a given recovery RECO_NAME, we will create a RECO_NAME_recoveryLog.out, PID_SERVERNAME.save, and logfile.o. These are created for every recovery job. RECO_NAME_recoveryLog.out is created in the home warrick directory, and contains a report of every URI recovered, the location of the recovered archived copy (the memento), and the location the file was saved to on the local machine in the following format:

  • ORIGINAL URI => MEMENTO URI => LOCAL FILE

Lines pre-pended with "FAILED" indicate a failed recovery of ORIGINAL URI

PID_SERVERNAME.save is the saved status file. This file is stored in the recovery directory and contains the information for resuming a suspended recovery job, as well as the stats for the recovery, such as the number of resources failed to be recovered, the number from different archives, etc. logfile.o is a temporary file that can be regarded as junk. It contains the headers for the last recovered resource.

History

  • Modified by Justin F. Brunelle (@jbrunelle) at Old Dominion University - 2011
  • Created by Frank McCown (@fmccown) at Old Dominion University - 2006

Contact

We want to know if you have if you have used Warrick toreconstruct your lost website. If you have successfully recovered your site or would like to assist in further development and improvements Warrick, please Open a GitHub issue and/or contact [email protected].

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

The GNU General Public License can be seen here: http://www.gnu.org/copyleft/gpl.html


warrick's People

Contributors

ibnesayeed avatar jbrunelle avatar machawk1 avatar rzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warrick's Issues

ero length content "No Content in ..."

From https://code.google.com/archive/p/warrick/issues/29

What steps will reproduce the problem? 1. ./warrick.pl -dr 2013-08-05 -d -a ia -D ../ftp/ http://www.atlantischild.hu/

What is the expected output? What do you see instead?

http://wayback.archive.org/web/20111031230326/http://www.atlantischild.hu/index.php?option=com_content&task=view&id=21&Itemid=9 has non-zero lenght, I get zero lenght files: "index.php?option=com_content&task=view&id=21&Itemid=9"

What version of the product are you using? On what operating system? warrickv2-2-5

Please provide any additional information below.

I've got a non-zero lenght file which has GET parameters in its name but all files containing & (ampersand) in their names are empty.

log says (below) as you see, nothig anfter "?" in "To stats ... Location:"

At Frontier location 79 of 769

My frontier at 79: http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&Itemid=28 My memento to get: |http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&Itemid=28|

targetpath: index.php

appending query string option=com_content&task=blogcategory&id=21&Itemid=28

mcurling: /home/davidprog/dev/design-check/atlantis/warrick//mcurl.pl -D "/home/davidprog/dev/design-check/atlantis/warrick/../ftp//logfile.o" -dt "Sun, 04 Aug 2013 22:00:00 GMT" -tg "http://web.archive.org/web" -L -o "/home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28" "http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&Itemid=28"

Reading logfile: /home/davidprog/dev/design-check/atlantis/warrick/../ftp//logfile.o

To stats http://atlantischild.hu:80/index.php?option=com_content&task=blogcategory&id=21&Itemid=28 => Location: http://web.archive.org/web/20120903050228/http://www.atlantischild.hu/index.php? => /home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28 --> stat IA

returning /home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28 Search HTML resource /home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28 for links to other missing resources... No Content in /home/davidprog/dev/design-check/atlantis/warrick/../ftp//index.php?option=com_content&task=blogcategory&id=21&Itemid=28!!

This is caused is a simple escaping bug in mcurl.pl and MementoThread.pm that can be fixed with a patch as follows:

~/t2/warrick2$ diff -u ../../warrick2/mcurl.pl mcurl.pl --- ../../warrick2/mcurl.pl 2014-02-05 16:35:37.362518862 -0800 +++ mcurl.pl 2012-03-27 13:02:41.000000000 -0700 @@ -95,10 +95,7 @@

for (my $i = 0; $i <= $#ARGV; ++$i) # { - if ( ( index($ARGV[$i] , ' ') > -1 ) - or ( index($ARGV[$i] , '?') > -1 ) - or ( index($ARGV[$i] , '*') > -1 ) - ) { + if ( index($ARGV[$i] , ' ') > -1 ){ $ARGV[$i] = '"' .$ARGV[$i] . '"'; } } ~/t2/warrick2$ diff -u ../../warrick2/MementoThread.pm MementoThread.pm --- ../../warrick2/MementoThread.pm 2014-02-05 16:38:19.914518843 -0800 +++ MementoThread.pm 2012-03-27 13:02:42.000000000 -0700 @@ -97,7 +97,7 @@ $acceptDateTimeHeader = " -H "Accept-Datetime: ".$self->{DateTime}." " "; }

my $command = "curl -I $acceptDateTimeHeader "$self->{URI}" ";
my $command = "curl -I $acceptDateTimeHeader $self->{URI} "; if($self->{Debug} == 1){ print "DEBUG: " .$command ."\n"; } @@ -351,7 +351,7 @@

} else {

$command = "curl @params $acceptDateTimeHeader "". $self->{TimeGate} ."/" . $self->{URI} . """;

$command = "curl @params $acceptDateTimeHeader ". $self->{TimeGate} ."/" . $self->{URI};

}

@@ -390,7 +390,7 @@

         $command = "curl -I -L $acceptDateTimeHeader ". $self->{Info}->{TimeGate} ;
     } else {
  • $command = "curl -I -L $acceptDateTimeHeader "". $self->{TimeGate} ."/" . $self->{URI} . """; + $command = "curl -I -L $acceptDateTimeHeader ". $self->{TimeGate} ."/" . $self->{URI};

       }
    

@@ -667,4 +667,4 @@ return $result; }

Incomplete Dump

From https://code.google.com/archive/p/warrick/issues/21

I want to recover an older version of a website, that contains "better" information compared to the pervius one and make a local copy for my personal use. I can't get a full website dump, it's alsways incomplete. I'm using OSX.

It's me or the software ? :)

sudo perl ./warrick.pl -o /Users/cesare/Documents/drclark.net/warrick.log -dr 2012-07-23 -D /Users/cesare/Documents/drclark.net - -ic -nc -nv -a ia -nB "http://www.drclark.net/"

drclark.net_recoveryLog.out.txt
warrick.log.txt

./TEST fails complaining that -nr is an invalid option

From https://code.google.com/archive/p/warrick/issues/39

What steps will reproduce the problem? 1. ./INSTALL 2. ./TEST 3. Error presents

What is the expected output?

Successful test

What do you see instead?

net4-dev# ./TEST Starting test...

Welcome to the Warrick Program!

Warrick is a website recovery tool out of Old Dominion University

Please provide feedback to Justin F. Brunelle at [email protected]

Arguments: -D MAKEFILE -o MAKEFILE_LOGFILE.log -xc -nr -dr 2007-08-02 -T -nv http://www.cs.odu.edu/

Unknown option: nr

TESTING DOWNLOAD COMPLETE...

Downloaded 28 resources, which is greater than the 27 from the testfile. We've found a new memento. Test success! cat: MAKEFILE/TESTHEADERS.out: No such file or directory 0 vs 0

What version of the product are you using? On what operating system?

Downloaded version warrickv2-2-5.tar.gz except version string in source says 2.2.3 + I'll open a separate ticket on this problem.

Please provide any additional information below.

No Clobber sleeping

From https://code.google.com/archive/p/warrick/issues/12

When the "no clobber" option is picked, and a file is detected to already exist, Warrick should not do any sleeping while moving on to the next file in the frontier. This will greatly enhance the usability of Warrick, and reduce the need for session management with the *.save files, and would be far more flexible. The most difficult part of the documentation to understand is the *.save feature used for saving sessions.

Distribution archive looks sloppy

From https://code.google.com/archive/p/warrick/issues/30

What steps will reproduce the problem? 1. Download warrickv2-2-5.tar.gz from project's "Downloads" 2. Look inside it with mc for example (with Midnight Commander)

What is the expected output? What do you see instead?

README file says version is 2.0 Expected: 2.5

Almost all files are executable. Even .o text files. Expected: executable must be only files which you have to run.

.o extension is confusing for text files (it usually have compiled object files) with 2 URL inside it. Expected: another extension and may be put these 20 files into a subdir.

I do not expect to see 'curl.exe' here. Expected: either Windows support is officially claimed on the project page or remove the file.

piklog.log is 1 Mb size. Is it really need into distribution archive? Expected: if the file is a part of test suite it might be placed into TEST_FILES subdir.

What version of the product are you using? On what operating system? warrick-2.5 Ubuntu 13.04 x32

Please provide any additional information below.

The usual thing which developers do is just using 'dist' makefile target or 'makedist.sh' script. It does some clean-up ('make clean' or 'rm' command for specified filelist) and put only really needed files into distribution archive.

Issues with sites having non-English encoding

From https://code.google.com/archive/p/warrick/issues/32

What steps will reproduce the problem? 1.Try to recover sites with non english accents and characters 2. 3.

What is the expected output? What do you see instead?

instead of ó i see ó instead of á i see á

etc...

What version of the product are you using? On what operating system?

Last V on ubuntu 12.04LTS Please provide any additional information below.

I am not sure whether we are talking the same thing. But is it possible to adding something like to all pages with a option in warrick script?

Unknown option nr in TEST

When running test after initially installing Warrick, the script report:

Arguments: -D MAKEFILE -o MAKEFILE_LOGFILE.log -xc -nr -dr 2007-08-02 -T -nv http://www.cs.odu.edu/

Unknown option: nr

Perhaps this argument is being captured by the perl environment instead of the perl script, i.e., an option to consider when executing the script.

Issues derived from the command:

perl warrick.pl -D MAKEFILE -o MAKEFILE_LOGFILE.log -xc -nr -dr 2007-08-02 -T -nv http://www.cs.odu.edu/

This may prevent the output files from being generated, which may explain why the TEST script fails on the first check.

Perl v5.18.2
macOS 10.13.4

URI Rewriting

From https://code.google.com/archive/p/warrick/issues/33

URI format changes by the archives have rendered the -k and uri-rewriting features inoperable. we need to develop a way -- without hardcoding -- to either automatically or easily manually change the URI strings that we need to detect to make relative.

Archive watermarks, status bars, and other archive-added features should be treated similarly.

Stream Editing

From https://code.google.com/archive/p/warrick/issues/15

Please describe your feature requests here.

Optimization: Use perl's internal 'stream editing' instead of calling out to sed by fork/exec. Not only do you save system call & fork & exec overhead, but perl has higher performance I/O according to one source.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.