Web Archives Collection System
Ubuntu Version > 16.04:
# Get source
git clone https://github.com/tigercosmos/web-archives.git
cd web-archives
git submodule init
git submodule update
# Dependencies
sudo apt-get install libxml2-dev libxslt-dev proxychains
# Don't change, it's hard code here.
virtualenv warcm_env/virt1 --no-site-packages
source warcm_env/virt1/bin/activate
pip install -r WarcMiddleware/pip_requirements.txt
pip install git+https://github.com/ikreymer/pywb.git
All sites list in assets/alexa-*.csv
# save
./scripts/getArchiveAll.py
# extract
./scripts/extractArchiveAll.py
# get and save as warc
./scripts/getArchive.sh [Name] [URL]
# extract warc to files
./scripts/extracArchive.sh [Name]
Alexa Top 1,000,000 at 2018/3/20
ISC