An HTTP Proxy that archives all intercepted trafic.
The Live Archiving Proxy (LAP) project is an HTTP proxy that is able to capture the traffic that flows through it. The LAP delegates the handling of the captured data to one or multiple writers using a simple network protocol. Writers exists for the DAFF, WARC and ARC format. Using an HTTP proxy for Web archiving enables the use of any HTTP client for crawling (Heritrix, PhantomJS, HTTrack, Scrapy, etc.) while keeping a unified and simple storage backend. The LAP is designed to be highly performant, easy to use and archive-format agnostic. It will run on any 64-bit linux system.
Ina uses the LAP in production since 2012 for 50% of its crawls and plans to use if for 100% of its crawls by 2014.
- User manual
- LAP distribution including binary and user manual
- WARC writer
- WARC writer project
- Generic writer project
- A perl PAR version of the LAP is included within the LAP distribution
Note: This changelog only list major apparent changes.
1.2.1 2014-08-26
- better HTTP/1.1 handling
- vortex log fix
- bloom filter handshake timeout
- bloom filter TCP tunneling removed
- hostname fix
- DNS caching for IPv6 fix
- deflate fix
- proxy setting revamped
- PAR version fix
1.2.0 2014-05-06
- pseudo HTTPS mode (see user manual)
- compression-factor info for compressibility hint (LZ4)
- bypass mode (lap-bypass header in request)
- PUT web service
- discard-when-no-writer option
- allow-range-requests option
- revamped screen log
- various bug fixes
X.X.X 2013-07-10
- initial public release