Topic: webarchiving Goto Github

Some thing interesting about webarchiving

👇 Here are 42 public repositories matching this topic...

akamhy / waybackpy

webarchiving,Wayback Machine API interface & a command-line tool

Home Page: https://pypi.org/project/waybackpy/

internet-archive wayback-machine internet-archiving archive-webpage archive-webpages wayback-machine-api cdx-api wayback-machine-python savepagenow web-archiving

archiveteam / webarchiver

webarchiving,Decentralized web archiving

Organization: archiveteam

web archiving decentralized python crawler warc webarchiving archiver

archiveteam / wget-lua

webarchiving,Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Organization: archiveteam

Home Page: https://www.archiveteam.org/

webarchiving warc wget lua archiving crawler crawl crawling spider archiveteam

archivingtoolsforwbm / advancedinternetarchiving

webarchiving,Makes saving pages in bulk to the wayback machine much easier

User: archivingtoolsforwbm

webarchiving web-archiving

arquivo / dspace-link-extractor

webarchiving,Extracts links from DSpace repositories

Organization: arquivo

webarchiving java tika sitemaps

athenekilta / arkisto

webarchiving,Digital archive of web pages related to the Guild of Information Networks

Organization: athenekilta

Home Page: https://athene.fi/arkisto/

webarchiving html php archive

atomotic / pywb-recorder-tor

webarchiving,pywb recorder over tor, anonymously records the web. (docker image)

User: atomotic

Home Page: https://pywb.readthedocs.io/en/develop/manual/recorder.html

webarchiving webrecorder tor

atomotic / webrecorder-chrome-extension

webarchiving,record current active tab on webrecorder.io

User: atomotic

webarchiving webrecorder chrome-extension

basenana / nanafs

webarchiving,🗄 File-Based Reference Filing System.

Organization: basenana

workflow-engine fuse-filesystem webdav storage gtd-workflow webarchiving

cipher387 / quickcacheandarchivesearch

webarchiving,Quick Cache and Archive search buttons

User: cipher387

Home Page: https://cybdetective.com

webarchiving webarchive google-cache yandex-cache baidu-cache

commoncrawl / cc-notebooks

webarchiving,Various Jupyter notebooks about Common Crawl data

Organization: commoncrawl

jupyter-notebook common-crawl aws-athena webarchiving commoncrawl webgraph-framework

datacoon / metawarc

webarchiving,metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

Organization: datacoon

warc warc-files webarchiving metadata osint osint-python

exponential-decay / moonshine

webarchiving, Given four bytes, download a random file from web archives implementing the UKWA Shine interface

Organization: exponential-decay

code4lib digipres webarchiving ukwa warclight archives glam file-formats

gitdev-bash / webarchiver

webarchiving,A archiving utility with an interface for web servers.

User: gitdev-bash

archiving webserver webarchiving webarchive

harvard-lil / warc-gpt

webarchiving,WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

Organization: harvard-lil

Home Page: https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

ai rag warc webarchiving

httpreserve / httpreserve

webarchiving,Digital Preservation of HTTP in documentary heritage.

Organization: httpreserve

code4lib archives webarchiving digitalpreservation internetarchive documentary-heritage digital-repositories digipres wayback waybackmachine

httpreserve / linkscanner

webarchiving,A helper package to tokenize textual content and retrieve hyperlinks

Organization: httpreserve

code4lib webarchiving documentary-heritage httpreserve digitalpreservation archives

httpreserve / phantomjsscreenshot

webarchiving,A wrapper for phantom.js commands for headless screenshots.

Organization: httpreserve

webarchiving websnapshot httpreserve code4lib digitalpreservation

httpreserve / tikalinkextract

webarchiving,Tika based link (URL) extractor for httpreserve

Organization: httpreserve

webarchiving tika tika-wrapper httpreserve code4lib digitalpreservation archives iipc url-extractor

httpreserve / wayback

webarchiving,A restrictied API in Golang for the (semi)-exposed functions of the internet archive.

Organization: httpreserve

code4lib webarchiving digitalpreservation archives internetarchive

httpreserve / workbench

webarchiving,Client app for httpreserve pkg that generates CSV, JSON, HTTP, and BoltDB

Organization: httpreserve

code4lib archives webarchiving digitalpreservation internetarchive boltdb digital-repositories

ibnesayeed / archival-tests

webarchiving,A set of web archival replay test cases

User: ibnesayeed

Home Page: https://ibnesayeed.github.io/archival-tests/

webarchiving webarchive memento testing replay-tests archival-replay

iipc / awesome-web-archiving

webarchiving,An Awesome List for getting started with web archiving

Organization: iipc

webarchiving awesome-list awesome

machawk1 / awesome-memento

webarchiving,A list of things related to software, literature, and other content for 🕣 Memento

User: machawk1

webarchiving memento memento-rfc awesome-list awesome

mijho / crawl-log2xml

webarchiving,Parse a Heritrix crawl.log into an XML sitemap

User: mijho

crawl deno heritrix sitemap sitemap-generator sitemap-xml webarchive webarchiving heritrix3

mozillacz / phpbbcrawler

webarchiving,Link crawler for a phpBB forum

Organization: mozillacz

phpbb crawler webarchiving wayback-archiver tool

n0tan3rd / node-cdxj

webarchiving,Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

User: n0tan3rd

webarchive webarchiving web-archives cdxj

n0tan3rd / node-warc

webarchiving,Parse And Create Web ARChive (WARC) files with node.js

User: n0tan3rd

webarchive webarchiving web-archives warc-files warc web-archiving pupeteer chrome-remote-interface

n0tan3rd / squidwarc

webarchiving,Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

User: n0tan3rd

Home Page: https://n0tan3rd.github.io/Squidwarc/

webarchiving webarchives crawler high-fidelity-preservation chrome-headless chrome puppeteer headless-chrome crawling browser-automation

n0tan3rd / whatdowebpagesdo

webarchiving,

User: n0tan3rd

webarchiving webarchives high-fidelity-preservation web-page-info web-archive-analysis

natliblux / warc-safe

webarchiving,A tool for detecting viruses and NSFW material in WARC files

Organization: natliblux

antivirus nsfw-classifier warc warc-safe webarchiving

news-archiver / news-archiver

webarchiving,News Archiver, Data Aggregation for CNN and Fox News

Organization: news-archiver

Home Page: https://newsarchiverdiff.com/

mysql javascript scraping-websites cnn foxnews webarchiving

oduwsdl / tmvis

webarchiving,An archival thumbnail visualization server

Organization: oduwsdl

tmvis timemap visualization webpage-changes webarchiving archive memento nodejs

peterk / munin-indexer

webarchiving,A social media open post web archiving tool

User: peterk

archiving preservation webarchiving high-fidelity-preservation

peterk / warcworker

webarchiving,A dockerized, queued high fidelity web archiver based on Squidwarc

User: peterk

preservation archiving webarchiving webarchives high-fidelity-preservation

pierlauro / mdbubing

webarchiving,From WARC records to MongoDB documents

User: pierlauro

warc webarchiving webarchive crawler crawling bubing warc-files warc-format warc-record

ruarxive / awesome-digital-preservation

webarchiving,Awesome list dedicated to digital and data preservation tools, sources, services and so on.

Organization: ruarxive

archival awesome awesome-list crawler digital-preservation list warc webarchiving

shawnmjones / government-sites-archive-projects

webarchiving,This repository contains work done to determine how much of www.guideline.gov and qualitymeasures.ahrq.gov were archived.

User: shawnmjones

Home Page: https://ws-dl.blogspot.com/2018/07/2018-07-15-how-well-are-national.html

archiving-datasets webarchiving webarchive-discovery

simonkocurek / trebis

webarchiving,Offline storage of website data on Android

User: simonkocurek

archived webarchiving offline android webview jetpack room kotlin-android storage

toimik / warcprotocol

webarchiving,Parser for WARC (aka WebArchive) files

Organization: toimik

warc warc-files webarchive webarchiving webarchives warc-record warc-format warc-reader

ubuntucz / archiver

webarchiving,Nástroj pro archivaci webových stránek na Wayback Machine

Organization: ubuntucz

crawler webarchiving wayback-archiver

webarchivcz / seeder

webarchiving,Seeder - Czech webarchive curating tool and public site

Organization: webarchivcz

django archive government tools czech czech-republic webarchive webarchiving webarchives

Topic: webarchiving Goto Github

👇 Here are 42 public repositories matching this topic...

akamhy / waybackpy

archiveteam / webarchiver

archiveteam / wget-lua

archivingtoolsforwbm / advancedinternetarchiving

arquivo / dspace-link-extractor

athenekilta / arkisto

atomotic / pywb-recorder-tor

atomotic / webrecorder-chrome-extension

basenana / nanafs

cipher387 / quickcacheandarchivesearch

commoncrawl / cc-notebooks

datacoon / metawarc

exponential-decay / moonshine

gitdev-bash / webarchiver

harvard-lil / warc-gpt

httpreserve / httpreserve

httpreserve / linkscanner

httpreserve / phantomjsscreenshot

httpreserve / tikalinkextract

httpreserve / wayback

httpreserve / workbench

ibnesayeed / archival-tests

iipc / awesome-web-archiving

machawk1 / awesome-memento

mijho / crawl-log2xml

mozillacz / phpbbcrawler

n0tan3rd / node-cdxj

n0tan3rd / node-warc

n0tan3rd / squidwarc

n0tan3rd / whatdowebpagesdo

natliblux / warc-safe

news-archiver / news-archiver

oduwsdl / tmvis

peterk / munin-indexer

peterk / warcworker

pierlauro / mdbubing

ruarxive / awesome-digital-preservation

shawnmjones / government-sites-archive-projects

simonkocurek / trebis

toimik / warcprotocol

ubuntucz / archiver

webarchivcz / seeder

Recommend Projects

Recommend Topics

Recommend Org