Code Monkey home page Code Monkey logo

warcsearcher's Introduction

WarcSearcher

A Python tool that performs regex keyword searches iteratively over the contents of local warc.gz files. Uses Python 3.12.2. (Currently in active development)

WarcSearcher is designed to facilitate the parsing of large web archives (WARC files) with a variety of search criteria. It also aims to package the results for sharing in collaborative online settings.

WarcSearcher uses the FastWARC library for faster warc.gz processing compared to similar libraries such as warcio.

Features

  • Regex searching of all response records in a WARC
  • Support for any number of regex "definitions" to search with - results output is segregated by definition
  • Recursive searching of nested .zip, .7z, .rar, and .gz archives within a WARC
  • Results output to .txt file per-definition
  • Optional per-definition .zip output of all files from the WARC that yielded a match
  • Multiprocessed warc.gz processing
  • File name and contents searched (binary file data is skipped)

Setup

[Work in progress]

  • Ensure Python is installed and you are able to install packages via pip install.
  • In order to process .rar files encountered while searching, WinRar must be installed and the path to the executable must be added to your System Path environment variable (i.e. C:\Program Files\WinRAR)
  • Acquire WarcSearcher.py, config.ini, and requirements.txt, and place them in a local directory.
  • Using a terminal, navigate to the directory you placed the files in and install all required libraries: pip install -r requirements.txt
  • Edit config.ini to provide the program with the following required information:
    • archives_directory - (Required) The directory containing your warc.gz files. Subfolders will be searched recursively during program execution.
    • definitions_directory - (Required) The directory containing your regular expressions saved as .txt files. See the Definitions section for more details.
  • Optionally, you may also want to modify these fields:
    • findings_output_path - (Optional) The directory for the results to be output to. Defaults to the current working directory if empty or invalid.
    • zip_files_with_matches - (Optional) Boolean indicating whether to add all files that yielded a match to a per-definition .zip file in the output directory. It is highly recommended to be used in order to better ascertain the context of matches - however, it can potentially consume large amounts of disk space depending on the regexes used to search. Use with discretion. Defaults to False.
    • max_concurrent_archive_read_threads - (Optional) The number of concurrent threads on the process responsible for reading in the records from the WARC files. Each .gz archive is assigned its own thread up to the maximum specified. Use with discretion. Defaults to 4.
    • max_concurrent_search_processes - (Optional) The number of concurrent processes to perform the regex searches with. Defaults to None (which will assign one process per each available logical processor on your machine - set it lower if you want more headroom for multitasking).
    • target_process_memory_bytes - (Optional) The target process memory such that the PC's RAM isn't fully consumed when reading in multiple large WARC files. The program will wait for the search processes to clear out items from the queue until the process memory falls below the specified target. Defaults to 1000000000 (1GB).

Command Line Arguments

  • zip - (Optional) Adds all files that generated a match from the WARC to a per-definition zip file. This argument will override the value of zip_files_with_matches in config.ini if used. It is highly recommended to be used in order to better ascertain the context of matches - however, it can potentially consume large amounts of disk space depending on the regexes used to search. Use with discretion.

Definitions

WarcSearcher operates by reading in any number of regex "definition" files. A definition is simply a .txt file in the user-defined definitions_directory that contains a regular expression. For each definition, it performs the regex keyword search against all warc.gz files found in archives_directory, then segregates the output by each definition.

Here's a simple example of the contents of a fruits.txt file that yields a match for any case-insensitive ocurrence of "Banana", "Apple", or "Orange":

(?i)(banana|apple|orange)

When executed, the program will output a fruits_findings.txt file in the output directory, containing all matches for the keywords prescribed by the regex found within the records of the WARC(s). Optionally, a fruits_findings.zip will also be output if zip_files_with_matches = True or the zip parameter was used, containing a copy of all files that yielded a match extracted from the WARC(s).

More information about regexes

Making an Executable (Optional)

In local testing, I've found that creating an executable speeds up the searching of large WARCs by about 25%. A standalone executable is planned for inclusion in the repo eventually, but you can create your own by navigating to the directory containing WarcSearcher.py and using PyInstaller:

  • pip install pyinstaller
  • pyinstaller --onefile WarcSearcher.py

The executable should be created in a dist folder - copy it to the directory containing your config.ini, then it should be ready for use. Be aware that it may be falsely flagged by antivirus software.

Future Development

  • Output by mimetype
  • Performance optimizations
  • Standalone executable
  • GUI?
  • HTML output?

Contributing

Anyone is welcome to contribute to the repo! This is essentially a hobby project, but anyone is welcome to set up a pull request.

Credits

  • CrazyTom
  • InfiniteBlueGX

warcsearcher's People

Contributors

infinitebluegx avatar

Stargazers

Brayconn avatar

Watchers

 avatar

Forkers

suneecat

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.