Code Monkey home page Code Monkey logo

warc2corpus's Introduction

warc2corpus

warc2corpus extracts text corpora from WARCs or HTML pages, according to a user-defined specification, or spec. The spec consists of CSS paths and (optional) transformations on the text extracted.

warc2corpus is best used with Archives Unleashed Toolkit (AUT).

Installation

The easiest way to use warc2corpus is to build and use a Docker image.

Docker Images

AUT's docker-aut image, version 1.x.x, serves as the base image for warc2corpus.

Note: Since pre-built Docker images for Archives Unleashed Toolkit (AUT) are no longer available, start by building docker-aut locally, see instructions.

git clone https://github.com/archivesunleashed/docker-aut.git
cd docker-aut
git checkout 1.x.x     # checkout the 1.x.x branch
docker build -t aut .  # build AUT Docker image

Next, build the warc2corpus image locally.

git clone [email protected]:sepastian/warc2corpus.git
cd warc2corpus
make build

You are now ready to use warc2corpus.

Usage

There are two ways of using warc2corpus: standalone or within AUT. In standalone mode, scripts are run using Python3. When using warc2corpus withing AUT, scripts are run using spark-submit.

Standalone warc2corpus

Given a single HTML page containing a news article and a spec, a Python3 script makes use of warc2corpus to extract the date of release, as shown in the following example.

import warc2corpus as w2c
import dateparser as dp
import json

# Warc2corpus spec definig extractors.
spec= {
  'extracts': [
    {
      'name': 'released_at',
      'css_path': 'header time',
      'f': lambda m: dp.parse(m[0]['datetime'].strip(), date_formats=['%d. %B %Y']).isoformat()
    }
  ]
}
with open('file.html','r',encoding='utf-8') as f:
    html= f.read()
    result= w2c.apply(html, spec)
    print(json.dumps(result))

# Output:
# [
#   {
#     "css_path": "header time",
#     "name": "released_at",
#     "value": "2022-02-14T17:04:21"
#   }
# ]

A complete example can be found in sample/w2c_standalone.py. The sample can be run using a make target or by invoking docker.

cd warc2corpus/

# Run sample using make target.
make sample-standalone

# Alternatively, run sample using docker run.
docker run --rm -it \
  -v $(CURDIR)/data:/w2c/data \
  -e PYTHONPATH=/aut/aut-1.1.0.zip:/w2c/lib \
  --entrypoint=/usr/bin/python3 \
  warc2corpus \
    sample/w2c_standalone.py

warc2corpus within AUT

For processing potentially large WARC files containing many HTML sites, it is recommended to run warc2corpus within AUT. Instead of reinventing the wheel, let AUT handle WARC files. Under the hat, Apache Spark is used for heavy data lifting.

In this example, a script processes a WARC file containing several HTML pages. The script is run using spark-submit. Not all HTML pages inside the WARC may be of interest. Warc2corpus allows specifying a regular expression, to select HTML pages to process by URL.

import json
import dateparser as dp
from pathlib import Path
from warc2corpus import run

# Warc2corpus will only process HTML pages within 
# the WARC with a URL matching this regex.
url_regex = ".+/pressemeldungen/meldung/detail/[a-z]+"

# Warc2corpus spec defining extractors.
spec= [
  {
    'extracts': [
      {
        'name': 'released_at',
        'css_path': '[itemprop~="datePublished"]',
        'f': lambda m: dp.parse(m[0]['datetime'].strip(), date_formats=['%d. %B %Y']).isoformat()
      }
    ]
  }
]

# Select HTML pages from WARC by regex, apply spec only 
# on pages with a url matching the regex supplied.
df = run(warc_file, config, url_regex=url_regex)

# Write results to console, in JSON format.
print(df.toJSON().collect())

A complete example can be found in sample/w2c_aut.py. The sample can be run using a make target or by invoking docker.

cd warc2corpus/

# Run sample using make target.
make sample-aut

# Alternatively, run sample using docker run.
docker run --rm -it \
  -v $(CURDIR)/data:/w2c/data \
  --entrypoint=/spark/bin/spark-submit \
  warc2corpus \
    --py-files /aut/aut-1.1.0.zip \
    --jars=/aut/aut-1.1.0-fatjar.jar \
    sample/w2c_aut.py

Spec Format

Warc2corpus requires a spec, to be able to extract structured information from HTML pages. The spec is a Python dictionary containing two keys, meta (optional) and extractors.

The meta entry (optional)

When running warc2corpus, the contents of the optional meta entry will be copied as-is into the result. This allows adding meta information to the result, such as the source of the information extracted, or the date of processing.

spec= {
  'meta': {
    'name': 'Uni Passau press releases',
    'location': 'WARCnet London 2022',
    'created_at': dt.now().isoformat()
  },
  'extracts': [
    # one or more extractors
  ]
}

The extracts entry

The extracts entry contains a list of so-called extractors. Each extractor is itself a dictionary, containing the following keys: name, css_path, f (optional).

spec= {
  'extracts': [
    {
      'name': 'released_at',
      'css_path': '[itemprop~="datePublished"]',
      'f': lambda m: dp.parse(m[0]['datetime'].strip(), date_formats=['%d. %B %Y']).isoformat()
    }
  ]
}

When running warc2corpus, each extractor will be applied on the HTML page processed. The css_path is used to select one or several elements from the HTML page. The information extracted will be stored under the key released_at, as defined by name. Instead of storing the date of release as-is, the lamba f is applied first, to transform the string "14. Februar 1992" into a datetime object.

Results of Extraction

The result of applying the extractor above is the following JSON object. The css_path and the name have been copied into the result. The date of release extracted from the HTML has been passed through the lambda defined in f, the result is stored under value.

[
  {
    "css_path": "header time",
    "name": "released_at",
    "value": "2022-02-14T17:04:21"
  }
]

warc2corpus's People

Contributors

sepastian avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.