Code Monkey home page Code Monkey logo

pg2tei's Introduction

PG2TEI

(C) 2012 by Damir Cavar and Malgosia Cavar

Project Gutenberg books to TEI XML conversion

This repository contains code for the automatic conversion of Project Gutenberg books to the TEI XML format. We focus on the generation of valid TEI Lite P5 XML from HTML-sources. The code is written in Java. It might use some Java 7 specific elements and constructions, but should be easily adaptable to Java 6 or earlier versions.

You might need the following components to convert the Project Gutenberg files yourself:

The TEI Subversion repository

For the conversion of ODT-documents to TEI XML we make use of XSLT-scripts that are part of the TEI@Sourceforge package. You can check out a local copy of the trunk using the following command:

svn co https://tei.svn.sourceforge.net/svnroot/tei/trunk ./TEI

For further details, see the links to the repository here:

The necessary components will be in the Stylesheets subfolder. In particular relevant is the odttotei script. You might have to change certain paths in the script, or provide appropriate command line parameters when invoking it.

In addition to the XSLT-scripts in the TEI folder you will most likely need Saxon. If you use oXygen, you should provide the path to the oXygen-lib-folder via command line to odttotei or directly in your adjusted script (e.g. a version of odttotei).

Document conversion tools

We make use of the textutil tool that is distributed with the recent versions of Mac OS X. textutil makes batch conversion of different document types easy. We use textutil to convert the Project Gutenberg HTML-files to ODT.

You might want to try alternative conversion strategies, for example using:

  • pandoc, a universal document converter that is available for all major platforms.
  • OpenOffice or LibreOffice via command line for batch processing. You will find a lot of descriptions of the command line usage online, see for example here...

Configuration of the Java code

Since the PG2TEI-code is a quick and dirty implementation of the conversion pipeline, with a very defensive coding strategy, avoiding complications that might improve the stability, but would cost coding time, there are some things in the code that need specific adaptation. You might experience crashes and error messages for individual files. We cannot avoid that. The conversion runs quite stable, and restarting the converter skips already available target files.

Follow the instructions in these documents to set up the conversion process for your specific environment:

If you discover serious bugs or problems with the code, please send us a message. Thanks!

License

The code is made available under the Apache 2.0 License as is. See LICENSE.md.

pg2tei's People

Contributors

dcavar avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

petrul

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.