Code Monkey home page Code Monkey logo

pub2tei's Introduction

This project proposes a set of style sheets for converting XML documents encoded in various scientific publisher formats into a common TEI format. Often called document ingestion, converting heterogeneous publisher formats into a common working format is a typical, painful and time-consuming sub-task for building scientific digital library applications.

These style sheets have been first developed in the context of the European Project PEER and have been then further extended over the last years, in particular in the context of the ISTEX project. Depending on the publishers (see bellow), the encoding of bibliographical information, abstracts, citation and full texts are supported.

Note: the test XML documents present in the sub-directory Samples are dummy documents with realistic publisher structures but random content.

Requirement

XSLT 2.0 processor.

Usage

The starting point of the transformation process is the style sheet Publisher.xsl.

The resulting TEI documents follow a TEI custumisation documented under the sub-directory Schemas. This TEI format is very close to the one used by GROBID, a complementary tool trying to convert documents in PDF into TEI.

Example with saxon9

Here is a usage example with the Open Source Saxon 9 Home Edition (java). You can download the latest saxon9he.jar here (for conveniency, one is included in the Samples/ directory):

java -jar Samples/saxon9he.jar -s:Samples/TestPubInput/BMJ/bmj_sample.xml -xsl:Stylesheets/Publishers.xsl -o:out.tei.xml -dtd:off -a:off -expand:off -t

The command will apply the Pub2TEI style sheets to a NLM file and produce a TEI out.tei.xml. You can remove the -t option for not producing the trace information.

You can select a directory as input and ouput, in order to process a large amount of files, while compiling the XSLT only one time. The normal behavior is then to transform around one hundred files per second. If it's close to one file per hundred seconds, see below...

Usual troubleshooting

Remember that XML is from the W3C, so anything simple is by default complicated. In particular, pay attention to the fact that the DTD declared in the source XML file should point locally to the file system, and be sure that your XSLT processor does not try to fetch the DTD on the internet (this will have a disastrous impact on the performance). For saxon, the option -dtd:off only applies to the XSLT part (the saxon part) and not to the parsing which will always try to fetch these damn DTDs.

Alternatively, you can add locally empty DTD files (empty file, yes!) with the same name (see also here). saxon will intercept the stupid (but conformant) online fetching of DTD with these local version and neutralize validation.

Alternatively, you can use a non-validating XML parser like piccolo, see also here.

Coverage

The following publisher's formats should be properly processed:

  • BMJ: metadata, header, bibliography, body
  • Elsevier (journals and conferences): metadata, header, bibliography, body
  • IOP: metadata, header, bibliography.
  • NPG (Nature): metadata, header, bibliography, body
  • NLM/JATS: metadata, header, bibliography, body
  • OUP: metadata, header, bibliography, body
  • PNAS: metadata, header, bibliography, body
  • RSC: metadata, header, bibliography, body
  • Sage: metadata, header
  • ScholarOne: metadata, header
  • Springer: metadata, header, bibliography, body
  • Wiley: metadata, header, bibliography, body

Coverage of NLM and JATS should be comprehensive (all versions), so covering also in particular PMC, PLOS and bioRxiv XML.

License

Pub2TEI is distributed under BSD 2-clause license.

authors:

pub2tei's People

Contributors

stefgreg avatar kermitt2 avatar rloth avatar laurentromary avatar rmeja avatar inistcnrs avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.