Code Monkey home page Code Monkey logo

sciencebeam's Introduction

ScienceBeam

Build Status License: MIT

A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools.

The aim of this project is to bring multiple tools together to generate a full XML document.

You might also be interested in the ScienceBeam Gym, for the model training ground (the model is not yet integrated into the conversion pipeline).

Status

This is in a very early status and may change significantly.

Docker

Note: If you just want to use the API, you could make use of the docker image.

Pre-requisites

Pipeline

The conversion pipeline could for example look as follows:

Example Conversion Pipeline

See below for current example implementations.

Simple Pipeline

A simple non-Apache Beam specific pipeline definition exists and can be configured using app.cfg (defaults in: app-defaults.cfg).

The pipeline can be executed directly (e.g. as part of the API, see below) or translated and run as an Apache Beam pipeline.

To run the pipeline using Apache Beam:

python -m sciencebeam.pipeline_runners.beam_pipeline_runner \
  --data-path=/home/deuser/_git_/elife/pdf-xml/data/other/00666 --source-path=*.pdf \
  --grobid-url=http://localhost:8070/api

To get a list of all of the available parameters:

python -m sciencebeam.pipeline_runners.beam_pipeline_runner --help

Note: the list of parameters may change depending on the configured pipeline.

Current pipelines:

API Server

The API server is currently available in combination with GROBID.

To start the GROBID run:

docker run -p 8070:8070 lfoppiano/grobid:0.5.1

To start the ScienceBeam server run:

./server.sh --host=0.0.0.0 --port=8075 --grobid-url http://localhost:8070/api

The ScienceBeam API will be available on port 8075.

The pipeline used by the API is currently is using the simple pipeline format described above. The pipeline can be configured via app.cfg (default: app-defaults.cfg). The default pipeline uses GROBID.

Extending the Pipeline

You can use the grobid_pipeline.py as a template and add your own pipelines with other step. Please see Simple Pipeline for configuration details.

The recommended way of extending the pipeline is to use a separate API server exposed via another docker container (as is the case for all of the currently integrated tools). If that is impractical for your use case you could also run locally installed programs (similar to the grobid_pipeline.py).

If the simple pipeline is too restrictive, you could consider the deprecated pipeline examples.

Tests

Unit tests are written using pytest. Run for example pytest or pytest-watch.

Contributing

See CONTRIBUTIG

sciencebeam's People

Contributors

de-code avatar elife-alfred-user avatar giorgiosironi avatar fabiobatalha avatar diversemix avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.