contentmine / cm-ucl Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 729.61 MB

A repository to openly track progress on table extraction.

License: Apache License 2.0

HTML 99.97% Shell 0.01% Batchfile 0.01% CSS 0.01% JavaScript 0.01%

cm-ucl's People

Contributors

Stargazers

Watchers

cm-ucl's Issues

Metrics

This issue will track our metrics. Please contribute your thoughts by replying to this issue and keep the theme restricted to Metrics.

Our intention is to be able to assess the achievement of this project using blinded testing, where the final evaluation kept the methods and corpus secret from the developers.

There are several metrics which can be used. for some we can use the standard "recall + precision" but others may use "accuracy" and yet others a "Likert-like" scale (L)

Identification of tables in articles. This is formally out of scope - the software will be presented with the tables.
Classification of table type. We may develop methods for detecting table type, but may also require the tool be be told it.
identification of sections. (title, header, body, footer, optionally and in any order). This will ly be relevant to tables which humans agree have this structure.
title. L?
Header Structure. Identification of column names, and column trees
Header content. L? will include wrapping, bleeding.
Body structure. May include subtables, possibly guessed or possibly template-driven. Metrics on number of cells missed, or with corrupt content.
Footer content. L?

Human Classification of Table Types

This issue will track our development of Table Types. Please contribute your thoughts by replying to this issue and keep the theme restricted to Development of Table Types.

For each Pubstyle we will assign a TableType. This will be iterative as we discover (a) the totality of types and (b) consistency within a PubStyle.

The actual PubStyles should be included in a subdirectory of PubStyles and this Issue used to discuss problems in assigning PubStyles.

Example
(This is NOT final!)

pubstyle: BMJ
style: APA-like
  separator: full hruler
title:
  title-regex:"Title \\d+\\s"
  title-weight: bold
  title-wrap: yes
  title-wrap-indent: none
title-caption:
  caption-weight: bold
header:
  header-weight: bold
  header-separator: spanning-hruler
  header-wrap: yes
  header-wrap-indent: 8 px
body:
  weight: normal
  row-separator: none
  body-row:
    cell-separator: whitespace
    cell-cell-distance-min : 20 px
  subtable:
    indent: yes

and so on. Suggest that we can find some CSS terms for this.

Header not created properly for nested column headers

The final table only includes the bottom row of the header. Develop a fuller header with 2 or more rows as appropropiate. (Already analyzed in annot.svg)

Wrong characters

This issue will track our analyses of wrong characters. Please contribute your thoughts by replying to this issue and keep the theme restricted to Wrong Characters.

Many typesetters use the wrong code point for the character, such as using "em-dash" for "minus". This can lead to severe corruption of numeric data. The only approach is to survey the likely misuses and create heuristics to "correct" them.

AMI-pdf2svg and extraction of tables

SVG should be created using AMI-pdf2svg (see https://bitbucket.org/petermr/pdf2svg/wiki/Home ). This ensures styles, weights, and legacy character conversion.

PMR has run this (2017-01-29) and created all necessary SVG (one per page) in CProject oa-corpus-pmr. The svg are in CTree subdirectory svg/ . The non-empty images have also been extracted into png/

I have started manual extraction of tables into svg/table%d.svg. Other can also assist in this and commit the results.

PDF2SVG conversion

There are several "PDF2SVG"converters running on different platforms (Java, Python, C(++)). Although the format is SVG there are many ways that it could be structured. We have used 2 which "run on all platforms":

PDF2SVG (AMI) Java https://bitbucket.org/petermr/pdf2svg/wiki/Home . This was based on PDFBox 1.8 (https://pdfbox.apache.org/) which has a very thorough toolchain for extracting PDF.
This is the default which will be used for this project. It runs from the commandline but is not yet pacaked as an uber-jar.
We plan to move to PDFBox 2.0.4 but not during the CM-UCL project.
PDF2SVG (http://www.cityinthesky.co.uk/opensource/pdf2svg/) this wraps some existing libraries. This is (somewhat) easier to install than AMI-PDF2SVG and has a more compact output. However it has not been tested for producing SVG2XML input and will not be used for production.

PDF2SVG only needs to be run once (and has been). The tables have been extracted by hand from both corpora.

Excutable jar files

We need executable jars for

pdf2svg (create the raw SVG output)
norma (check old ones will work with new args)

Character Normalization

This issue will track our analyses of character normalization. Please contribute your thoughts by replying to this issue and keep the theme restricted to Character Normalization.

Continuation tables

Catalogue the semantics of split tables ("continuation") and devise a structure which accommodates "most" of them.

Character Stream

This issue will track our analyses of character streams. Please contribute your thoughts by replying to this issue and keep the theme restricted to Character Streams.

Tables consist of characters (letters, digits, punctuation, symbols, etc.) and graphics (lines, rectangles, etc.). Ideally the character stream should consist of Unicode characters, but many PubStyles have legacy fonts (with no open documentation) which do not indicate the code point. This is a large source of information loss and corruption. In many cases we can guess the code points for legacy fonts with high reliability and ContentMine has many conversion tables. The most common problems are legacy symbol fonts (e.g. used in LaTeX and Word).

Known issues include:

Normalization. Aring (A with super-ring) and Angstrom have identical glyphs. Diacritics (accents) can be added as separate characters or using single accented characters. See character normalization.
Ligatures. Many typesetting systems combine certain combinations of characters (e.g. "ff") into single code points. These can be normalized to explicit single characters.
Legacy codepoints.
Inappropriate codepoints. The commonest of these are "small caps" or "symbols" rather than using the stand codepoints and applying styles
Wrong characters. The commonest (and serious) problem is replacing minus signs with em-dash or en-dash, German eszett for beta.
Font weights. Ideally weights should be normal or bold, but some systems will "overprint" (repeat character) for bold, use a color (e.g. "black" and "gray" for bold/normal) or use a black font (thicker stems). All these are very error prone.
Font styles. These should be normal or italic, but sometimes this is fudged with a different (Oblique, slanted) font or by a shear affine transformation on the character. All these are very error prone.

Please add your observations of character stream issues here.

Ligatures

This issue will track our analyses of ligatures. Please contribute your thoughts by replying to this issue and keep the theme restricted to Ligatures.

csv,conf,v3 talk

It would be nice to present the results of this project at csvconf in Portland this May. The conference is all about data; "For those who love data" says the site. @blahah spoke at csvconf2016, so he might be able to share some thoughts. Proposals can be submitted here until Feb 15, 2017. Talks last ~25 minutes, so a 20 minute talk/demo could fit nicely here.

If we want to submit, these are some of the things off the top of my head:

Who would be available? (May 2-3 2017, Portland OR, USA)
Who would like to go? :-)
Would (partial) travel funds be available from ContentMine or elsewhere? (organizers offer financial aide for part of speakers)
What would the content be, exactly? (I think the specific ami-table process would already be great, making people aware and increasing usage/contribs, maybe in demo form showing the entire process for one table)

Cheers

Align header columns with body columns

Headers are left-aligned when they should be aligned with body columns

Font Weights

This issue will track our analyses of font weights. Please contribute your thoughts by replying to this issue and keep the theme restricted to Font Weights.

The only weight we can process are "normal" and "bold". Please indicate where PubStyles use other approaches such as:

overprinting
colors (black and gray)
"black fonts"

Legacy Fonts

This issue will track our analyses of legacy fonts. Please contribute your thoughts by replying to this issue and keep the theme restricted to Legacy Fonts.

Record issues where the original document used (some) characters with code points which were not Unicode. The commonest are Publisher-specific fonts (e.g. Elsevier) or LaTeX such as CM.

Development Strategy

This issue will track our development strategy. Please contribute your thoughts by replying to this issue and keep the theme restricted to Development Strategy.

The goal of the software development in this 2-month project is:

create a toolkit which can process completely or partially "most" tables in the literature.
measure the effectiveness of the toolkit.

The project has the phases:

informal analysis ("eyeball") of the literature, primarily biomedical.
identification of the most important PubStyles, and the TableTypes in them.
develop prototype software that could analyse the most important Pubstyles (prioritized by project members).
create a fixed corpus of articles from these PubStyles.
"develop" the software iteratively against this corpus for a fixed period/resources. "training"
create blindedly a validation suite of articles which the developer/s have not seen.
"measure" the software against the corpus.

This is similar to the train-test-validate cycle for machine-learning but differs since the "training" and "testing" are condensed into developer-driven enhancements. The final software is limited by developer time and the scope of the corpus. Validation therefore measures (a) the comprehesiveness of the corpus (b) the effort and skill of the developer/s.

The intention is that on the last day of the project we can report :
"AMI table retrieved structure from xx% tables, content yy% with zz% character corruption". If the validation corpus is split it may be possible to estimate some error/variance.

Output CSV

output the LAST row of the header and bodies of tables as CSV

Unicode

This issue will track our analyses of Unicode. Please contribute your thoughts by replying to this issue and keep the theme restricted to Unicode.

We attempt to convert to Unicode and normalize as soon as possible and downstream tools will assume all codepoints are normalized Unicode.

Table Types

This issue will track our analyses of table types (format rather than content). Please contribute your thoughts by replying to this issue and keep the theme restricted to Table Types.

Every new table type requires either bespoke software or adding generality to the existing software. See "issues" directory for examples.

APA-like
The American Psychological Association has created and mandated a style for tables which has been adopted by "many/most" PubStyles in the social and medical sciences.

It consists of up to 4 sections (our terminology: title, header, body, footer).
The sections are normally separated by horizontal rulers.
The title and footer are free-form.
The header contains (possibly nested) column names. When nested the hierarchy is indicated by smaller horizontal rulers as overbars for 2 or more subcolumns.
The body has rows, possibly separated by rulers, but often implicit. The content may be in gridded cells but is often separated only by whitespace. The body may have subtables of various sorts, indicated by indents in the row name, or by interleaved subtable names, or implicit.
Text that wraps has continuation rows, which may or may not be indicated by indents in the cells.

Gridded
This contains explicit vertical rulers to indicate cells. The header and footer may have different formatting. This is probably the default output from LaTeX or some Word tools.
Free Form
Some tables rely on whitespace and analysis of the content to indicate row, column and cell boundaries. They give lower metrics, and often humans cannot tell absolutely what the semantics are.

Include title and footer in table

The title and footer are created but not included in final table.

Development Corpus

ChrisH and UCL have agreed a "development corpus" (DevCorp) of 54 articles with over 20 PubStyles. PMR has deliberately had no part in selecting them but will develop AMI-Table against them. This corpus will not be used for formal metrics (precision/recall/accuracy) but can be used to measure overall project coverage and "success".

A "validation corpus" (ValCorp) from new articles conforming to the known PubStyles will be assembled by UCL/Chris; PMR is blinded to the contents. ValCorp will be used to measure the performance of AMI-Table in the final phase of the project. The basis of the metrics will be developed and agreed by all partners in the current phase of the project. AMI-Table will be packaged with instructions so the evaluation could be run by a third party.

Pipeline/stack build logbook

These are the various pieces of software needed in the stack. I am trying to independently build and run these (running should be no difficulty if jar file is provided). It seems like there are problems in the github POMs; pdf2svg and svg2xml do not build at all on my machine.

Build

euclid
a. No success building from Github clone; see errors here
b. Success when cloning from wwmm/euclid on Bitbucket
svg
a. No success building from Github clone; see errors here
b. Success when building from bitbucket clone from wwmm/svg on Bitbucket
html
a. No success building from Github clone; see errors here
b. Success when building from bitbucket clone from wwmm/html on Bitbucket
imageanalysis
a. No success building from Github clone; see errors here
b. Success when building from bitbucket clone from petermr/imageanalysis on Bitbucket
pdf2svg
a. No success building from Github clone; see errors here
b. Failure building from bitbucket clone from petermr/pdf2svg on Bitbucket; see errors here
svg2xml
a. No success building from Github clone; see errors here
b. Failure when building from bitbucket clone from petermr/svg2xml on Bitbucket; see errors here

JAR files download

This needs updating. @petermr: what jars do we need to run the pipeline? Not all, I figure, so please remove (and add) as you see fit.

euclid
svg
html
imageanalysis
pdf2svg
svg2xml

Font styles

This issue will track our analyses of font styles. Please contribute your thoughts by replying to this issue and keep the theme restricted to Font Styles.

The only styles we will process are "normal" and "italic". Please indicate:

synonyms such as "oblique"
shearing using affine transformations on characters to make them oblique

Rotated tables

enhance code to recognize rotated tables, rotate them, and analyze in normal fashion

Inappropriate codepoints

This issue will track our analyses of inappropriate codepoints. Please contribute your thoughts by replying to this issue and keep the theme restricted to Inappropriate Codepoints.

A common example is the use of "small caps". These will not by default be treated as equivalent to ASCII characters and will need normalizing.