contentmine / cm-ucl Goto Github PK
View Code? Open in Web Editor NEWA repository to openly track progress on table extraction.
License: Apache License 2.0
A repository to openly track progress on table extraction.
License: Apache License 2.0
This issue will track our metrics. Please contribute your thoughts by replying to this issue and keep the theme restricted to Metrics.
Our intention is to be able to assess the achievement of this project using blinded testing, where the final evaluation kept the methods and corpus secret from the developers.
There are several metrics which can be used. for some we can use the standard "recall + precision" but others may use "accuracy" and yet others a "Likert-like" scale (L)
This issue will track our development of Table Types. Please contribute your thoughts by replying to this issue and keep the theme restricted to Development of Table Types.
For each Pubstyle we will assign a TableType. This will be iterative as we discover (a) the totality of types and (b) consistency within a PubStyle.
The actual PubStyles should be included in a subdirectory of PubStyles and this Issue used to discuss problems in assigning PubStyles.
Example
(This is NOT final!)
pubstyle: BMJ
style: APA-like
separator: full hruler
title:
title-regex:"Title \\d+\\s"
title-weight: bold
title-wrap: yes
title-wrap-indent: none
title-caption:
caption-weight: bold
header:
header-weight: bold
header-separator: spanning-hruler
header-wrap: yes
header-wrap-indent: 8 px
body:
weight: normal
row-separator: none
body-row:
cell-separator: whitespace
cell-cell-distance-min : 20 px
subtable:
indent: yes
and so on. Suggest that we can find some CSS terms for this.
The final table only includes the bottom row of the header. Develop a fuller header with 2 or more rows as appropropiate. (Already analyzed in annot.svg
)
This issue will track our analyses of wrong characters. Please contribute your thoughts by replying to this issue and keep the theme restricted to Wrong Characters.
Many typesetters use the wrong code point for the character, such as using "em-dash" for "minus". This can lead to severe corruption of numeric data. The only approach is to survey the likely misuses and create heuristics to "correct" them.
SVG should be created using AMI-pdf2svg (see https://bitbucket.org/petermr/pdf2svg/wiki/Home ). This ensures styles, weights, and legacy character conversion.
PMR has run this (2017-01-29) and created all necessary SVG (one per page) in CProject oa-corpus-pmr
. The svg
are in CTree subdirectory svg/
. The non-empty images have also been extracted into png/
I have started manual extraction of tables into svg/table%d.svg
. Other can also assist in this and commit the results.
There are several "PDF2SVG"converters running on different platforms (Java, Python, C(++)). Although the format is SVG there are many ways that it could be structured. We have used 2 which "run on all platforms":
PDF2SVG (AMI) Java https://bitbucket.org/petermr/pdf2svg/wiki/Home . This was based on PDFBox 1.8 (https://pdfbox.apache.org/) which has a very thorough toolchain for extracting PDF.
This is the default which will be used for this project. It runs from the commandline but is not yet pacaked as an uber-jar.
We plan to move to PDFBox 2.0.4 but not during the CM-UCL project.
PDF2SVG (http://www.cityinthesky.co.uk/opensource/pdf2svg/) this wraps some existing libraries. This is (somewhat) easier to install than AMI-PDF2SVG and has a more compact output. However it has not been tested for producing SVG2XML input and will not be used for production.
PDF2SVG only needs to be run once (and has been). The tables have been extracted by hand from both corpora.
We need executable jars for
This issue will track our analyses of character normalization. Please contribute your thoughts by replying to this issue and keep the theme restricted to Character Normalization.
Catalogue the semantics of split tables ("continuation") and devise a structure which accommodates "most" of them.
This issue will track our analyses of character streams. Please contribute your thoughts by replying to this issue and keep the theme restricted to Character Streams.
Tables consist of characters (letters, digits, punctuation, symbols, etc.) and graphics (lines, rectangles, etc.). Ideally the character stream should consist of Unicode characters, but many PubStyles have legacy fonts (with no open documentation) which do not indicate the code point. This is a large source of information loss and corruption. In many cases we can guess the code points for legacy fonts with high reliability and ContentMine has many conversion tables. The most common problems are legacy symbol fonts (e.g. used in LaTeX and Word).
Known issues include:
Please add your observations of character stream issues here.
This issue will track our analyses of ligatures. Please contribute your thoughts by replying to this issue and keep the theme restricted to Ligatures.
It would be nice to present the results of this project at csvconf in Portland this May. The conference is all about data; "For those who love data" says the site. @blahah spoke at csvconf2016, so he might be able to share some thoughts. Proposals can be submitted here until Feb 15, 2017. Talks last ~25 minutes, so a 20 minute talk/demo could fit nicely here.
If we want to submit, these are some of the things off the top of my head:
ami-table
process would already be great, making people aware and increasing usage/contribs, maybe in demo form showing the entire process for one table)Cheers
Headers are left-aligned when they should be aligned with body columns
This issue will track our analyses of font weights. Please contribute your thoughts by replying to this issue and keep the theme restricted to Font Weights.
The only weight we can process are "normal" and "bold". Please indicate where PubStyles use other approaches such as:
This issue will track our analyses of legacy fonts. Please contribute your thoughts by replying to this issue and keep the theme restricted to Legacy Fonts.
Record issues where the original document used (some) characters with code points which were not Unicode. The commonest are Publisher-specific fonts (e.g. Elsevier) or LaTeX such as CM.
This issue will track our development strategy. Please contribute your thoughts by replying to this issue and keep the theme restricted to Development Strategy.
The goal of the software development in this 2-month project is:
The project has the phases:
This is similar to the train-test-validate cycle for machine-learning but differs since the "training" and "testing" are condensed into developer-driven enhancements. The final software is limited by developer time and the scope of the corpus. Validation therefore measures (a) the comprehesiveness of the corpus (b) the effort and skill of the developer/s.
The intention is that on the last day of the project we can report :
"AMI table retrieved structure from xx% tables, content yy% with zz% character corruption". If the validation corpus is split it may be possible to estimate some error/variance.
output the LAST row of the header and bodies of tables as CSV
This issue will track our analyses of Unicode. Please contribute your thoughts by replying to this issue and keep the theme restricted to Unicode.
We attempt to convert to Unicode and normalize as soon as possible and downstream tools will assume all codepoints are normalized Unicode.
This issue will track our analyses of table types (format rather than content). Please contribute your thoughts by replying to this issue and keep the theme restricted to Table Types.
Every new table type requires either bespoke software or adding generality to the existing software. See "issues" directory for examples.
Gridded
This contains explicit vertical rulers to indicate cells. The header and footer may have different formatting. This is probably the default output from LaTeX or some Word tools.
Free Form
Some tables rely on whitespace and analysis of the content to indicate row, column and cell boundaries. They give lower metrics, and often humans cannot tell absolutely what the semantics are.
The title and footer are created but not included in final table.
ChrisH and UCL have agreed a "development corpus" (DevCorp) of 54 articles with over 20 PubStyles. PMR has deliberately had no part in selecting them but will develop AMI-Table against them. This corpus will not be used for formal metrics (precision/recall/accuracy) but can be used to measure overall project coverage and "success".
A "validation corpus" (ValCorp) from new articles conforming to the known PubStyles will be assembled by UCL/Chris; PMR is blinded to the contents. ValCorp will be used to measure the performance of AMI-Table in the final phase of the project. The basis of the metrics will be developed and agreed by all partners in the current phase of the project. AMI-Table will be packaged with instructions so the evaluation could be run by a third party.
These are the various pieces of software needed in the stack. I am trying to independently build and run these (running should be no difficulty if jar
file is provided). It seems like there are problems in the github POMs; pdf2svg
and svg2xml
do not build at all on my machine.
euclid
a. No success building from Github clone; see errors here
b. Success when cloning from wwmm/euclid
on Bitbucket
svg
a. No success building from Github clone; see errors here
b. Success when building from bitbucket clone from wwmm/svg
on Bitbucket
html
a. No success building from Github clone; see errors here
b. Success when building from bitbucket clone from wwmm/html
on Bitbucket
imageanalysis
a. No success building from Github clone; see errors here
b. Success when building from bitbucket clone from petermr/imageanalysis
on Bitbucket
pdf2svg
a. No success building from Github clone; see errors here
b. Failure building from bitbucket clone from petermr/pdf2svg
on Bitbucket; see errors here
svg2xml
a. No success building from Github clone; see errors here
b. Failure when building from bitbucket clone from petermr/svg2xml
on Bitbucket; see errors here
This needs updating. @petermr: what jars do we need to run the pipeline? Not all, I figure, so please remove (and add) as you see fit.
euclid
svg
html
imageanalysis
pdf2svg
svg2xml
This issue will track our analyses of font styles. Please contribute your thoughts by replying to this issue and keep the theme restricted to Font Styles.
The only styles we will process are "normal" and "italic". Please indicate:
enhance code to recognize rotated tables, rotate them, and analyze in normal fashion
This issue will track our analyses of inappropriate codepoints. Please contribute your thoughts by replying to this issue and keep the theme restricted to Inappropriate Codepoints.
A common example is the use of "small caps". These will not by default be treated as equivalent to ASCII characters and will need normalizing.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.