contentmine / normami Goto Github PK

A tool to convert a variety of inputs into normalized, tagged, XHTML (with embedded/linked SVG and PNG where appropriate).

normami's People

Watchers

Forkers

ambarishk

normami's Issues

implement --table

This is currently in forestPlot - is this the best place?

Anyway needs implementing

unite cephis and normami with mvn modules

cephis and normami are two separate repos and code bases.

They should be united under maven modules.
This will also give an opportunity to clean out old material bloating the system.

project issues 20190624

Issues for attention:

technical updates PMR and MD)
- PMR HTML display of results, segmentation
- MD Docker
missing components:
- non-simple plots (Stata and SPSS)
- extraction of tables

extraction of numbers

test corpus (clean room approach)
regression issues (gocr) and how to identify/report
building and running ami under mvn (CG)
identification of extraction errors (error-detection)

missing fields (esp Tesseract)
character garbles (esp gocr)

documentation of current system (user level)
metrics (including automatic creation)
future developments (PMR)
- extraction of content as words and phrases
- segmentation (SPSS)
- creation of tables
- automatic error detection , T/G discrepancies
- error correction (gocr characters)
future (MD)
future (CG)

finalising `ami` output for forest plots

ami now carries out most of the required task and its my intention to prototype and test the full functionality in the next few days.
The result of running normami will be a large CTree and a set of html and csv files that can be re-used. The missing functionality includes

develop TableExtractor to identify table structure
TableExtractor should unify hocr and gocr output to a canonical table format.
TableExtractor will attempt to unify the cell content, according to a schema.
TableExtractor will apply simply heuristics to detect errors and add @class-based annotation
TableExtractor will emit CSV files or Html for the various components of a plot (i.e. possibly several files)
Develop GraphExtractor to extract SVGLines from body.graphs
Develop ScaleExtractor to extract numeric scales
apply the results of GraphExtractor and ScaleExtractor to convert to a CSV with user coordinates.
synchronise tables and graphs to determine consistency of horizontal content lines
provide an aggregate view of gocr, hocr and graph values.
extract and parse summary data in tables (e.g. Overall P values).
allow parameterisation of hocr and gocr as far as I understand it. (e.g. to prepare argument lists with whitelists. However both programs are very poorly documented, fragile and I shall not research this. I may open Issues showing the possible tasks.

This data should then be sufficient for repurposing for clients.

PMR output will be CSV and HTML that try to replicate what is visit on the screens, with some indications of reliability.

== What PMR will not currently do ==

domain-specific analysis of results.
customisation of use
client-facing documentations
refinement of image analysis parameters
creation of corpora
develop JS, containers, servers for this project
implement software on client site.
respond to alternative corpora.
write a clean facility for normami (there is a lot of potential output from a run, especially when different parameters are being used.)

== What PMR will do ==

attempt to fix runtime bugs
mentor CG and MD on how to run programs

43  [main] ERROR org.contentmine.ami.tools.AMIOCRTool  - Cannot run GOCR
java.lang.NullPointerException
	at org.contentmine.norma.image.ocr.GOCRConverter.runGOCR(GOCRConverter.java:425)
	at org.contentmine.ami.tools.AMIOCRTool.runOCR(AMIOCRTool.java:246)
	at org.contentmine.ami.tools.AMIOCRTool.processImageDir(AMIOCRTool.java:421)
	at org.contentmine.ami.tools.ImageDirProcessor.processImageDir(ImageDirProcessor.java:91)
	at org.contentmine.ami.tools.ImageDirProcessor.processImageDirs(ImageDirProcessor.java:60)
	at org.contentmine.ami.tools.AMIOCRTool.processTree(AMIOCRTool.java:218)
	at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:579)
	at org.contentmine.ami.tools.AMIOCRTool.runSpecifics(AMIOCRTool.java:210)
	at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:258)
	at org.contentmine.ami.tools.AMIOCRTool.main(AMIOCRTool.java:188)
>image.6.1.96_553/raw>
>skip gocr>raw.png

This was working earlier in the week for me, so I assume it's related to a recent change @petermr ?

Test ami-stack script

The current ami-stack test looks like:

(please comment on possible enhancements - these might include passing parameters)

#! /bin/sh

# your path should include the /bin directory of the appassembler distrib, e.g.
# ami-forestplot => /Users/pm286/workspace/cmdev/normami/target/appassembler/bin/ami-forestplot

# edit this to your own directory
STATA_TOTAL="/Users/pm286/projects/forestplots/stataforestplots"

# directories
STATA_TOTAL_EDITED_DIR=$STATA_TOTAL/stataTotalEdited
CPROJECT=$STATA_TOTAL_EDITED_DIR
CTREE=$CPROJECT/PMC5882397

# choose the first SOURCE to run a single CTree, the second to run a CProject (long). 
# Comment in the one you want
SOURCE="-t $CTREE"
# SOURCE="-p $CPROJECT"
echo $CTREE
ls $CTREE

# images 
RAW=raw
RAW230DS=raw_thr_230_ds
RAWS4230DS=raw_s4_thr_230_ds
#subimages

# regions of image
HEADER=header
BODY=body
LTABLE=ltable
RTABLE=rtable
SCALE=scale

HEADERS120D=${HEADER}"_s4_thr_120_ds"
LTABLES120D=${LTABLE}"_s4_thr_120_ds"
RTABLES120D=${RTABLE}"_s4_thr_120_ds"

SLEEP1=1
SLEEP5=5

# make project from a directory (CPROJECT) containing PDFs. 
# a no-op here as EuPMC has already done this

ami-makeproject -p $CPROJECT --rawfiletypes pdf

# convert PDFs to CTrees

ami-pdf $SOURCE

# image processing at 3 threshold levels (later will try to make this an AMI loop)

ami-image $SOURCE --sharpen sharpen4 --threshold 150 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 230 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 240 --despeckle true

echo "===============Finished AmiImage============="
sleep $SLEEP1

# run OCR both types

ami-ocr $SOURCE --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --tesseract /usr/local/bin/tesseract --extractlines hocr --html false

echo "===============Finished AmiOcr============="
sleep $SLEEP1

# extract the pixels and project onto axes to get subimage regions
# further project the scale subimage (y(2)) to get the tick values 
# in this case do it for the threshold 230 version only
# the spreadsheet location (xsl) is hard coded into the distrib but it could be 
# more general.
# This *generates* raw_thr_230_ds/template.xml . its variables (e.f. $RAW.$HEADER) are specified 
# in the stylesheet and values computed from applying ami-pixel to the images

ami-pixel $SOURCE --projections --yprojection 0.8 --xprojection 0.5 \
                --minheight -1 --rings -1 --islands 0 \
			    --inputname $RAW230DS \
			    --subimage statascale y 2 delta 10 projection x \
			    --templateinput $RAW230DS/projections.xml \
			    --templateoutput template.xml \
			    --templatexsl /org/contentmine/ami/tools/stataTemplate.xsl

echo "===============Finished AmiPixel============="
sleep $SLEEP5

# use the generated template.xml in each CTree/*/image*/raw_thr_230_ds/ directory to segment the image
# this will create subimages $RAW.$HEADER, $RAW.$BODY.$LTABLE, raw.body.graph, $RAW.$BODY.$RTABLE and raw.scale
# these subimages will be written to *.png in the CTree/*/image* directory
			    
ami-forestplot $SOURCE --template $RAW230DS/template.xml

echo "===============Finished AmiForest============="
sleep $SLEEP5

#now re-run ami-image to enhance each subimage separately

ami-image $SOURCE --inputname $RAW.$HEADER --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$LTABLE --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$RTABLE --sharpen sharpen4 --threshold 120 --despeckle true

echo "===============Finished Sharpen Threshold============="
sleep $SLEEP5

# and rerun tesseract on each subimage (suspect Tesseract gets confused by the whole
# image including the graph and lines.

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr

echo "===============Finished Tesseract ============="
sleep $SLEEP5

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr

echo "===============Finished GOCR ============="
sleep $SLEEP5

 <horizontallines>
  <g xmlns="http://www.w3.org/2000/svg">
   <line x1="0.0" y1="43.0" x2="1277.0" y2="43.0" style="stroke:black;stroke-width:1.0;"/>
  </g>
  <g xmlns="http://www.w3.org/2000/svg">
   <line x1="782.0" y1="413.0" x2="1262.0" y2="413.0" style="stroke:black;stroke-width:1.0;"/>
  </g>
 </horizontallines>

But in the generated projections.xml we have:

 <g class="horizontallines" xmlns="http://www.w3.org/2000/svg">
  <line x1="0.0" y1="33.0" x2="1266.0" y2="33.0" style="stroke:red;stroke-width:2.0;"/>
  <line x1="758.0" y1="310.0" x2="1252.0" y2="310.0" style="stroke:red;stroke-width:2.0;"/>
 </g>

As a result the ami-forestplot --segment stage fails, as there are missing border values.

AMI search includes trailing punctuation

AMI search (search with dictionaries) includes trailing punctuations. Thus for target:
The transport, which was rapid
when indexed against transport would return
transport,

Remove all trailing punctuation

contentmine / normami Goto Github PK

normami's People

Watchers

Forkers

normami's Issues

Recommend Projects

Recommend Topics

Recommend Org