Code Monkey home page Code Monkey logo

normami's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ambarishk

normami's Issues

implement --table

This is currently in forestPlot - is this the best place?

Anyway needs implementing

unite cephis and normami with mvn modules

cephis and normami are two separate repos and code bases.

They should be united under maven modules.
This will also give an opportunity to clean out old material bloating the system.

project issues 20190624

Issues for attention:

  • technical updates PMR and MD)
    • PMR HTML display of results, segmentation
    • MD Docker
  • missing components:
    • non-simple plots (Stata and SPSS)
    • extraction of tables
  • extraction of numbers
  • test corpus (clean room approach)
  • regression issues (gocr) and how to identify/report
  • building and running ami under mvn (CG)
  • identification of extraction errors (error-detection)
  • missing fields (esp Tesseract)
  • character garbles (esp gocr)
  • documentation of current system (user level)

  • metrics (including automatic creation)

  • future developments (PMR)

    • extraction of content as words and phrases
    • segmentation (SPSS)
    • creation of tables
    • automatic error detection , T/G discrepancies
    • error correction (gocr characters)
  • future (MD)

  • future (CG)

finalising `ami` output for forest plots

ami now carries out most of the required task and its my intention to prototype and test the full functionality in the next few days.
The result of running normami will be a large CTree and a set of html and csv files that can be re-used. The missing functionality includes

  • develop TableExtractor to identify table structure
  • TableExtractor should unify hocr and gocr output to a canonical table format.
  • TableExtractor will attempt to unify the cell content, according to a schema.
  • TableExtractor will apply simply heuristics to detect errors and add @class-based annotation
  • TableExtractor will emit CSV files or Html for the various components of a plot (i.e. possibly several files)
  • Develop GraphExtractor to extract SVGLines from body.graphs
  • Develop ScaleExtractor to extract numeric scales
  • apply the results of GraphExtractor and ScaleExtractor to convert to a CSV with user coordinates.
  • synchronise tables and graphs to determine consistency of horizontal content lines
  • provide an aggregate view of gocr, hocr and graph values.
  • extract and parse summary data in tables (e.g. Overall P values).
  • allow parameterisation of hocr and gocr as far as I understand it. (e.g. to prepare argument lists with whitelists. However both programs are very poorly documented, fragile and I shall not research this. I may open Issues showing the possible tasks.

This data should then be sufficient for repurposing for clients.

PMR output will be CSV and HTML that try to replicate what is visit on the screens, with some indications of reliability.

== What PMR will not currently do ==

  • domain-specific analysis of results.
  • customisation of use
  • client-facing documentations
  • refinement of image analysis parameters
  • creation of corpora
  • develop JS, containers, servers for this project
  • implement software on client site.
  • respond to alternative corpora.
  • write a clean facility for normami (there is a lot of potential output from a run, especially when different parameters are being used.)

== What PMR will do ==

  • attempt to fix runtime bugs
  • mentor CG and MD on how to run programs

GOCR not working?

With the latest code (as of 23/6/2019) I can't get GOCR to work on either macOS or Debian Linux based maven container image.

43  [main] ERROR org.contentmine.ami.tools.AMIOCRTool  - Cannot run GOCR
java.lang.NullPointerException
	at org.contentmine.norma.image.ocr.GOCRConverter.runGOCR(GOCRConverter.java:425)
	at org.contentmine.ami.tools.AMIOCRTool.runOCR(AMIOCRTool.java:246)
	at org.contentmine.ami.tools.AMIOCRTool.processImageDir(AMIOCRTool.java:421)
	at org.contentmine.ami.tools.ImageDirProcessor.processImageDir(ImageDirProcessor.java:91)
	at org.contentmine.ami.tools.ImageDirProcessor.processImageDirs(ImageDirProcessor.java:60)
	at org.contentmine.ami.tools.AMIOCRTool.processTree(AMIOCRTool.java:218)
	at org.contentmine.ami.tools.AbstractAMITool.processTrees(AbstractAMITool.java:579)
	at org.contentmine.ami.tools.AMIOCRTool.runSpecifics(AMIOCRTool.java:210)
	at org.contentmine.ami.tools.AbstractAMITool.runCommands(AbstractAMITool.java:258)
	at org.contentmine.ami.tools.AMIOCRTool.main(AMIOCRTool.java:188)
>image.6.1.96_553/raw>
>skip gocr>raw.png

This was working earlier in the week for me, so I assume it's related to a recent change @petermr ?

Test ami-stack script

The current ami-stack test looks like:

(please comment on possible enhancements - these might include passing parameters)

#! /bin/sh

# your path should include the /bin directory of the appassembler distrib, e.g.
# ami-forestplot => /Users/pm286/workspace/cmdev/normami/target/appassembler/bin/ami-forestplot

# edit this to your own directory
STATA_TOTAL="/Users/pm286/projects/forestplots/stataforestplots"

# directories
STATA_TOTAL_EDITED_DIR=$STATA_TOTAL/stataTotalEdited
CPROJECT=$STATA_TOTAL_EDITED_DIR
CTREE=$CPROJECT/PMC5882397

# choose the first SOURCE to run a single CTree, the second to run a CProject (long). 
# Comment in the one you want
SOURCE="-t $CTREE"
# SOURCE="-p $CPROJECT"
echo $CTREE
ls $CTREE

# images 
RAW=raw
RAW230DS=raw_thr_230_ds
RAWS4230DS=raw_s4_thr_230_ds
#subimages

# regions of image
HEADER=header
BODY=body
LTABLE=ltable
RTABLE=rtable
SCALE=scale

HEADERS120D=${HEADER}"_s4_thr_120_ds"
LTABLES120D=${LTABLE}"_s4_thr_120_ds"
RTABLES120D=${RTABLE}"_s4_thr_120_ds"

SLEEP1=1
SLEEP5=5

# make project from a directory (CPROJECT) containing PDFs. 
# a no-op here as EuPMC has already done this

ami-makeproject -p $CPROJECT --rawfiletypes pdf

# convert PDFs to CTrees

ami-pdf $SOURCE

# image processing at 3 threshold levels (later will try to make this an AMI loop)

ami-image $SOURCE --sharpen sharpen4 --threshold 150 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 230 --despeckle true
ami-image $SOURCE --sharpen sharpen4 --threshold 240 --despeckle true

echo "===============Finished AmiImage============="
sleep $SLEEP1

# run OCR both types

ami-ocr $SOURCE --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --tesseract /usr/local/bin/tesseract --extractlines hocr --html false

echo "===============Finished AmiOcr============="
sleep $SLEEP1

# extract the pixels and project onto axes to get subimage regions
# further project the scale subimage (y(2)) to get the tick values 
# in this case do it for the threshold 230 version only
# the spreadsheet location (xsl) is hard coded into the distrib but it could be 
# more general.
# This *generates* raw_thr_230_ds/template.xml . its variables (e.f. $RAW.$HEADER) are specified 
# in the stylesheet and values computed from applying ami-pixel to the images

ami-pixel $SOURCE --projections --yprojection 0.8 --xprojection 0.5 \
                --minheight -1 --rings -1 --islands 0 \
			    --inputname $RAW230DS \
			    --subimage statascale y 2 delta 10 projection x \
			    --templateinput $RAW230DS/projections.xml \
			    --templateoutput template.xml \
			    --templatexsl /org/contentmine/ami/tools/stataTemplate.xsl

echo "===============Finished AmiPixel============="
sleep $SLEEP5

# use the generated template.xml in each CTree/*/image*/raw_thr_230_ds/ directory to segment the image
# this will create subimages $RAW.$HEADER, $RAW.$BODY.$LTABLE, raw.body.graph, $RAW.$BODY.$RTABLE and raw.scale
# these subimages will be written to *.png in the CTree/*/image* directory
			    
ami-forestplot $SOURCE --template $RAW230DS/template.xml

echo "===============Finished AmiForest============="
sleep $SLEEP5

#now re-run ami-image to enhance each subimage separately

ami-image $SOURCE --inputname $RAW.$HEADER --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$LTABLE --sharpen sharpen4 --threshold 120 --despeckle true
ami-image $SOURCE --inputname $RAW.$BODY.$RTABLE --sharpen sharpen4 --threshold 120 --despeckle true

echo "===============Finished Sharpen Threshold============="
sleep $SLEEP5

# and rerun tesseract on each subimage (suspect Tesseract gets confused by the whole
# image including the graph and lines.

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --tesseract /usr/local/bin/tesseract --extractlines hocr

echo "===============Finished Tesseract ============="
sleep $SLEEP5

ami-ocr $SOURCE --inputname $RAW.$HEADERS120D      --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$LTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr
ami-ocr $SOURCE --inputname $RAW.$BODY.$RTABLES120D --gocr /usr/local/bin/gocr --extractlines gocr

echo "===============Finished GOCR ============="
sleep $SLEEP5

projections.xml and sspsTemplate.xsl do not match

In sspsTemplate.xsl we are looking for XML of the form:

 <horizontallines>
  <g xmlns="http://www.w3.org/2000/svg">
   <line x1="0.0" y1="43.0" x2="1277.0" y2="43.0" style="stroke:black;stroke-width:1.0;"/>
  </g>
  <g xmlns="http://www.w3.org/2000/svg">
   <line x1="782.0" y1="413.0" x2="1262.0" y2="413.0" style="stroke:black;stroke-width:1.0;"/>
  </g>
 </horizontallines>

But in the generated projections.xml we have:

 <g class="horizontallines" xmlns="http://www.w3.org/2000/svg">
  <line x1="0.0" y1="33.0" x2="1266.0" y2="33.0" style="stroke:red;stroke-width:2.0;"/>
  <line x1="758.0" y1="310.0" x2="1252.0" y2="310.0" style="stroke:red;stroke-width:2.0;"/>
 </g>

As a result the ami-forestplot --segment stage fails, as there are missing border values.

AMI search includes trailing punctuation

AMI search (search with dictionaries) includes trailing punctuations. Thus for target:
The transport, which was rapid
when indexed against transport would return
transport,

Remove all trailing punctuation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.