Core libraries by the PRImA Research Lab
prima-research-lab / prima-core-libs Goto Github PK
View Code? Open in Web Editor NEWCore libraries by the PRImA Research Lab
License: Apache License 2.0
Core libraries by the PRImA Research Lab
License: Apache License 2.0
I sometimes have trouble debugging PAGE-XML documents that just won't open in PageViewer, despite the fact that they validate under the schema and there is no obvious mistake. The problem is that PageViewer won't tell you (except that when it outright crashes, you at least get a stack trace).
Now I digged into /PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageReader.java
and found that XmlPageReader.read()
does have all the information in a PageErrorHandler
instance called lastErrors
. But this gets thrown away.
Why is this not piggy-backed on an exception which PageViewer's event listener can then react on?
For example, it would help seeing (at least on the console):
There is no ID/IDREF binding for IDREF 'region0015'
Hi there!
I was using UB-Mannheim's ocr-fileformat
to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image:
(Note the extra quotation mark below)
I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly.
Above, the line 319 should become
image = part.substring(part.indexOf(" \"")+2);
to fix the issue because part.indexOf(" \"")
returns the index of space character not the "
.
Is it possible to retrieve the value of the index attribute for a given TextEquiv (textContent)-element with the current version of prima-core-libs?
I'm using schema version 2017-07-15
and upwards and (if I'm not mistaken) the attribute should be allowed for TextEquiv-elements.
I can't find the the attribute in the VariableMap
returned by the .getAttributes()
method (unlike dataType or comments for example) even if the attribute is set and has a valid value.
As far as I can see there's no explicit getter-method for the (sort) index for textContent objects either.
When using (for example) the following PAGE XML snippet (from 0030.zip)
<TextEquiv dataType="xsd:boolean" comments="comment" conf="0.5" index="1">
<Unicode>մարդիկ ։ Սակայն իբրեւ տեսին ըզ_</Unicode>
</TextEquiv>
I can access the dataType, conf or comments via their associated getter-method or using
textContent.getAttributes().get("dataType").getValue()
textContent.getAttributes().get("conf").getValue()
textContent.getAttributes().get("comments").getValue()
But
textContent.getAttributes().get("index")
will always return null.
Is this a misunderstanding/false usage of the library from my side or actually not possible at the moment/not intended to work that way?
AFAICS, the existing implementations for all versions of PAGE-XML ignore (OrderedGroup|OrderedGroupIndexed)/@index
when parsing the XML.
This is how it looks:
References for ATTR_index
are nowhere to be found.
The model class of the group in turn does nothing on its part to check incoming indices, it simply appends them:
This means that applications like PageViewer or PageConverter will use the XML order instead of the actual order laid out by the schema semantics. Which in turn creates a problem for applications like OCR-D: What is the correct representation, the one shown by PageViewer or my strict implementation?
Here's an example of the difference this can make:
In sharp contrast to what one might suspect superficially, here it's PageViewer who gets the order wrong – along with the producing tool eynollah (which follows its model of just looking at the XML order), hence a compensatory error.
If my interpretation is wrong, please get back to me soonish for confirmation. (I don't care about the fix so much as clarity on the correct meaning of the standard for implementation in software and adoption in derived specifications like OCR-D.)
If the better place is the PAGE-XML repo, please transfer.
I stumbled upon the following code regarding the calculation of height and width for rectangle:
Is there any specific reason for adding the constant factor of 1 to the calculated height/width?
When converting (for example) PAGEXML to ALTO the calculated width/height of TextBlocks is always one pixel too big because of this.
I understand one needs to install Eclipse and import each of the directories under java
? (If so, please make this explicit and prominent in the README. It may sound trivial, but other PRImA components depend on this and have no pre-builts, so non-Java / non-Eclipse developers will already stumble here.)
However, I get the following build errors I am unable to get around:
Project 'PrimaDla' is missing required Java project: 'json-simple-tag_release_1_1_1'
I even cloned https://github.com/fangyidong/json-simple and imported it into Eclipse, checkout out release 1.1.1 and had Eclipse rebuild (after deactivating the tests, which would not compile under OpenJDK 11). But it seems like the exported name is different and I have no idea how to set it to 'json-simple-tag_release_1_1_1
as required by PrimaDla.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.