Code Monkey home page Code Monkey logo

prima-core-libs's Introduction

prima-core-libs

Core libraries by the PRImA Research Lab

prima-core-libs's People

Contributors

bertsky avatar chris1010010 avatar maxnth avatar stweil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prima-core-libs's Issues

reader/validation: throw informative exception

I sometimes have trouble debugging PAGE-XML documents that just won't open in PageViewer, despite the fact that they validate under the schema and there is no obvious mistake. The problem is that PageViewer won't tell you (except that when it outright crashes, you at least get a stack trace).

Now I digged into /PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageReader.java and found that XmlPageReader.read() does have all the information in a PageErrorHandler instance called lastErrors. But this gets thrown away.

Why is this not piggy-backed on an exception which PageViewer's event listener can then react on?

For example, it would help seeing (at least on the console):

There is no ID/IDREF binding for IDREF 'region0015'

Possible small bug in `SaxPageHandler_Hocr.java`

Hi there!
I was using UB-Mannheim's ocr-fileformat to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image:
image
(Note the extra quotation mark below)
image

I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly.

if (part.startsWith("image")) {
String image = null;
//Filename
// Path
if (part.contains(File.separator))
image = part.substring(part.lastIndexOf(File.separator)+1);
// No path
else if (part.contains(" \""))
image = part.substring(part.indexOf(" \"")+1);
if (image != null) {
//Remove quotation mark
if (image.endsWith("\""))
image = image.substring(0, image.length()-1);
page.setImageFilename(image);

Above, the line 319 should become

            image = part.substring(part.indexOf(" \"")+2); 

to fix the issue because part.indexOf(" \"") returns the index of space character not the ".

@stweil @bertsky

[Question] Get TextContent (TextEquiv) index

Is it possible to retrieve the value of the index attribute for a given TextEquiv (textContent)-element with the current version of prima-core-libs?
I'm using schema version 2017-07-15 and upwards and (if I'm not mistaken) the attribute should be allowed for TextEquiv-elements.

I can't find the the attribute in the VariableMap returned by the .getAttributes() method (unlike dataType or comments for example) even if the attribute is set and has a valid value.
As far as I can see there's no explicit getter-method for the (sort) index for textContent objects either.

When using (for example) the following PAGE XML snippet (from 0030.zip)

<TextEquiv dataType="xsd:boolean" comments="comment" conf="0.5" index="1">
    <Unicode>մարդիկ ։ Սակայն իբրեւ տեսին ըզ_</Unicode>
</TextEquiv>

I can access the dataType, conf or comments via their associated getter-method or using

textContent.getAttributes().get("dataType").getValue()
textContent.getAttributes().get("conf").getValue()
textContent.getAttributes().get("comments").getValue()

But

textContent.getAttributes().get("index")

will always return null.

Is this a misunderstanding/false usage of the library from my side or actually not possible at the moment/not intended to work that way?

reader ignores index in ordered groups

AFAICS, the existing implementations for all versions of PAGE-XML ignore (OrderedGroup|OrderedGroupIndexed)/@index when parsing the XML.

This is how it looks:

else if ( DefaultXmlNames.ELEMENT_RegionRef.equals(localName)
|| DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) {
if (currentLogicalGroup != null) {
if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
currentLogicalGroup.addRegionRef(atts.getValue(i));
}
}

References for ATTR_index are nowhere to be found.

The model class of the group in turn does nothing on its part to check incoming indices, it simply appends them:

public void addRegionRef(String id) {
try {
members.add(new RegionRef(this, contentFactory.getIdRegister().getId(id)));
} catch (InvalidIdException e) {
e.printStackTrace();
}
}

This means that applications like PageViewer or PageConverter will use the XML order instead of the actual order laid out by the schema semantics. Which in turn creates a problem for applications like OCR-D: What is the correct representation, the one shown by PageViewer or my strict implementation?

Here's an example of the difference this can make:

  • PAGE-XML and original image: debug-readingorder.zip
  • rendered by PageViewer: FILE_0002_ORIGINAL_pageviewer-all-order
  • rendered by ocrd-segment-extract-pages: FILE_0002_EXTRACT-LINES-EYNOLLAH pseg

In sharp contrast to what one might suspect superficially, here it's PageViewer who gets the order wrong – along with the producing tool eynollah (which follows its model of just looking at the XML order), hence a compensatory error.

If my interpretation is wrong, please get back to me soonish for confirmation. (I don't care about the fix so much as clarity on the correct meaning of the standard for implementation in software and adoption in derived specifications like OCR-D.)

If the better place is the PAGE-XML repo, please transfer.

Calculation of a rectangles height and width

I stumbled upon the following code regarding the calculation of height and width for rectangle:

public int getWidth() {
return right-left+1;
}
public int getHeight() {
return bottom-top+1;

Is there any specific reason for adding the constant factor of 1 to the calculated height/width?
When converting (for example) PAGEXML to ALTO the calculated width/height of TextBlocks is always one pixel too big because of this.

document requirements and build procedure

I understand one needs to install Eclipse and import each of the directories under java? (If so, please make this explicit and prominent in the README. It may sound trivial, but other PRImA components depend on this and have no pre-builts, so non-Java / non-Eclipse developers will already stumble here.)

However, I get the following build errors I am unable to get around:

Project 'PrimaDla' is missing required Java project: 'json-simple-tag_release_1_1_1'

I even cloned https://github.com/fangyidong/json-simple and imported it into Eclipse, checkout out release 1.1.1 and had Eclipse rebuild (after deactivating the tests, which would not compile under OpenJDK 11). But it seems like the exported name is different and I have no idea how to set it to 'json-simple-tag_release_1_1_1 as required by PrimaDla.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.