prima-research-lab / prima-core-libs Goto Github PK

View Code? Open in Web Editor NEW

16.0 8.0 15.0 3.11 MB

Core libraries by the PRImA Research Lab

License: Apache License 2.0

CSS 0.59% Java 14.43% HTML 84.94% JavaScript 0.04%

prima-core-libs's Introduction

prima-core-libs

Core libraries by the PRImA Research Lab

prima-core-libs's People

Contributors

Stargazers

Watchers

Forkers

europeananewspapers hoogenm msxcarlos kba factminers transkribus pvk444 bertsky jkatzwinkel stweil rvankoert ub-mannheim maxnth

prima-core-libs's Issues

reader/validation: throw informative exception

I sometimes have trouble debugging PAGE-XML documents that just won't open in PageViewer, despite the fact that they validate under the schema and there is no obvious mistake. The problem is that PageViewer won't tell you (except that when it outright crashes, you at least get a stack trace).

Now I digged into /PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageReader.java and found that XmlPageReader.read() does have all the information in a PageErrorHandler instance called lastErrors. But this gets thrown away.

Why is this not piggy-backed on an exception which PageViewer's event listener can then react on?

For example, it would help seeing (at least on the console):

There is no ID/IDREF binding for IDREF 'region0015'

Possible small bug in `SaxPageHandler_Hocr.java`

Hi there!
I was using UB-Mannheim's ocr-fileformat to convert hOCR to PAGE XML. It internally uses PRImA.PageConverter to do the job which itself depends on this java library. It succesfully converted the file except that it put an extra quotation mark next to the image file name in the PAGE XML output causing mismatch between XML file and the image:

(Note the extra quotation mark below)

I delved into the problem and it turned out that when there's no seperator in the file name, the hOCR handler extracts it incorrectly.

prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java

Lines 311 to 325 in 1bdcc57

    
           if (part.startsWith("image")) { 
        
           	String image = null; 
        
           	//Filename 
        
           	// Path 
        
           	if (part.contains(File.separator)) 
        
           		image = part.substring(part.lastIndexOf(File.separator)+1); 
        
           	// No path 
        
           	else if (part.contains(" \"")) 
        
           		image = part.substring(part.indexOf(" \"")+1); 
        
           	if (image != null) { 
        
           		//Remove quotation mark 
        
           		if (image.endsWith("\"")) 
        
           			image = image.substring(0, image.length()-1); 
        
           		page.setImageFilename(image);

Above, the line 319 should become

            image = part.substring(part.indexOf(" \"")+2);

to fix the issue because part.indexOf(" \"") returns the index of space character not the ".

@stweil @bertsky

[Question] Get TextContent (TextEquiv) index

Is it possible to retrieve the value of the index attribute for a given TextEquiv (textContent)-element with the current version of prima-core-libs?
I'm using schema version 2017-07-15 and upwards and (if I'm not mistaken) the attribute should be allowed for TextEquiv-elements.

I can't find the the attribute in the VariableMap returned by the .getAttributes() method (unlike dataType or comments for example) even if the attribute is set and has a valid value.
As far as I can see there's no explicit getter-method for the (sort) index for textContent objects either.

When using (for example) the following PAGE XML snippet (from 0030.zip)

<TextEquiv dataType="xsd:boolean" comments="comment" conf="0.5" index="1">
    <Unicode>մարդիկ ։ Սակայն իբրեւ տեսին ըզ_</Unicode>
</TextEquiv>

I can access the dataType, conf or comments via their associated getter-method or using

textContent.getAttributes().get("dataType").getValue()
textContent.getAttributes().get("conf").getValue()
textContent.getAttributes().get("comments").getValue()

But

textContent.getAttributes().get("index")

will always return null.

Is this a misunderstanding/false usage of the library from my side or actually not possible at the moment/not intended to work that way?

reader ignores index in ordered groups

AFAICS, the existing implementations for all versions of PAGE-XML ignore (OrderedGroup|OrderedGroupIndexed)/@index when parsing the XML.

This is how it looks:

prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2019_07_15.java

Lines 335 to 342 in 1f087a4

    
            else if (	DefaultXmlNames.ELEMENT_RegionRef.equals(localName) 
        
            		||	DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) { 
        
            	if (currentLogicalGroup != null) { 
        
           if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) { 
        
              	currentLogicalGroup.addRegionRef(atts.getValue(i)); 
        
           } 
        
            	}

References for ATTR_index are nowhere to be found.

The model class of the group in turn does nothing on its part to check incoming indices, it simply appends them:

prima-core-libs/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Group.java

Lines 193 to 199 in 1f087a4

    
           public void addRegionRef(String id) { 
        
           	try { 
        
           		members.add(new RegionRef(this, contentFactory.getIdRegister().getId(id))); 
        
           	} catch (InvalidIdException e) { 
        
           		e.printStackTrace(); 
        
           	} 
        
           }

This means that applications like PageViewer or PageConverter will use the XML order instead of the actual order laid out by the schema semantics. Which in turn creates a problem for applications like OCR-D: What is the correct representation, the one shown by PageViewer or my strict implementation?

Here's an example of the difference this can make:

PAGE-XML and original image: debug-readingorder.zip
rendered by PageViewer:
rendered by ocrd-segment-extract-pages:

In sharp contrast to what one might suspect superficially, here it's PageViewer who gets the order wrong – along with the producing tool eynollah (which follows its model of just looking at the XML order), hence a compensatory error.

If my interpretation is wrong, please get back to me soonish for confirmation. (I don't care about the fix so much as clarity on the correct meaning of the standard for implementation in software and adoption in derived specifications like OCR-D.)

If the better place is the PAGE-XML repo, please transfer.

Calculation of a rectangles height and width

I stumbled upon the following code regarding the calculation of height and width for rectangle:

prima-core-libs/java/PrimaMaths/src/org/primaresearch/maths/geometry/Rect.java

Lines 49 to 54 in d95dc0b

    
           public int getWidth() { 
        
           	return right-left+1; 
        
           } 
        
           public int getHeight() { 
        
           	return bottom-top+1;

Is there any specific reason for adding the constant factor of 1 to the calculated height/width?
When converting (for example) PAGEXML to ALTO the calculated width/height of TextBlocks is always one pixel too big because of this.

document requirements and build procedure

I understand one needs to install Eclipse and import each of the directories under java? (If so, please make this explicit and prominent in the README. It may sound trivial, but other PRImA components depend on this and have no pre-builts, so non-Java / non-Eclipse developers will already stumble here.)

However, I get the following build errors I am unable to get around:

Project 'PrimaDla' is missing required Java project: 'json-simple-tag_release_1_1_1'

I even cloned https://github.com/fangyidong/json-simple and imported it into Eclipse, checkout out release 1.1.1 and had Eclipse rebuild (after deactivating the tests, which would not compile under OpenJDK 11). But it seems like the exported name is different and I have no idea how to set it to 'json-simple-tag_release_1_1_1 as required by PrimaDla.

prima-research-lab / prima-core-libs Goto Github PK

prima-core-libs's Introduction

prima-core-libs

prima-core-libs's People

Contributors

Stargazers

Watchers

Forkers

prima-core-libs's Issues

reader/validation: throw informative exception

Possible small bug in `SaxPageHandler_Hocr.java`

[Question] Get TextContent (TextEquiv) index

reader ignores index in ordered groups

Calculation of a rectangles height and width

document requirements and build procedure

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if (part.startsWith("image")) {
	String image = null;
	//Filename
	// Path
	if (part.contains(File.separator))
	image = part.substring(part.lastIndexOf(File.separator)+1);
	// No path
	else if (part.contains(" \""))
	image = part.substring(part.indexOf(" \"")+1);

	if (image != null) {
	//Remove quotation mark
	if (image.endsWith("\""))
	image = image.substring(0, image.length()-1);
	page.setImageFilename(image);

	else if ( DefaultXmlNames.ELEMENT_RegionRef.equals(localName)
	\|\| DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) {

	if (currentLogicalGroup != null) {
	if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
	currentLogicalGroup.addRegionRef(atts.getValue(i));
	}
	}

	public void addRegionRef(String id) {
	try {
	members.add(new RegionRef(this, contentFactory.getIdRegister().getId(id)));
	} catch (InvalidIdException e) {
	e.printStackTrace();
	}
	}

	public int getWidth() {
	return right-left+1;
	}

	public int getHeight() {
	return bottom-top+1;