while using alto_text(), i see duplicated content at hyphenations: the word at the end

To get hyphenated content as in the original file <div class="snippet-clipboard-co

Many thanks for sharing your thoughts on this <a class="user-mention notranslate" data

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hyphenated content gets duplicated about alto-tools HOT 6 CLOSED

iiegn commented on June 29, 2024

Hyphenated content gets duplicated

from alto-tools.

Comments (6)

iiegn commented on June 29, 2024 1

just realized, instead of pass it's necessary to set text = ""...

from alto-tools.

imlabormitlea-code commented on June 29, 2024 1

To get hyphenated content as in the original file

      if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
          text = line.attrib.get('CONTENT') + ' '
      if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
          text = line.attrib.get('CONTENT') + ' '

should avoid duplicates.

from alto-tools.

iiegn commented on June 29, 2024 1

i guess, there is no right answer to this. however, once converted to text it's not obvious what is what: (from a small sample and more intuition than fact observation) i've seen cases where the end-of-line hyphenation was not recognized by the OCR engine but the text representation still contained a hyphen at the end of the line. these lines would be indistinguishable from #16 (comment) option 2). otoh, when the OCR engine does recognize an end-of-line hyphenation, it seems to be often correct. for me, as i'm interested in the text for further NLP processing, having some words back to 'normal' (that is, known to NLP tools) form is a plus; and knowing that they are often correct is also a plus.

so, i have a preference for option 1)... but as long as there is a run-time option to select one or the other, i don't really care. this begs the question (then for me), should this become a run-time option?

from alto-tools.

cneud commented on June 29, 2024 1

Many thanks for sharing your thoughts on this @iiegn!

I am leaning more towards option 2. My thinking is that alto-tools should itself aim not to alter/normalize the OCR text in any way - and while the HypPart is indeed a part of the OCR output (I assume it was added in ALTO for the reason you describe, ease of downstream NLP processing with normalized tokens), I now see it as more problematic when by substituting hyphens also line breaks may be altered - this can cause trouble for applications that use some form of textline-image alignement, like evaluation or training. And I suppose it could be a greater source for confusion and require more documentation than keeping the hyphenation from the source document. The fact that the capitalization (averAge vs aver-age) is different in the example ALTO I used is also still confusing me. I don't expect an OCR engine would do that, it must be some post-processing...

TL;DR - a runtime option would satisfy more downstream use cases at the cost of increased documentation.

I guess I will implement option 2 first as the default. But I can also see the value in option 1 and would like to implement it, just not very soon (too much other stuff going on). PR's are obviously welcome :) Also example test files with hyphenation edge cases would be helpful if they can be shared freely.

from alto-tools.

cneud commented on June 29, 2024

Thank you @imlabormitlea-code! I will test this also next week and if all checks out, make a PR and credit you for the fix (i.e. unless you want to make a PR for this).

from alto-tools.

cneud commented on June 29, 2024

Basically, there seem to be two ways to deal with the hyphenation info from the ALTO xml.

Looking at a live example from the ALTO LOC website:

<TextLine ...>
  <String STYLEREFS="ID7" HEIGHT="72.0" WIDTH="276.0" HPOS="6150.0" VPOS="5388.0" CONTENT="aver" SUBS_TYPE="HypPart1" SUBS_CONTENT="averAge" WC="1.0">
  <HYP WIDTH="1.0" HPOS="6427.0" VPOS="5388.0" CONTENT="-"/>
</TextLine>
<TextLine ...>
  <String STYLEREFS="ID7" HEIGHT="90.0" WIDTH="213.0" HPOS="4053.0" VPOS="5556.0" CONTENT="age" SUBS_TYPE="HypPart2" SUBS_CONTENT="averAge" WC="1.0">
</TextLine>

1) substitute hyphens

Where ALTO contains a string with SUBS_TYPE="HypPart1", use the SUBS_CONTENT to substitute the hyphenated word in the text output.

SUBS_CONTENT must be used only once, the second part of the hyphenated content should be ignored, e.g. as suggested by #16 (comment).

Code:

if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('SUBS_CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
   text = ''

Output:

averAge

2) don't substitute hyphens

Output closer to the original as e.g. suggested by #16 (comment), use the CONTENT of each string with the HYP as - and a text line break

This

Code:

if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('CONTENT') + '-'
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('CONTENT') + ' '

Output:

aver-
age

Note the correct capitalisation compared to using the substitution content (this is from the ALTO source).

I am not sure I have a preference for how this should be handled by default. Thoughts?

from alto-tools.

Hyphenated content gets duplicated about alto-tools HOT 6 CLOSED

Comments (6)

1) substitute hyphens

2) don't substitute hyphens

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent