Code Monkey home page Code Monkey logo

Comments (6)

iiegn avatar iiegn commented on June 29, 2024 1

just realized, instead of pass it's necessary to set text = ""...

from alto-tools.

imlabormitlea-code avatar imlabormitlea-code commented on June 29, 2024 1

To get hyphenated content as in the original file

      if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
          text = line.attrib.get('CONTENT') + ' '
      if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
          text = line.attrib.get('CONTENT') + ' '

should avoid duplicates.

from alto-tools.

iiegn avatar iiegn commented on June 29, 2024 1

i guess, there is no right answer to this. however, once converted to text it's not obvious what is what: (from a small sample and more intuition than fact observation) i've seen cases where the end-of-line hyphenation was not recognized by the OCR engine but the text representation still contained a hyphen at the end of the line. these lines would be indistinguishable from #16 (comment) option 2). otoh, when the OCR engine does recognize an end-of-line hyphenation, it seems to be often correct. for me, as i'm interested in the text for further NLP processing, having some words back to 'normal' (that is, known to NLP tools) form is a plus; and knowing that they are often correct is also a plus.

so, i have a preference for option 1)... but as long as there is a run-time option to select one or the other, i don't really care. this begs the question (then for me), should this become a run-time option?

from alto-tools.

cneud avatar cneud commented on June 29, 2024 1

Many thanks for sharing your thoughts on this @iiegn!

I am leaning more towards option 2. My thinking is that alto-tools should itself aim not to alter/normalize the OCR text in any way - and while the HypPart is indeed a part of the OCR output (I assume it was added in ALTO for the reason you describe, ease of downstream NLP processing with normalized tokens), I now see it as more problematic when by substituting hyphens also line breaks may be altered - this can cause trouble for applications that use some form of textline-image alignement, like evaluation or training. And I suppose it could be a greater source for confusion and require more documentation than keeping the hyphenation from the source document. The fact that the capitalization (averAge vs aver-age) is different in the example ALTO I used is also still confusing me. I don't expect an OCR engine would do that, it must be some post-processing...

TL;DR - a runtime option would satisfy more downstream use cases at the cost of increased documentation.

I guess I will implement option 2 first as the default. But I can also see the value in option 1 and would like to implement it, just not very soon (too much other stuff going on). PR's are obviously welcome :) Also example test files with hyphenation edge cases would be helpful if they can be shared freely.

from alto-tools.

cneud avatar cneud commented on June 29, 2024

Thank you @imlabormitlea-code! I will test this also next week and if all checks out, make a PR and credit you for the fix (i.e. unless you want to make a PR for this).

from alto-tools.

cneud avatar cneud commented on June 29, 2024

Basically, there seem to be two ways to deal with the hyphenation info from the ALTO xml.

Looking at a live example from the ALTO LOC website:

<TextLine ...>
  <String STYLEREFS="ID7" HEIGHT="72.0" WIDTH="276.0" HPOS="6150.0" VPOS="5388.0" CONTENT="aver" SUBS_TYPE="HypPart1" SUBS_CONTENT="averAge" WC="1.0">
  <HYP WIDTH="1.0" HPOS="6427.0" VPOS="5388.0" CONTENT="-"/>
</TextLine>
<TextLine ...>
  <String STYLEREFS="ID7" HEIGHT="90.0" WIDTH="213.0" HPOS="4053.0" VPOS="5556.0" CONTENT="age" SUBS_TYPE="HypPart2" SUBS_CONTENT="averAge" WC="1.0">
</TextLine>

1) substitute hyphens

Where ALTO contains a string with SUBS_TYPE="HypPart1", use the SUBS_CONTENT to substitute the hyphenated word in the text output.

SUBS_CONTENT must be used only once, the second part of the hyphenated content should be ignored, e.g. as suggested by #16 (comment).

Code:

if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('SUBS_CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
   text = ''

Output:

averAge

2) don't substitute hyphens

Output closer to the original as e.g. suggested by #16 (comment), use the CONTENT of each string with the HYP as - and a text line break

This

Code:

if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('CONTENT') + '-'
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('CONTENT') + ' '

Output:

aver-
age

Note the correct capitalisation compared to using the substitution content (this is from the ALTO source).

I am not sure I have a preference for how this should be handled by default. Thoughts?

from alto-tools.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.