Comments (6)
just realized, instead of pass
it's necessary to set text = ""
...
from alto-tools.
To get hyphenated content as in the original file
if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + ' '
should avoid duplicates.
from alto-tools.
i guess, there is no right answer to this. however, once converted to text it's not obvious what is what: (from a small sample and more intuition than fact observation) i've seen cases where the end-of-line hyphenation was not recognized by the OCR engine but the text representation still contained a hyphen at the end of the line. these lines would be indistinguishable from #16 (comment) option 2). otoh, when the OCR engine does recognize an end-of-line hyphenation, it seems to be often correct. for me, as i'm interested in the text for further NLP processing, having some words back to 'normal' (that is, known to NLP tools) form is a plus; and knowing that they are often correct is also a plus.
so, i have a preference for option 1)... but as long as there is a run-time option to select one or the other, i don't really care. this begs the question (then for me), should this become a run-time option?
from alto-tools.
Many thanks for sharing your thoughts on this @iiegn!
I am leaning more towards option 2. My thinking is that alto-tools
should itself aim not to alter/normalize the OCR text in any way - and while the HypPart
is indeed a part of the OCR output (I assume it was added in ALTO for the reason you describe, ease of downstream NLP processing with normalized tokens), I now see it as more problematic when by substituting hyphens also line breaks may be altered - this can cause trouble for applications that use some form of textline-image alignement, like evaluation or training. And I suppose it could be a greater source for confusion and require more documentation than keeping the hyphenation from the source document. The fact that the capitalization (averAge
vs aver-age
) is different in the example ALTO I used is also still confusing me. I don't expect an OCR engine would do that, it must be some post-processing...
TL;DR - a runtime option would satisfy more downstream use cases at the cost of increased documentation.
I guess I will implement option 2 first as the default. But I can also see the value in option 1 and would like to implement it, just not very soon (too much other stuff going on). PR's are obviously welcome :) Also example test files with hyphenation edge cases would be helpful if they can be shared freely.
from alto-tools.
Thank you @imlabormitlea-code! I will test this also next week and if all checks out, make a PR and credit you for the fix (i.e. unless you want to make a PR for this).
from alto-tools.
Basically, there seem to be two ways to deal with the hyphenation info from the ALTO xml.
Looking at a live example from the ALTO LOC website:
<TextLine ...>
<String STYLEREFS="ID7" HEIGHT="72.0" WIDTH="276.0" HPOS="6150.0" VPOS="5388.0" CONTENT="aver" SUBS_TYPE="HypPart1" SUBS_CONTENT="averAge" WC="1.0">
<HYP WIDTH="1.0" HPOS="6427.0" VPOS="5388.0" CONTENT="-"/>
</TextLine>
<TextLine ...>
<String STYLEREFS="ID7" HEIGHT="90.0" WIDTH="213.0" HPOS="4053.0" VPOS="5556.0" CONTENT="age" SUBS_TYPE="HypPart2" SUBS_CONTENT="averAge" WC="1.0">
</TextLine>
1) substitute hyphens
Where ALTO contains a string with SUBS_TYPE="HypPart1"
, use the SUBS_CONTENT
to substitute the hyphenated word in the text output.
SUBS_CONTENT
must be used only once, the second part of the hyphenated content should be ignored, e.g. as suggested by #16 (comment).
Code:
if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('SUBS_CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
text = ''
Output:
averAge
2) don't substitute hyphens
Output closer to the original as e.g. suggested by #16 (comment), use the CONTENT
of each string with the HYP
as -
and a text line break
This
Code:
if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + '-'
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + ' '
Output:
aver-
age
Note the correct capitalisation compared to using the substitution content (this is from the ALTO source).
I am not sure I have a preference for how this should be handled by default. Thoughts?
from alto-tools.
Related Issues (9)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alto-tools.