cneud / alto-tools Goto Github PK
View Code? Open in Web Editor NEWPython tools for performing various operations on ALTO XML files
License: Apache License 2.0
Python tools for performing various operations on ALTO XML files
License: Apache License 2.0
This kind of document do not use the same namespace https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k6579172z&E=ALTO&Deb=4.
But the warning can't be silenced.
Available Alto namespaces not sufficient for Alto documents from the National Library of Scotland Digital Foundry.
Needs addition of 'alto-v3-alt' : 'http://www.loc.gov/standards/alto/v3/alto.xsd'
to namespaces.
add v4
Use click
instead of argparse
The script confusingly takes only directories as INPUT, not files.
works:
python3 alto_tools.py ~/devel/experiments/tesseract-fraktur/ --text
does NOT work:
python3 alto_tools.py ~/devel/experiments/tesseract-fraktur/test-frak.xml --text
I guess it would be nice to have a pyproject.toml installable package and eventually releases on pypi.setup.py
With travis-ci.org long gone, move CI to CircleCI or GitHub Actions.
Neat tools, nice to have them available!
Noticed now that there is a thing now with hyphenated words, which we have a lot in Finnish and also in Swedish. I don't know if it is as intended, but they come now to the output in parts and not combined.
<String ID="P1_ST01183" HPOS="2372" VPOS="3307" WIDTH="183" HEIGHT="39" CONTENT="Täydellisem" SUBS_TYPE="HypPart1" SUBS_CONTENT="Täydellisempää" WC="0.99" CC="44545880035"/>
<HYP HPOS="2555" VPOS="3307" WIDTH="16" CONTENT="-"/>
</TextLine>
<TextLine ID="P1_TL00207" HPOS="1774" VPOS="3346" WIDTH="797" HEIGHT="43">
<String ID="P1_ST01184" HPOS="1774" VPOS="3348" WIDTH="60" HEIGHT="39" CONTENT="pää" SUBS_TYPE="HypPart2" SUBS_CONTENT="Täydellisempää" WC="0.99" CC="100"/>
<SP ID="P1_SP00976" HPOS="1834" VPOS="3387" WIDTH="24"/>
At the moment e.g. the alto_text function takes the CONTENT part, so the hyphenated words come out as separated tokens.
E.g. I run the original version vs. version where they are combined, the difference is visible: 'Täydellisempää' vs. 'Täydellisem' 'pää'. But depends on whether to follow the text line 'boundaries' or have more readable output, and there can be votes to the either solution.
207,208c207,208
< roilja paremmin kestää hallaaki. Täydellisempää
< sala-ojitusta talonpoikain pelloilla ei saata
---
> roilja paremmin kestää hallaaki. Täydellisem
> pää sala-ojitusta talonpoikain pelloilla ei saata
211,216c211,216
while using alto_text(), i see duplicated content at hyphenations: the word at the end of one line and at the beginning of the next are identical (un-hyphenated) words. it seems, SUBS_CONTENT
gets used from both HypPart1
and HypPart2
.
looking into this, i'd assume the indentation of the the block:
Lines 66 to 67 in e942f86
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.