Comments (12)
We're currently completely ignoring the structMap
, so this needs new code ;)
from mods4pandas.
@labusch told me there's interest in per-page information ("This is a title page" or whatever types exist). IMHO This should not go into the mods_info, but a separate file/table - and I would consider this after reviewing first (I am simply not familiar enough with mets:structMap
yet.)
from mods4pandas.
AFAICT, what we would need is this: for a given list of PPNs, we want the information which specific image files per PPN correspond to each (if any) of the Strukturdaten listed here (scroll down to "Strukturtypen - vollständige Liste"): https://digital.staatsbibliothek-berlin.de/features-und-hilfe/suchen-und-stoebern.
from mods4pandas.
Yeah that's per-page information :) Need to look at it for specifics.
I somewhat think it might still be an idea to read the METS but now that I understand the use case I might be able to come up with something that is easier to digest than 1. to deal with all that XML and 2. understand METS structure.
Just because of the grammar reversing it, what you want is this:
Given ppn=PPN12345 and type=illustration, give me the matching pages (and the images).
That should be possible, as far as I can see now. There will be some difficulties (omnibus volumes (Sammelbände) etc.), but it could work.
from mods4pandas.
Given ppn=PPN12345 and type=illustration, give me the matching pages (and the images).
Exactly. The use case is to automatically ingest the structural tags for e.g. title pages as tags into the image search db, so that users can easily select all images from all PPNs that are title pages and annotate the according regions on those images.
from mods4pandas.
That's the other way around ;) But I think I understand now and will implement it.
from mods4pandas.
The details are a bit tricky, but this seems a good way to do it:
Per PPN and page (as in structMap TYPE=PHYSICAL → TYPE=page):
- Have the filenames for this page in all fileGrp
- Have flags for every type if they exist (TYPE=LOGICAL div TYPE=)
- These are hierarchical, e.g. we could have (1) an illustration in the (2) bookend of the (3) binding (just making it up). I don't see an elegant way to keep the hierarchy while still exposing all the types. I think flags (booleans) for the types are sufficient, and people wanting more need to read the METS.
This way there's an immediate link between a file name and a logical type. (Hardcoding a fileGrp or guessing e.g. filename from the ID in structMap[@TYPE=LOGICAL]
would be only slightly easier, and probably fail here and there → prefer the correct version and resolve filenames using all fileGrp
s)
- Implement it
- Write some examples
- Read up that I understood the
structMap[@TYPE=LOGICAL]
correctly - Look at some illustrations, just to make sure
- New export
- Sammelbände (I avoided those, and should probably know these better)
from mods4pandas.
I've been held up by the joy of Sammelbände...
Implementing it as above also is a bit trickier than I thought (e.g. need to read the fileGrp to know to which a file FILEID belongs and associating structMap PHYSICAL vs structMap LOGICAL is full of ID pointers, too), but I started it.
from mods4pandas.
Alright, this looks like it's going somewhere:
{'ID': 'PHYS_0594',
'fileGrp_DEFAULT_file_FLocat_href': 'https://content.staatsbibliothek-berlin.de/dc/PPN821507109-0000>
'fileGrp_PRESENTATION_file_FLocat_href': 'file:///goobi/tiff001/sbb/PPN821507109/00000594.tif',
'fileGrp_THUMBS_file_FLocat_href': 'https://content.staatsbibliothek-berlin.de/dc/PPN821507109-00000>
'ppn': 'TODO',
'structmap_LOGICAL_TYPE_illustration': 1,
'structmap_LOGICAL_TYPE_monograph': 1,
'structmap_LOGICAL_TYPE_section': 1}
This dict is going to be a line in a DataFrame and gives both the filenames (as they are in METS) and the associated TYPEs in the mets:structMap[@TYPE="LOGICAL"]
. What the three indicator variables (the 1
s) mean:
- This page is part of a monograph
- This page is part of a section
- This page is part of an illustration ("part of" ... this is as fine as it gets)
The way our METS is set up, you get these TYPEs explicitly, but I also made it transitive, e.g. if an illustration
is part of a section
, you would get section
in any case.
With this info you can get the structMap TYPEs for a given page and also have it backwards, i.e. get pages with illustrations on it.
from mods4pandas.
(Currently working in branch feat/page_info
).
from mods4pandas.
Implementation is too slow: For sbb-mets-PPN821507109.xml (~1300 pages), it takes 80s to process... For now, I'll ignore this and improve later. It's probably all the XPath in here.
- Write a test
- Test how to improve this (use XPath class instead of
.xpath()
? Avoid XPath? Resolve links the other way around)
from mods4pandas.
Structure types were given transitively in my test files, e.g. a page that had type cover_front
and was part of a binding
had smLink
s to both the cover_front
and the binding
logical elements. This was not the case for all documents in @labusch's selection.
I had code that would sanitize this, i.e. walk up the tree and add the types. Coincidentally this was buggy and failed for some hundred documents - so I noticed the - possible - inconsistency.
- Investigate files
from mods4pandas.
Related Issues (20)
- Fix tests on Python 3.10
- Integration of ALTO metadata HOT 19
- Review imports
- Smarter handling of namespaces
- Better name for altotool HOT 1
- Review XXXs and TODOs
- Use test data in qurator/modstool/tests/data
- alto4pandas: LANG + language
- README should show some results
- Group name columns by role
- Add missing information for "original" PPNs HOT 7
- One or more element has unexpected attributes: mods:recordIdentifier source="dnb-ppn" HOT 3
- Missing information from the original METS/MOTS HOT 12
- Missing subject/topic, genre HOT 3
- Use Case: Aggregate number of pages/canvases across multiple METS derived from search query HOT 8
- Documentation of the fields exported
- Don't use the Python namespace qurator
- Test on Python 3.12
- Group names given in the MODS-file according to given roles to reduce number of columns HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mods4pandas.