qurator-spk / mods4pandas Goto Github PK
View Code? Open in Web Editor NEWExtract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames for data analysis
License: Apache License 2.0
Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames for data analysis
License: Apache License 2.0
Using "qurator" requires coordination between all projects. → Avoid this.
mods:name
s now may have more than one role:
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
<mods:name type="personal" valueURI="http://d-nb.info/gnd/117357669">
<mods:displayForm>Wurm, Mary</mods:displayForm>
<mods:namePart type="given">Mary</mods:namePart>
<mods:nameIdentifier type="gbv-ppn">078789583</mods:nameIdentifier>
<mods:namePart type="family">Wurm</mods:namePart>
<mods:role>
<mods:roleTerm authority="marcrelator" type="code">cmp</mods:roleTerm>
</mods:role>
<mods:role>
<mods:roleTerm authority="marcrelator" type="code">aut</mods:roleTerm>
</mods:role>
</mods:name>
</mods:mods>
This should be merged into one column, e.g. d['name0_role_roleTerm'] == {'cmp', 'aut'}
From the feedback by Maria Federbusch:
PPN813655765
Zusätzlich fehlen in diesem Beispiel (MONO) noch die Schlagwörter - hier aus dem ARMA-Projekt:
<mods:subject authority="getty">
<mods:topic valueURI="http://vocab.getty.edu/aat/300411614">reading culture</mods:topic>
<mods:topic valueURI="http://vocab.getty.edu/aat/300020756">Medieval (European)</mods:topic>
</mods:subject>
<mods:subject authority="wikidata">
<mods:topic valueURI="https://www.wikidata.org/wiki/Q107274053">Reading culture (medieval)</mods:topic>
</mods:subject>
<mods:genre valueURI="https://www.wikidata.org/wiki/Q1261026" type="class" authority="wikidata">
<mods:genre>printed matter</mods:genre>
</mods:genre>
mods:genre
and b. more examples of mods:genre
- here the structure is clearly different from the mods:subject
examples. Would make more sense if the nested genre had the valueURI
?Currently, the README does not show any results (i.e. an excerpt of the resulting table).
@labusch had questions regarding structure information (from METS metadata) and @joergleh already had suggestions regarding missing information (#23, #24).
While there is certainly information that I find out of scope for this tool (like the location of a title page → should use the original METS for this) there is certainly information we should include (like the count/presence of a title page).
(Edit: Moved the missing field documentation to #27.)
structMap[@TYPE="LOGICAL"]
: should count the divs, grouped by their type. They are nested, so this needs to be accounted for.structMap[@TYPE="PHYSICAL"]
: Count?tests/data
to read them more easilyDissect this comment by @joergleh:
Ansonsten: Da nur die oberste dmdSec verwendet wurde, sind Felder wie der Titel des übergeordneten Werks, alternative Titel einzelner Werks, Informationen darüber, dass dieses Werk Teil einer mehrbändigen Publikation ist, strukturelle Informationen (z. B. über den Umfang des Bandes oder darüber, wo das Inhaltsverzeichnis zu finden ist, falls ein solches vorhanden ist), über den Eigentümer des physischen Buchs, das Digitalisierungsprojekt oder einige spezifische Verschlagwortungen möglicherweise nicht in der Tabelle enthalten.
How could I get the total number of pages/images/canvases from the METS files of all objects returned in a query like https://digital.staatsbibliothek-berlin.de/suche?queryString=type%3Aannotation%20date_issued%3A%3E1455%20date_issued%3A%3C1800&category=Naturwissenschaften%20%2F%20Mathematik
Since the ALTO functionality, imports are mess. This should be reviewed
Language attributes are LANG
and the deprecated language
:
TextBlock/@language
tags are usedERROR:mods4pandas:Exception in /srv/digisam_mets/PPN1830497871.xml: One or more element has unexpected attributes: <mods:recordIdentifier xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" source="dnb-ppn">1236513355</mods:recordIdentifier>
Edit after feedback from a co-worker:
mods:recordIdentifier
we need to check that's it's a GBV PPN ("our" PPN)mods:relatedItem//mods:recordIdentifier
we need to distinguish GBV PPNs from DNB PPN (or others). This happens when we have digitized works from other libraries (in this case DNB)Newly downgrade to Pandas 1.0.5 broke Python 3.10
(Moved from #26)
There's also the issue that I usually forget what we already included in our current mods4pandas/current mods_info export. We should have a complete description of the fields. This should ideally be generated automatically, or at least checked for completeness. This would also enable checking what changed between versions of the export. (This is related #26 because I simply could not tell @labusch what currently is possible with the info we already export.)
Our pandas dependency is broken on Python 3.12.
Should this be in here? Or in "codename altotool"?
What info would be relevant? What would be metadata, what would be data (count words?)
Include metadata from the Description
section
Include descriptive statistics for the Layout
section etc.
When that's done review the comments below for things we may have missed
Test using all available versions of ALTO
NER annotated ALTO should at least be identifiable
Include ALTO version/namespace
<LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>
Any language infos?
Update README that we now support ALTO
/home/mike/devel/modstool-github/qurator/modstool/modstool.py:504: UserWarning: Exception in /srv/digisam_mets/PPN1683730747.xml:
More than one instance: <mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-1</mods:shelfLocator>
<mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-2</mods:shelfLocator>
warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Traceback (most recent call last):
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 488, in process
d = flatten(mods_to_dict(mods, raise_errors=True))
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 221, in mods_to_dict
value['location'] = TagGroup(tag, group) \
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 69, in descend
return _to_dict(self.is_singleton().group[0], raise_errors)
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 201, in _to_dict
return mods_to_dict(root, raise_errors)
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 234, in mods_to_dict
value['shelfLocator'] = TagGroup(tag, group) \
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 36, in is_singleton
raise ValueError('More than one instance: {}'.format(self))
ValueError: More than one instance: <mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-1</mods:shelfLocator>
Following up on #12, we need a better name for altotool
.
/home/mike/devel/modstool-github/qurator/modstool/modstool.py:504: UserWarning: Exception in test-data/PPN1727545451.xml:
Unknown tag "{http://www.loc.gov/mods/v3}partName"
warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Traceback (most recent call last):
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 488, in process
d = flatten(mods_to_dict(mods, raise_errors=True))
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 293, in mods_to_dict
.is_singleton().has_no_attributes().descend(raise_errors)
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 69, in descend
return _to_dict(self.is_singleton().group[0], raise_errors)
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 201, in _to_dict
return mods_to_dict(root, raise_errors)
File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 377, in mods_to_dict
raise ValueError('Unknown tag "{}"'.format(tag))
ValueError: Unknown tag "{http://www.loc.gov/mods/v3}partName"
By now, only the content of <mods:identifier type="PPNanalog">
seems to be included in the output but not the equivalent information from
<mods:relatedItem type="original">
<mods:recordInfo>
<mods:recordIdentifier source="gbv-ppn">PPNxyz</mods:recordIdentifier>
</mods:recordInfo>
</mods:relatedItem>
Newer input files have two <mods:languageTerm>
s in one <mods:language>
:
<mods:language>
<mods:languageTerm authority="iso639-2b" type="code">ger</mods:languageTerm>
<mods:languageTerm authority="iso639-2b" type="code">eng</mods:languageTerm>
</mods:language>
modstool should not throw an exception here.
Aim: Reduce number of columns for better manageability of data frame
Proposal: Group names given in the MODS-file according to given roles
Explanation: Each "name" entry in the mods-file consists of at least four parts:
"nameXX_namePart.family"
"nameXX_namePart.given"
"nameXX_displayForm"
"nameXX_role_roleTerm"
The number of columns could significantly be reduced if the names would first be grouped according to the roles and then concatenated into a fewer number of columns.
Examples:
PPN735425078 contains 76 names with the role "asn" (= associated name); this amounts up to 304 columns, but could be reduced to three columns ("nameASN_namePart.family", "nameASN_namePart.given", "nameASN_displayForm"), each containing 76 names in nested form (Mauschwitz; Baudis; Hoberg; ...)
PPN858144891 contains 50 names with the role "oth" (= other); this amounts up to 200 columns, but could be reduced to three columns ("nameOTH_namePart.family", "nameOTH_namePart.given", "nameOTH_displayForm")
PPN1774254956 contains 42 names with the role "ctb" (= contributor); this amounts up to 168 columns, but could be reduced to three columns ("nameCTB_namePart.family", "nameCTB_namePart.given", "nameCTB_displayForm")
The most frequently used roles are asn (associated name), oth (other), ctb (contributor), dte (dedicatee), fnd (funder), auth (author), isb (issuing body), egr (engraver), hnr (honoree), ill (illustrator), prt (printer).
The mods:name
now has a mods:nameIdentifier
:
/home/mike/devel/qurator-mono-repo/modstool/qurator/modstool/modstool.py:428: UserWarning: Exception in /srv/digisam_mets/PPN1678618276.xml:
Unknown tag "{http://www.loc.gov/mods/v3}nameIdentifier"
warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Also, the mods:name/mods:displayForm
(optional according to DFG MODS-Anwendungsprofil) got dropped in favor of the mandatory name part fields.
The obvious thing that could be included:
In addition to that:
modstool.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.