Code Monkey home page Code Monkey logo

mods4pandas's Introduction

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames.

Build Status

mods4pandas converts the MODS metadata from METS files into a pandas DataFrame.

Column names are derived from the corresponding MODS elements. Some domain knowledge is used to convert elements to a useful column, e.g. produce sets instead of ordered lists for topics, etc. Parts of the tool are specific to our environment/needs at the State Library Berlin and may need to be changed for your library.

alto4pandas converts the metadata from ALTO files into a pandas DataFrame.

Column names are derived from the corresponding ALTO elements. Some columns contain descriptive statistics (e.g. counts or mean) of the corresponding ALTO elements or attributes.

Usage

mods4pandas /path/to/a/directory/containing/mets_files
alto4pandas /path/to/a/directory/full/of/alto_files

Example

In this example we convert the MODS metadata contained in the METS files in /srv/data/digisam_mets-sample-300 to a pandas DataFrame under mods_info_df.pkl. This file can then be read by your data scientist using pd.read_pickle().

% mods4pandas /srv/data/digisam_mets-sample-300
INFO:root:Scanning directory /srv/data/digisam_mets-sample-300
301it [00:00, 19579.19it/s]
INFO:root:Processing METS files
100%|████████████████████████████████████████| 301/301 [00:01<00:00, 162.59it/s]
INFO:root:Writing DataFrame to mods_info_df.pkl

In the next example we convert the metadata from the ALTO files in the test data directory:

% alto4pandas qurator/mods4pandas/tests/data/alto
Scanning directory qurator/mods4pandas/tests/data/alto
Scanning directory qurator/mods4pandas/tests/data/alto/PPN636777308
Scanning directory qurator/mods4pandas/tests/data/alto/734008031
Scanning directory qurator/mods4pandas/tests/data/alto/PPN895016346
Scanning directory qurator/mods4pandas/tests/data/alto/PPN640992293
Scanning directory qurator/mods4pandas/tests/data/alto/alto-ner
Scanning directory qurator/mods4pandas/tests/data/alto/PPN767883624
Scanning directory qurator/mods4pandas/tests/data/alto/PPN715049151
Scanning directory qurator/mods4pandas/tests/data/alto/749782137
Scanning directory qurator/mods4pandas/tests/data/alto/weird-ns
INFO:alto4pandas:Processing ALTO files
INFO:alto4pandas:Writing DataFrame to alto_info_df.pkl

mods4pandas's People

Contributors

mikegerber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mods4pandas's Issues

MODS "name" changes

The mods:name now has a mods:nameIdentifier:

/home/mike/devel/qurator-mono-repo/modstool/qurator/modstool/modstool.py:428: UserWarning: Exception in /srv/digisam_mets/PPN1678618276.xml:
Unknown tag "{http://www.loc.gov/mods/v3}nameIdentifier"
  warnings.warn('Exception in {}:\n{}'.format(mets_file, e))

Also, the mods:name/mods:displayForm (optional according to DFG MODS-Anwendungsprofil) got dropped in favor of the mandatory name part fields.

Better name for altotool

Following up on #12, we need a better name for altotool.

  • Rename altotool to alto4pandas
  • Rename modstool to mods4pandas
  • Rename qurator.modstool
  • Rename project to mods4pandas

Missing subject/topic, genre

From the feedback by Maria Federbusch:

PPN813655765
Zusätzlich fehlen in diesem Beispiel (MONO) noch die Schlagwörter - hier aus dem ARMA-Projekt:

<mods:subject authority="getty">
<mods:topic valueURI="http://vocab.getty.edu/aat/300411614">reading culture</mods:topic>
<mods:topic valueURI="http://vocab.getty.edu/aat/300020756">Medieval (European)</mods:topic>
</mods:subject>
<mods:subject authority="wikidata">
<mods:topic valueURI="https://www.wikidata.org/wiki/Q107274053">Reading culture (medieval)</mods:topic>
</mods:subject>
<mods:genre valueURI="https://www.wikidata.org/wiki/Q1261026" type="class" authority="wikidata">
<mods:genre>printed matter</mods:genre>
</mods:genre>
  • Review a. DFG MODS Anwendungsprofil for mods:genre and b. more examples of mods:genre - here the structure is clearly different from the mods:subject examples. Would make more sense if the nested genre had the valueURI?

Add missing information for "original" PPNs

By now, only the content of <mods:identifier type="PPNanalog"> seems to be included in the output but not the equivalent information from

<mods:relatedItem type="original">
	<mods:recordInfo>
		<mods:recordIdentifier source="gbv-ppn">PPNxyz</mods:recordIdentifier>
	</mods:recordInfo>
</mods:relatedItem>

More than one instance: <mods:shelfLocator>

/home/mike/devel/modstool-github/qurator/modstool/modstool.py:504: UserWarning: Exception in /srv/digisam_mets/PPN1683730747.xml:
More than one instance: <mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-1</mods:shelfLocator>
<mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-2</mods:shelfLocator>
  warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Traceback (most recent call last):
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 488, in process
    d = flatten(mods_to_dict(mods, raise_errors=True))
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 221, in mods_to_dict
    value['location'] = TagGroup(tag, group) \
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 69, in descend
    return _to_dict(self.is_singleton().group[0], raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 201, in _to_dict
    return mods_to_dict(root, raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 234, in mods_to_dict
    value['shelfLocator'] = TagGroup(tag, group) \
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 36, in is_singleton
    raise ValueError('More than one instance: {}'.format(self))
ValueError: More than one instance: <mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-1</mods:shelfLocator>

Integration of ALTO metadata

  • Should this be in here? Or in "codename altotool"?

  • What info would be relevant? What would be metadata, what would be data (count words?)

  • Include metadata from the Description section

  • Include descriptive statistics for the Layout section etc.

    • E.g. word count
    • E.g. mean STRING WC (confidence)
  • When that's done review the comments below for things we may have missed

  • Test using all available versions of ALTO

  • NER annotated ALTO should at least be identifiable

  • Include ALTO version/namespace

  • <LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>

  • Any language infos?

  • Update README that we now support ALTO

Group name columns by role

  • Test grouping name columns by role
  • Consider using a list in the column (e.g. multiple authors, multiple publishers)
  • Review #1

ValueError: Unknown tag "{http://www.loc.gov/mods/v3}partName"

/home/mike/devel/modstool-github/qurator/modstool/modstool.py:504: UserWarning: Exception in test-data/PPN1727545451.xml:
Unknown tag "{http://www.loc.gov/mods/v3}partName"
  warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Traceback (most recent call last):
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 488, in process
    d = flatten(mods_to_dict(mods, raise_errors=True))
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 293, in mods_to_dict
    .is_singleton().has_no_attributes().descend(raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 69, in descend
    return _to_dict(self.is_singleton().group[0], raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 201, in _to_dict
    return mods_to_dict(root, raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 377, in mods_to_dict
    raise ValueError('Unknown tag "{}"'.format(tag))
ValueError: Unknown tag "{http://www.loc.gov/mods/v3}partName"

Update docs

  • Mention CSV + Excel
  • Show example data in the README

Missing information from the original METS/MOTS

Dissect this comment by @joergleh:

Ansonsten: Da nur die oberste dmdSec verwendet wurde, sind Felder wie der Titel des übergeordneten Werks, alternative Titel einzelner Werks, Informationen darüber, dass dieses Werk Teil einer mehrbändigen Publikation ist, strukturelle Informationen (z. B. über den Umfang des Bandes oder darüber, wo das Inhaltsverzeichnis zu finden ist, falls ein solches vorhanden ist), über den Eigentümer des physischen Buchs, das Digitalisierungsprojekt oder einige spezifische Verschlagwortungen möglicherweise nicht in der Tabelle enthalten.

  • Da nur die oberste dmdSec verwendet wurde,
  • Titel des übergeordneten Werks
  • alternative Titel einzelner Werks
  • Informationen darüber, dass dieses Werk Teil einer mehrbändigen Publikation ist,
  • strukturelle Informationen (z. B.
  • über den Umfang des Bandes oder
  • darüber, wo das Inhaltsverzeichnis zu finden ist, falls ein solches vorhanden ist),
  • über den Eigentümer des physischen Buchs,
  • das Digitalisierungsprojekt oder
  • einige spezifische Verschlagwortungen möglicherweise nicht in der Tabelle enthalten.

Structure information

@labusch had questions regarding structure information (from METS metadata) and @joergleh already had suggestions regarding missing information (#23, #24).

While there is certainly information that I find out of scope for this tool (like the location of a title page → should use the original METS for this) there is certainly information we should include (like the count/presence of a title page).

(Edit: Moved the missing field documentation to #27.)

  • structMap[@TYPE="LOGICAL"]: should count the divs, grouped by their type. They are nested, so this needs to be accounted for.
  • structMap[@TYPE="PHYSICAL"]: Count?
  • Should reformat the files in tests/data to read them more easily

Handle multiple mods:role/mods:roleTerm

mods:names now may have more than one role:

    <mods:mods xmlns:mods="http://www.loc.gov/mods/v3">                                                                  
    <mods:name type="personal" valueURI="http://d-nb.info/gnd/117357669">                                                
      <mods:displayForm>Wurm, Mary</mods:displayForm>                                                                    
      <mods:namePart type="given">Mary</mods:namePart>                                                                   
      <mods:nameIdentifier type="gbv-ppn">078789583</mods:nameIdentifier>                                                
      <mods:namePart type="family">Wurm</mods:namePart>                                                                  
      <mods:role>                                                                                                        
        <mods:roleTerm authority="marcrelator" type="code">cmp</mods:roleTerm>                                           
      </mods:role>                                                                                                       
      <mods:role>                                                                                                        
        <mods:roleTerm authority="marcrelator" type="code">aut</mods:roleTerm>                                           
      </mods:role>                                                                                                       
    </mods:name>                                                                                                                                             
    </mods:mods>           

This should be merged into one column, e.g. d['name0_role_roleTerm'] == {'cmp', 'aut'}

Review imports

Since the ALTO functionality, imports are mess. This should be reviewed

Include METS metadata

The obvious thing that could be included:

  • Include page count per filegroup

In addition to that:

  • Investigate tags which were ignored for now

Multiple language tags vs multiple languageTerm tags

Newer input files have two <mods:languageTerm>s in one <mods:language>:

<mods:language>
  <mods:languageTerm authority="iso639-2b" type="code">ger</mods:languageTerm>
  <mods:languageTerm authority="iso639-2b" type="code">eng</mods:languageTerm>
</mods:language>

modstool should not throw an exception here.

Group names given in the MODS-file according to given roles to reduce number of columns

Aim: Reduce number of columns for better manageability of data frame
Proposal: Group names given in the MODS-file according to given roles
Explanation: Each "name" entry in the mods-file consists of at least four parts:
"nameXX_namePart.family"
"nameXX_namePart.given"
"nameXX_displayForm"
"nameXX_role_roleTerm"
The number of columns could significantly be reduced if the names would first be grouped according to the roles and then concatenated into a fewer number of columns.
Examples:
PPN735425078 contains 76 names with the role "asn" (= associated name); this amounts up to 304 columns, but could be reduced to three columns ("nameASN_namePart.family", "nameASN_namePart.given", "nameASN_displayForm"), each containing 76 names in nested form (Mauschwitz; Baudis; Hoberg; ...)
PPN858144891 contains 50 names with the role "oth" (= other); this amounts up to 200 columns, but could be reduced to three columns ("nameOTH_namePart.family", "nameOTH_namePart.given", "nameOTH_displayForm")
PPN1774254956 contains 42 names with the role "ctb" (= contributor); this amounts up to 168 columns, but could be reduced to three columns ("nameCTB_namePart.family", "nameCTB_namePart.given", "nameCTB_displayForm")
The most frequently used roles are asn (associated name), oth (other), ctb (contributor), dte (dedicatee), fnd (funder), auth (author), isb (issuing body), egr (engraver), hnr (honoree), ill (illustrator), prt (printer).

One or more element has unexpected attributes: mods:recordIdentifier source="dnb-ppn"

ERROR:mods4pandas:Exception in /srv/digisam_mets/PPN1830497871.xml: One or more element has unexpected attributes: <mods:recordIdentifier xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" source="dnb-ppn">1236513355</mods:recordIdentifier>

Edit after feedback from a co-worker:

  • For top-level mods:recordIdentifier we need to check that's it's a GBV PPN ("our" PPN)
  • For mods:relatedItem//mods:recordIdentifier we need to distinguish GBV PPNs from DNB PPN (or others). This happens when we have digitized works from other libraries (in this case DNB)

Documentation of the fields exported

(Moved from #26)

There's also the issue that I usually forget what we already included in our current mods4pandas/current mods_info export. We should have a complete description of the fields. This should ideally be generated automatically, or at least checked for completeness. This would also enable checking what changed between versions of the export. (This is related #26 because I simply could not tell @labusch what currently is possible with the info we already export.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.