Code Monkey home page Code Monkey logo

mods4pandas's Issues

Handle multiple mods:role/mods:roleTerm

mods:names now may have more than one role:

    <mods:mods xmlns:mods="http://www.loc.gov/mods/v3">                                                                  
    <mods:name type="personal" valueURI="http://d-nb.info/gnd/117357669">                                                
      <mods:displayForm>Wurm, Mary</mods:displayForm>                                                                    
      <mods:namePart type="given">Mary</mods:namePart>                                                                   
      <mods:nameIdentifier type="gbv-ppn">078789583</mods:nameIdentifier>                                                
      <mods:namePart type="family">Wurm</mods:namePart>                                                                  
      <mods:role>                                                                                                        
        <mods:roleTerm authority="marcrelator" type="code">cmp</mods:roleTerm>                                           
      </mods:role>                                                                                                       
      <mods:role>                                                                                                        
        <mods:roleTerm authority="marcrelator" type="code">aut</mods:roleTerm>                                           
      </mods:role>                                                                                                       
    </mods:name>                                                                                                                                             
    </mods:mods>           

This should be merged into one column, e.g. d['name0_role_roleTerm'] == {'cmp', 'aut'}

Missing subject/topic, genre

From the feedback by Maria Federbusch:

PPN813655765
Zusätzlich fehlen in diesem Beispiel (MONO) noch die Schlagwörter - hier aus dem ARMA-Projekt:

<mods:subject authority="getty">
<mods:topic valueURI="http://vocab.getty.edu/aat/300411614">reading culture</mods:topic>
<mods:topic valueURI="http://vocab.getty.edu/aat/300020756">Medieval (European)</mods:topic>
</mods:subject>
<mods:subject authority="wikidata">
<mods:topic valueURI="https://www.wikidata.org/wiki/Q107274053">Reading culture (medieval)</mods:topic>
</mods:subject>
<mods:genre valueURI="https://www.wikidata.org/wiki/Q1261026" type="class" authority="wikidata">
<mods:genre>printed matter</mods:genre>
</mods:genre>
  • Review a. DFG MODS Anwendungsprofil for mods:genre and b. more examples of mods:genre - here the structure is clearly different from the mods:subject examples. Would make more sense if the nested genre had the valueURI?

Structure information

@labusch had questions regarding structure information (from METS metadata) and @joergleh already had suggestions regarding missing information (#23, #24).

While there is certainly information that I find out of scope for this tool (like the location of a title page → should use the original METS for this) there is certainly information we should include (like the count/presence of a title page).

(Edit: Moved the missing field documentation to #27.)

  • structMap[@TYPE="LOGICAL"]: should count the divs, grouped by their type. They are nested, so this needs to be accounted for.
  • structMap[@TYPE="PHYSICAL"]: Count?
  • Should reformat the files in tests/data to read them more easily

Missing information from the original METS/MOTS

Dissect this comment by @joergleh:

Ansonsten: Da nur die oberste dmdSec verwendet wurde, sind Felder wie der Titel des übergeordneten Werks, alternative Titel einzelner Werks, Informationen darüber, dass dieses Werk Teil einer mehrbändigen Publikation ist, strukturelle Informationen (z. B. über den Umfang des Bandes oder darüber, wo das Inhaltsverzeichnis zu finden ist, falls ein solches vorhanden ist), über den Eigentümer des physischen Buchs, das Digitalisierungsprojekt oder einige spezifische Verschlagwortungen möglicherweise nicht in der Tabelle enthalten.

  • Da nur die oberste dmdSec verwendet wurde,
  • Titel des übergeordneten Werks
  • alternative Titel einzelner Werks
  • Informationen darüber, dass dieses Werk Teil einer mehrbändigen Publikation ist,
  • strukturelle Informationen (z. B.
  • über den Umfang des Bandes oder
  • darüber, wo das Inhaltsverzeichnis zu finden ist, falls ein solches vorhanden ist),
  • über den Eigentümer des physischen Buchs,
  • das Digitalisierungsprojekt oder
  • einige spezifische Verschlagwortungen möglicherweise nicht in der Tabelle enthalten.

Review imports

Since the ALTO functionality, imports are mess. This should be reviewed

Group name columns by role

  • Test grouping name columns by role
  • Consider using a list in the column (e.g. multiple authors, multiple publishers)
  • Review #1

One or more element has unexpected attributes: mods:recordIdentifier source="dnb-ppn"

ERROR:mods4pandas:Exception in /srv/digisam_mets/PPN1830497871.xml: One or more element has unexpected attributes: <mods:recordIdentifier xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" source="dnb-ppn">1236513355</mods:recordIdentifier>

Edit after feedback from a co-worker:

  • For top-level mods:recordIdentifier we need to check that's it's a GBV PPN ("our" PPN)
  • For mods:relatedItem//mods:recordIdentifier we need to distinguish GBV PPNs from DNB PPN (or others). This happens when we have digitized works from other libraries (in this case DNB)

Documentation of the fields exported

(Moved from #26)

There's also the issue that I usually forget what we already included in our current mods4pandas/current mods_info export. We should have a complete description of the fields. This should ideally be generated automatically, or at least checked for completeness. This would also enable checking what changed between versions of the export. (This is related #26 because I simply could not tell @labusch what currently is possible with the info we already export.)

Update docs

  • Mention CSV + Excel
  • Show example data in the README

Integration of ALTO metadata

  • Should this be in here? Or in "codename altotool"?

  • What info would be relevant? What would be metadata, what would be data (count words?)

  • Include metadata from the Description section

  • Include descriptive statistics for the Layout section etc.

    • E.g. word count
    • E.g. mean STRING WC (confidence)
  • When that's done review the comments below for things we may have missed

  • Test using all available versions of ALTO

  • NER annotated ALTO should at least be identifiable

  • Include ALTO version/namespace

  • <LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>

  • Any language infos?

  • Update README that we now support ALTO

More than one instance: <mods:shelfLocator>

/home/mike/devel/modstool-github/qurator/modstool/modstool.py:504: UserWarning: Exception in /srv/digisam_mets/PPN1683730747.xml:
More than one instance: <mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-1</mods:shelfLocator>
<mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-2</mods:shelfLocator>
  warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Traceback (most recent call last):
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 488, in process
    d = flatten(mods_to_dict(mods, raise_errors=True))
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 221, in mods_to_dict
    value['location'] = TagGroup(tag, group) \
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 69, in descend
    return _to_dict(self.is_singleton().group[0], raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 201, in _to_dict
    return mods_to_dict(root, raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 234, in mods_to_dict
    value['shelfLocator'] = TagGroup(tag, group) \
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 36, in is_singleton
    raise ValueError('More than one instance: {}'.format(self))
ValueError: More than one instance: <mods:shelfLocator xmlns:mods="http://www.loc.gov/mods/v3" xmlns:mets="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Wb 1920-1</mods:shelfLocator>

Better name for altotool

Following up on #12, we need a better name for altotool.

  • Rename altotool to alto4pandas
  • Rename modstool to mods4pandas
  • Rename qurator.modstool
  • Rename project to mods4pandas

ValueError: Unknown tag "{http://www.loc.gov/mods/v3}partName"

/home/mike/devel/modstool-github/qurator/modstool/modstool.py:504: UserWarning: Exception in test-data/PPN1727545451.xml:
Unknown tag "{http://www.loc.gov/mods/v3}partName"
  warnings.warn('Exception in {}:\n{}'.format(mets_file, e))
Traceback (most recent call last):
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 488, in process
    d = flatten(mods_to_dict(mods, raise_errors=True))
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 293, in mods_to_dict
    .is_singleton().has_no_attributes().descend(raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 69, in descend
    return _to_dict(self.is_singleton().group[0], raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 201, in _to_dict
    return mods_to_dict(root, raise_errors)
  File "/home/mike/devel/modstool-github/qurator/modstool/modstool.py", line 377, in mods_to_dict
    raise ValueError('Unknown tag "{}"'.format(tag))
ValueError: Unknown tag "{http://www.loc.gov/mods/v3}partName"

Add missing information for "original" PPNs

By now, only the content of <mods:identifier type="PPNanalog"> seems to be included in the output but not the equivalent information from

<mods:relatedItem type="original">
	<mods:recordInfo>
		<mods:recordIdentifier source="gbv-ppn">PPNxyz</mods:recordIdentifier>
	</mods:recordInfo>
</mods:relatedItem>

Multiple language tags vs multiple languageTerm tags

Newer input files have two <mods:languageTerm>s in one <mods:language>:

<mods:language>
  <mods:languageTerm authority="iso639-2b" type="code">ger</mods:languageTerm>
  <mods:languageTerm authority="iso639-2b" type="code">eng</mods:languageTerm>
</mods:language>

modstool should not throw an exception here.

Group names given in the MODS-file according to given roles to reduce number of columns

Aim: Reduce number of columns for better manageability of data frame
Proposal: Group names given in the MODS-file according to given roles
Explanation: Each "name" entry in the mods-file consists of at least four parts:
"nameXX_namePart.family"
"nameXX_namePart.given"
"nameXX_displayForm"
"nameXX_role_roleTerm"
The number of columns could significantly be reduced if the names would first be grouped according to the roles and then concatenated into a fewer number of columns.
Examples:
PPN735425078 contains 76 names with the role "asn" (= associated name); this amounts up to 304 columns, but could be reduced to three columns ("nameASN_namePart.family", "nameASN_namePart.given", "nameASN_displayForm"), each containing 76 names in nested form (Mauschwitz; Baudis; Hoberg; ...)
PPN858144891 contains 50 names with the role "oth" (= other); this amounts up to 200 columns, but could be reduced to three columns ("nameOTH_namePart.family", "nameOTH_namePart.given", "nameOTH_displayForm")
PPN1774254956 contains 42 names with the role "ctb" (= contributor); this amounts up to 168 columns, but could be reduced to three columns ("nameCTB_namePart.family", "nameCTB_namePart.given", "nameCTB_displayForm")
The most frequently used roles are asn (associated name), oth (other), ctb (contributor), dte (dedicatee), fnd (funder), auth (author), isb (issuing body), egr (engraver), hnr (honoree), ill (illustrator), prt (printer).

MODS "name" changes

The mods:name now has a mods:nameIdentifier:

/home/mike/devel/qurator-mono-repo/modstool/qurator/modstool/modstool.py:428: UserWarning: Exception in /srv/digisam_mets/PPN1678618276.xml:
Unknown tag "{http://www.loc.gov/mods/v3}nameIdentifier"
  warnings.warn('Exception in {}:\n{}'.format(mets_file, e))

Also, the mods:name/mods:displayForm (optional according to DFG MODS-Anwendungsprofil) got dropped in favor of the mandatory name part fields.

Include METS metadata

The obvious thing that could be included:

  • Include page count per filegroup

In addition to that:

  • Investigate tags which were ignored for now

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.