umd-coding-workshop / taming-the-beast Goto Github PK

ArchivesSpace Ingest Scripts

Python 100.00%

taming-the-beast's Introduction

taming-the-beast

This project will aim to develop scripts to prepare EAD XML generated using the University of Maryland's Java-Based EAD Converter tool for ingest into ArchiveSpace. There may also be need to use scripts to clean accession data generated from the University of Maryland's Microsoft Access-based archival management database ("The Beast") in CSV format for ingest into ArchivesSpace.

ArchivesSpace (http://www.archivesspace.org/) is an open-source archives information management system. The UMD Libraries will be using this system beginning in spring 2014. There are currently close to 1000 archival finding aids, encoded using Encoded Archival Description (EAD) that will be ingested into ArchivesSpace. In addition, several thousand accession records from Special Collections and University Archives and the Special Collections in Performing Arts will be ingested, providing a more robust networked system for the UMD Libraries.

Although EAD is a standard, the EAD files currently in use by the UMD Libraries (see http://digital.lib.umd.edu/oclc) are valid EAD, they are formatted differently than how they would appear in ArchivesSpace. Special Collections and University Archives will be evaluating what changes need to be made to standard UMD EAD for smooth import into ArchivesSpace.

taming-the-beast's People

Contributors

Watchers

Forkers

spurioso dmwright

taming-the-beast's Issues

Sample - work in progress

For everyone watching this project, I've gotten an import into ArchivesSpace to work... sort of.

http://sandbox.archivesspace.org:8081/repositories/2/resources/28 (note that the ArchivesSpace sandbox gets reset periodically, so this link might be short-lived)

Compare to: http://hdl.handle.net/1903.1/2994

It imported once the

<extent>

tags were added inside the

<physdesc>

tags. But there are still lots of kinks to be ironed out. Short list of problems:

Multiple abstracts that are all identical - because abstracts are put in multiple times for ArchivesUM subject guides. Need to remove extras, put subject categorizations elsewhere.
Headings repeated - for example "Biography" is the heading of the section as well as the first line of the text for that section. Need to remove first line of each section in the collection description.
Duplication and Copyright Information - can the contact url be an actual link? Look into this.
Arrangement - series are listed in a poorly-formatted text and then in a nice list - only need to be listed once.
Related material - refers to subject guides, which obviously aren't here. So take that out.
Components - on the series level, each series is listed twice. The second listing is the one we want (containing the folders and items for each series), so we need to eliminate the first listing.

EAD import map

http://archivesspace.org/sites/default/files/EAD-Import-Export-Mapping-20130831.xlsx

This document shows how EAD tags are represented in the ArchivesSpace back end.

Interestingly, there are already "if" and "else" conditions written into the map. For example, if is found nested within tags, it corresponds to the "resource.title" element, but if it's nested within tags, it corresponds to "archival_object.title"

Error message when importing into Archivesspace

When attempting to impport into Archivesspace,, we get the error "Resource/extent: must have at least 1 item" or something like that. Need to try adding in a inside the element around the value. Hopefully that will fix it.

http://www.loc.gov/ead/tglib1998/tlin068.html

demonstrating use of pyodbc

See commit dae609c for some sample pyodbc code to connect to the beast.

I couldn't find the most recent version of the db so I just grabbed the first I could find and made a local copy.

Here is the output I get when running the program:

table: archdescdid, rowcount=5735
table: archdescphysloc, rowcount=7412
table: BoxList, rowcount=128482
table: corpname, rowcount=4
table: eadheadercreationOLD, rowcount=281
table: eadheaderpublisher, rowcount=570
table: eadheaderpublisherOLD, rowcount=434
table: eadheaderrevision, rowcount=761
table: ItemList, rowcount=32327
table: resourceguide, rowcount=2437
table: rgnames, rowcount=40
table: seriesdesclist, rowcount=1559
table: seriestest, rowcount=102
table: source, rowcount=4668
table: sourcecopy, rowcount=2488
table: subjects, rowcount=2564
table: subseriesdesclist, rowcount=352
table: cas_coll_umd, rowcount=7630
table: qryURL, rowcount=5735

move contents of pseudocode.md to Issues

@alifabeta , I recommend moving the contents of pseudocode.md to github issues, one issue for each problem listed. This will allow each problem to be more easily tracked:

document each problem
document implemention decision
record history of discussion in comments
match the issue to the commit(s) which resolve it
assign issues to different people
close when the issue is complete
assign due dates

it will all be easier to track what has been done, what still needs to be done, and by whom

what do you think?

umd-coding-workshop / taming-the-beast Goto Github PK

taming-the-beast's Introduction

taming-the-beast

taming-the-beast's People

Contributors

Watchers

Forkers

taming-the-beast's Issues

Sample - work in progress

EAD import map

Error message when importing into Archivesspace

demonstrating use of pyodbc

move contents of pseudocode.md to Issues

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent