Code Monkey home page Code Monkey logo

taming-the-beast's Introduction

taming-the-beast

This project will aim to develop scripts to prepare EAD XML generated using the University of Maryland's Java-Based EAD Converter tool for ingest into ArchiveSpace. There may also be need to use scripts to clean accession data generated from the University of Maryland's Microsoft Access-based archival management database ("The Beast") in CSV format for ingest into ArchivesSpace.

ArchivesSpace (http://www.archivesspace.org/) is an open-source archives information management system. The UMD Libraries will be using this system beginning in spring 2014. There are currently close to 1000 archival finding aids, encoded using Encoded Archival Description (EAD) that will be ingested into ArchivesSpace. In addition, several thousand accession records from Special Collections and University Archives and the Special Collections in Performing Arts will be ingested, providing a more robust networked system for the UMD Libraries.

Although EAD is a standard, the EAD files currently in use by the UMD Libraries (see http://digital.lib.umd.edu/oclc) are valid EAD, they are formatted differently than how they would appear in ArchivesSpace. Special Collections and University Archives will be evaluating what changes need to be made to standard UMD EAD for smooth import into ArchivesSpace.

taming-the-beast's People

Contributors

jennielevineknies avatar spurioso avatar wallberg-umd avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

spurioso dmwright

taming-the-beast's Issues

Sample - work in progress

For everyone watching this project, I've gotten an import into ArchivesSpace to work... sort of.

http://sandbox.archivesspace.org:8081/repositories/2/resources/28 (note that the ArchivesSpace sandbox gets reset periodically, so this link might be short-lived)

Compare to: http://hdl.handle.net/1903.1/2994

It imported once the

<extent>

tags were added inside the

<physdesc>

tags. But there are still lots of kinks to be ironed out. Short list of problems:

  • Multiple abstracts that are all identical - because abstracts are put in multiple times for ArchivesUM subject guides. Need to remove extras, put subject categorizations elsewhere.
  • Headings repeated - for example "Biography" is the heading of the section as well as the first line of the text for that section. Need to remove first line of each section in the collection description.
  • Duplication and Copyright Information - can the contact url be an actual link? Look into this.
  • Arrangement - series are listed in a poorly-formatted text and then in a nice list - only need to be listed once.
  • Related material - refers to subject guides, which obviously aren't here. So take that out.
  • Components - on the series level, each series is listed twice. The second listing is the one we want (containing the folders and items for each series), so we need to eliminate the first listing.

demonstrating use of pyodbc

See commit dae609c for some sample pyodbc code to connect to the beast.

I couldn't find the most recent version of the db so I just grabbed the first I could find and made a local copy.

Here is the output I get when running the program:

table: archdescdid, rowcount=5735
table: archdescphysloc, rowcount=7412
table: BoxList, rowcount=128482
table: corpname, rowcount=4
table: eadheadercreationOLD, rowcount=281
table: eadheaderpublisher, rowcount=570
table: eadheaderpublisherOLD, rowcount=434
table: eadheaderrevision, rowcount=761
table: ItemList, rowcount=32327
table: resourceguide, rowcount=2437
table: rgnames, rowcount=40
table: seriesdesclist, rowcount=1559
table: seriestest, rowcount=102
table: source, rowcount=4668
table: sourcecopy, rowcount=2488
table: subjects, rowcount=2564
table: subseriesdesclist, rowcount=352
table: cas_coll_umd, rowcount=7630
table: qryURL, rowcount=5735

move contents of pseudocode.md to Issues

@alifabeta , I recommend moving the contents of pseudocode.md to github issues, one issue for each problem listed. This will allow each problem to be more easily tracked:

  • document each problem
  • document implemention decision
  • record history of discussion in comments
  • match the issue to the commit(s) which resolve it
  • assign issues to different people
  • close when the issue is complete
  • assign due dates

it will all be easier to track what has been done, what still needs to be done, and by whom

what do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.