Code Monkey home page Code Monkey logo

high_level_browse's Introduction

HighLevelBrowse

Given an LC Call Number, try to get a set of academic disciplines associated with it

Usage

use 'high_level_browse'

# Pull a new version of the raw data from the UM website,
# transform it into something that can be quickly searched,
# and serialize it to `hlb.json.gz` in the specified directory
hlb = HighLevelBrowse.fetch_and_save(dir: '/tmp')

# ...or just grab an already fetch_and_saved copy
hlb = HighLevelBrowse.load(dir: '/tmp')

# What HLB categories is an LC Call Number in?
hlb.topics 'hc 9112.2'
# => [["Social Sciences", "Economics"],
#     ["Social Sciences", "Social Sciences (General)"]]

# ... or use the #[] shortcut syntax

hlb['NC1766 .U52 D733 2014']
# => [["Arts", "Art History"],
#    ["Arts", "Art and Design"],
#    ["Arts", "Film and Video Studies"]]

# You can also send more than one call number at a time

hlb.topics('E 99 .S2 Y67 1993', 'PS 3565 .R5734 F67 2015')
# => [["Humanities", "American Culture"],
#     ["Humanities", "United States History"],
#     ["Social Sciences", "Native American Studies"],
#     ["Social Sciences", "Archaeology"],
#     ["Humanities", "English Language and Literature"]]

Overview

While we in the library world sometimes use LC Call Numbers (or at least the initial letters) as a proxy for subject matter, the mapping is iffy in many cases and is, in any case, one-dimensional. Many works simply cover multiple subjects or are relevant to sometimes quite different types of academics.

Take, for example, the chemistry of the brain as it applies to mental illness. We have a book, Endorphins : new waves in brain chemistry cataloged as QP552.E53 D381 1984. The QP's map to "Phsiology", which is correct but not complete.

The University of Michigan Library has for years maintained the High Level Browse (HLB), a mapping of call-number ranges to academic subjects. The entire data set is available as 1.8MB XML file for download.

In the HLB, the call number for Endorphins : new waves in brain chemistry maps to the following categories:

  • Science | Physiology
  • Health Sciences | Physiology
  • Health Sciences | Public Health (General)
  • Science | Chemical Engineering
  • Engineering | Chemical Engineering
  • Health Sciences | Biological Chemistry
  • Science | Chemistry | Biological Chemistry

This opens up potentially more accurate categorization of works for, say, faceting in a library catalog.

This gem gives a relatively time-efficient way to get the set of disciplines associated with the given callnumber or callnumbers as part of indexing MARC records into Solr. This mapping is used in many places in the University Library at the University of Michigan, including the Mirlyn Catalog (exposed as "Academic Discipline" in the facets) and ejournals/databases (and even Librarians!) via the Browse page.

This categorization may be useful for clustering/faceting in similar applications at other institutions. Note that the actual creation and maintenance of the call number ranges is done by subject specialist librarians and is out of scope for this gem.

Command line utilities: fetch_new_hlb and hlb

There are also a couple command line applications for managing and querying the data.

  • fetch_new_hlb tries to grab a new copy of the data from the umich website and serialize it to a ~500k file called hlb.json.gz in the given directory. Useful for putting in a cron job to periodically update with fresh data
$> fetch_new_hlb

fetch_new_hlb -- get a new copy of the HLB ready for use by high_level_browse
and stick it in the given directory

   Usage: fetch_new_hlb <dir>
  • hlb takes one or more callnumbers and returns a text display of the categories associated with them. It will stash a copy of the database in Dir.tmpdirif there isn't one there already, and use it on subsequent calls so things aren't so desperately slow. (To find your tmpdir, in your shell run ruby -e 'require "tmpdir"; puts Dir.tmpdir')
$> hlb

hlb -- get high level browse data for an LC call number

Example:
   hlb "qa 11.33 .C4 .H3"
    or do several at once
   hlb "PN 33.4" "AC 1122.3 .C22" ...
 
# Let's try it
$> hlb "qa 11.33 .C4"
   
   Science | Mathematics
   Social Sciences | Education   
   

A warning about (lack of) coverage

Note that not every possible valid callnumber will be necessarily be contained in any dicipline at all. Many books aren't academic in nature, and even then coverage is known to have some holes. Some of the ranges cover essentially a single book in the umich collection. And, of course, not every record is going to have a LC Call Number, so there's that.

This is all to say: this may or may not be useful at your insitution. You'll have to experiment.

To help with this, there's a little script in the bin/ directory called test_marc_file_for_hlb which will, when given a MARC-XML file (ending in .xml) or a MARC-binary file (ending in anything else), output some statistics on what kind of coverage you would get. It might be useful to send a test file through there to see what comes up. It looks in the 050 and the 852[h] to see if anything pops, but you can make it looks elsewhere pretty easily.

It produces something like this:

050 fields
     9790 total
      209 not recognized as LC call numbers
     9337 with at least one HLB category
      244 with NO category

Of 17642 records,
  9677 (54.85%) had a field that often contains an LC Call Number
  9262 (95.71%) of *those* had at least one HLB category

Performance

On my laptop under normal load (e.g., not very scientific at all) I get the following running in a single thread

  ruby 2.3  this gem       ~8500 lookups/second
  ruby 2.4  this gem       ~9100 lookups/second
  jruby 9   this gem     ~20,000 lookups/second
  jruby 9,  old HLB.jar    ~6500 lookups/second
  jruby 1.7 this gem             error, can't do named arguments since it's 1.9 mode
  jruby 1.7 old HLB.jar    ~6700 lookups/second

The old HLB.jar refers to a pure java version that I call from within Jruby as part of my catalog indexing process now. Ithas a different (worse) algorithm, but is of interest because it's what I'm writing this to replace.

Installation

    gem 'high_level_browse'

Contributing

  1. Fork it ( https://github.com/[my-github-username]/high_level_browse/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

high_level_browse's People

Contributors

billdueber avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.