Code Monkey home page Code Monkey logo

library_identifier_solr_filters's Introduction

library_identifer_solr_filters

Overview

This is a series of simple solr analysis-chain filters useful to those dealing with library identifiers (currently only LC Callnumbers, but more to come).

Getting/generating the .jar file

You can just nab a .jar file from the github releases page. They're labeled with the version of the library and the version of solr they're created against.

You can also use maven. You should be able to build with just

mvn package # .jar file will appear in `target/`

To use different versions of solr and/or the icu4j library, you can define them on the command line (defaults are in the pom.xml file)

mvn package -Dsolr.version=8.6.1 -Dicu.version=66.1

Placing the .jar file

The jar file needs to put somewhere solr's going to pick it up, which is defined in the solrconfig.xml file.

I like to have a lib directory "next to" my conf directory with the solr configuration

mycore
  |- conf
      |- schema.xml
      |- solrconfig.xml
      |- ...
  |- lib
      |- library_identifier_solr_filters-0.1-solr8.8.2.jar

... and then have conf/solrconfig.xml include the line:

<lib dir="${solr.core.config}/lib" regex=".*\.jar"/>

LC Callnumbers

This is a simple/simplistic attempt to take LC callnumbers and turn them into something sortable/searchable. It does the bare minimum to massage the callnumbers before indexing.

In addition to the underlying code to do the conversion, there are two ways to use it in fieldTypes.

LCCallNumberSimpleFilterFactory for prefix queries

An analysis filter that will take a token and perform the callnumber normalization (or at least do its best), suitable for use with the edge n-gram filter for providing It requires that callnumbers be treated as a single token, so should only be used with a keyword tokenizer.

Note that this filter will never be called for range searches; if you want to use ranges see the CallnumberSortableFieldType.

A good fieldType definition for prefix searches is as follows:

<fieldType name="callnumber_prefix_search"  class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="edu.umich.library.lucene.analysis.LCCallNumberSimpleFilterFactory" passThroughOnError="true"/>
    <filter class="solr.EdgeNGramFilterFactory" maxGramSize="40" minGramSize="2"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="edu.umich.library.lucene.analysis.LCCallNumberSimpleFilterFactory" passThroughOnError="true"/>
  </analyzer>
</fieldType>

The CallnumberSortableFieldType

CallnumberSortableFieldType is a derivative of solr.String which does the callnumber conversion on the way in (for both stored and indexed values). This not only gives you a sortable value (which the filter does as well), but allows the type to be used correctly with ranges (since Solr doesn't run the analysis chain for range queries).

Because it's implemented as a FieldType, all the normalization works as you'd expect (e.g., callnumber_search:[qa20 to *] will pick up the callnumber "QA 20.2" and not "QA 3.11 .D4").

The FieldType can/should be used for both exact queries and range queries, as well as (pretty) accurate sorting, but can't be used for prefix search since it's not a part of the analysis chain (being based on String and not TextField), hence the filter, above.

<fieldType name="callnumber_sortable" class="edu.umich.library.
library_identifier.schema.CallnumberSortableFieldType" />


<field name="callnumber_search" type="callnumber_sortable"
       multiValued="true"/>

<field name="callnumber_sort" type="callnumber_sortable"
       multiValued="false"/>

The normalization algorithm

Given a callnumber:

QA 123.456 .C5 D6 1990 v.3
 1  2   3  <--    4    -->

We label it as follows:

  1. The initial letters
  2. The digits
  3. An (optional) decimal
  4. Everything else

In particular, there's no attempt to separate out the cutters, enumchron, year, etc. since at my institution there just wasn't any appetite for what little functionality it added compared to the ambiguities/bugs it produced.

The transformation process is, essentially:

  • Lowercase everything, trim/collapse whitespace
  • Remove any space between the initial letters and digits
  • Prepend the digits with its string length (e.g., 44 -> 244, 1234 -> 41234). This makes the number correctly sort "alphabetically" and we don't have to mess around with zero-padding or anything
  • Remove punctuation other than dots that create a decimal (e.g., "v.1" will become "v1", but "no. 123.45" will become "no 123.45").

Invalid callnumbers

Anything that doesn't start with some letters followed by some digits is declared invalid. These values can be either kept or ignored depending on the argument allowInvalid in the solr fieldType (see below).

The invalid callnumber passed through isn't exactly the same as what was passed in -- we still do lowercasing, space collapse/trim, and remove non-decimal-place-looking punctuation.

library_identifier_solr_filters's People

Contributors

billdueber avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.