lib-re / dublin-core-text-parser Goto Github PK

6.0 2.0 0.0 501 KB

Cataloguing tool for converting specially formatted text files containing dublin core metadata into various formats

License: MIT License

Java 100.00%

metadata batch library digital-humanities dublin-core collection lis cataloguing code4lib

dublin-core-text-parser's Introduction

Purpose

Short: Convert dublin core metadata stored in text files to other machine-readable formats to be used by other software.

Long: Assist in cataloguing batches of similar or series-based items from a collection by:

decreasing the complexity of logging each individual item/issue
minimizing repetitive typing and template editing
combining the information that is shared across items in a collection in one place

Usage

Process Instructions

Edit Config File: If desired, edit the configuration file to customize the format of the header
Edit Shared File: Edit the settings to include any and all shared metadata applicable to all of the items in the given batch (e.g language, publisher, etc.)
Create Text File/s: Create a text file of the basic metadata for each item in the collection.
Run Script: Run the script to create, in that directory, the desired output/s encoded with the dublin-core metadata you logged in the text files.
Check and Utilize Output: Ensure that everything has been placed in the appropriate field by checking a few individual item representations.
Clean up or Reference Text Files: After completion of the above tasks, the text files can be discarded as irrelevant, or used as a quick reference to the metadata info along- side where you're storing the files themselves.

Output Types

Flag	.ext	Description
C	.csv	output originally intended for use with DSPace-Labs/SAFBuilder.
~~X,x~~	~~.xml~~	[One or many] is a commonly used in SOAP APIs
~~J,j~~	~~.json~~	[One or many] is commonly used by REST APIs
M	~~.mrk~~	MARC format which will likely need to be compiled into .mrc
...	...	Feel free to fork and create more output types or suggest different uses.

Raw Help Output

Use -h at any time to get (something like) the following:

usage: dublin-core-text-parser

A cataloguing tool for converting specially formatted text files
containing dublin core metadata into various formats

 -c,--config <arg>   Reference to a file containing alternative header
                     arrangements
 -C,--csv            Create a single .csv  file containing metadata of
                     each item
 -h,--help           Display the help information
 -J,--json           Create a single .json file containing metadata of
                     each item
 -M,--mrk            Create a single .mrk  file containing metadata of
                     each item
 -o,--output <arg>   Name the output file
 -s,--shared <arg>   file location of the shared.csv file containing the
                     shared fields
 -X,--xml            Create a single .xml  file containing metadata of
                     each item

External Links

dublin-core-text-parser's People

Contributors

Stargazers

Watchers

dublin-core-text-parser's Issues

Convert Drive documentation to MD and add to wiki

Once finished with the documentation for the internal use of the tool, convert the paragraphs to markdown and add it to the wiki.

Expand options with dublin-core official categories

add in some more of the dublin-core elements with their associated refinements/qualifiers to allow for more specific and expansive cataloging control without having to go into the code.

maintain hierarchy for advanced readability and to allow the user to forego specificity for the learning curve if desired.

rationale:

the config/shared bit is the part of using the program that was always meant/allowed to be more technical. expanding this doesn't compromise usability or learnability either, especially given hierarchy
will help to extend the quality of metadata attainable by the program.

Add DOS support instead of just bash scripts

the scripts I want to write for running the command with all of its tags, i'm going to write in bash first.

making an equivalent in DOS would be nice for windows users.

maybe someone else will do it when the time comes.

NOTE: the bash scripts do not yet exist at time of writing

Accept multiple values for header elements

allow for multiple titles, associated filenames, etc for each row, separated by some deliminator.

likely should not be ',' due to the TITLE field.

deliver: bugfix and prepare for delivery/usage

Units of work:

mvn: reconfigure src, test, res, output directories
exec: create an executable/artifact build process
cmd: enable the running of the program out of the box from the command line
brew: enable brew install dublin-core-text-parser by creating a Formula for this in the lib-re/homebrews repo

Refactor Export to reduce weight and duplication

The way it's currently built, each ConcreteExporter will run it's own copy of the AbstractExporter.

This is inefficient and should be swapped out for a brand of the Strategy implementation which calls a single Explorer which calls the multiple sub-functions independently.

Add Prefix/Suffix/Title capacity to name storage

System does not handle 'John Doe DCLXVIII', 'Dr. Jane Doe', 'James Doe Jr.'.

Make more robust name parsing, and perhaps a small name object.

JSON export

Add per-item and combined JSON export triggered by -j and -J tags.

Ensure it works with non-text types

basic execution should not be tied to textual resources. should be decoupled (made agnostic, remove any accidental shortcuts or assumptions) and the dc support expanded to larger vocabularies and options.

Include Formats for Datetime objects

Need to be able to set the format for the datetime objects somehow.

there's a predefined DC listing for them.

Add Name Recognition/Matching

A drawback of using this system as opposed to others as it stands now is that there is no existing name recognition or system for flagging typos caused by the OCR or those who originally input the names.

In what will be the 0.1.0 release of the program, the same person with the same exact name may be logged multiple times depending on the parsing preferences set for the specificity. This is buggy and should be fixed for other releases. (check for previous occurrences on item.addContributor() call)
Adding a system whereby each new name was added to some sort of hash table and a fuzzy match algorithm would suggest what might be a typo in the file before doing the actual export might help prevent a lot of errors.
A deeper implementation could attempt to create a unique author identifier resistant to shortened or nick- names (Tom -> Thomas), middle names (George Bush + George W. Bush), etc. might help sync up otherwise disparate entries.

Add flag for checking data without export

Want to be able to run the program without creating any export to:

find any duplicate names (Douglas R. Moody <=> Doug Moody <=> Douglas Moody)
find any spelling errors (Doug Moody <=> Dog Moody)
preempt any parsing errors (3D1T0R => EDITOR)
display classifications (EDITOR => dc.contributor.ManagingEditor)

This mode would be run before exporting to:

tweak or test any unique preferences/classifications/configuration
check the health/quality of the files
repair any ambiguities and aid professionalism
in order to increase the quality and specificity of the final exported product.

Unsure how to best display all of those. probably an output .txt file.

Consider Amazon Ion Support

add type protection to dates, ints, and other relevant fields as per fitting the standard
add flag or warning and print statement for those items that cannot meet the strongly-typed standards given what it's gotten from the parsed input
add Amazon Ion per-item and combined support triggered by -a and -A tags. (i might be taken by 'interactive' somewhere down the line)
validate that this meets the Amazon Ion standards

Take .mrk as input...

Given that dublin-core doesn't specify

would require extensive refactor...

Allow support for non-names as contributors

items beginning with (or containing?) an asterisk (e.g. "*The Library Consortium Unlimited") should not be parsed as a name, but the whole line should be sent in as a contributor.

Transfer ownership to RIT Wallace Memorial Library GitHub organization

In order to ensure the preservation of this code, please transfer ownership of this repository to @rit-wml.

XML export support

Add per-item and combined export support for XML triggered by the -x and -X tags
Validate this xml according to discovered standards

refactor export

Decouple the export functions within the item class and move into separate export objects extending a common interface. This should declutter an important class and make extending export functionality much easier and extendable.

Convert to/register as package for command line installation

Allow for addition to servers as well as individual computers.

Basic bash commands

should be able to do the following and get information back on validity and interpretation:

You should be able to do the following to the java commandLine thing.

set list of files (e.g. name_of_collection-*.txt)
set output using the traditional > and >>

note: right now i don't care about DOS. windows is getting bash anyway soon. ...I'll add an issue to 1.0.0 to include DOS support...

Find some way to Auto-Populate Articles

Article titles are made to be easy to spot with the eye, which makes their text larger and bolder in newspaper-esque publications.

Given a quality OCR, it may be possible to visually scan a PDF looking for article titles.

Refactor Element creation (factory and helper)

all Element subclasses currently contain nearly identical factor functions...

public static [Element] create[Element](String qualifierText, String value) { 
  ...
}

and different implementations of the same helper

public static String determineQualifier(String str){
  if(...){ return ...; }
  else if(...){ return ...; }
  else{ return ""; }
}

I had originally planned to put both of these into the Element superclass, make the former generic and inherited, and stub out the latter as an abstract function. I can't do this because they're static, but they need to be in order to be used to create the objects.

Finding no direct solution to this problem, having a bit of a time sensitivity, and given that it's not a functional issue at this point (code base is small, it's easy to infer, and unlikely to need revision), it stands as an ugly part of a codebase I'm otherwise pretty proud of.

I'd love some help with a solution to this while I work on other things, else I'll just come back to it.

Add Provenance Support

Add support for dc.provenance in "Submitted by ____".

likely just add it as a 'suggested' field in shared.csv

consider adding BibJSON export option

BibJSON looks to be an existing linked data standard that might be worthwhile to port to.

Differentiating DC standard Json and other forms of json might be an issue. For now -b and -B are available.

Links to come

Utilize 'lib-name-parser'

implement the lib-name-parser library and leverage it to provide

combination/progressive revelation (edit past records)
authority record
spell checker.

Add Scripts for common commands

As the number of tags grows and in order to make the use of the tool more approachable to beginners, having some simpler aliases may be helpful.

In particular, this was originally conceived as a midpoint between the text files to work with DSpaceLabs/SAFBuilder. Adding a command to get from A (.txt files) -> C (simple archive format) may really help novice users/student workers.

Systematic Logger

Want to add a quality logging system that outputs to text.

Should differentiate between relevant cataloging information (usable for librarianship) versus system status and health used for debugging and understanding what's going on under the hood.

singleton,

object	prefix	description
.debug	[DEBUG]	granular description of function
.info	[INFO]	completion of high-level functions (both)
.meta	[META]	metadata and cataloging info
.error	[ERROR]	problems in system execution

Add Collection Notes - Column information, Acronyms

In addition to creating an authority record (mentioned in #24 and #25), it would be possible and helpful to add:

recurring 'columns' from within a collection
a key for the commonly occurring acronyms or abbreviations

Compute by looking at the dc.description.tableofcontents fields for each and finding:

duplicates or consistencies in article titles that keep coming up issue to issue
multiple occurrences of the same acronyms (F.B.I) or abbreviations (Inc.)

logger: replace tinylogger with java.util version

Background/Description

looks like tinylogger died and is preventing the app from running
will need to remove it and replace it with java.util.logging.Logger

Add LICENSE

I haven't gone through the process of putting a license on this yet, and I'm not sure which I'd recommend.

I know I've used the MIT License before, but the following may help:

Utilize third party library for spell-checker/autocorrect.

A plan for pre-parsing text files for inconsistencies (#7, #16) and name-matching (#24) is already in the works.

In addition, run each text file through a spell-checker or dictionary to try to spot misspellings and errors.

Set a flag for warnings on particular files and list them at completion (ideally with the list of words/suggestions).

Perhaps allow an option for automating changing them (autocorrect)

configuration support

Should be able to edit config.txt to meaningfully alter the header metadata, given text matches 'options.txt' options.

Add properties file to allow for settings

name	type	description
exportFilename	String	name given to all filenames (e.g. collection_name.csv, collection_name.json)
createAuthorityRecord	boolean	create a record containing the authority control information
createCollectionRecord	boolean	create a record for the collection metadata
includeTextFile	boolean	include the metadata text files as filenames
...	...	...

Add body parsing for subject keywords

TableOfContents, and Contributors are examples of elements that tend to be added en masse, and that was the reason I devoted so much of the body to them.

Subject is another one of these that could benefit from bulk processing, and the change may be a large added value/feature, especially in the absence of a quality OCR scan (or bPress database) :p

questions would be whether or not...

to remove it from the "options.txt" table then (prevent adding them inline for consistency).
i'd have to change the strategy for determining qualifiers (EDITOR vs ARTIST, KEYWORD vs. LCSH).*
i'd have to create a bunch of synonyms for "subject", "keyword", etc. to make it sensible.

*LCSH codes don't have to include lowercase letters and would often match the current criteria for a qualifier switch

ex:

-SUBJECTS-
KEYWORDS/TAGS
dogs
cats
fish
LCSH
QA76.
DDC
LCC
MESH
UDC

Expand vocabulary's flexibility

'TABLE OF CONTENTS' should be considered an acceptable synonym for ARTICLES.

consider more robust list for each classification. (check from a list instead of directly checking. inherit this functionality where possible)

Consider hosting as Web Service

It would seem that a version of this software could be hosted online and used as an endpoint for users without requiring users to download and run the source themselves.

set it up so that you can send an HTTP GET request with the contents of one of the text files in the body and have it return the json object

write script to automate sending each individual one so that end users don't have to touch the code at all.

JSON, XML would be rather simple to use with Rest or SOAP, and we could make the MARC and CSV files available for download as well.

Add interactive functionality

Planned additions to the functionality and extensibility of the tool, coupled with its use for large collections, make the discovery of small errors difficult and the act of redoing each parse costly and annoying.

Adding an interactive mode that will allow the user to tweak what is placed where and to handle warnings, errors, and name overlap/suggestions before exporting might improve usability and cut down on what is growing into a very large number of tags.

Add flags for creating additional records

authority records
collection record

MARC Export

Add per-item and combined MARC export support triggered by -m and -M tags
make sure it validates and can be compiled in MarcEdit.

realtime: Test, improve, and polish for command line use in the field

usability: ensure that everything is looking in the same input directory (the location of which is configurable)

// dc-text-parser -C [PATH TO INPUT FOLDER] -o [OUTPUT FILENAME]
$ dc-text-parser -C ~/digitization-projects/EXAMPLE_PROJECT/YYYY/ -o EX_PROJ_YYYY ]
...
~/digitization-projects/EX_PROJ_YYYY.csv created successfully!

flag records that can't be read effectively
generate concretely how to run the program in the README

Design Documentation

existing documentation covers how to use the application, what it helps with, how to customize its execution, etc.

need some documentation on how the system itself is constructed beyond the sections necessary to tweak how it runs.

Add some tests to assist with match-checking

Want to make sure that various permutations are accepted across multiple changes and to allow for an easy way to test given strings without looking directly at the 'actual code'.

note: 'actual' meaning operating logic, not the tests.

Add Warning/Flag Support

add responsive protections so that items which cannot be marked correctly are still catalogued in some degree (error parsing a contributor classification adds the names associated with it under the blank dc.contributor tag), flags the item, and presents a warning with an explanation of what occurred.
Items which have some sort of error or reason for questioning their validity in a greater capacity (error parsing at all) should be tagged as problematic, potentially staged for review, and the user notified.

Clean tab characters from input fields

\t characters can skip cells and cause overwritten or incorrectly tagged metadata.

This happens when using the .csv export option (or copying the regular logging/standard output from the program), where the tab character leads to moving ahead a column.

csv: Add CSV Export

Success Criteria:

Support export to CSV ($ dctp -C) as the first step towards use in the SAFBuilder.

Add wiki page for tweaking matching rules

I purposely made it so that the matching rules could be easily changed by getting into the sourcecode.

I should do a quick write-up on how to do this for future reference so no one has to re-engineer it.