ksclarke / freelib-marc4j Goto Github PK

View Code? Open in Web Editor NEW

This project forked from marc4j/marc4j

11.0 11.0 8.0 9.53 MB

A fork of the MARC4J project

Home Page: http://projects.freelibrary.info/freelib-marc4j/

License: GNU Lesser General Public License v3.0

Java 91.37% HTML 0.26% XSLT 8.37%

freelib-marc4j's People

Contributors

Stargazers

Watchers

Forkers

santteegt cthdev brinxmat zman0900 neilstevens2005 adjam haschart dgermano17

freelib-marc4j's Issues

Add decorators for MarcReader to read files as Java 8 Streams

To enable using MarcReader objects as Iterators and streams, it might be handy to have a utility class along the lines of

https://gist.github.com/adjam/25f615af87ea6e5bfdbcead32c7e313d

This would assume building with and for Java 8+. An alternative would be to retrofit the MarcReader class to implement the Iterator interface, but that's more invasive.

I can generate a pull request from this if you can suggest a package (org.marc4j.util perhaps?)

MarcStreamReader missing encoding handler other than "utf8","iso8859-1","marc8"

Reported on marc4j/marc4j -- noting here as well.

private String getDataAsString(byte[] bytes) 
{
    String dataElement = null;
    if (encoding.equals("UTF-8") || encoding.equals("UTF8"))
    {
        try {
            dataElement = new String(bytes, "UTF8");
        } 
        catch (UnsupportedEncodingException e) {
            throw new MarcException("unsupported encoding", e);
        }
    }
    else if (encoding.equals("MARC-8") || encoding.equals("MARC8"))
    {
        if (converterAnsel == null) converterAnsel = new AnselToUnicode();
        dataElement = converterAnsel.convert(bytes);
    }
    else if (encoding.equals("ISO-8859-1") || encoding.equals("ISO8859_1") ||    encoding.equals("ISO_8859_1"))
    {
        try {
            dataElement = new String(bytes, "ISO-8859-1");
        } 
        catch (UnsupportedEncodingException e) {
            throw new MarcException("unsupported encoding", e);
        }
    }else{   //other encoding types
        try {
            dataElement = new String(bytes, encoding);
        } 
        catch (UnsupportedEncodingException e) {
            throw new MarcException("unsupported encoding", e);
        }           
    }
    return dataElement;
}

Bug: MARC21 binary file that contains UTF8 chars (tested on Hebrew chars) parsing ends with error.

From upstream marc4j:

When trying to parse MARC21 binary file that contains UTF8 chars(Hebrew chars were used in testing), parsing ends with error (using MarcStreamReader):

org.marc4j.MarcException: expected field terminator at end of field
at org.marc4j.MarcStreamReader.parseRecord(MarcStreamReader.java:223)
at org.marc4j.MarcStreamReader.next(MarcStreamReader.java:135)
at org.marcRDF.MarcToRDF.main(MarcToRDF.java:64)
expected field terminator at end of field

When i replace Hebrew chars with the English one, parsing works as expected...
In my opinion the problem is that UTF8 chars can be 1 - 4 bytes long, but currently in parsing module the assumption is that each char is exactly 1 byte long.

type attribute of record is not parsed

The type attribute of the record tag is not parsed. It is used by the "Deutsche Bibliothek".

Fix new site feed and update with new release information

http://projects.freelibrary.info/freelib-marc4j/feed.xml

Questions:

How to acknowledge people who submit PRs?
Just document release changes in the feed or elsewhere too?

2.6.12-SNAPSHOT: Build Error

I got build error when executing "mvn install". I have to set http.agent to mvn command:

mvn -Dhttp.agent=Maven install

or add the following to surefire plugin:

<configuration>
   <systemPropertyVariables>
     <http.agent>Maven</http.agent>
   </systemPropertyVariables>
</configuration>

Update MarcXmlParserThread and TransformerHandler to use final input

This would involve removing the setInputSource method which is an API breaking change.

Cf. #16 for more details (there was an alternate fix applied there but this one would be better).

Clean up what gets packaged in jar

Code table XML files and generator classes don't need to go into the final jar. Likewise, all the sample code could be excluded as well.

Bug: parsing error by unordered directory entries

Hi,

From MARC21 record structure specs, directory entries for data fields do not have to be stored in the same order in the record. Method parseRecord() in org.marc4j.MarcStreamReader.java assumes that directory entries are ordered. How can I submit a patch?

Thanks,
Thien

Remove SolrMARC classes, they exist elsewhere.

I removed some of the SolrMARC stuff already. The rest can also be removed. They exist in other places.

Can fork move under the marc4j umbrella?

I was just registering an account on Sonatype when I checked and saw that there was a new artifact on maven central (someone filed a bug report, which I fixed- then decided I might as well add a pom.xml

Seems reflow/site:site has broken... time to look into jbake generation for the site docs

As a workaround, I've made the project-info.html the index.html page (because otherwise the index page is just an empty document).

Document handling of bad characters

What does marc4j do when it finds a bad character (there are several readers and each might do something different)? Does it swap in a placeholder, raise an exception, or just put in nothing and keep going?

Find out then document it. Question arising out of a discussion of what ruby-marc does (or will do).

Incorporate ErrorHandler updates from upstream(?)

From upstream: The biggest difference in this version is that the ErrorHandler object is deprecated and no longer used within the project. Instead the MarcRecord maintains the list of detected error with the record itself. The class has new methods named addError, addErrors, hasErrors and getErrors.

marc4j@506f380
marc4j@6afa8d3

XML Error catching code cause failures for records from Koha

I cherry-picked some recent changes to MarcXmlHandler.java from this branch where illegal elements, or missing tags, missing indicators or missing subfield code cause an MarcException to be thrown. These changes cause some records exported from Koha to fail. The following commit to the marc4j branch shows a different way such errors could be handled: Instead of throwing an exception, this code will create a MarcError object and add it to the Record being built, and then it will continue processing the record.

marc4j@0d718cb

The code currently omits the DataField of ControlField in which the error is found, it might be better to fill in a reasonable "default" value, and include the DataField or ControlField, and flag the error.

Bug in class MarcXmlParser

I have been getting the following stacktrace when parsing a Marc XML data in a webserver using Freelib Marc4J (when building freelib-marc4J from the current master branch):

org.marc4j.MarcException: Unable to parse input
        at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:89)
        at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:60)
        at org.marc4j.MarcXmlParserThread.run(MarcXmlParserThread.java:111)
Caused by: java.io.IOException: Stream closed
        at org.apache.catalina.connector.InputBuffer.readByte(InputBuffer.java:315)
        at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:105)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$RewindableInputStream.read(XMLEntityManager.java:2899)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:674)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:189)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
        at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:87)
        ... 2 more

This only occurs sometimes (maybe 1 in every 5 runs when running in Java 8 on Linux). The problem does not occur when using earlier versions of Java or when using Windows. I have investigated this and discovered it is being caused by a bug in the class MarcXmlParserThread (in the freelib-marc4J library). The problem is caused by the member variable "InputSource input" not being declared "volatile". This variable needs to be declared volatile since it will always be set in a different thread (regardless of whether it is set during construction or by the setter), to the thread using it (The thread defined by MarcXmlParserThread).

The simple fix to solve this problem is to add the "volatile" keyword to the "input" member variable. (I have rebuilt the library with this change and can confirm it fixes my problem)

An alternative nicer way to fix this would be to add the "final" keyword to the "input" member variable and remove the "setInputSource" method, forcing the input source to be set in the constructor. However this would be a change to the interface of the library, which is something I assume you are trying to avoid?

The exact same problem exists with the "TransformerHandler th" member which should also be declared final or volatile.

I would be happy to make the change these member variables to volatile if you can temporarily give me write access to the repository?

MarcXMLWriter - run ALL Marc Fields through the converter

MarcXMLWriter assumes that all characters are valid in leaders, indicators, subfield names. However, if there's anything invalid (e.g. a 0x14), then the XML output contains strings like "" which breaks XML parsing.

There is a workaround for the issue: https://github.com/bsandiford/marc4j/commit/0982cbc20fa6269de68d2f1ca75bbc50e0d8b7eb

Its author says:

I changed MarcXmlWriter so that ALL characters are run through the converter, so that any that are invalid get converted to "" format.

The data is still invalid from a Marc point of view, but at least the XML parsers won't choke.

New upstream code (Oct. 21- )

There is new upstream code that deals with handling large record lengths that needs to be grokked and merged in. From latest release:

The previous release fixed the handling of a new class of oversized, invalid,
binary MARC records, unfortunately it broke handling of other oversized
records that worked previously. This release restores that correct handling
of those invalid records, adds tests and data coverage, and also fixes a 
related issue in the RawRecord classes handling of oversized binary MARC
records.

As with the previous release, if you never encounter oversized records, this
release is irrelevant, if you do (or even if you might) this release will handle a
few more classes of malformed records as well as can be expected.

Notice further that only the MarcPermissiveStreamReader class handles
these oversized records, the normal MarcStreamReader class will probably
just throw an exception and terminate.

and related previous one:

The primary substantive difference in this version is changes to how
MarcPermissiveStreamReader class handles malformed binary MARC
directories. The directory consists of a 3-byte tag, a 4-byte ASCII encoded
field length, and a 5-byte ASCII encoded field offset. Some system decided
that when it encountered an illegal, oversize record (>99999 bytes long) a
reasonable solution would be to write the field offset as a 6-byte ASCII
encoded number. It also now will deal with records where the fields are
stored in an order different than the order of the directory entries.

If you aren't dealing with such benighted records, this update adds little value
for you.

Is this fork needed now that upstream is publishing to Maven Central?

I created this fork mostly because I wanted a version of MARC4J in Maven Central. I made a few other XML updates along the way. MARC4J upstream has just published an artifact to Maven Central and has merged in some/all(?) of my XML handling updates. Does this fork still need to exist, given all this? The upstream isn't going whole hog on Maven (i.e., they're still using Ant) but does that really matter to me if the end product is a Maven Central artifact?

TODO: Test some of my code against the latest version of MARC4J that's now in Maven Central.

Apply PR on upstream related to subfield retrieval

This weekend, add marc4j#46

Do have a question about getSubfieldsAsString(String sfSpec) though... it's not adding any separator between the concatenated strings? Wouldn't it make more sense to do:

getSubfieldsAsString(String sfSpec, String padding) or something?