Code Monkey home page Code Monkey logo

freelib-marc4j's People

Contributors

atz avatar billdueber avatar cthdev avatar haschart avatar ksclarke avatar neilstevens2005 avatar sesuncedu avatar zman0900 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

freelib-marc4j's Issues

Add decorators for MarcReader to read files as Java 8 Streams

To enable using MarcReader objects as Iterators and streams, it might be handy to have a utility class along the lines of

https://gist.github.com/adjam/25f615af87ea6e5bfdbcead32c7e313d

This would assume building with and for Java 8+. An alternative would be to retrofit the MarcReader class to implement the Iterator interface, but that's more invasive.

I can generate a pull request from this if you can suggest a package (org.marc4j.util perhaps?)

MarcStreamReader missing encoding handler other than "utf8","iso8859-1","marc8"

Reported on marc4j/marc4j -- noting here as well.

private String getDataAsString(byte[] bytes) 
{
    String dataElement = null;
    if (encoding.equals("UTF-8") || encoding.equals("UTF8"))
    {
        try {
            dataElement = new String(bytes, "UTF8");
        } 
        catch (UnsupportedEncodingException e) {
            throw new MarcException("unsupported encoding", e);
        }
    }
    else if (encoding.equals("MARC-8") || encoding.equals("MARC8"))
    {
        if (converterAnsel == null) converterAnsel = new AnselToUnicode();
        dataElement = converterAnsel.convert(bytes);
    }
    else if (encoding.equals("ISO-8859-1") || encoding.equals("ISO8859_1") ||    encoding.equals("ISO_8859_1"))
    {
        try {
            dataElement = new String(bytes, "ISO-8859-1");
        } 
        catch (UnsupportedEncodingException e) {
            throw new MarcException("unsupported encoding", e);
        }
    }else{   //other encoding types
        try {
            dataElement = new String(bytes, encoding);
        } 
        catch (UnsupportedEncodingException e) {
            throw new MarcException("unsupported encoding", e);
        }           
    }
    return dataElement;
}

Bug: MARC21 binary file that contains UTF8 chars (tested on Hebrew chars) parsing ends with error.

From upstream marc4j:

When trying to parse MARC21 binary file that contains UTF8 chars(Hebrew chars were used in testing), parsing ends with error (using MarcStreamReader):

org.marc4j.MarcException: expected field terminator at end of field
at org.marc4j.MarcStreamReader.parseRecord(MarcStreamReader.java:223)
at org.marc4j.MarcStreamReader.next(MarcStreamReader.java:135)
at org.marcRDF.MarcToRDF.main(MarcToRDF.java:64)
expected field terminator at end of field

When i replace Hebrew chars with the English one, parsing works as expected...
In my opinion the problem is that UTF8 chars can be 1 - 4 bytes long, but currently in parsing module the assumption is that each char is exactly 1 byte long.

2.6.12-SNAPSHOT: Build Error

I got build error when executing "mvn install". I have to set http.agent to mvn command:

mvn -Dhttp.agent=Maven install

or add the following to surefire plugin:

<configuration>
   <systemPropertyVariables>
     <http.agent>Maven</http.agent>
   </systemPropertyVariables>
</configuration>

Clean up what gets packaged in jar

Code table XML files and generator classes don't need to go into the final jar. Likewise, all the sample code could be excluded as well.

Bug: parsing error by unordered directory entries

Hi,

From MARC21 record structure specs, directory entries for data fields do not have to be stored in the same order in the record. Method parseRecord() in org.marc4j.MarcStreamReader.java assumes that directory entries are ordered. How can I submit a patch?

Thanks,
Thien

Can fork move under the marc4j umbrella?

I was just registering an account on Sonatype when I checked and saw that there was a new artifact on maven central (someone filed a bug report, which I fixed- then decided I might as well add a pom.xml

Document handling of bad characters

What does marc4j do when it finds a bad character (there are several readers and each might do something different)? Does it swap in a placeholder, raise an exception, or just put in nothing and keep going?

Find out then document it. Question arising out of a discussion of what ruby-marc does (or will do).

XML Error catching code cause failures for records from Koha

I cherry-picked some recent changes to MarcXmlHandler.java from this branch where illegal elements, or missing tags, missing indicators or missing subfield code cause an MarcException to be thrown. These changes cause some records exported from Koha to fail. The following commit to the marc4j branch shows a different way such errors could be handled: Instead of throwing an exception, this code will create a MarcError object and add it to the Record being built, and then it will continue processing the record.

marc4j@0d718cb

The code currently omits the DataField of ControlField in which the error is found, it might be better to fill in a reasonable "default" value, and include the DataField or ControlField, and flag the error.

Bug in class MarcXmlParser

I have been getting the following stacktrace when parsing a Marc XML data in a webserver using Freelib Marc4J (when building freelib-marc4J from the current master branch):

org.marc4j.MarcException: Unable to parse input
        at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:89)
        at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:60)
        at org.marc4j.MarcXmlParserThread.run(MarcXmlParserThread.java:111)
Caused by: java.io.IOException: Stream closed
        at org.apache.catalina.connector.InputBuffer.readByte(InputBuffer.java:315)
        at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:105)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$RewindableInputStream.read(XMLEntityManager.java:2899)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:674)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:189)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
        at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:87)
        ... 2 more

This only occurs sometimes (maybe 1 in every 5 runs when running in Java 8 on Linux). The problem does not occur when using earlier versions of Java or when using Windows. I have investigated this and discovered it is being caused by a bug in the class MarcXmlParserThread (in the freelib-marc4J library). The problem is caused by the member variable "InputSource input" not being declared "volatile". This variable needs to be declared volatile since it will always be set in a different thread (regardless of whether it is set during construction or by the setter), to the thread using it (The thread defined by MarcXmlParserThread).

The simple fix to solve this problem is to add the "volatile" keyword to the "input" member variable. (I have rebuilt the library with this change and can confirm it fixes my problem)

An alternative nicer way to fix this would be to add the "final" keyword to the "input" member variable and remove the "setInputSource" method, forcing the input source to be set in the constructor. However this would be a change to the interface of the library, which is something I assume you are trying to avoid?

The exact same problem exists with the "TransformerHandler th" member which should also be declared final or volatile.

I would be happy to make the change these member variables to volatile if you can temporarily give me write access to the repository?

MarcXMLWriter - run ALL Marc Fields through the converter

MarcXMLWriter assumes that all characters are valid in leaders, indicators, subfield names. However, if there's anything invalid (e.g. a 0x14), then the XML output contains strings like "" which breaks XML parsing.

There is a workaround for the issue: https://github.com/bsandiford/marc4j/commit/0982cbc20fa6269de68d2f1ca75bbc50e0d8b7eb

Its author says:

I changed MarcXmlWriter so that ALL characters are run through the converter, so that any that are invalid get converted to "" format.

The data is still invalid from a Marc point of view, but at least the XML parsers won't choke.

New upstream code (Oct. 21- )

There is new upstream code that deals with handling large record lengths that needs to be grokked and merged in. From latest release:

The previous release fixed the handling of a new class of oversized, invalid,
binary MARC records, unfortunately it broke handling of other oversized
records that worked previously. This release restores that correct handling
of those invalid records, adds tests and data coverage, and also fixes a 
related issue in the RawRecord classes handling of oversized binary MARC
records.

As with the previous release, if you never encounter oversized records, this
release is irrelevant, if you do (or even if you might) this release will handle a
few more classes of malformed records as well as can be expected.

Notice further that only the MarcPermissiveStreamReader class handles
these oversized records, the normal MarcStreamReader class will probably
just throw an exception and terminate.

and related previous one:

The primary substantive difference in this version is changes to how
MarcPermissiveStreamReader class handles malformed binary MARC
directories. The directory consists of a 3-byte tag, a 4-byte ASCII encoded
field length, and a 5-byte ASCII encoded field offset. Some system decided
that when it encountered an illegal, oversize record (>99999 bytes long) a
reasonable solution would be to write the field offset as a 6-byte ASCII
encoded number. It also now will deal with records where the fields are
stored in an order different than the order of the directory entries.

If you aren't dealing with such benighted records, this update adds little value
for you.

Is this fork needed now that upstream is publishing to Maven Central?

I created this fork mostly because I wanted a version of MARC4J in Maven Central. I made a few other XML updates along the way. MARC4J upstream has just published an artifact to Maven Central and has merged in some/all(?) of my XML handling updates. Does this fork still need to exist, given all this? The upstream isn't going whole hog on Maven (i.e., they're still using Ant) but does that really matter to me if the end product is a Maven Central artifact?

TODO: Test some of my code against the latest version of MARC4J that's now in Maven Central.

Apply PR on upstream related to subfield retrieval

This weekend, add marc4j#46

Do have a question about getSubfieldsAsString(String sfSpec) though... it's not adding any separator between the concatenated strings? Wouldn't it make more sense to do:

getSubfieldsAsString(String sfSpec, String padding) or something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.