Code Monkey home page Code Monkey logo

Comments (5)

rlskoeser avatar rlskoeser commented on August 10, 2024 1

Hey @kaizimmer1 - what you're describing sounds pretty concerning me, although it seems like it must be something at the Fedora level and not specific to eulfedora.

The eulfedora checksum script uses the fedora api method compareDatastreamChecksum which asks Fedora to recalculate the checksum for the content and compare it with the stored value. So if I'm understanding you correctly, it sounds like Fedora is calculating a different checksum at ingest/creation than it is calculating later on for validation. I'm not sure what could cause that - although I will say we've had occasional, very sporadic instances where Fedora reports a bad checksum for content that is fine when we check it again later. So maybe there's a bug in Fedora...

Something else you could try is comparing the stored checksum and finding the datastream where it's stored on disk within the fedora storage and calculating the checksum locally, without downloading via wget. I do think this is really venturing more into a fedora issue than an eulfedora one, although based on your initial question it would be a good idea to make some updates to the script documentation.

from eulfedora.

rlskoeser avatar rlskoeser commented on August 10, 2024 1

Thanks for following up. I've added a few more notes to the script documentation, hopefully that will help anyone else who comes upon the script and runs into similar issues.

from eulfedora.

rlskoeser avatar rlskoeser commented on August 10, 2024

@kaizimmer1 thanks so much for opening an issue! I apologize that nobody responded to your email yet, but opening an issue is probably better anyway, since then it will be publicly documented for everyone. Thanks also for the nice comment about eulfedora, always good to hear. :-)

So, if the script is running and not causing any errors with an older version of eulfedora, then I think you are probably ok as far as that goes. You might still want to look at upgrading to a newer version, but I recommend you look at the change log first - we made some changes at some point for Fedora 3.8 compatibility, and I'm not certain that they're backwards compatible. (As a side note, I'm hoping that we'll be able to set up a travis-ci build, which should make it possible to check against different versions and see which ones actually work with current eulfedora, but we're not there yet.)

The output you're getting from the script sounds a little surprising, but there are a couple of things I can think of that could be going on:

  • it's possible some of those bad checksums are historical versions of datastreams (I know we had lots of those); those are not possible to fix via the API, and eulfedora can't do anything about them. As it is, the way the script "repairs" checksums is actually by saving a new version, which obviously isn't great in a lot of ways.
  • Looking at the code and the script documentation I see that it's not obvious, but from what I remember the repair mode works by triggering a save and forcing Fedora to regenerate the checksum. I believe that requires you to have Fedora auto-checksumming turned on (see the autoChecksum value in your fedora.fcfg)

If you want to investigate more, you might try running the script on a list of individual pids, and maybe turn on debug logging so you can see exactly what api calls eulfedora is making.

However, I would actually recommend that you use a different tool called fcsu to repair the checksums. It acts on the filesystem level, which bypasses the APIs and makes it possible to fix problems in historical data without generating new versions. (We've also used it to fix 0 size datastreams that cropped up in some of our very old content). Here's a sample command provided by one of our systems administrators:

./fcsu modify legacy --filter=set-fixity --pids=`cat /tmp/pids.txt` --algorithm=md5 --force=true

Obviously, you'll want to be careful with this - try it on a small scale and in a test or QA fedora to make sure you're comfortable before running it on production content! See fcsu documentation and fcrepo-store github for more details and code.

Please report back with what you find. Either way, it sounds like we should make some updates to the documentation to that script to make these points clearer.

from eulfedora.

kaizimmer1 avatar kaizimmer1 commented on August 10, 2024

@rlskoeser, thanks a lot for your comprehensive answer :-)

In my fedora.fcfg, on my development system i have autoChecksum=true and checksumAlgorithm=MD5. When starting fedora-checksums, i did not use the -a switch to get historical versions.

As you suggested, i tried fcsu to update a single PID (also with algorithm=MD5). The datastream checksum was updated (without generating a new version), but still fedora-checksum complains about a false checksum for that datastream. Also, when i download the datastream with wget, the md5sum utility reports a different checksum than fcsu/fedora has saved.

To me it looks like there is a different implementation of MD5 between fedora and fedora-checksum/md5sum. On the other hand - if you use fcsu successfully, it's probably my error :-}

from eulfedora.

kaizimmer1 avatar kaizimmer1 commented on August 10, 2024

Hi @rlskoeser - you're absolutely right. The errors reported by the fedora-checksums script reported datastreams which were stored as inline XML. The fedora documentation checksum mentions about inline XML that it's "difficult" to ensure the integrity of these datastreams. So maybe it would be best not to try to check their checksums by the script by default?
However - thanks a lot for your advice, hints and of course your work on Eulfedora!

from eulfedora.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.