Comments (5)
Hey @kaizimmer1 - what you're describing sounds pretty concerning me, although it seems like it must be something at the Fedora level and not specific to eulfedora.
The eulfedora checksum script uses the fedora api method compareDatastreamChecksum which asks Fedora to recalculate the checksum for the content and compare it with the stored value. So if I'm understanding you correctly, it sounds like Fedora is calculating a different checksum at ingest/creation than it is calculating later on for validation. I'm not sure what could cause that - although I will say we've had occasional, very sporadic instances where Fedora reports a bad checksum for content that is fine when we check it again later. So maybe there's a bug in Fedora...
Something else you could try is comparing the stored checksum and finding the datastream where it's stored on disk within the fedora storage and calculating the checksum locally, without downloading via wget. I do think this is really venturing more into a fedora issue than an eulfedora one, although based on your initial question it would be a good idea to make some updates to the script documentation.
from eulfedora.
Thanks for following up. I've added a few more notes to the script documentation, hopefully that will help anyone else who comes upon the script and runs into similar issues.
from eulfedora.
@kaizimmer1 thanks so much for opening an issue! I apologize that nobody responded to your email yet, but opening an issue is probably better anyway, since then it will be publicly documented for everyone. Thanks also for the nice comment about eulfedora, always good to hear. :-)
So, if the script is running and not causing any errors with an older version of eulfedora, then I think you are probably ok as far as that goes. You might still want to look at upgrading to a newer version, but I recommend you look at the change log first - we made some changes at some point for Fedora 3.8 compatibility, and I'm not certain that they're backwards compatible. (As a side note, I'm hoping that we'll be able to set up a travis-ci build, which should make it possible to check against different versions and see which ones actually work with current eulfedora, but we're not there yet.)
The output you're getting from the script sounds a little surprising, but there are a couple of things I can think of that could be going on:
- it's possible some of those bad checksums are historical versions of datastreams (I know we had lots of those); those are not possible to fix via the API, and eulfedora can't do anything about them. As it is, the way the script "repairs" checksums is actually by saving a new version, which obviously isn't great in a lot of ways.
- Looking at the code and the script documentation I see that it's not obvious, but from what I remember the repair mode works by triggering a save and forcing Fedora to regenerate the checksum. I believe that requires you to have Fedora auto-checksumming turned on (see the
autoChecksum
value in yourfedora.fcfg
)
If you want to investigate more, you might try running the script on a list of individual pids, and maybe turn on debug logging so you can see exactly what api calls eulfedora is making.
However, I would actually recommend that you use a different tool called fcsu to repair the checksums. It acts on the filesystem level, which bypasses the APIs and makes it possible to fix problems in historical data without generating new versions. (We've also used it to fix 0 size datastreams that cropped up in some of our very old content). Here's a sample command provided by one of our systems administrators:
./fcsu modify legacy --filter=set-fixity --pids=`cat /tmp/pids.txt` --algorithm=md5 --force=true
Obviously, you'll want to be careful with this - try it on a small scale and in a test or QA fedora to make sure you're comfortable before running it on production content! See fcsu documentation and fcrepo-store github for more details and code.
Please report back with what you find. Either way, it sounds like we should make some updates to the documentation to that script to make these points clearer.
from eulfedora.
@rlskoeser, thanks a lot for your comprehensive answer :-)
In my fedora.fcfg, on my development system i have autoChecksum=true and checksumAlgorithm=MD5. When starting fedora-checksums, i did not use the -a switch to get historical versions.
As you suggested, i tried fcsu to update a single PID (also with algorithm=MD5). The datastream checksum was updated (without generating a new version), but still fedora-checksum complains about a false checksum for that datastream. Also, when i download the datastream with wget, the md5sum utility reports a different checksum than fcsu/fedora has saved.
To me it looks like there is a different implementation of MD5 between fedora and fedora-checksum/md5sum. On the other hand - if you use fcsu successfully, it's probably my error :-}
from eulfedora.
Hi @rlskoeser - you're absolutely right. The errors reported by the fedora-checksums script reported datastreams which were stored as inline XML. The fedora documentation checksum mentions about inline XML that it's "difficult" to ensure the integrity of these datastreams. So maybe it would be best not to try to check their checksums by the script by default?
However - thanks a lot for your advice, hints and of course your work on Eulfedora!
from eulfedora.
Related Issues (18)
- Retrieving previous versions of a datastream's content HOT 2
- Missing Content-Disposition in raw_datastream view HOT 5
- progressbar versions HOT 6
- syncutil - window size for reading datastream information can be too small HOT 3
- custom panel for django-debug-toolbar
- raw_datastream view documentation not included on readthedocs.org
- Documentation: clarify datastream.content usage for generic DatastreamObject vs FileDatastreamObject HOT 5
- eulfedora not compatible with requests 2.11.0 HOT 3
- Checksum Mismatch
- Eulfedora and Fedora 4.x
- Use pycryptodome instead of pycrypto
- Status of the project
- documentation (installation, tutorial) is out of date for current versions of Django HOT 2
- TypeError at /objects/simplerepo:1/ HOT 1
- syncrepo command not found HOT 4
- References to XMLDatastreamObjects being overwritten HOT 3
- add_relationship() does not recognize object PIDs when provided as unicode strings HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eulfedora.