Code Monkey home page Code Monkey logo

Comments (2)

GerHobbelt avatar GerHobbelt commented on June 9, 2024

For Optimizers / Developers

So this scenario had no use for #113 at all. But what MIGHT have been done?

Given the "almost 75% of the files scanned has 1 or more duplicates" plus "about 60% of 'em are dupes" could have been found in about -- <licks thumb and senses the breeze> -- half the time when the check-head and check-tail content check phases would/could have been eliminated. After all, for this particular set, those cut ~3M down to ~2.2M, so that's about 25% gain only for a noticable additional HDD access cost.

Maybe a handy commandline option to add: skip phase A+B, only do C?

But wait! What's the expected impact of such an optimization, when we do not just want a report but wish to do something about it, e.g. replace duplicates by hardlinks -- arguably one of the most costly dedup operations: it's a create-hardlink + file-delete action wrapped into one, so plenty of filesystem I/O per conversion?

Optimistically, I'ld say you'ld have to assume a cost of one more 'scan time' at least even for the cheapest action (file-delete?), while the convert-to-hardlink turned out very costly here indeed.

It would, however, be safe to say that reducing the number of rounds going through the filesystem and/or accessing files CAN be beneficial; regrettably OSes do not hand out APIs to optimize your disk access re head seek reduction/optimization, so the only way up is get everything onto SSD if you want to speed this up (seek times there are basically nil; conversion to hardlink however still requires some costly SSD page writes/clears).

In the end, is it worth it to deduplicate a large file set like this (millions of files)?

Given that only a small portion of the 41+M files on that HDD got processed in ~4 days, it might be opportune to invest in yet another HDD instead and postpone/do any mandatory deduplication in the end-point application only -- I was wondering about this as I have a large test corpus that is expected to have some (south of 10%) duplicates... 🤔 It might be "smart" to only detect&report the duplicates via rdfind -- as that is the fastest way to get intel on these, by a large margin -- and then postprocess that results.txt into something that can be parsed and used by the applications involved in processing such data.

Of course, the thriftness-lobe in my brain is on the barricades against this next revolting thought, but it may be unwise to bother deduplicating (very) large file sets, unless you're at least moderately sure the deduplication will produce significant space-saving gains, say 50% or better. Otherwise you might be better off begging your employer for more funding for your permanent storage addiction.
I had intended rdfind to check and dedup only files 100MByte or larger by specifying -minsize 100M but that apparently got parsed as -minsize 100 without an error or warning notice (bug? #134), a fact I sneakily hid in the console snapshot shown. 😉

from rdfind.

GerHobbelt avatar GerHobbelt commented on June 9, 2024

Related: #114

from rdfind.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.