pauldreik / rdfind Goto Github PK

find duplicate files utility

License: Other

C++ 56.05% Makefile 1.32% Shell 28.94% M4 5.23% Roff 5.33% PostScript 3.13%

rdfind's Introduction

Rdfind - redundant data find

Rdfind is a command line tool that finds duplicate files. It is useful for compressing backup directories or just finding duplicate files. It compares files based on their content, NOT on their file names.

If you find rdfind useful, drop me an email! I love hearing about how people actually use rdfind. In the unlikely case you want to throw money at rdfind, please use goclimate.

Continuous integration status

Status (main)	Status (devel)	Description
issues list	issues list	Static analyzer (codeql)
		Static analyzer (cppcheck)
		Builds and tests on Ubuntu 20.04 and 22.04 with the default compiler and settings
		Compiles on Mac OS X
		Builds and executes test on Ubuntu 20.04 with multiple versions of gcc and clang. Runs builds with address/undefined sanitizers and valgrind. Also performs the tests with a binary compiled in 32 bit mode.

Install

Debian/Ubuntu:

# apt install rdfind

Fedora

# dnf install rdfind

Mac

If you are on Mac, you can install through MacPorts (direct link to Portfile). If you want to compile the source yourself, that is fine. Rdfind is written in C++11 and should compile under any *nix.

Windows

using cygwin.

Usage

rdfind [options] directory_or_file_1 [directory_or_file_2] [directory_or_file_3] ...

Without options, a results file will be created in the current directory. For full options, see the man page.

Examples

Basic example, taken from a *nix environment: Look for duplicate files in directory /home/pauls/bilder:

$ rdfind /home/pauls/bilder/
Now scanning "/home/pauls/bilder", found 3301 files.
Now have 3301 files in total.
Removed 0 files due to nonunique device and inode.
Now removing files with zero size...removed 3 files
Total size is 2861229059 bytes or 3 Gib
Now sorting on size:removed 3176 files due to unique sizes.122 files left.
Now eliminating candidates based on first bytes:removed 8 files.114 files left.
Now eliminating candidates based on last bytes:removed 12 files.102 files left.
Now eliminating candidates based on md5 checksum:removed 2 files.100 files left.
It seems like you have 100 files that are not unique
Totally, 24 Mib can be reduced.
Now making results file results.txt

It indicates there are 100 files that are not unique. Let us examine them by looking at the newly created results.txt:

$ cat results.txt
# Automatically generated
# duptype id depth size device inode priority name
DUPTYPE_FIRST_OCCURENCE 960 3 4872 2056 5948858 1 /home/pauls/bilder/digitalkamera/horisontbild/.xvpics/test 001.jpg.gtmp.jpg
DUPTYPE_WITHIN_SAME_TREE -960 3 4872 2056 5932098 1 /home/pauls/bilder/digitalkamera/horisontbild/.xvpics/test 001.jpg
.
(intermediate rows removed)
.
DUPTYPE_FIRST_OCCURENCE 1042 2 7904558 2056 6209685 1 /home/pauls/bilder/digitalkamera/skridskotur040103/skridskotur040103 014.avi
DUPTYPE_WITHIN_SAME_TREE -1042 3 7904558 2056 327923 1 /home/pauls/bilder/digitalkamera/saknat/skridskotur040103/skridskotur040103 014.avi

Consider the last two rows. It says that the file skridskotur040103 014.avi exists both in /home/pauls/bilder/digitalkamera/skridskotur040103/ and /home/pauls/bilder/digitalkamera/saknat/skridskotur040103/. I can now remove the one I consider a duplicate by hand if I want to.

Algorithm

Rdfind uses the following algorithm. If N is the number of files to search through, the effort required is in worst case O(Nlog(N)). Because it sorts files on inodes prior to disk reading, it is quite fast. It also only reads from disk when it is needed.

Loop over each argument on the command line. Assign each argument a priority number, in increasing order.
For each argument, list the directory contents recursively and assign it to the file list. Assign a directory depth number, starting at 0 for every argument.
If the input argument is a file, add it to the file list.
Loop over the list, and find out the sizes of all files.
If flag -removeidentinode true: Remove items from the list which already are added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”!
Sort files on size. Remove files from the list, which have unique sizes.
Sort on device and inode(speeds up file reading). Read a few bytes from the beginning of each file (first bytes).
Remove files from list that have the same size but different first bytes.
Sort on device and inode(speeds up file reading). Read a few bytes from the end of each file (last bytes).
Remove files from list that have the same size but different last bytes.
Sort on device and inode(speeds up file reading). Perform a checksum calculation for each file.
Only keep files on the list with the same size and checksum. These are duplicates.
Sort list on size, priority number, and depth. The first file for every set of duplicates is considered to be the original.
If flag ”-makeresultsfile true”, then print results file (default).
If flag ”-deleteduplicates true”, then delete (unlink) duplicate files. Exit.
If flag ”-makesymlinks true”, then replace duplicates with a symbolic link to the original. Exit.
If flag ”-makehardlinks true”, then replace duplicates with a hard link to the original. Exit.

Development

Building

To build this utility, you need nettle (on Debian based distros: apt install nettle-dev).

Install from source

Here is how to get and install nettle from source. Please check for the current version before copying the instructions below:

wget https://ftp.gnu.org/gnu/nettle/nettle-3.4.1.tar.gz
wget https://ftp.gnu.org/gnu/nettle/nettle-3.4.1.tar.gz.sig
gpg --recv-keys 28C67298                # omit if you do not want to verify
gpg --verify nettle-3.4.1.tar.gz{.sig,} # omit if you do not want to verify
tar -xf nettle-3.4.1.tar.gz
./configure
make
sudo make install

If you install nettle as non-root, you must create a link in the rdfind directory so that rdfind later can do #include "nettle/nettle_header_files.h" correctly. Use for instance the commands

ln -s nettle-1.14 nettle

Quality

The following methods are used to maintain code quality:

builds without warnings on gcc and clang, even with all the suggested warnings from cppbestpractices enabled. Pass --enable-warnings to configure to turn them on.
builds with standards c++11, 14, 17 and 2a
tests are written for newly found bugs, first to prove the bug and then to prove that it is fixed. Older bugs do not all have tests.
tests are also run through valgrind
tests are run on address sanitizer builds
tests are run on undefined sanitizer builds
tests are run with debug iterators enabled
builds are made in default mode (debug) as well as release, and also with the flags suggested by Debian's hardening helper dpkg-buildflags
builds are made with both libstdc++ (gcc) and libc++ (llvm)
clang format is used, issue make format to execute it
cppcheck has been run manually and relevant issues are fixed
disorderfs is used (if available) to verify independence of file system ordering

There is a helper script that does the test build variants, see do_quality_checks.sh in the project root.

Alternatives

There are some interesting alternatives.

Fslint by Pádraig Brady

A search on ”finding duplicate files” will give you lots of matches.

Performance Comparison

Here is a small benchmark. Times are obtained from ”elapsed time” in the time command. The command has been repeated several times in a row, where the result from each run is shown in the table below. As the operating system has a cache for data written/read to the disk, the consecutive calls are faster than the first call. The test computer is a 3 GHz PIV with 1 GB RAM, Maxtor SATA 8 Mb cache, running Mandriva 2006.

Test case	duff 0.4 (`time ./duff -rP dir >slask.txt`)	Fslint 2.14 (`time ./findup dir >slask.txt`)	Rdfind 1.1.2 (`time rdfind dir`)
Directory structure with 3,301 files (2,782 Mb jpegs) / 100 files (24 Mb) are redundant	0:01.55 / 0:01.61 / 0:01.58	0:02.59 / 0:02.66 / 0:02.58	0:00.49 / 0:00.50 / 0:00.49
Directory structure with 35,871 files (5,325 Mb) / 10,889 files (233 Mb) are redundant	3:24.90 / 0:46.48 / 0:46.20 / 0:45.31	1:26.37 / 1:16.36 / 1:15.38 / 0:53.20	0:29.37 / 0:07.81 / 0:06.24 / 0:06.17

Note: units are minutes:seconds.milliseconds

Caveats / Features

A group of hardlinked files to a single inode are collapsed to a single entry if -removeidentinode true. If you have two equal files (inodes) and two or more hardlinks for one or more of the files, the behaviour might not be what you think. Each group is collapsed to a single entry. That single entry will be hardlinked/symlinked/deleted depending on the options you pass to rdfind. This means that rdfind will detect and correct one file at a time. Running multiple times solves the situation. This has been discovered by a user who uses a ”hardlinks and rsync”-type of backup system. There are lots of such backup scripts around using that technique, Apple time machine also uses hardlinks. If a file is moved within the backuped tree, one gets a group of hardlinked files before the move and after the move. Running rdfind on the entire tree has to be done multiple times if -removeidentinode true. To understand the behaviour, here is an example demonstrating the behaviour:

$ echo abc>a
$ ln a a1
$ ln a a2
$ cp a b
$ ln b b1
$ ln b b2
$ stat --format="name=%n inode=%i nhardlinks=%h" a* b*
name=a inode=18 nhardlinks=3
name=a1 inode=18 nhardlinks=3
name=a2 inode=18 nhardlinks=3
name=b inode=19 nhardlinks=3
name=b1 inode=19 nhardlinks=3
name=b2 inode=19 nhardlinks=3

Everything is as expected.

$ rdfind -removeidentinode true -makehardlinks true ./a* ./b*
$ stat --format="name=%n inode=%i nhardlinks=%h" a* b*
name=a inode=58930 nhardlinks=4
name=a1 inode=58930 nhardlinks=4
name=a2 inode=58930 nhardlinks=4
name=b inode=58930 nhardlinks=4
name=b1 inode=58931 nhardlinks=2
name=b2 inode=58931 nhardlinks=2

a, a1 and a2 got collapsed into a single entry. b, b1 and b2 got collapsed into a single entry. So rdfind is left with a and b (depending on which of them is received first by the * expansion). It replaces b with a hardlink to a. b1 and b2 are untouched.

If one runs rdfind repeatedly, the issue is resolved, one file being corrected every run.

rdfind's People

Contributors

Stargazers

Watchers

Forkers

as-com tabulon-ext erez-il starstuck pompomtom tweakoz djhejna tbasnoopy stribor neui t-oster wassupdoc rickhohler maxqia karawitan enver-haase direc85 mlissner watch-later nanaanim27 gelma juerei mdedonno1337 mneilly bertbaron serhin jvassev persianyagami90xs lazyforks eoghanmurray tastytea clayne alwayshc samee skywinder manfreddz yurigo79 rkarlsba soniccat wdongw ashlinrichardson kraj jbest szabgab yonasbsd strorozhsergeich gagikk acloserview cyberguard martysharmson dongshuyan huhu147 shr-project zakkuuno acidburn0zzz positioner ultramusso askagirl tyl12 kkpan11 hydrargyrum codingspiderfox fox-forks lpgpa h3xx javeree einsteinx2 nbeisert ysaxon alex6djforkedrepos

rdfind's Issues

Differentiate on metadata

Please consider an option to treat files as different unless they have the same permissions, owner/group, and modification time.

Simplified use case as an example: a UNIX/Linux system with a stock ".bash_profile" in each user's home directory. These are not really duplicates because they each have their own owner/group, permissions, and date/time metadata.

At the moment I can use rdfind to generate a list of files that are duplicates by data, but I then have to post-process the results.txt file to determine whether or not they are true duplicates.

Provide option for file hash proxy process

I have a huge collection of files that is already largely indexed, it would be helpful if rdfind could ask a subprocess about a files rather than actually needing to calculate it at each run.

tell rdfind to store results.txt in a different location

Hi,

not sure if I am doing it right but If I tell rdfind to scan folders with spaces it fails.

admin@VMX-Mate-finddupes:~$ rdfind -n true /mnt/nas/Adults/TV\ Shows
(DRYRUN MODE) Now scanning "/mnt/nas/Adults/TV Shows", found 0 files.
(DRYRUN MODE) Now have 0 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.

if I start rdfind in the dir with rdfind ./ it works but then it cant write the results.txt

Is there an option with -outputname to have the results.txt written to a different location eg ~/results.txt?

thanks
chris

Compilation error: ‘symlink’ was not declared in this scope

When compiling under Cygwin with the latest code from git (as of August 10, 2020), I get:

Fileinfo.cc:279:14: error: ‘symlink’ was not declared in this scope

Full session:

$ ./bootstrap.sh
it seems like everything went fine. now try
./configure && make

$ ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.exe
checking for suffix of executables... .exe
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking whether make supports the include directive... yes (GNU style)
checking dependency style of g++... gcc3
checking whether make sets $(MAKE)... (cached) yes
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
no
checking whether to enable assertions... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking nettle/sha.h usability... yes
checking nettle/sha.h presence... yes
checking for nettle/sha.h... yes
checking for main in -lnettle... yes
checking for stat... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
configure: checking for c++ 11, disable with --disable-cppstandardcheck
checking whether g++ supports C++11 features with -std=c++11... yes
checking check for fallthrough support... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating config.h
config.status: executing depfiles commands

$ make
make  all-am
make[1]: Entering directory '/home/▒▒▒/build/rdfind'
g++ -std=c++11 -DHAVE_CONFIG_H -I.     -g -O2 -MT rdfind.o -MD -MP -MF .deps/rdfind.Tpo -c -o rdfind.o rdfind.cc
mv -f .deps/rdfind.Tpo .deps/rdfind.Po
g++ -std=c++11 -DHAVE_CONFIG_H -I.     -g -O2 -MT Checksum.o -MD -MP -MF .deps/Checksum.Tpo -c -o Checksum.o Checksum.cc
mv -f .deps/Checksum.Tpo .deps/Checksum.Po
g++ -std=c++11 -DHAVE_CONFIG_H -I.     -g -O2 -MT Dirlist.o -MD -MP -MF .deps/Dirlist.Tpo -c -o Dirlist.o Dirlist.cc
mv -f .deps/Dirlist.Tpo .deps/Dirlist.Po
g++ -std=c++11 -DHAVE_CONFIG_H -I.     -g -O2 -MT Fileinfo.o -MD -MP -MF .deps/Fileinfo.Tpo -c -o Fileinfo.o Fileinfo.cc
Fileinfo.cc: In lambda function:
Fileinfo.cc:279:14: error: ‘symlink’ was not declared in this scope
  279 |       return symlink(target.c_str(), filename.c_str());
      |              ^~~~~~~
Fileinfo.cc: In instantiation of ‘int {anonymous}::transactional_operation(const string&, const Func&) [with Func = Fileinfo::makesymlink(const Fileinfo&)::<lambda(const string&)>; std::string = std::basic_string<char>]’:
Fileinfo.cc:280:6:   required from here
Fileinfo.cc:249:20: error: void value not ignored as it ought to be
  249 |   const int ret = f(filename);
      |                   ~^~~~~~~~~~
make[1]: *** [Makefile:657: Fileinfo.o] Error 1
make[1]: Leaving directory '/home/▒▒▒/build/rdfind'
make: *** [Makefile:536: all] Error 2

BLAKE3 support

It would make sense to add BLAKE3 support since it's way faster than the currently supported hash functions, see https://github.com/BLAKE3-team/BLAKE3 for more info

Ability to filter file names with a regex

Hello,

I need to deduplicate a huge mail folder from a mail server and some files are identical in a lot of mailboxes but they should not be hardlinked. I looked at the source code and I look quite easy to filter the list of files with a regex.

Would you consider merging those changes ? If you do I can implement that carefully and make a merge request but if you don't I just make the change I need quick & dirty

I just saw the issue about filtering the answer by file name. So I guess the answer is probably no

Control minimum file size

Sometimes it is of interest to ignore small files. rdfind already ignores empty files by default. The suggestion is to replace the -ignoreempty flag (or complement it) with a -minsize N flag where N is the minimum size in bytes. Using N=0 would mean empty files are included.

This is an old issue, reported by Andrew Buehler 20131130.

blake2 hash is fast

BLAKE2 (512 bit) hashes about 20% faster than fastest sha, maybe worths a try.

Allow setting SomeByteSize for first/last bytes checks

I tried rdfind with firmware update files. The problem with these is that the first 1000 Bytes are identical and even the last 64 (current default in Fileinfo.hh) don't differ that much between the files. So rdfind resorts to calculating checksums which takes a long time (large files) in comparison.

Could you maybe add a parameter to set the SomeByteSize value to higher values?

Space missing in output text

There are spaces missing after the full stop before the last numbers.

Now eliminating candidates based on first bytes:removed 24347 files from list.33115 files left.
Now eliminating candidates based on last bytes:removed 29203 files from list.3912 files left.
Now eliminating candidates based on sha1 checksum:removed 3896 files from list.16 files left.

I think the space needs to be inserted somewhere here.

Restart rdfind job on interuption.

Enable rdfind to be re-initiated on large jobs.

Right now, trying to scan an 8TB cold storage drive with Linux OS backups. Taking literal days and I've had to kill the job because there's no way to tell how it's making progress.

[feature] Optimize for big files/virtual images

Hi,

I have the need to frequently check a bunch of virtual images for duplicates.

See below an example for a small data set.

Now have 324 files in total.
Removed 179 files due to nonunique device and inode.
Now removing files with zero size from list...removed 3 files
Total size is 3205167303976 bytes or 3 TiB
Now sorting on size:removed 30 files due to unique sizes from list.112 files left.
Now eliminating candidates based on first bytes:removed 6 files from list.106 files left.
Now eliminating candidates based on last bytes:removed 2 files from list.104 files left.
Now eliminating candidates based on sha1 checksum:

The problem is these are quite huge but often the first and last bytes are equal, so rdfind goes into full checksum calculation.

Secondary problem is that most of these files are sparse files, which means reading them full reads a lot of zeros.

Would it be possible in addition/instead of the first/last bytes to check the first/last x megabytes? Say 1 meg at the beginning would be sufficient, as this either covers a file system or swap space, which is more likely not equal in case the images differ.

It would mean to sometimes read the x MB in addition, but for my use case it would dramatically reduce the total amount of data read.

rdfind creating symlinks and midnight commander

Hi,
there's is a "problem" with rdfind, creating symlinks and midnight commander. If rdfind is creating a symlink for a duplicate file in the same path, midnight commander IS NOT showing both of them, but only either the symlink OR the real file. OK, by
find <path> -type l -print
The symlinks can be sorted out but this is problematic. How to circumvent that? Check wether symlink should only be created in a new directory (e.g. 000-symlinks in that folder?)

apt installs version 1.3.5 - which includes "failed to make symlink" error

I just ran your program on a 10 Tb backup I made to try to reduce the overall size. Surprised to find that the "failed to make symlink error" hasn't been fixed in the version that apt installs. Seems like something that should be passed to all the automatic installation services....

add travis builds

Restrict By File Extention

Hej, Paul! Excellent tool! (I was actually about to start writing something very similar, but now I've found I don't have to! Great!)

My situation is this: I am using rdfind to recover space on my hard drive. So, I run rdfind . in my home directory to see what's going on. The result is that I have a lot of mp3 files in my Downloads directory and other places on my hard drive that iTunes has copied into itself. Bad iTunes! However, I also have a lot of .py and .js files that are used by virtualenv and npm for my various projects.

I would like to remove the duplicated mp3 files but leave the py and js files untouched. I've read the documentation, but I don't see a way to do this currently in rdfind.

I'm imagining the interface might be something like:

rdfind --include-extention mp3 or rdfind --exclude-extention py,js.

Set default hash to SHA1. Add support for SHA256

This is suggested by Debian bug 815120
and a good idea.

Show status

Could you show a simple progress indicator for each of the steps?

For example this takes quite long:

Now eliminating candidates based on sha1 checksum:

You could display reading file i of n until it is finished. Also an elapsed time would be helpful.

Clarify that the default action is nondestructive

This is a documentation change.
See Debian bug 901423

Add test for ranking correctness

Having a test that tests that the ranking between files (as described in the man page) works as promised would make it easier to do refactoring and cleanups.

unexpected token `11,noext,mandatory'

Hi there,

there is an error when using configure and i can't figure out what the problem ist.
I used the latest release as well as the latest github master branch.
On the gcc side i tried version 7.4 and 8.3.
What am i missing?

checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... no
checking whether make supports nested variables... yes
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for style of include used by make... none
checking dependency style of g++... none
checking whether make sets $(MAKE)... (cached) no
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
no
checking whether to enable assertions... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking nettle/sha.h usability... yes
checking nettle/sha.h presence... yes
checking for nettle/sha.h... yes
checking for main in -lnettle... yes
checking for stat... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
./configure: line 4230: syntax error near unexpected token `11,noext,mandatory'
./configure: line 4230: `      AX_CXX_COMPILE_STDCXX(11,noext,mandatory)'

Provide option for deduplication subprocess

I've noted that there seems to be btrfs deduplication in the roadmap. I would suggest that it might be easier to provide an option for a sub-process which performs the deduplication on rdfind's behalf.

Then regardless of system configuration a programmer may plug in a tool that simply does whatever "deduplication" is on their filesystem.

Convert options to getopts standandard

There might be no way to do it without breaking backwards compatibility, but specifying options to rdfind is really strange as currently implemented.

Long options should really be specified as
rdfind --makehardlinks <path>
instead of
rdfind -makehardlinks true <path>

To be consistent with other command line tool invocations.

If settings both -minsize and -ignoreempty true the first is ignored

It seems like the -minsize argument is ignored in the commnad:

rdfind -removeidentinode false -minsize 100000000 -ignoreempty true -makeresultsfile true -makehardlinks true /tank/

Tons of 16 byte files and the like have been hardlinked.

Remove compiler warnings

Compiler warnings are very useful and fixing them makes the code better.

For reference, see this: c++ best practices

recursion limit exceeded running on smb mounts

When a directory is mounted over smb via samba on linux, running rdfind inside that directory spams:

recursion limit exceeded
recursion limit exceeded
recursion limit exceeded
recursion limit exceeded
recursion limit exceeded

and never completes.

Observed the following errors in the kernel log:

[ 1588.565758] CIFS VFS: Close unmatched open
[ 1588.566208] CIFS VFS: Close unmatched open
[ 1588.566521] CIFS VFS: Close unmatched open
[ 1588.566856] CIFS VFS: Close unmatched open
[ 1588.567192] CIFS VFS: Close unmatched open
[ 1588.567573] CIFS VFS: Close unmatched open
[ 1588.567809] CIFS VFS: Close unmatched open
[ 1588.568044] CIFS VFS: Close unmatched open
[ 1588.568425] CIFS VFS: Close unmatched open
[ 1588.568691] CIFS VFS: Close unmatched open

When a directory is mounted in wsl1 over smb via samba, it also includes errors such as:

Dirlist.cc::handlepossiblefile: This should never happen. FIXME! details on the next row:
possiblefile="./master/.hardinfo"
Dirlist.cc::handlepossiblefile: This should never happen. FIXME! details on the next row:
possiblefile="./master/build/build/ouya/menu/initrd/data"

along with the recursion limit exceeded errors.

Samba host is ubuntu 20.04, running kernel 5.6.11.
Linux client is ubuntu 19.10.
WSL client is ubuntu 18.04.
"This is rdfind version 1.4.1"
Issues do not manifest when running over nfs.

Add Mac OS X support to travis builds

Any Mac users? I would be happy if someone could help me adding mac builds to travis.

There is probably quite a lot of "gnu-isms" in the shell scripts etc which may be a problem.

Provide way to cache files checksums

I use rdfind to make hardlinks between two big directory trees. The number of files is ~25000. Files are immutable - they can be created and deleted but not modified.

Each time I run rdfind it excludes about 10% of files on some criteria but after that it starts the most time consuming part - read first bytes, read last bytes. This takes a lot of time because the number of files to read is more than 20000 and the total size of files is about 40GB.

My idea is to cache file checksums somewhere so when I start rdfind next time it can read checksums of old files from the cache and calculate checksums only for those files which are new (not existed at the moment of previous run).

I believe this can dramatically reduce execution time.

Unable to scan external HDD with rdfind, I can Read & Write via Terminal.

What I have done to resolve this but failed:

rdfind /Volumes/1TB\ HDD2/Pictures/
Move copies of jpg files to root Directory of 1TB HDD2
Change from jpg to jpeg.

Output:
❯ rdfind /Volumes/1TB\ HDD2/Pictures/ Now scanning "/Volumes/1TB", found 0 files. Now scanning "HDD2/Pictures", found 0 files. Now have 0 files in total. Removed 0 files due to nonunique device and inode. Total size is 0 bytes or 0 B Removed 0 files due to unique sizes from list.0 files left. Now eliminating candidates based on first bytes:removed 0 files from list.0 files left. Now eliminating candidates based on last bytes:removed 0 files from list.0 files left. Now eliminating candidates based on sha1 checksum:removed 0 files from list.0 files left. It seems like you have 0 files that are not unique Totally, 0 B can be reduced. Now making results file results.txt

MacOS Catalina
Seagate 1TB

Bad error message on failed delete/symlink/hardlink

The error message when an action has failed is subpar. See Debian bug 648671
This is related to, and perhaps partially solved by, #5.

Running with option -makehardlinks true is not idempotent

I might be misled, but I expected the run of rdfind . -makehardlinks to be idempotent, on a file system that hasn't changed meanwhile. So a second run should not be able to replace any files with hard links.
But the output suggests, that it isn't. See my output of consecutive runs below, running rdfind version 1.3.5 on Ubuntu 18.04 (latest stable).

I am now afraid, that unique files are affected and I'm unsure if I should restore backup.
Please, tell me, where I have a misconception.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50145 files.
Now have 50145 files in total.
Removed 1830 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 812738601 bytes or 775 MiB
Now sorting on size:removed 5608 files due to unique sizes from list.42707 files left.
Now eliminating candidates based on first bytes:removed 4329 files from list.38378 files left.
Now eliminating candidates based on last bytes:removed 4885 files from list.33493 files left.
Now eliminating candidates based on md5 checksum:removed 527 files from list.32966 files left.
It seems like you have 32966 files that are not unique
Totally, 442 MiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 28744 links.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50146 files.
Now have 50146 files in total.
Removed 30259 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 356656677 bytes or 340 MiB
Now sorting on size:removed 7452 files due to unique sizes from list.12435 files left.
Now eliminating candidates based on first bytes:removed 5760 files from list.6675 files left.
Now eliminating candidates based on last bytes:removed 5513 files from list.1162 files left.
Now eliminating candidates based on md5 checksum:removed 532 files from list.630 files left.
It seems like you have 630 files that are not unique
Totally, 4 MiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 315 links.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50145 files.
Now have 50145 files in total.
Removed 30378 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 351620489 bytes or 335 MiB
Now sorting on size:removed 7518 files due to unique sizes from list.12249 files left.
Now eliminating candidates based on first bytes:removed 5810 files from list.6439 files left.
Now eliminating candidates based on last bytes:removed 5515 files from list.924 files left.
Now eliminating candidates based on md5 checksum:removed 532 files from list.392 files left.
It seems like you have 392 files that are not unique
Totally, 2 MiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 196 links.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50146 files.
Now have 50146 files in total.
Removed 30434 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 350678785 bytes or 334 MiB
Now sorting on size:removed 7548 files due to unique sizes from list.12164 files left.
Now eliminating candidates based on first bytes:removed 5831 files from list.6333 files left.
Now eliminating candidates based on last bytes:removed 5521 files from list.812 files left.
Now eliminating candidates based on md5 checksum:removed 532 files from list.280 files left.
It seems like you have 280 files that are not unique
Totally, 1 MiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 140 links.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50146 files.
Now have 50146 files in total.
Removed 30460 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 350342492 bytes or 334 MiB
Now sorting on size:removed 7566 files due to unique sizes from list.12120 files left.
Now eliminating candidates based on first bytes:removed 5839 files from list.6281 files left.
Now eliminating candidates based on last bytes:removed 5521 files from list.760 files left.
Now eliminating candidates based on md5 checksum:removed 532 files from list.228 files left.
It seems like you have 228 files that are not unique
Totally, 1 MiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 114 links.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50146 files.
Now have 50146 files in total.
Removed 30477 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 350052283 bytes or 334 MiB
Now sorting on size:removed 7574 files due to unique sizes from list.12095 files left.
Now eliminating candidates based on first bytes:removed 5846 files from list.6249 files left.
Now eliminating candidates based on last bytes:removed 5523 files from list.726 files left.
Now eliminating candidates based on md5 checksum:removed 532 files from list.194 files left.
It seems like you have 194 files that are not unique
Totally, 920 KiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 97 links.

stefan@meidling:/var/phoenix/pan-uploads/images/episode/4/7$ sudo rdfind . -makehardlinks true
Now scanning ".", found 50146 files.
Now have 50146 files in total.
Removed 30490 files due to nonunique device and inode.
Now removing files with zero size from list...removed 0 files
Total size is 349873207 bytes or 334 MiB
Now sorting on size:removed 7581 files due to unique sizes from list.12075 files left.
Now eliminating candidates based on first bytes:removed 5850 files from list.6225 files left.
Now eliminating candidates based on last bytes:removed 5525 files from list.700 files left.
Now eliminating candidates based on md5 checksum:removed 532 files from list.168 files left.
It seems like you have 168 files that are not unique
Totally, 749 KiB can be reduced.
Now making results file results.txt
Now making hard links.
Making 84 links.

Option to Omit a Directory or File

Provide the ability to tell rdfind not to parse part of a tree...

e.g. rdfind -makehardlinks true "/usr/me" -omit "/usr/me/somedirectory"

In my use-case I have a directory with 30 sub-directories... and I need to run rdfind across 29 of the 30.

Use ZFS metadata to compare checksums

I use ZFS to store all my files in a big pool. Sometimes I have duplicates I identify with rdfind in a dry-run, check the results.txt manually and (if it's ok with me) re-run rdfind to really delete the duplicate files. So rdfind not only needs to fully read the remaining files in full twice (at least) to compute the checksums, but also it does not leverage the block checksums of every file that are an inherent feature of ZFS (and calculated anyway, but at write-time of the file).

The ZFS command zdb gives an indication on how this could work. To query which files (and their ZFS object ID) are on a given ZFS file system (here minitank/fw/video/gopro):

$ zdb -vvv minitank/fw/video/gopro | grep \(type:\ Regular | head -n 100
(...)
		.VolumeIcon.icns = 7 (type: Regular File)
		com.apple.FinderInfo = 9 (type: Regular File)
		VolumeConfiguration.plist = 16 (type: Regular File)
(...)
		GOPR0410.MP4 = 110 (type: Regular File)
		GOPR0428.LRV = 109 (type: Regular File)
		G0030427.JPG = 108 (type: Regular File)
		GOPR0434.MP4 = 121 (type: Regular File)
		G0010416.JPG = 120 (type: Regular File)
		GOPR0433.MP4 = 116 (type: Regular File)
		G0020421.JPG = 115 (type: Regular File)

This allows to query (for example) what the checksums for each block the ZFS object 115 (file G0020421.JPG) has:

$ zzdb -vvvvv minitank/fw/video/gopro 115 | grep "L0 ZFS plain file" | grep -o cksum=.*
cksum=2b0ad5f93fb4:7298cf4a6946d86:9121b965256654b8:1bc1f6f4a2a099d8
cksum=4097453f3d6c:103cb4291526d015:8afcb0a258e95b77:a9e71a3e8f82f19d
cksum=4022edcd7609:100fea58fa9ac828:2ca5db609ae0a854:faea7f497f7daf2c
cksum=404ea3b38c21:1015c2c361081102:bcc31ed291dca604:4b126a2d4052739b
cksum=3f598a2ab67a:fde544c32e664c2:abd956aeed12b9db:7cf2e4d2a6c9bd4f
cksum=3f675000fe35:fd98072b0095081:5a98cae21efdd757:f7fbf8cd6fa73304
cksum=3ef7cee7b23e:fcc06f4ad0f2b00:9588581ab10b1fe6:c6ee212e369f0f79
cksum=3e4bcddec672:f94b6b8790382e5:628b25de42d08d2a:b563887a5e7995a8
cksum=3d43f9295614:f4fc8a2c9ced025:9bf45e95e78116e6:97f66dbb9bcc3da7
cksum=3db1b8dcb8ea:f6898164ef8961d:717da2e5f2a8ae7c:7d1ff9f6e079a349
cksum=3e8278d5f3eb:f969929d7718668:cc9012a3cdbee3ab:be677c0b5973d2e
cksum=3edd74e3736a:fb28bbd504ea3ed:29d1881b6fdf791f:d88937b1fa135fe6
cksum=3e6d420fa43c:f925adf1e0beeae:e6ec2c707a56118d:2b704c513ab8e7a4
cksum=3df8ee864c70:f7c1a08ad5774eb:9f92915c010af70:37f591cd192339ae
cksum=3d95d67f89ef:f60c81b1766b80c:a5ffc708f633a88f:a4fa96a5806af28d
cksum=3e191cd0dcf0:f7a200c5a8e12ce:dae62952d2d776eb:a932d6341e2bf523
cksum=409332d2588d:ffff751b79ce0c4:13eb43c9f54317cc:352b881aa6025856
cksum=53e67ac1208:283c714f9b3a2a1:78069759bb33e8b1:1e280f42ef6e382a

Now for rdfind using the zdb command is not a very good idea (except for a PoC maybe), as the output format is clearly not meant for automatic processing (it is also said not to be backward compatible through the ZFS releases). But zdb has no real magic, AFAIK it just queries the ZFS API and spits out the resulting information as text.

So when using ZDB, reading the actual files to check for equality would be unnecessary - which would render rdfind even quicker.

In a small experiment I could reproduce the cksum values being stable across different ZFS file systems (bar the checksums of the last block, because of a shorter block length on one of the ZFS file systems) and also intra-file system with a simple copy of the file (but different name and access times etc.).

Add option to get stable output

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=795790

See also issue #10 for tests related to this.

Add 32 bit testing to continous integration

Rdfind 1.4.0 failed to build on some of the Debian's build machines. It worked on amd64, arm64, mips64, ppc64el, s390x, alpha, ia64, ppc64, riscv64 and sparc64. It failed on armel, armhf, i386,mips,hppa, hurd-i386,m68k, powerpc, powerpcspe, sh4 and x32. The common denominator is that all those are (I believe) 32 bit platforms.
This demonstrates that building on a single architecture (amd64) is not enough.
Here are some ideas how to fix this on appveyour and/or travis

build on 32 bit x86 (how? through debian multiarch?)
qemu ?
docker?
x32?

Is there any freely available continuous integration that uses arm? Thankful for input and suggestions.

Improve documentation and error handling of -dryrun

The dryrun option is possible to misunderstand, and the error message is bad. See Debian bug 754663 and Ubuntu bug 1782273.

It is perhaps difficult to get right from glancing over the documentation because it needs a boolean argument true or false afterwards. This is a bad design (sorry!) but must remain as is for backwards compatibility. Making the true|false optional, would lead to ambiguity if invoking like this awkward example shows:

cd $(dirname $(which true))
rdfind -dryrun true false

It would be good to improve the error output, such that the error message immediately suggest what should be done.
Perhaps the builtin documentation can be made a bit clearer.

Nettle issues on centos

I'm trying to build this on centos 7 with autotools. I have an old version of rdfind with yum.

checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking nettle/sha.h usability... no
checking nettle/sha.h presence... no
checking for nettle/sha.h... no
configure: error:
 nettle header files missing. Please install nettle
 first. If you have already done so and get this error message
 anyway, it may be installed somewhere else, maybe because you
 don't have root access. Pass CPPFLAGS=-I/your/path/to/nettle to configure
 and try again. The path should be so that #include "nettle/sha.h" works.
 On Debian-ish systems, use "apt-get install nettle-dev" to get a system
 wide nettle install.

I tried the pass CPPFLAGS=-I"/opt/rfind/nettle" ./configure ... no cigar. sha.h is there and nettle seemed to build correctly.

autotools macro issue

When running autoreconf -fi and then ./configure I get the following error

./configure: line 4176: syntax error near unexpected token `11,noext,mandatory'
./configure: line 4176: `      AX_CXX_COMPILE_STDCXX(11,noext,mandatory)'

Probably the macro is missing? Creating a local m4 folder might help?

Display more accurate space saving if removeidentinode is false

After the run, I see no additional free space, despite the Totally, 740 Gib can be reduced in the console.
Here is my full output:

# df -h /mnt/s4; time rdfind -removeidentinode false -makehardlinks true /mnt/s4/; df -h /mnt/s4
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdf1       7.3T  7.3T     0 100% /mnt/s4

Now scanning "/mnt/s4", found 436762 files.
Now have 436762 files in total.
Now removing files with zero size from list...removed 1008 files
Total size is 9425999506070 bytes or 9 Tib
Now sorting on size:removed 142093 files due to unique sizes from list.293661 files left.
Now eliminating candidates based on first bytes:removed 117002 files from list.176659 files left.
Now eliminating candidates based on last bytes:removed 6037 files from list.170622 files left.
Now eliminating candidates based on md5 checksum:removed 13035 files from list.157587 files left.
It seems like you have 157587 files that are not unique
Totally, 740 Gib can be reduced.
Now making results file results.txt
Now making hard links.
Making 111390 links.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdf1       7.3T  7.3T  620K 100% /mnt/s4

Seems I'm using the wrong flags, any advice?

make the official Debian build reproducible

Find and fix any issues preventing rdfind from being reproducibly built. See

NTFS support

It's not clear what happens if you run this on an NTFS volume mounted from Linux. Will NTFS hard links be created?

Assertion failure from Dirlist.cc::handlepossiblefile on unreadable subdirectory

Running 1.4.1 from the Ubuntu package. Works fine except when encountering unreadable subdirectories:

tmp$ mkdir rdfind-test
tmp$ sudo mkdir rdfind-test/subdir
tmp$ sudo touch rdfind-test/subdir/file
tmp$ mkdir rdfind-test/otherdir
tmp$ touch rdfind-test/otherdir/file
tmp$ rdfind rdfind-test/
Now scanning "rdfind-test", found 0 files.
Now have 0 files in total.
Removed 0 files due to nonunique device and inode.
Total size is 0 bytes or 0 B
Removed 0 files due to unique sizes from list.0 files left.
Now eliminating candidates based on first bytes:removed 0 files from list.0 files left.
Now eliminating candidates based on last bytes:removed 0 files from list.0 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.0 files left.
It seems like you have 0 files that are not unique
Totally, 0 B can be reduced.
Now making results file results.txt
tmp$ sudo chmod go-rwx rdfind-test/subdir/
tmp$ rdfind rdfind-test/
Now scanning "rdfind-test"Dirlist.cc::handlepossiblefile: This should never happen. FIXME! details on the next row:
possiblefile="rdfind-test/subdir"
, found 0 files.
Now have 0 files in total.
Removed 0 files due to nonunique device and inode.
Total size is 0 bytes or 0 B
Removed 0 files due to unique sizes from list.0 files left.
Now eliminating candidates based on first bytes:removed 0 files from list.0 files left.
Now eliminating candidates based on last bytes:removed 0 files from list.0 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.0 files left.
It seems like you have 0 files that are not unique
Totally, 0 B can be reduced.
Now making results file results.txt

Mention in man page that directories are recursed

How to hardlink duplicates on whole filesystem?

Optionally show progress bar

This is after suggestion from SB sending me an email, thanks!

While waiting for rdfind to complete, it would be nice to present some kind of feedback in case the session is interactive. During the first step (scanning the inputs), it is difficult to know how long time is left, but the later steps (reading first bytes, and hashing the contents) it is certainly possible.

The downside is that it complicates the program and slows it down, so it should probably be made an opt-in.

It may also be confusing for users that the progress bar is per step. Perhaps it could say
Step 1/3: ###
and then once that step is finished, the output will look like:
Step 2/3: #

This is the proposal:

Alternative 1 - progress bar

The progress bar could be a character like # (like wget does, but without the percentage and time estimate). The output is # repeated 0 to 80 times, depending on the progress.

During scanning which files exist, present nothing.
During scanning the first bytes of each file, output progress as i/N where i is the current step and N the number of files known to be scanned
During hashing the file contents, output progress as i/N where i is the current step and N the number of files known to be hashed

This will be misleading in case some of the files are unevenly fast to read (one dir on a local ssd, the other on a network share), or the sizes vary a lot.

Alternative 2 - spinner

A spinner alternates between displaying the characters \ | / -, making it appear like a spinning rod.

During scanning which files exist, turn the spinner each time a file is found.
During scanning the first bytes of each file, output progress as i/N where i is the current step and N the number of files known to be scanned
During hashing the file contents, output progress as i/N where i is the current step and N the number of files known to be hashed

The good thing with this is that it is easy to see the process has not hang, and perhaps a bit less visually obstructive than the previous alternative.

Implementation

This does not look difficult to implement, see for instance https://stackoverflow.com/a/14539953

Feedback on this proposal

Feel free to drop your comments and suggestions here!

Better processing of duplicate inodes

As described in readme, currently identical inodes can be dealt with in one of two ways:

with -removeidentinode true, one might need to run rdfind several times to merge two hardlinked groups of identical files (each run will do md5 computation on these two inodes)
with -removeidentinode false, this can be done in one pass, at expense of doing md5 computation on each of hardlinks (I assume so).

I suggest improving this in such a way that instead of "collapsing a group of hardlinked files to a single entry" and dealing with this entry alone, on step 5 of the algorithm we keep a list of all hardlinks linked to this entry and, after step 12 on the algorithm (after checksum comparison), if "main entry" is still in list of duplicates - add all its hardlinks to the list, too

Prefer base filenames, older files to asciibetical listing

While using rdfind to clean up my ebooks, the tool deleted copies that were downloaded first in favor of those that had a (1).ext in the filename, an artifact of downloading files twice in Firefox.

In the -deleteduplicates process, dropping files that have already been tagged by another program as potential dupes should be preferable.

DUPTYPE_FIRST_OCCURRENCE 69 0 6487394 64769 29229369 3 eBooks/9781787123687(1).epub
DUPTYPE_WITHIN_SAME_TREE -69 0 6487394 64769 29230708 3 eBooks/9781787123687(2).epub
DUPTYPE_WITHIN_SAME_TREE -69 0 6487394 64769 29231261 3 eBooks/9781787123687.epub
DUPTYPE_FIRST_OCCURRENCE 310 0 7356520 64769 28967090 3 eBooks/gamehacking(1).pdf
DUPTYPE_WITHIN_SAME_TREE -310 0 7356520 64769 25559164 3 eBooks/gamehacking.pdf
DUPTYPE_FIRST_OCCURRENCE 28 0 7367823 64769 33296269 3 eBooks/9781783553358-PYTHON_DATA_ANALYSIS(1).pdf
DUPTYPE_WITHIN_SAME_TREE -28 0 7367823 64769 33296270 3 eBooks/9781783553358-PYTHON_DATA_ANALYSIS.pdf

Don't try to make more hard links than the fs allows

Many filesystems have 16bit (or even smaller) link counters, eg. ext4 has a limit of 65000 links per file. If there are more duplicate files than that, rdfind tries to link all of them into one, and it will fail to make links after the limit has been reached.

If the limit has been reached, the next processed copy of the file should be used as a target for subsequent duplicates instead of replacing it with a link (and so on if the limit is reached again).

The maximum number of links for any given file or dir should be able to be found with pathconf/fpathconf and the current link count with stat/fstat

Support filtering by file size to allow parallelization of workloads and smaller job sizes

Support for min and max filesizes to allow for “file size based partitioning” of the deduplication process. One could think of this as “banding” along the spectrum of sizes or sharding.

Why this enhancement is safe:

The size filtering only affects whether or not files are pushed onto the queue to be worked on. It is merely a “cull” of the files that are walked in the directory tree..

Why this enhancement is needed:

We work on several machine learning and deep learning projects which operate on millions of files. Deduplication is critical in certain applications and categorizations. However running rdfind on the entire dataset can lead to jobs that do not finish before the arrival of new data and the performance degrades as the working set grows beyond physical memory. By partitioning on file size, the deduplication operation can be run in parallel (machines and cores) and in smaller batches with no loss of functionality and gains in speed from parallelization. Additionally, since certain size files are more frequent in our applications, it is possible to target deduplication to “the most active file sizes” in the archives.

Other motivations:

I have used this feature for some years and it is valuable to me. I think they would be useful to others. There is no discernible impact on runtime performance. Arguments that running in parallel is fraught with danger only apply if running in parallel over the same size band which is no different than running two rdfind jobs at once. (This by the way happens when using cron and discovering a prior job has not finished, so partitioning to faster smaller jobs is actually safer in my mind.)

Working Implementation Available (see pull request)

[feature request] different treatment of directories?

Right now "rdfind … -makesymlinks|-deleteduplicates true … /directory1/ /directory2/" will attempt to prune duplicates in both. Sometimes it would be safer or more convenient if there was an option allowing to:
A: keep /directory1/ from any modifications and remove/replace duplicates from /directory2/ only; or
B: simply don't count duplicates within the same tree at all, only between different trees.
Perhaps (B) is better, as it could be used instead of (A) by running twice - 1st pass without this option pruning the directories that don't need to be preserved, then the 2nd pass with "between directories only" option and the one which should be treated as read-only in the first directory argument, to have it always counted as the "original" instance.
Especially since rdfind already discerns "DUPTYPE_WITHIN_SAME_TREE" vs. "DUPTYPE_OUTSIDE_TREE".

Source files are deleted when hardlinks can't be created

When running the script on multiple devices, the following error appears:

failed to make hardlink /mount1/…/file1 to /mount2/…/file2    
Rdutil.cc: Failed to apply function f on it.

file1 ends up being deleted. There should be some kind of check to verify that both files live in the same device.
I'm running 1.3.5, I haven't checked the git master.