↑↑↑ Home | ↑↑ UNIX | ↑ Updateware |
diffn is a small program intended for comparing a set of files for pairwise equality. There are ways to do this using standard utilities, but they are inefficient and therefore unsuitable for large files. For example, you could run diff for every pair of files:
for i in <files> ; do for j in <files> ; do diff -q $i $j done done
But that would perform every comparison twice, and read every file n times. A different possibility is computing a hash:
md5sum <files> | sort
Then one could visually compare the MD5 sums (with equal ones next to each other because of the sort). But this reads all the data from each file, which makes it slow for large files.
diffn reads each file only once, and only up to the point when it is seen to be different from all others. Before starting to read, it compares file sizes to sort out differently sized files. The way it outputs equal files can be adapted to facilitate reading its output with a different program.
Get diffn: Download gzipped tar archive
For Arch Linux users, this is now in the AUR; the PKGBUILD is also here.
diffn - compare n files for equality
diffn determines which of a set of files have the same content or are otherwise equivalent. It is intended for comparing large, possibly binary files, some of which are duplicates. Because it reads each file only once and stops at the first difference, it is more efficient than n-by-n diff or comparing md5sums. It can also handle special files such as directories, symbolic links, devices, sockets and others.
diffn distinguishes two kinds of equality. Files are considered identical when they refer to the same inode. Symbolic links are considered identical to their targets unless the -l option is given. Otherwise, only hard links are identical. Regular files are considered equal if their contents are the same. Device files are equal if their major and minor numbers are equal. All other special files are not equal to each other.
diffn tries to be as efficient as possible when comparing regular files. File sizes are compared first. Files are read and compared block by block. Once a file is found to be different from all others, it is output and removed from the set. Only duplicate files have to be read completely, and this is unavoidable because there could always be a difference later on.
Each set of equal files is output one file per line. All but the first are indented and prepended with "==". Files identical to their predecessors are indented more and prepended by "===". This can be changed with the --*sep options. Because files are output as soon as they are found to be unique, the output is not sorted in any way.
Print brief usage information.
Do not dereference symbolic links. Symlinks will not be considered equal to their targets, or to other symlinks with the same target.
Do not output anything, but return 1 if all files are unique.
Do not output anything, but return 1 if not all files are equal.
Set the separator strings between sets of equal files, between equal and identical files, respectively. --eqsep also sets the separator between identical files unless --idsep overrides it. This allows to taylor diffn's output for further automated processing. For example, passing --eqsep " "
will print one set of equal files per line.
If the -q or -Q option is given, a return value of 1 indicates differences. A return value greater than 1 indicates an error. Otherwise, 0 is returned.
If too many files cannot be distinguished by stat'ing them, the system may run out of file descriptors when diffn tries to open all of them, and diffn will be unable to proceed.
diffn is (c) 2010 Volker Schatz. It is free software and may be redistributed and/or modified under the terms of the GNU General Public License, version 3 or later.