On 09/24/13 00:12, Dr.Ruud wrote:
I assume this is about paths and filenames. Have you considered an rsync
I use "rsync -n ..." frequently.
I also assume that you want to communicate as little as possible, so you
don't have supersets of all strings on all sides. (or it would become a
simple indexing problem)
I also assume that you are more interested in missing items, so
hash-value collisions are not a problem.
My use-case is ~100k files. I'm looking for a hash function that will
have few, if any, collisions.
I also assume that the set of string1 is smaller than that of string2,
let's say 100 vs. 10000 different values.
string1 and string2 can be anywhere from the empty string to the entire
contents of files; the largest file I have is ~12 GB.
For local deduplication, you would store paths as a directory name and a
And then have a list of filenames, and per filename in which path it
For combining index values, use something like: ( i1 << N ) | i2.
(where N is the number of bits needed by i2)
Where did you find "( i1 << N ) | i2" for MD5?
I would not involve string concatenation: keep things separate once
separated. Use arrays.
I would prefer comparing two files by comparing two digests, rather than
comparing two arrays of digests.
Use (parts of) md5's of strings, if you need to compare to remote
I use all of the digest.
So best first explain *more* now about what you try to solve.
A single or multiple computers, connected or not?
Suppose 1 computer sends a concise email about what it has, such that
the other computer can reply with an even conciser email about what it
has, and what it needs. IOW: diff+patch.
I'd like the application(s) to work over SSH, similar to rsync.