FAQ
beginners:

I'm looking for a hash function and a related function or operator such
that:

H(string1 . string2) = f(H(string1), H(string2))

H(string1 . string2) = H(string1) op H(string2)

where:

H() is the hash function

string1 is a string

string2 is a string

. is the string concatenation operator

f() is a function

op is a binary operator

Any suggestions?

TIA,

David

## Search Discussions

•  at Sep 23, 2013 at 10:29 pm ⇧

On 23/09/2013 23:17, David Christensen wrote:
beginners:

I'm looking for a hash function and a related function or operator such
that:

H(string1 . string2) = f(H(string1), H(string2))

H(string1 . string2) = H(string1) op H(string2)
Hi David

My immediate thought is that the only hash function that can work like
this is the identity function (or any one-one mapping) because, by
extension, the hash of a string must be equal to f(the hash of each of
its characters).

Not that I can prove this at present, of course!

Could you explain the problem you're trying to solve?

Rob
•  at Sep 23, 2013 at 11:01 pm ⇧

On 09/23/13 15:29, Rob Dixon wrote:
My immediate thought is that the only hash function that can work like
this is the identity function (or any one-one mapping) because, by
extension, the hash of a string must be equal to f(the hash of each of
its characters).
Not that I can prove this at present, of course!
Could you explain the problem you're trying to solve?
Writing scripts that look for duplicate, similar, and/or missing files.

David
•  at Sep 24, 2013 at 1:17 am ⇧
Hi David,

I don't know the answer but... it sounds like NCBI's BLAST to me, which compares nucleotide or protein sequences. NCBI's FTP site provides local BLAST binaries, and bioperl offers some convenient tools to implement it.

Regards,

Jing
On 24 Sep 2013, at 07:01, David Christensen wrote:
On 09/23/13 15:29, Rob Dixon wrote:
My immediate thought is that the only hash function that can work like
this is the identity function (or any one-one mapping) because, by
extension, the hash of a string must be equal to f(the hash of each of
its characters).
Not that I can prove this at present, of course!
Could you explain the problem you're trying to solve?
Writing scripts that look for duplicate, similar, and/or missing files.

David

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/
•  at Sep 24, 2013 at 4:10 am ⇧

On 09/23/13 18:17, Jing Yu wrote:
I don't know the answer but... it sounds like NCBI's BLAST to me, which compares nucleotide or protein sequences. NCBI's FTP site provides local BLAST binaries, and bioperl offers some convenient tools to implement it.
That looks like server-side software, callable via a WWW UI and/or Perl
modules (?).

Where is the design documentation for BLAST -- especially the hashing
algorithms?

David
•  at Sep 23, 2013 at 11:39 pm ⇧

On 09/23/13 15:34, someone wrote:
Er "hash function" as in crypto hashing? a does:
H(string1 . string2) = f(H(string1), H(string2))
H(string1 . string2) = H(string1) op H(string2)
mean that
I'm looking for a hash function and a related function or operator such
that:
f(H(string1), H(string2)) = H(string1) op H(string2)
that is
sub f
{
my (\$string1, \$string2) = @_;
return \$string1 op \$string2;
}
Sorry, this sounds sort of homework-y so ...
Nope. I would have flunked out of a school that asks questions like
that. ;-)

I believe it's really math question, and I don't know the right math.
Implementation could be software or hardware.

As I understand it, the foundation of modern public-key cryptography is
the discrete modulus function, and what happens when it is applied to
prime numbers and to the products of prime numbers. I'm looking for
something like that -- e.g. an algebra for for hash functions when
applied to finite character streams.

David
•  at Sep 24, 2013 at 7:12 am ⇧

On 24/09/2013 00:17, David Christensen wrote:

I'm looking for a hash function and a related function or operator such
that:

H(string1 . string2) = f(H(string1), H(string2))
H(string1 . string2) = H(string1) op H(string2)

where:

H() is the hash function
string1 is a string
string2 is a string
. is the string concatenation operator
f() is a function
op is a binary operator
On 09/23/13 15:29, Rob Dixon wrote:
Could you explain the problem you're trying to solve?
Writing scripts that look for duplicate, similar, and/or
missing files.
I assume this is about paths and filenames. Have you considered an rsync
dry-run?

I also assume that you want to communicate as little as possible, so you
don't have supersets of all strings on all sides. (or it would become a
simple indexing problem)

I also assume that you are more interested in missing items, so
hash-value collisions are not a problem.

I also assume that the set of string1 is smaller than that of string2,
let's say 100 vs. 10000 different values.

For local deduplication, you would store paths as a directory name and a
parent-index:

#table=path
#columns=id,name,pid
1,"",0
2,"usr",1
3."local",2

And then have a list of filenames, and per filename in which path it exists.

#table=file
#columns=id,name

#table=detail
#columns=file_id,path_id,size,md5

For combining index values, use something like: ( i1 << N ) | i2.
(where N is the number of bits needed by i2)

I would not involve string concatenation: keep things separate once
separated. Use arrays.

Use (parts of) md5's of strings, if you need to compare to remote locations.

So best first explain *more* now about what you try to solve.
A single or multiple computers, connected or not?

Suppose 1 computer sends a concise email about what it has, such that
the other computer can reply with an even conciser email about what it
has, and what it needs. IOW: diff+patch.

--
Greetings, Ruud
•  at Sep 24, 2013 at 7:31 pm ⇧

On 09/24/13 00:12, Dr.Ruud wrote:
I assume this is about paths and filenames. Have you considered an rsync
dry-run?
I use "rsync -n ..." frequently.

I also assume that you want to communicate as little as possible, so you
don't have supersets of all strings on all sides. (or it would become a
simple indexing problem)
I also assume that you are more interested in missing items, so
hash-value collisions are not a problem.
My use-case is ~100k files. I'm looking for a hash function that will
have few, if any, collisions.

I also assume that the set of string1 is smaller than that of string2,
let's say 100 vs. 10000 different values.
string1 and string2 can be anywhere from the empty string to the entire
contents of files; the largest file I have is ~12 GB.

For local deduplication, you would store paths as a directory name and a
parent-index:
#table=path
#columns=id,name,pid
1,"",0
2,"usr",1
3."local",2
And then have a list of filenames, and per filename in which path it
exists.
#table=file
#columns=id,name
#table=detail
#columns=file_id,path_id,size,md5
For combining index values, use something like: ( i1 << N ) | i2.
(where N is the number of bits needed by i2)
Where did you find "( i1 << N ) | i2" for MD5?

I would not involve string concatenation: keep things separate once
separated. Use arrays.
I would prefer comparing two files by comparing two digests, rather than
comparing two arrays of digests.

Use (parts of) md5's of strings, if you need to compare to remote
locations.
I use all of the digest.

So best first explain *more* now about what you try to solve.
A single or multiple computers, connected or not?
Suppose 1 computer sends a concise email about what it has, such that
the other computer can reply with an even conciser email about what it
has, and what it needs. IOW: diff+patch.
I'd like the application(s) to work over SSH, similar to rsync.

David
•  at Sep 26, 2013 at 1:53 am ⇧
Hi David,

Another look at it, and I think I've pointed you to a wrong way. BLAST might not what you need. Sorry about this.

Jing
On 25 Sep 2013, at 03:31, David Christensen wrote:
On 09/24/13 00:12, Dr.Ruud wrote:
I assume this is about paths and filenames. Have you considered an rsync
dry-run?
I use "rsync -n ..." frequently.

I also assume that you want to communicate as little as possible, so you
don't have supersets of all strings on all sides. (or it would become a
simple indexing problem)
I also assume that you are more interested in missing items, so
hash-value collisions are not a problem.
My use-case is ~100k files. I'm looking for a hash function that will have few, if any, collisions.

I also assume that the set of string1 is smaller than that of string2,
let's say 100 vs. 10000 different values.
string1 and string2 can be anywhere from the empty string to the entire contents of files; the largest file I have is ~12 GB.

For local deduplication, you would store paths as a directory name and a
parent-index:
#table=path
#columns=id,name,pid
1,"",0
2,"usr",1
3."local",2
And then have a list of filenames, and per filename in which path it
exists.
#table=file
#columns=id,name
#table=detail
#columns=file_id,path_id,size,md5
For combining index values, use something like: ( i1 << N ) | i2.
(where N is the number of bits needed by i2)
Where did you find "( i1 << N ) | i2" for MD5?

I would not involve string concatenation: keep things separate once
separated. Use arrays.
I would prefer comparing two files by comparing two digests, rather than comparing two arrays of digests.

Use (parts of) md5's of strings, if you need to compare to remote
locations.
I use all of the digest.

So best first explain *more* now about what you try to solve.
A single or multiple computers, connected or not?
Suppose 1 computer sends a concise email about what it has, such that
the other computer can reply with an even conciser email about what it
has, and what it needs. IOW: diff+patch.
I'd like the application(s) to work over SSH, similar to rsync.

David

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/
•  at Sep 26, 2013 at 2:46 am ⇧

On 09/25/13 18:53, Jing Yu wrote:
Another look at it, and I think I've pointed you to a wrong way. BLAST might not what you need. Sorry about this.
No problem. The more I look at it, the less I believe there is such a
pair of functions.

David
•  at Sep 27, 2013 at 4:32 pm ⇧
Hi,
On 09/24/2013 12:17 AM, David Christensen wrote:
beginners:

H() is the hash function
as you want the same hash of a substring regardless of the position, I
think you cannot make the hash dependent on the position of each
character in the string. A simple solution could be to add the number
value of each character. This would fullfill

H(string1 . string2) = H(string1) + H(string2)

I am not sure if the modulo operator also fulfils

(a % b) % c = (a % c) % (b % c)

In that case you could also use the modulo on your hash function.
string1 is a string

string2 is a string

. is the string concatenation operator

f() is a function

op is a binary operator

Any suggestions?

TIA,

David

## Related Discussions

 view thread | post
Discussion Overview
 group beginners categories perl posted Sep 23, '13 at 10:17p active Sep 27, '13 at 4:32p posts 11 users 5 website perl.org

### 5 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase