FAQ
beginners:

I'm looking for a hash function and a related function or operator such
that:

      H(string1 . string2) = f(H(string1), H(string2))

      H(string1 . string2) = H(string1) op H(string2)

      where:

  H() is the hash function

  string1 is a string

  string2 is a string

  . is the string concatenation operator

  f() is a function

  op is a binary operator


Any suggestions?


TIA,

David

Search Discussions

  • Rob Dixon at Sep 23, 2013 at 10:29 pm

    On 23/09/2013 23:17, David Christensen wrote:
    beginners:

    I'm looking for a hash function and a related function or operator such
    that:

    H(string1 . string2) = f(H(string1), H(string2))

    H(string1 . string2) = H(string1) op H(string2)
    Hi David

    My immediate thought is that the only hash function that can work like
    this is the identity function (or any one-one mapping) because, by
    extension, the hash of a string must be equal to f(the hash of each of
    its characters).

    Not that I can prove this at present, of course!

    Could you explain the problem you're trying to solve?

    Rob
  • David Christensen at Sep 23, 2013 at 11:01 pm

    On 09/23/13 15:29, Rob Dixon wrote:
    My immediate thought is that the only hash function that can work like
    this is the identity function (or any one-one mapping) because, by
    extension, the hash of a string must be equal to f(the hash of each of
    its characters).
    Not that I can prove this at present, of course!
    Could you explain the problem you're trying to solve?
    Writing scripts that look for duplicate, similar, and/or missing files.


    David
  • Jing Yu at Sep 24, 2013 at 1:17 am
    Hi David,

    I don't know the answer but... it sounds like NCBI's BLAST to me, which compares nucleotide or protein sequences. NCBI's FTP site provides local BLAST binaries, and bioperl offers some convenient tools to implement it.

    Regards,

    Jing
    On 24 Sep 2013, at 07:01, David Christensen wrote:
    On 09/23/13 15:29, Rob Dixon wrote:
    My immediate thought is that the only hash function that can work like
    this is the identity function (or any one-one mapping) because, by
    extension, the hash of a string must be equal to f(the hash of each of
    its characters).
    Not that I can prove this at present, of course!
    Could you explain the problem you're trying to solve?
    Writing scripts that look for duplicate, similar, and/or missing files.


    David

    --
    To unsubscribe, e-mail: beginners-unsubscribe@perl.org
    For additional commands, e-mail: beginners-help@perl.org
    http://learn.perl.org/
  • David Christensen at Sep 24, 2013 at 4:10 am

    On 09/23/13 18:17, Jing Yu wrote:
    I don't know the answer but... it sounds like NCBI's BLAST to me, which compares nucleotide or protein sequences. NCBI's FTP site provides local BLAST binaries, and bioperl offers some convenient tools to implement it.
    That looks like server-side software, callable via a WWW UI and/or Perl
    modules (?).


    Where is the design documentation for BLAST -- especially the hashing
    algorithms?


    David
  • David Christensen at Sep 23, 2013 at 11:39 pm

    On 09/23/13 15:34, someone wrote:
    Er "hash function" as in crypto hashing? a does:
    H(string1 . string2) = f(H(string1), H(string2))
    H(string1 . string2) = H(string1) op H(string2)
    mean that
    I'm looking for a hash function and a related function or operator such
    that:
    f(H(string1), H(string2)) = H(string1) op H(string2)
    that is
    sub f
    {
    my ($string1, $string2) = @_;
    return $string1 op $string2;
    }
    Sorry, this sounds sort of homework-y so ...
    Nope. I would have flunked out of a school that asks questions like
    that. ;-)


    I believe it's really math question, and I don't know the right math.
    Implementation could be software or hardware.


    As I understand it, the foundation of modern public-key cryptography is
    the discrete modulus function, and what happens when it is applied to
    prime numbers and to the products of prime numbers. I'm looking for
    something like that -- e.g. an algebra for for hash functions when
    applied to finite character streams.


    David
  • Dr.Ruud at Sep 24, 2013 at 7:12 am

    On 24/09/2013 00:17, David Christensen wrote:

    I'm looking for a hash function and a related function or operator such
    that:

    H(string1 . string2) = f(H(string1), H(string2))
    H(string1 . string2) = H(string1) op H(string2)

    where:

    H() is the hash function
    string1 is a string
    string2 is a string
    . is the string concatenation operator
    f() is a function
    op is a binary operator
    On 09/23/13 15:29, Rob Dixon wrote:
    Could you explain the problem you're trying to solve?
    Writing scripts that look for duplicate, similar, and/or
    missing files.
    I assume this is about paths and filenames. Have you considered an rsync
    dry-run?

    I also assume that you want to communicate as little as possible, so you
    don't have supersets of all strings on all sides. (or it would become a
    simple indexing problem)

    I also assume that you are more interested in missing items, so
    hash-value collisions are not a problem.

    I also assume that the set of string1 is smaller than that of string2,
    let's say 100 vs. 10000 different values.


    For local deduplication, you would store paths as a directory name and a
    parent-index:

    #table=path
    #columns=id,name,pid
    1,"",0
    2,"usr",1
    3."local",2

    And then have a list of filenames, and per filename in which path it exists.

    #table=file
    #columns=id,name

    #table=detail
    #columns=file_id,path_id,size,md5


    For combining index values, use something like: ( i1 << N ) | i2.
    (where N is the number of bits needed by i2)

    I would not involve string concatenation: keep things separate once
    separated. Use arrays.

    Use (parts of) md5's of strings, if you need to compare to remote locations.

    So best first explain *more* now about what you try to solve.
    A single or multiple computers, connected or not?

    Suppose 1 computer sends a concise email about what it has, such that
    the other computer can reply with an even conciser email about what it
    has, and what it needs. IOW: diff+patch.

    --
    Greetings, Ruud
  • David Christensen at Sep 24, 2013 at 7:31 pm

    On 09/24/13 00:12, Dr.Ruud wrote:
    I assume this is about paths and filenames. Have you considered an rsync
    dry-run?
    I use "rsync -n ..." frequently.

    I also assume that you want to communicate as little as possible, so you
    don't have supersets of all strings on all sides. (or it would become a
    simple indexing problem)
    I also assume that you are more interested in missing items, so
    hash-value collisions are not a problem.
    My use-case is ~100k files. I'm looking for a hash function that will
    have few, if any, collisions.

    I also assume that the set of string1 is smaller than that of string2,
    let's say 100 vs. 10000 different values.
    string1 and string2 can be anywhere from the empty string to the entire
    contents of files; the largest file I have is ~12 GB.

    For local deduplication, you would store paths as a directory name and a
    parent-index:
    #table=path
    #columns=id,name,pid
    1,"",0
    2,"usr",1
    3."local",2
    And then have a list of filenames, and per filename in which path it
    exists.
    #table=file
    #columns=id,name
    #table=detail
    #columns=file_id,path_id,size,md5
    For combining index values, use something like: ( i1 << N ) | i2.
    (where N is the number of bits needed by i2)
    Where did you find "( i1 << N ) | i2" for MD5?

    I would not involve string concatenation: keep things separate once
    separated. Use arrays.
    I would prefer comparing two files by comparing two digests, rather than
    comparing two arrays of digests.

    Use (parts of) md5's of strings, if you need to compare to remote
    locations.
    I use all of the digest.

    So best first explain *more* now about what you try to solve.
    A single or multiple computers, connected or not?
    Suppose 1 computer sends a concise email about what it has, such that
    the other computer can reply with an even conciser email about what it
    has, and what it needs. IOW: diff+patch.
    I'd like the application(s) to work over SSH, similar to rsync.


    David
  • Jing Yu at Sep 26, 2013 at 1:53 am
    Hi David,

    Another look at it, and I think I've pointed you to a wrong way. BLAST might not what you need. Sorry about this.

    Jing
    On 25 Sep 2013, at 03:31, David Christensen wrote:
    On 09/24/13 00:12, Dr.Ruud wrote:
    I assume this is about paths and filenames. Have you considered an rsync
    dry-run?
    I use "rsync -n ..." frequently.

    I also assume that you want to communicate as little as possible, so you
    don't have supersets of all strings on all sides. (or it would become a
    simple indexing problem)
    I also assume that you are more interested in missing items, so
    hash-value collisions are not a problem.
    My use-case is ~100k files. I'm looking for a hash function that will have few, if any, collisions.

    I also assume that the set of string1 is smaller than that of string2,
    let's say 100 vs. 10000 different values.
    string1 and string2 can be anywhere from the empty string to the entire contents of files; the largest file I have is ~12 GB.

    For local deduplication, you would store paths as a directory name and a
    parent-index:
    #table=path
    #columns=id,name,pid
    1,"",0
    2,"usr",1
    3."local",2
    And then have a list of filenames, and per filename in which path it
    exists.
    #table=file
    #columns=id,name
    #table=detail
    #columns=file_id,path_id,size,md5
    For combining index values, use something like: ( i1 << N ) | i2.
    (where N is the number of bits needed by i2)
    Where did you find "( i1 << N ) | i2" for MD5?

    I would not involve string concatenation: keep things separate once
    separated. Use arrays.
    I would prefer comparing two files by comparing two digests, rather than comparing two arrays of digests.

    Use (parts of) md5's of strings, if you need to compare to remote
    locations.
    I use all of the digest.

    So best first explain *more* now about what you try to solve.
    A single or multiple computers, connected or not?
    Suppose 1 computer sends a concise email about what it has, such that
    the other computer can reply with an even conciser email about what it
    has, and what it needs. IOW: diff+patch.
    I'd like the application(s) to work over SSH, similar to rsync.


    David

    --
    To unsubscribe, e-mail: beginners-unsubscribe@perl.org
    For additional commands, e-mail: beginners-help@perl.org
    http://learn.perl.org/
  • David Christensen at Sep 26, 2013 at 2:46 am

    On 09/25/13 18:53, Jing Yu wrote:
    Another look at it, and I think I've pointed you to a wrong way. BLAST might not what you need. Sorry about this.
    No problem. The more I look at it, the less I believe there is such a
    pair of functions.


    David
  • Gerhard Jungwirth at Sep 27, 2013 at 4:32 pm
    Hi,
    On 09/24/2013 12:17 AM, David Christensen wrote:
    beginners:

    H() is the hash function
    as you want the same hash of a substring regardless of the position, I
    think you cannot make the hash dependent on the position of each
    character in the string. A simple solution could be to add the number
    value of each character. This would fullfill

    H(string1 . string2) = H(string1) + H(string2)

    I am not sure if the modulo operator also fulfils

    (a % b) % c = (a % c) % (b % c)

    In that case you could also use the modulo on your hash function.
    string1 is a string

    string2 is a string

    . is the string concatenation operator

    f() is a function

    op is a binary operator


    Any suggestions?


    TIA,

    David

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedSep 23, '13 at 10:17p
activeSep 27, '13 at 4:32p
posts11
users5
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase