FAQ
Hi,

I am trying to write a small tool to do "smart downloads" from a "kind-of"
rest web interface.
(The project is in github,
https://github.com/josvazg/slicesync
but it might not have the lastest code and I prefer to explain it cause
this question is more about an algorithm design decision more than code
details)

The idea is similar to rsync, but using a webinterface and SHA1 for the
hashes: hash sliced of the file, download only the differences

My current algorithm is as follows:

Infrastructure:

- The server can dump and hash file slices.
That is, an http client can ask for the file starting from offset position
and with a size (0 means till the end). And it also can ask to produce a
bulkhash dump of all sha1 of slices of size N from the beginning to the end
of the file. The bulkhash dump also includes the total file size at
the beginning and the final total file hash at the last line of the dump
(all in plain text, each line is one piece of data)

- The client can do the same for local files and also ask the server for
that about remote files.

"Smart" download steps:
1. The client starts a bulkhash dump against the remote (usually big) file
to be downloaded and another against a local similar file (probably an
earlier version of the same thing)
The idea is to download only the parts that are different from (no
present in) the local similar file, and copy the rest locally.

2. With both input streams, the client compares each hash with the
corresponding and builds a Diffs data structure that basically enumerates
which parts of the remote file are different and need to be downloaded.

3. Building the diffs with big files (512MB, 4GB) takes some time, so as
soon as each slice is compared and I know if it is different or not, I can
call a function to process that slice by downloading it from the server or
copying it locally.

4. The final destination output stream has a sha1 hasher so that I can
compare, when done, if the produced file has the same sha1 hash that
the remote file and thus the download is successful.

At this point I am NOT using any go-routines or channels, everything
happens on the same thread. I tested various versions of the same idea with
quite a few go-routines going on and the results where quite slow and
disappointing (not a problem of the go-routines, but of my design and
infrastructure)...

"SO What is the question??..."

I run a test with a pair of 512MB files and the results where quite good
(=fast). It actually took less time to smart download than a plain
download, even testing everything on the same machine and having to
download 75% of the data.

I was happy... till I tested my 4GB pair of files. The results were even
worse than in my previous disappointing tests. The smart download nearly 5
times more than the plain download of the file even thog it only had to
download 32 different MB out of the 4GB!

I believe this is because the server&local bulkhash producing is being a
lot faster than the client can consume them while it is writing the 4GB
file. I suppose there are lots of memory allocations happening. I guess the
4GB test is pushing some threshold that the 512MB test is not reaching.

MY FIRST question is:
Do you think I can fix this by decoupling the diff building task from the
data copying task using go-routines?

I am thinking of using an intermediate go-routine between both tasks, the
diff task sends updates after each slice and the intermediate go-routine
starts new copying tasks as soon as the previous one finished with the
latest diff state update.

MY SECOND question is:
Do you think this is the best idea?

Sorry for the big post...
(and I hope I explained myself properly)

Thanks in advance,

Jose

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Péter Szilágyi at Mar 14, 2013 at 2:20 pm
    Hello,

    IMHO, you provided very little information about your benchmarks,
    which themselves are very very sensitive to the test environment.

    Three separate scenarios come to mind:

    - Copying locally from/to same hard drive
    - Copying locally between different hard drives
    - Copying over the network

    These are very very different cases, with different problems:

    - Local same-hdd copy has congestion problems, each thread racing for
    the data.
    - Local dual-hdd copy is not really worth the effort, simple copy will
    be the fastest (i.e. you need to read the data anyway, read ~ write speed).
    - Network copy has the most to gain, because write speed is considerably
    slower than read.

    I'd suggest that you separate and test the above three cases. They cannot
    be compared, thus any "help request" should specifically target one of them.

    Other issues that can still crop up during testing is file caches: 2 *
    512MB will easily fit into memory, thus your benchmark will be effectively
    void. You should pick a file size which is either significantly larger than
    your available memory, or ensure that the OS will not cache.

    Cheers,
    Peter

    On Thu, Mar 14, 2013 at 2:05 PM, josvazg wrote:

    Hi,

    I am trying to write a small tool to do "smart downloads" from a "kind-of"
    rest web interface.
    (The project is in github,
    https://github.com/josvazg/slicesync
    but it might not have the lastest code and I prefer to explain it cause
    this question is more about an algorithm design decision more than code
    details)

    The idea is similar to rsync, but using a webinterface and SHA1 for the
    hashes: hash sliced of the file, download only the differences

    My current algorithm is as follows:

    Infrastructure:

    - The server can dump and hash file slices.
    That is, an http client can ask for the file starting from offset position
    and with a size (0 means till the end). And it also can ask to produce a
    bulkhash dump of all sha1 of slices of size N from the beginning to the end
    of the file. The bulkhash dump also includes the total file size at
    the beginning and the final total file hash at the last line of the dump
    (all in plain text, each line is one piece of data)

    - The client can do the same for local files and also ask the server for
    that about remote files.

    "Smart" download steps:
    1. The client starts a bulkhash dump against the remote (usually big) file
    to be downloaded and another against a local similar file (probably an
    earlier version of the same thing)
    The idea is to download only the parts that are different from (no
    present in) the local similar file, and copy the rest locally.

    2. With both input streams, the client compares each hash with the
    corresponding and builds a Diffs data structure that basically enumerates
    which parts of the remote file are different and need to be downloaded.

    3. Building the diffs with big files (512MB, 4GB) takes some time, so as
    soon as each slice is compared and I know if it is different or not, I can
    call a function to process that slice by downloading it from the server or
    copying it locally.

    4. The final destination output stream has a sha1 hasher so that I can
    compare, when done, if the produced file has the same sha1 hash that
    the remote file and thus the download is successful.

    At this point I am NOT using any go-routines or channels, everything
    happens on the same thread. I tested various versions of the same idea with
    quite a few go-routines going on and the results where quite slow and
    disappointing (not a problem of the go-routines, but of my design and
    infrastructure)...

    "SO What is the question??..."

    I run a test with a pair of 512MB files and the results where quite good
    (=fast). It actually took less time to smart download than a plain
    download, even testing everything on the same machine and having to
    download 75% of the data.

    I was happy... till I tested my 4GB pair of files. The results were even
    worse than in my previous disappointing tests. The smart download nearly 5
    times more than the plain download of the file even thog it only had to
    download 32 different MB out of the 4GB!

    I believe this is because the server&local bulkhash producing is being a
    lot faster than the client can consume them while it is writing the 4GB
    file. I suppose there are lots of memory allocations happening. I guess the
    4GB test is pushing some threshold that the 512MB test is not reaching.

    MY FIRST question is:
    Do you think I can fix this by decoupling the diff building task from the
    data copying task using go-routines?

    I am thinking of using an intermediate go-routine between both tasks, the
    diff task sends updates after each slice and the intermediate go-routine
    starts new copying tasks as soon as the previous one finished with the
    latest diff state update.

    MY SECOND question is:
    Do you think this is the best idea?

    Sorry for the big post...
    (and I hope I explained myself properly)

    Thanks in advance,

    Jose

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 14, 2013 at 5:12 pm
    Ok, you are right, lets stick with the one really useful use case, copying
    over the network...

    I deployed the client side on a Windows box, connected to the linux server
    side over a 100Mbps LAN, quite fast compared to the final really bad use
    case, access through a random "over the seas" DSL line.

    The 4GB test took 1h34 minutes while the wget download from the same go
    server side takes just under 7 minutes (for the record the local test was
    around 25 minutes, "much better" than this but also very poor)

    It is pretty clear to me that there are a lot of copies and allocations and
    swappings going on due to the fact that the speed of consumption of the
    hashes and the bulk data copies are very different and a lot of data gets
    buffered for too long and some data gets buffered and dumped (closing the
    socket prematurely).

    And I certainly agree with this:
    "Other issues that can still crop up during testing is file caches: 2 *
    512MB will easily fit into memory, thus your benchmark will be effectively
    void. You should pick a file size which is either significantly larger than
    your available memory, or ensure that the OS will not cache."

    Many thanks!

    I think I will NOT go for a goroutine fix just yet. First I will test a few
    tweaks and see what happens.

    Jose

    El jueves, 14 de marzo de 2013 15:19:57 UTC+1, Péter Szilágyi escribió:
    Hello,

    IMHO, you provided very little information about your benchmarks,
    which themselves are very very sensitive to the test environment.

    Three separate scenarios come to mind:

    - Copying locally from/to same hard drive
    - Copying locally between different hard drives
    - Copying over the network

    These are very very different cases, with different problems:

    - Local same-hdd copy has congestion problems, each thread racing for
    the data.
    - Local dual-hdd copy is not really worth the effort, simple copy will
    be the fastest (i.e. you need to read the data anyway, read ~ write speed).
    - Network copy has the most to gain, because write speed is
    considerably slower than read.

    I'd suggest that you separate and test the above three cases. They cannot
    be compared, thus any "help request" should specifically target one of them.

    Other issues that can still crop up during testing is file caches: 2 *
    512MB will easily fit into memory, thus your benchmark will be effectively
    void. You should pick a file size which is either significantly larger than
    your available memory, or ensure that the OS will not cache.

    Cheers,
    Peter


    On Thu, Mar 14, 2013 at 2:05 PM, josvazg <jos...@gmail.com <javascript:>>wrote:
    Hi,

    I am trying to write a small tool to do "smart downloads" from a
    "kind-of" rest web interface.
    (The project is in github,
    https://github.com/josvazg/slicesync
    but it might not have the lastest code and I prefer to explain it cause
    this question is more about an algorithm design decision more than code
    details)

    The idea is similar to rsync, but using a webinterface and SHA1 for the
    hashes: hash sliced of the file, download only the differences

    My current algorithm is as follows:

    Infrastructure:

    - The server can dump and hash file slices.
    That is, an http client can ask for the file starting from offset
    position and with a size (0 means till the end). And it also can ask to
    produce a bulkhash dump of all sha1 of slices of size N from
    the beginning to the end of the file. The bulkhash dump also includes the
    total file size at the beginning and the final total file hash at the last
    line of the dump (all in plain text, each line is one piece of data)

    - The client can do the same for local files and also ask the server for
    that about remote files.

    "Smart" download steps:
    1. The client starts a bulkhash dump against the remote (usually big)
    file to be downloaded and another against a local similar file (probably an
    earlier version of the same thing)
    The idea is to download only the parts that are different from (no
    present in) the local similar file, and copy the rest locally.

    2. With both input streams, the client compares each hash with the
    corresponding and builds a Diffs data structure that basically enumerates
    which parts of the remote file are different and need to be downloaded.

    3. Building the diffs with big files (512MB, 4GB) takes some time, so as
    soon as each slice is compared and I know if it is different or not, I can
    call a function to process that slice by downloading it from the server or
    copying it locally.

    4. The final destination output stream has a sha1 hasher so that I can
    compare, when done, if the produced file has the same sha1 hash that
    the remote file and thus the download is successful.

    At this point I am NOT using any go-routines or channels, everything
    happens on the same thread. I tested various versions of the same idea with
    quite a few go-routines going on and the results where quite slow and
    disappointing (not a problem of the go-routines, but of my design and
    infrastructure)...

    "SO What is the question??..."

    I run a test with a pair of 512MB files and the results where quite good
    (=fast). It actually took less time to smart download than a plain
    download, even testing everything on the same machine and having to
    download 75% of the data.

    I was happy... till I tested my 4GB pair of files. The results were even
    worse than in my previous disappointing tests. The smart download nearly 5
    times more than the plain download of the file even thog it only had to
    download 32 different MB out of the 4GB!

    I believe this is because the server&local bulkhash producing is being a
    lot faster than the client can consume them while it is writing the 4GB
    file. I suppose there are lots of memory allocations happening. I guess the
    4GB test is pushing some threshold that the 512MB test is not reaching.

    MY FIRST question is:
    Do you think I can fix this by decoupling the diff building task from the
    data copying task using go-routines?

    I am thinking of using an intermediate go-routine between both tasks, the
    diff task sends updates after each slice and the intermediate go-routine
    starts new copying tasks as soon as the previous one finished with the
    latest diff state update.

    MY SECOND question is:
    Do you think this is the best idea?

    Sorry for the big post...
    (and I hope I explained myself properly)

    Thanks in advance,

    Jose

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Tamás Gulácsi at Mar 14, 2013 at 8:58 pm
    Why not implement zsync?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 15, 2013 at 9:00 am
    Cause I didn't know it.. ;-)

    I might probably end up using this, but I still want to debug my project
    and see if I am able to make it work with acceptable performance.

    zsync seems to be a Unix compilable tool only, that might be a problem,
    unless I can make it to compile in Windows with some cygwin or similar
    stuff. My tool goal was to be multiplatform (Windows+Linux+Mac at least),
    one of the benefits of using Go standard library. On the other hand the
    handling of compressed files is a feature I was not planing to do (for now)
    but that I really wanted to have, so zsync is quite a promising option.

    Jose

    El jueves, 14 de marzo de 2013 21:58:40 UTC+1, Tamás Gulácsi escribió:
    Why not implement zsync?
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Tamás Gulácsi at Mar 15, 2013 at 12:13 pm
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks once, distribute the checksums, so the client can search missing/changed blocks by itself, without server work. Then ask only the required blocks.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 15, 2013 at 2:03 pm
    When you read the paper the zsync idea seems much more complex, al least
    more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified idea
    of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync file) so
    that I save its generation time and I also speed up the diff processing.

    Both client and server can pre-generate the hashdump for each file they
    manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kyle Lemons at Mar 16, 2013 at 7:59 am
    Reading a 4GB data file off the harddrive will take a long time. If both
    your client and your server read the data file off their respective drives
    and hash them, you have incurred a significant amount of extra time. This
    can be fine, of course, if what you care about are the bytes transferred
    and not the amount of time the transfer takes. The pathological case here,
    if I understand your algorithm correctly, would be adding 1 byte to the
    beginning of a 4GB file, which would cause every hash to change, so the
    download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize*hashsize)) + t(diff) + t(transfer(size)) which is
    almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al least
    more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified
    idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync file)
    so that I save its generation time and I also speed up the diff processing.

    Both client and server can pre-generate the hashdump for each file they
    manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 16, 2013 at 9:06 am
    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a weak/cheap
    hash and a strong/expensive one. For security reasons you end up doing
    both, which is more expensive than doing just the strong one. How does it
    save time or data?]

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If both
    your client and your server read the data file off their respective drives
    and hash them, you have incurred a significant amount of extra time. This
    can be fine, of course, if what you care about are the bytes transferred
    and not the amount of time the transfer takes. The pathological case here,
    if I understand your algorithm correctly, would be adding 1 byte to the
    beginning of a 4GB file, which would cause every hash to change, so the
    download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize*hashsize)) + t(diff) + t(transfer(size)) which is
    almost certainly > t(transfer(size)).


    On Fri, Mar 15, 2013 at 7:03 AM, josvazg <jos...@gmail.com <javascript:>>wrote:
    When you read the paper the zsync idea seems much more complex, al least
    more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified
    idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync file)
    so that I save its generation time and I also speed up the diff processing.

    Both client and server can pre-generate the hashdump for each file they
    manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aliaksandr Valialkin at Mar 16, 2013 at 12:09 pm

    On Saturday, March 16, 2013 11:06:14 AM UTC+2, josvazg wrote:

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected).
    Reading two files located on the same physical device sequentially is much
    faster than reading these files in parallel (provided these files aren't
    cached by the OS in RAM). Try reducing chunk sizes for large files and
    you'll see much higher difference in times between sequential reads and
    parallel reads. Blame for slow disk seeks (aka IOPS -
    http://en.wikipedia.org/wiki/IOPS ). Let's do some math. With 1Mb chunk
    sizes you hard drive need to perform 2*(4Gb/1Mb)=8K disk seeks during
    parallel diff computation. This means HDD with 10ms disk seek time will
    waste 8K*10ms=80 seconds only on seeking file chunks (without reading them).
    So, avoid disk seeks if you want high speed.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Tarmigan at Mar 16, 2013 at 4:43 pm

    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:
    [One thing I quite don't understand is why does rsync use of a weak/cheap
    hash and a strong/expensive one. For security reasons you end up doing both,
    which is more expensive than doing just the strong one. How does it save
    time or data?]
    Here's my understanding of rsync:

    Rsync uses a "rolling" hash to determine the byte boundaries for the
    blocks of the 2nd hash. The rolling hash is special because you can
    efficiency (and incrementally) remove bytes from the beginning of the
    block you are hashing in addition to being able to adding bytes at the
    end. The rolling hash uses a predetermined, fixed size window to
    "roll" along the whole file and hash every possible block of the fixed
    block size. When a predefined hash output is found, the current byte
    position is marked as a boundary for the 2nd hash. A second hash is
    used on the new (variable sized) blocks determined from the marked
    boundaries. The rolling hash is (intentionally) not immune to
    collisions (intentional or unintentional) which is why the 2nd hash is
    used.

    This strategy allows you to change one byte at the beginning of a
    large file and not have to transfer the whole file. The first block
    will change but the rolling hash will eventually synchronize the two
    files and notice all the common blocks.

    -Tarmigan

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kyle Lemons at Mar 16, 2013 at 7:35 pm

    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the parallel
    test to speculate about why it's taking twice as long. When you say "at
    the same time" do you mean on different machines at the same time? If
    they're on the same machine and the same harddrive, it seems likely that
    reading two 4GB files at the same time off the harddrive would take twice
    as long. If you're reading a chunk and then communicating about its hash
    over the network, then you're essentially forcing the slowest read at any
    given time to stall both processes, which also seems like it could balloon
    the time it takes, though whether it's 10% slower or 100% would depend on
    the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a weak/cheap
    hash and a strong/expensive one. For security reasons you end up doing
    both, which is more expensive than doing just the strong one. How does it
    save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at file
    granularity; they're somewhat different beasts. It checks things like size
    and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If both
    your client and your server read the data file off their respective drives
    and hash them, you have incurred a significant amount of extra time. This
    can be fine, of course, if what you care about are the bytes transferred
    and not the amount of time the transfer takes. The pathological case here,
    if I understand your algorithm correctly, would be adding 1 byte to the
    beginning of a 4GB file, which would cause every hash to change, so the
    download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified
    idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync file)
    so that I save its generation time and I also speed up the diff processing.

    Both client and server can pre-generate the hashdump for each file they
    manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 17, 2013 at 9:59 am
    I will try to reply to Aliaksandr, Tarmigan and Kyle...

    The funny thing about the tests I was talking about is that one stream was
    a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs. First
    question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg <jos...@gmail.com <javascript:>>wrote:
    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the parallel
    test to speculate about why it's taking twice as long. When you say "at
    the same time" do you mean on different machines at the same time? If
    they're on the same machine and the same harddrive, it seems likely that
    reading two 4GB files at the same time off the harddrive would take twice
    as long. If you're reading a chunk and then communicating about its hash
    over the network, then you're essentially forcing the slowest read at any
    given time to stall both processes, which also seems like it could balloon
    the time it takes, though whether it's 10% slower or 100% would depend on
    the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a weak/cheap
    hash and a strong/expensive one. For security reasons you end up doing
    both, which is more expensive than doing just the strong one. How does it
    save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at file
    granularity; they're somewhat different beasts. It checks things like size
    and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified
    idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file they
    manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aliaksandr Valialkin at Mar 17, 2013 at 12:11 pm



    The funny thing about the tests I was talking about is that one stream was
    a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible. Probably
    your code uses too small write buffers on network connections where hashes
    are pushed and/or it requires an ack for each hash before sending the next
    hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the parallel
    test to speculate about why it's taking twice as long. When you say "at
    the same time" do you mean on different machines at the same time? If
    they're on the same machine and the same harddrive, it seems likely that
    reading two 4GB files at the same time off the harddrive would take twice
    as long. If you're reading a chunk and then communicating about its hash
    over the network, then you're essentially forcing the slowest read at any
    given time to stall both processes, which also seems like it could balloon
    the time it takes, though whether it's 10% slower or 100% would depend on
    the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified
    idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 18, 2013 at 9:34 am
    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences from
    both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the parallel
    test to speculate about why it's taking twice as long. When you say "at
    the same time" do you mean on different machines at the same time? If
    they're on the same machine and the same harddrive, it seems likely that
    reading two 4GB files at the same time off the harddrive would take twice
    as long. If you're reading a chunk and then communicating about its hash
    over the network, then you're essentially forcing the slowest read at any
    given time to stall both processes, which also seems like it could balloon
    the time it takes, though whether it's 10% slower or 100% would depend on
    the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aliaksandr Valialkin at Mar 18, 2013 at 1:58 pm
    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before closing
    the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences from
    both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aliaksandr Valialkin at Mar 18, 2013 at 2:04 pm
    Also probably it would be better substituting the Pipe by buffered channel<http://golang.org/doc/effective_go.html#channels> .
    This way you'll avoid hash values' serialization overhead.
    On Monday, March 18, 2013 3:58:56 PM UTC+2, Aliaksandr Valialkin wrote:

    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before closing
    the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences from
    both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) +
    t(transfer(size)) which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 18, 2013 at 5:13 pm
    Ups!
    Thanks Aliaksandr!
    Already corrected in github....
    What do you suggest as "bigBufferSize"?
    (I am currently using 1MiB)

    As for the buffered channel, I am not convinced to change that because now,
    the same function can send the hashes over the network OR to a local pipe
    AND (probably shortly) to a file for later use, saving all the calculation
    time. I believe that with the channel I will lose that flexibility.

    Jose

    El lunes, 18 de marzo de 2013 15:04:14 UTC+1, Aliaksandr Valialkin escribió:
    Also probably it would be better substituting the Pipe by buffered channel<http://golang.org/doc/effective_go.html#channels> .
    This way you'll avoid hash values' serialization overhead.
    On Monday, March 18, 2013 3:58:56 PM UTC+2, Aliaksandr Valialkin wrote:

    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before closing
    the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences from
    both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I
    start looking, byte by byte, if I find the current and probably any of the
    next checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For
    instance, t(localhash) or t(remotehash) on a 4GB file where about 2 min,
    doing one after the other should be 4-5 minutes. But I tested a diff
    operation, that is checking a hash against another (while calculating them
    both at the same time) and it took 9 minutes (nearly twice as expected).
    Maybe there was something wrong with my code, but now I plan to test how
    long it takes to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for
    1MiB chunks is "instantaneous" (compared to the time it takes to generate
    it, specially if you just use SHA1 instead of a more efficient hash
    combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time.
    If both your client and your server read the data file off their
    respective drives and hash them, you have incurred a significant amount of
    extra time. This can be fine, of course, if what you care about are the
    bytes transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) +
    t(transfer(size)) which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex,
    al least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aliaksandr Valialkin at Mar 18, 2013 at 8:14 pm

    On Monday, March 18, 2013 7:13:40 PM UTC+2, josvazg wrote:
    Ups!
    Thanks Aliaksandr!
    Already corrected in github....
    What do you suggest as "bigBufferSize"?
    (I am currently using 1MiB)
    This depends on the size serialized hash for each chunk occupies. For
    example, if each searialized hash value occupies 16 bytes, then 1MiB buffer
    can contain 64K hashes, which is enough for buffering hashes for 64GiB file
    if chunkSize=1MiB.

    As for the buffered channel, I am not convinced to change that because
    now, the same function can send the hashes over the network OR to a local
    pipe AND (probably shortly) to a file for later use, saving all the
    calculation time. I believe that with the channel I will lose that
    flexibility.
    You can always convert channel to other stream (network connection or file
    stream) and vice versa via dedicated goroutine.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 18, 2013 at 7:22 pm
    Is it possible that crypto/sha1 is 2-3 times slower than shasum?

    I am comparing shasum times on the same file with the same operation with
    my code and I am getting twice the time with go than with shasum.

    My function is:

    // doHash is the internal function that calculates the local hash of the
    given slice of filename
    func doHash(filename string, offset, slice int64) *HashInfo {
    file, err := os.Open(filename) // For read access
    autopanic(err)
    defer file.Close()
    fi, err := file.Stat()
    autopanic(err)
    toread := sliceFile(file, fi.Size(), offset, slice)
    hash := ""
    if toread > 0 {
    h := newSliceHasher()
    _, err = io.CopyN(h, file, toread)
    autopanic(err)
    hash = fmt.Sprintf("%v-%x", sliceHasherName(), h.Sum(nil))
    }
    return &HashInfo{fi.Size(), offset, toread, hash}
    }

    And:

    // sliceHasherName returns the implementation name of the slice hasher
    func sliceHasherName() string {
    return "sha1"
    }

    Jose

    El lunes, 18 de marzo de 2013 15:04:14 UTC+1, Aliaksandr Valialkin escribió:
    Also probably it would be better substituting the Pipe by buffered channel<http://golang.org/doc/effective_go.html#channels> .
    This way you'll avoid hash values' serialization overhead.
    On Monday, March 18, 2013 3:58:56 PM UTC+2, Aliaksandr Valialkin wrote:

    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before closing
    the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences from
    both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I
    start looking, byte by byte, if I find the current and probably any of the
    next checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For
    instance, t(localhash) or t(remotehash) on a 4GB file where about 2 min,
    doing one after the other should be 4-5 minutes. But I tested a diff
    operation, that is checking a hash against another (while calculating them
    both at the same time) and it took 9 minutes (nearly twice as expected).
    Maybe there was something wrong with my code, but now I plan to test how
    long it takes to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for
    1MiB chunks is "instantaneous" (compared to the time it takes to generate
    it, specially if you just use SHA1 instead of a more efficient hash
    combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time.
    If both your client and your server read the data file off their
    respective drives and hash them, you have incurred a significant amount of
    extra time. This can be fine, of course, if what you care about are the
    bytes transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) +
    t(transfer(size)) which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex,
    al least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 18, 2013 at 7:23 pm
    I meant:


    // newSliceHasher returns a Hash implementation for each slice
    // (SHA1 on naive implementation or rolling+hash in rsync's symulation)
    func newSliceHasher() hash.Hash {
    return sha1.New()
    }

    El lunes, 18 de marzo de 2013 20:22:08 UTC+1, josvazg escribió:
    Is it possible that crypto/sha1 is 2-3 times slower than shasum?

    I am comparing shasum times on the same file with the same operation with
    my code and I am getting twice the time with go than with shasum.

    My function is:

    // doHash is the internal function that calculates the local hash of the
    given slice of filename
    func doHash(filename string, offset, slice int64) *HashInfo {
    file, err := os.Open(filename) // For read access
    autopanic(err)
    defer file.Close()
    fi, err := file.Stat()
    autopanic(err)
    toread := sliceFile(file, fi.Size(), offset, slice)
    hash := ""
    if toread > 0 {
    h := newSliceHasher()
    _, err = io.CopyN(h, file, toread)
    autopanic(err)
    hash = fmt.Sprintf("%v-%x", sliceHasherName(), h.Sum(nil))
    }
    return &HashInfo{fi.Size(), offset, toread, hash}
    }

    And:

    // sliceHasherName returns the implementation name of the slice hasher
    func sliceHasherName() string {
    return "sha1"
    }

    Jose

    El lunes, 18 de marzo de 2013 15:04:14 UTC+1, Aliaksandr Valialkin
    escribió:
    Also probably it would be better substituting the Pipe by buffered
    channel <http://golang.org/doc/effective_go.html#channels> . This way
    you'll avoid hash values' serialization overhead.
    On Monday, March 18, 2013 3:58:56 PM UTC+2, Aliaksandr Valialkin wrote:

    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before
    closing the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences from
    both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one
    stream was a remote hash calculation on a remote disk and the other was
    local hash calculation. So they were sequential reads on independent /
    different systems, each on its own harddrive. It seemed to me that the most
    expensive operation was the hash calculus more than the IO, cause I
    thought, from memory, than the stream copy without hash calculus took half
    the time... (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I
    start looking, byte by byte, if I find the current and probably any of the
    next checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For
    instance, t(localhash) or t(remotehash) on a 4GB file where about 2 min,
    doing one after the other should be 4-5 minutes. But I tested a diff
    operation, that is checking a hash against another (while calculating them
    both at the same time) and it took 9 minutes (nearly twice as expected).
    Maybe there was something wrong with my code, but now I plan to test how
    long it takes to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for
    1MiB chunks is "instantaneous" (compared to the time it takes to generate
    it, specially if you just use SHA1 instead of a more efficient hash
    combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates
    at file granularity; they're somewhat different beasts. It checks things
    like size and mtime before it checks the hash by default. In these cases a
    more expensive hash can prevent transferring files that are actually
    identical but were more recently touched. If you tell it to ignore
    modification times, though, you can speed the equality checks by using a
    less expensive hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time.
    If both your client and your server read the data file off their
    respective drives and hash them, you have incurred a significant amount of
    extra time. This can be fine, of course, if what you care about are the
    bytes transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) +
    t(transfer(size)) which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex,
    al least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the
    zsync file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each
    file they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the
    Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**
    groups/opt_out <https://groups.google.com/groups/opt_out>.

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Aliaksandr Valialkin at Mar 18, 2013 at 8:32 pm

    On Monday, March 18, 2013 9:23:05 PM UTC+2, josvazg wrote:
    Is it possible that crypto/sha1 is 2-3 times slower than shasum?
    Yes, it is quite possible, because sha1.go<http://golang.org/src/pkg/crypto/sha1/sha1.go>contains plain unoptimized implementation of sha1, while shasum probably
    contains highly optimized implementation.
    Also your code has additional overhead, which can be avoided - it opens the
    file, calls fstat and seeks for the given offset in the file for each
    chunk. Also, since crypto/sha1 doesn't implement io.ReaderFrom interface,
    io.CopyN() makes additional copy of file data into a temporary buffer
    before passing it into sha1.Hash.Write().


    I am comparing shasum times on the same file with the same operation with
    my code and I am getting twice the time with go than with shasum.

    My function is:

    // doHash is the internal function that calculates the local hash of the
    given slice of filename
    func doHash(filename string, offset, slice int64) *HashInfo {
    file, err := os.Open(filename) // For read access
    autopanic(err)
    defer file.Close()
    fi, err := file.Stat()
    autopanic(err)
    toread := sliceFile(file, fi.Size(), offset, slice)
    hash := ""
    if toread > 0 {
    h := newSliceHasher()
    _, err = io.CopyN(h, file, toread)
    autopanic(err)
    hash = fmt.Sprintf("%v-%x", sliceHasherName(), h.Sum(nil))
    }
    return &HashInfo{fi.Size(), offset, toread, hash}
    }

    And:

    // sliceHasherName returns the implementation name of the slice hasher
    func sliceHasherName() string {
    return "sha1"
    }

    Jose

    El lunes, 18 de marzo de 2013 15:04:14 UTC+1, Aliaksandr Valialkin
    escribió:
    Also probably it would be better substituting the Pipe by buffered
    channel <http://golang.org/doc/effective_go.html#channels> . This way
    you'll avoid hash values' serialization overhead.
    On Monday, March 18, 2013 3:58:56 PM UTC+2, Aliaksandr Valialkin wrote:

    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before
    closing the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences
    from both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one
    stream was a remote hash calculation on a remote disk and the other was
    local hash calculation. So they were sequential reads on independent /
    different systems, each on its own harddrive. It seemed to me that the most
    expensive operation was the hash calculus more than the IO, cause I
    thought, from memory, than the stream copy without hash calculus took half
    the time... (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data
    shitfs. First question I have is "in both directions or just advancing
    shifts?"
    With the help of that rolling checksum, when I find a mismatch, I
    start looking, byte by byte, if I find the current and probably any of the
    next checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For
    instance, t(localhash) or t(remotehash) on a 4GB file where about 2 min,
    doing one after the other should be 4-5 minutes. But I tested a diff
    operation, that is checking a hash against another (while calculating them
    both at the same time) and it took 9 minutes (nearly twice as expected).
    Maybe there was something wrong with my code, but now I plan to test how
    long it takes to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for
    1MiB chunks is "instantaneous" (compared to the time it takes to generate
    it, specially if you just use SHA1 instead of a more efficient hash
    combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates
    at file granularity; they're somewhat different beasts. It checks things
    like size and mtime before it checks the hash by default. In these cases a
    more expensive hash can prevent transferring files that are actually
    identical but were more recently touched. If you tell it to ignore
    modification times, though, you can speed the equality checks by using a
    less expensive hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons
    escribió:
    Reading a 4GB data file off the harddrive will take a long time.
    If both your client and your server read the data file off their
    respective drives and hash them, you have incurred a significant amount of
    extra time. This can be fine, of course, if what you care about are the
    bytes transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) +
    t(transfer(size)) which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex,
    al least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the
    zsync file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each
    file they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the
    Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**
    groups/opt_out <https://groups.google.com/groups/opt_out>.

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 18, 2013 at 11:03 pm
    Ok then, for the parts I can do something right now...
    The seek overhead should be negligible, as is done just once in my test.
    When I hash the whole file I use a slice=0=AUTOSIZE, meaning "till the
    end", so the function is called JUST once.

    As for the CopyN, I don't see how can I implement ReadFrom for an extended
    type on of hash.Hash AVOIDING the copy. I will have to use a buffer to
    Read() from io.Reader anyway, so I will be using another copy, just like
    CopyN does.

    Or am I missing something?
    Is there a sample of ReadFrom somewhere avoiding a copy?

    Jose

    El lunes, 18 de marzo de 2013 21:32:28 UTC+1, Aliaksandr Valialkin escribió:

    On Monday, March 18, 2013 9:23:05 PM UTC+2, josvazg wrote:

    Is it possible that crypto/sha1 is 2-3 times slower than shasum?
    Yes, it is quite possible, because sha1.go<http://golang.org/src/pkg/crypto/sha1/sha1.go>contains plain unoptimized implementation of sha1, while shasum probably
    contains highly optimized implementation.
    Also your code has additional overhead, which can be avoided - it opens
    the file, calls fstat and seeks for the given offset in the file for each
    chunk. Also, since crypto/sha1 doesn't implement io.ReaderFrom interface,
    io.CopyN() makes additional copy of file data into a temporary buffer
    before passing it into sha1.Hash.Write().


    I am comparing shasum times on the same file with the same operation
    with my code and I am getting twice the time with go than with shasum.

    My function is:

    // doHash is the internal function that calculates the local hash of the
    given slice of filename
    func doHash(filename string, offset, slice int64) *HashInfo {
    file, err := os.Open(filename) // For read access
    autopanic(err)
    defer file.Close()
    fi, err := file.Stat()
    autopanic(err)
    toread := sliceFile(file, fi.Size(), offset, slice)
    hash := ""
    if toread > 0 {
    h := newSliceHasher()
    _, err = io.CopyN(h, file, toread)
    autopanic(err)
    hash = fmt.Sprintf("%v-%x", sliceHasherName(), h.Sum(nil))
    }
    return &HashInfo{fi.Size(), offset, toread, hash}
    }

    And:

    // sliceHasherName returns the implementation name of the slice hasher
    func sliceHasherName() string {
    return "sha1"
    }

    Jose

    El lunes, 18 de marzo de 2013 15:04:14 UTC+1, Aliaksandr Valialkin
    escribió:
    Also probably it would be better substituting the Pipe by buffered
    channel <http://golang.org/doc/effective_go.html#channels> . This way
    you'll avoid hash values' serialization overhead.
    On Monday, March 18, 2013 3:58:56 PM UTC+2, Aliaksandr Valialkin wrote:

    The problem is in io.Pipe(). According to Pipe docs<http://golang.org/pkg/io/#Pipe>:
    "Pipe creates a *synchronous* in-memory pipe." The key word here is
    "synchronous" - this means writer goroutines always block (i.e. wait for
    the reader) on the line:

    readed, err = io.CopyN(hashSink, file, toread)

    until reader goroutine reads data on lines:

    localHash, err := readString(local)
    remoteHash, err := readString(remote)

    How to fix this? Wrap the writer side of the pipe into bufio.Writer<http://golang.org/pkg/bufio/#NewWriterSize> at
    the beginning of bulkHashDump(). Don't forget to flush it before
    closing the underlying writer. I.e.:
    ...
    defer w.Close()
    bufW = io.NewWriterSize(w, bigBufferSize)
    defer bufW.Flush()
    ...


    On Monday, March 18, 2013 11:34:52 AM UTC+2, josvazg wrote:

    Yeah, I have probably made a mistake somewhere. Let's look at the
    code:
    (https://github.com/josvazg/slicesync)

    These are the pair of functions that produce the hash dump locally or
    remotely:

    // BulkHash calculates the file hash and all hashes of size slice and writes them to w
    //
    // Output is as follows:
    // first text line is the file size
    // then there are size / slice lines each with a slice hash for consecutive slices
    // finally there is the line "Final: "+total file hash
    //
    // Post initialization errors are dumped in-line starting as a "Error: "... line
    // Nothing more is sent after an error occurs and is dumped to w
    func (hnd *LocalHashNDump) BulkHash(filename string, slice int64) (rc io.ReadCloser, err error) {
    file, err := os.Open(calcpath(hnd.Dir, filename)) // For read access
    if err != nil {
    return nil, err
    }
    if slice <= 0 { // protection against infinite loop by bad arguments
    slice = MiB
    }
    fi, err := file.Stat()
    if err != nil {
    return nil, err
    }
    r, w := io.Pipe()
    go bulkHashDump(w, file, slice, fi.Size())
    return r, nil
    }


    // bulkHashDump produces BulkHash output into the piped writer
    func bulkHashDump(w io.WriteCloser, file io.ReadCloser, slice, size int64) {
    defer file.Close()
    defer w.Close()
    fmt.Fprintf(w, "%v\n", size)
    if size > 0 {
    h := newHasher()
    sliceHash := newSliceHasher()
    hashSink := io.MultiWriter(h, sliceHash)
    readed := int64(0)
    var err error
    for pos := int64(0); pos < size; pos += readed {
    toread := slice
    if toread > (size - pos) {
    toread = size - pos
    }
    readed, err = io.CopyN(hashSink, file, toread)
    if err != nil {
    fmt.Fprintf(w, "Error:%s\n", err)
    return
    }
    fmt.Fprintf(w, "%x\n", sliceHash.Sum(nil))
    sliceHash.Reset()
    }
    fmt.Fprintf(w, "Final: %x\n", h.Sum(nil))
    }
    }

    I use io.CopyN to produce both the block by block and the file hash for each slice. The tricky part, maybe, is that I use a goroutine to io.Pipe all the produced hash dump from a writer to the returned reader on the other side.


    And these are the pair of functions that calculate the differences
    from both the remote and local stream of hash dumps:

    func calcDiffs(server, filename, alike string, slice int64, fn mixerfn) (*Diffs, error) {
    // local & remote streams opening
    localHnd := &LocalHashNDump{"."}
    lc, err := localHnd.BulkHash(alike, slice)
    if err != nil {
    return nil, err
    }
    defer lc.Close()
    local := bufio.NewReader(lc)
    remoteHnd := &RemoteHashNDump{server}
    rm, err := remoteHnd.BulkHash(filename, slice)
    if err != nil {
    return nil, err
    }
    defer rm.Close()
    remote := bufio.NewReader(rm)
    // local & remote sizes
    lsize, err := readInt64(local)
    if err != nil {
    return nil, err
    }
    rsize, err := readInt64(remote)
    if err != nil {
    return nil, err
    }
    // diff building loop
    diffs := NewDiffs(server, filename, alike, slice, AUTOSIZE)
    if err = diffLoop(diffs, local, remote, lsize, rsize, fn); err != nil {
    return nil, err
    }
    // total hashes
    diffs.AlikeHash, err = readAttribute(local, "Final")
    if err != nil {
    return nil, err
    }
    diffs.Hash, err = readAttribute(remote, "Final")
    if err != nil {
    return nil, err
    }
    return diffs, nil
    }

    // diffLoop builds the diffs from the hash streams
    func diffLoop(diffs *Diffs, local, remote *bufio.Reader, lsize, rsize int64, fn mixerfn) error {
    indiff := false
    end := min(lsize, rsize)
    segment := diffs.Slice
    for pos := int64(0); pos < end; pos += segment {
    if pos+segment > end {
    segment = end - pos
    }
    localHash, err := readString(local)
    if err != nil {
    return err
    }
    remoteHash, err := readString(remote)
    if err != nil {
    return err
    }
    if indiff {
    if localHash == remoteHash {
    indiff = false
    } else {
    diffs.Diffs[len(diffs.Diffs)-1].Size += segment
    diffs.Differences += segment
    }
    } else if localHash != remoteHash {
    diffs.Diffs = append(diffs.Diffs, Diff{pos, segment})
    diffs.Differences += segment
    indiff = true
    }
    if fn != nil {
    fn(pos+segment, indiff)
    }
    }
    if lsize < rsize {
    remaining := rsize - lsize
    diffs.Diffs = append(diffs.Diffs, Diff{lsize, remaining})
    diffs.Differences += remaining
    }
    return nil
    }

    Do you see many mistakes or something that could justify the extra time the diffs take?


    Jose


    El domingo, 17 de marzo de 2013 13:11:04 UTC+1, Aliaksandr Valialkin
    escribió:
    The funny thing about the tests I was talking about is that one
    stream was a remote hash calculation on a remote disk and the other was
    local hash calculation. So they were sequential reads on independent /
    different systems, each on its own harddrive. It seemed to me that the most
    expensive operation was the hash calculus more than the IO, cause I
    thought, from memory, than the stream copy without hash calculus took half
    the time... (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.
    From the descirption it looks like the fastest algorithm possible.
    Probably your code uses too small write buffers on network connections
    where hashes are pushed and/or it requires an ack for each hash before
    sending the next hash over the stream?

    Tested times (2mins vs 9minutes) seem to be quite constant, so if
    it depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data
    shitfs. First question I have is "in both directions or just advancing
    shifts?"
    With the help of that rolling checksum, when I find a mismatch, I
    start looking, byte by byte, if I find the current and probably any of the
    next checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For
    instance, t(localhash) or t(remotehash) on a 4GB file where about 2 min,
    doing one after the other should be 4-5 minutes. But I tested a diff
    operation, that is checking a hash against another (while calculating them
    both at the same time) and it took 9 minutes (nearly twice as expected).
    Maybe there was something wrong with my code, but now I plan to test how
    long it takes to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for
    1MiB chunks is "instantaneous" (compared to the time it takes to generate
    it, specially if you just use SHA1 instead of a more efficient hash
    combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates
    at file granularity; they're somewhat different beasts. It checks things
    like size and mtime before it checks the hash by default. In these cases a
    more expensive hash can prevent transferring files that are actually
    identical but were more recently touched. If you tell it to ignore
    modification times, though, you can speed the equality checks by using a
    less expensive hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons
    escribió:
    Reading a 4GB data file off the harddrive will take a long time.
    If both your client and your server read the data file off their
    respective drives and hash them, you have incurred a significant amount of
    extra time. This can be fine, of course, if what you care about are the
    bytes transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) +
    t(transfer(size)) which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more
    complex, al least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the
    zsync file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each
    file they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the
    Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**
    groups/opt_out <https://groups.google.com/groups/opt_out>.

    --
    You received this message because you are subscribed to the
    Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 19, 2013 at 9:43 am
    UPDATE:
    It seems I might have been wrong with the tests result before, my memory is
    not that good... (I probably mixed results from different machines or tools)

    Today, I had access to my original tests data files and I can now confirm
    that tests are coherent.

    Hashing with go the 4GB file locally takes 5 minutes, and I am talking just
    one pass, getting just one hash (slice=file size, equivalent to shasum but
    with go, that we know is 2-3 times slower)

    Generating the diffs between a remote and local 4GB file takes almost
    9minutes and 20seconds.

    And generating the full hash dump locally takes almost 8 minutes 40
    seconds. That is not bad at all when the diff process takes less than a
    minute more.

    Maybe the "improvement" is also due to the pipe buffering that
    Aliaksandr suggested and is now implemented.

    El domingo, 17 de marzo de 2013 10:59:48 UTC+1, josvazg escribió:
    I will try to reply to Aliaksandr, Tarmigan and Kyle...

    The funny thing about the tests I was talking about is that one stream was
    a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the parallel
    test to speculate about why it's taking twice as long. When you say "at
    the same time" do you mean on different machines at the same time? If
    they're on the same machine and the same harddrive, it seems likely that
    reading two 4GB files at the same time off the harddrive would take twice
    as long. If you're reading a chunk and then communicating about its hash
    over the network, then you're essentially forcing the slowest read at any
    given time to stall both processes, which also seems like it could balloon
    the time it takes, though whether it's 10% slower or 100% would depend on
    the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a simplified
    idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum blocks
    once, distribute the checksums, so the client can search missing/changed
    blocks by itself, without server work. Then ask only the required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 19, 2013 at 3:37 pm
    UPDATE 2:
    At work I can't observe the x2 or x3 performance penalty from shasum to my
    go equivalent program. On the linux box, both times are really similar, and
    on the windows box, compared to sha1sum.exe (not the same as shasum but I
    guess similar) there is a performance penalty but more on the x1.1 or x1.2
    range, not x2 or x3.

    At work both go versions are 1.0.3. I will have to check at home (my linux
    and mac boxes), maybe those had older versions of go or the libraries that
    justified those performance penalties observed.

    Jose

    El martes, 19 de marzo de 2013 10:42:58 UTC+1, josvazg escribió:
    UPDATE:
    It seems I might have been wrong with the tests result before, my memory
    is not that good... (I probably mixed results from different machines or
    tools)

    Today, I had access to my original tests data files and I can now confirm
    that tests are coherent.

    Hashing with go the 4GB file locally takes 5 minutes, and I am talking
    just one pass, getting just one hash (slice=file size, equivalent to shasum
    but with go, that we know is 2-3 times slower)

    Generating the diffs between a remote and local 4GB file takes almost
    9minutes and 20seconds.

    And generating the full hash dump locally takes almost 8 minutes 40
    seconds. That is not bad at all when the diff process takes less than a
    minute more.

    Maybe the "improvement" is also due to the pipe buffering that
    Aliaksandr suggested and is now implemented.

    El domingo, 17 de marzo de 2013 10:59:48 UTC+1, josvazg escribió:
    I will try to reply to Aliaksandr, Tarmigan and Kyle...

    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the parallel
    test to speculate about why it's taking twice as long. When you say "at
    the same time" do you mean on different machines at the same time? If
    they're on the same machine and the same harddrive, it seems likely that
    reading two 4GB files at the same time off the harddrive would take twice
    as long. If you're reading a chunk and then communicating about its hash
    over the network, then you're essentially forcing the slowest read at any
    given time to stall both processes, which also seems like it could balloon
    the time it takes, though whether it's 10% slower or 100% would depend on
    the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Josvazg at Mar 20, 2013 at 8:37 am
    UPDATE 3:
    (Replying to myself and so that it gets written somewhere...)

    Pre-computing one side's bulk hash dump (the list of each slice hashes)
    does not make any difference. The diff alone takes the same time it took
    before, when none was precomputed.

    Pre-computing BOTH sides was HUGE! the naive diff calculus was "almost"
    instantaneous.

    [By naive diff calculus I mean looking for block matches in-place. While
    the advanced rsync like approach will have to re-read the local file to
    look for block displacements.]

    It is clear that it would be best to have those bulk hash files precomputed
    when possible.

    It also makes me suspect that implementing the "naive diff calculus"
    approach might not be that bad at all:

    - I believe that statistically speaking, changes that shift or suffle the
    data in big files are not that common; I can think of images rearranged,
    video feeds, sound files where parts are moved back and forth. But most of
    the time, what you have is a file system on an ISO, a binary image of some
    set of files, etc so the file is not rearranged, usually just some files or
    parts within changed.

    - Yes, in text files shifts and reshuffles are quite common, you put the
    same text before of after, etc But for that use you have better tools
    already, like git, mercurial, etc

    - The use case for a tool like this is sending VERSIONS of REALLY BIG files
    over REALLY SLOW communication links. Most of teh time thos files are going
    to be big archives, ISOs, filesystems and binaries alike.

    So it could just work the following way:
    1. Attempt the naive diff calculus approach first.
    2. ONLY If the NAIVE diff calculus renders a POOR local data reuse
    (configurable %?), kick the rsync like advanced diff seek using the rolling
    checksums.
    3. Download the file as instructed by the calculated diffs in step 1 or 2.

    And now the bad news...

    On my first download tests, it seems that calculating the hash
    while reconstructing the file makes the process take twice the time
    (12min) compared to a plain full download with wget (6m) on a 100Mbps link,
    which of course, is not the proper use case. I have to see if there is room
    for improvement (=I made a mistake somewhere) or I reached a hard limit.

    Jose

    El martes, 19 de marzo de 2013 16:37:19 UTC+1, josvazg escribió:
    UPDATE 2:
    At work I can't observe the x2 or x3 performance penalty from shasum to my
    go equivalent program. On the linux box, both times are really similar, and
    on the windows box, compared to sha1sum.exe (not the same as shasum but I
    guess similar) there is a performance penalty but more on the x1.1 or x1.2
    range, not x2 or x3.

    At work both go versions are 1.0.3. I will have to check at home (my linux
    and mac boxes), maybe those had older versions of go or the libraries that
    justified those performance penalties observed.

    Jose

    El martes, 19 de marzo de 2013 10:42:58 UTC+1, josvazg escribió:
    UPDATE:
    It seems I might have been wrong with the tests result before, my memory
    is not that good... (I probably mixed results from different machines or
    tools)

    Today, I had access to my original tests data files and I can now confirm
    that tests are coherent.

    Hashing with go the 4GB file locally takes 5 minutes, and I am talking
    just one pass, getting just one hash (slice=file size, equivalent to shasum
    but with go, that we know is 2-3 times slower)

    Generating the diffs between a remote and local 4GB file takes almost
    9minutes and 20seconds.

    And generating the full hash dump locally takes almost 8 minutes 40
    seconds. That is not bad at all when the diff process takes less than a
    minute more.

    Maybe the "improvement" is also due to the pipe buffering that
    Aliaksandr suggested and is now implemented.

    El domingo, 17 de marzo de 2013 10:59:48 UTC+1, josvazg escribió:
    I will try to reply to Aliaksandr, Tarmigan and Kyle...

    The funny thing about the tests I was talking about is that one stream
    was a remote hash calculation on a remote disk and the other was local hash
    calculation. So they were sequential reads on independent / different
    systems, each on its own harddrive. It seemed to me that the most expensive
    operation was the hash calculus more than the IO, cause I thought, from
    memory, than the stream copy without hash calculus took half the time...
    (but don't trust my memory I should test that to confirm)

    Code mistakes aside, the server reads the file to be downloaded
    sequentially and calculates each block hash and send its through the
    network connection. The client is reading that stream and another it opened
    itself doing the sequential read and hash calculus on the local file. With
    that it can figure out the differences.

    Tested times (2mins vs 9minutes) seem to be quite constant, so if it
    depends on system load, then that load is strangely constant.

    On rsync now...
    If I understood it correctly, rsync allows you to detect data shitfs.
    First question I have is "in both directions or just advancing shifts?"
    With the help of that rolling checksum, when I find a mismatch, I start
    looking, byte by byte, if I find the current and probably any of the next
    checksums.

    Next questions are for how long? and how many blocks apart from the
    mismatched one I keep on looking for?

    Thanks for all the replies and the info!

    Jose

    El sábado, 16 de marzo de 2013 20:35:25 UTC+1, Kyle Lemons escribió:
    On Sat, Mar 16, 2013 at 2:06 AM, josvazg wrote:

    You are right, those are the theoretical times.

    The thing is that in my tests I had unexpected results. For instance,
    t(localhash) or t(remotehash) on a 4GB file where about 2 min, doing one
    after the other should be 4-5 minutes. But I tested a diff operation, that
    is checking a hash against another (while calculating them both at the same
    time) and it took 9 minutes (nearly twice as expected). Maybe there
    was something wrong with my code, but now I plan to test how long it takes
    to calculate the diffs with pregenerated hash files.
    I think we'd have to hear more about the methodology behind the
    parallel test to speculate about why it's taking twice as long. When you
    say "at the same time" do you mean on different machines at the same time?
    If they're on the same machine and the same harddrive, it seems likely
    that reading two 4GB files at the same time off the harddrive would take
    twice as long. If you're reading a chunk and then communicating about its
    hash over the network, then you're essentially forcing the slowest read at
    any given time to stall both processes, which also seems like it could
    balloon the time it takes, though whether it's 10% slower or 100% would
    depend on the workloads and harddrives on both systems.

    Downloading a pre-generated text based hash file of 4GB data for 1MiB
    chunks is "instantaneous" (compared to the time it takes to generate it,
    specially if you just use SHA1 instead of a more efficient hash combination)

    [One thing I quite don't understand is why does rsync use of a
    weak/cheap hash and a strong/expensive one. For security reasons you end up
    doing both, which is more expensive than doing just the strong one. How
    does it save time or data?]
    Does rsync send partial file diffs? I think it only ever operates at
    file granularity; they're somewhat different beasts. It checks things like
    size and mtime before it checks the hash by default. In these cases a more
    expensive hash can prevent transferring files that are actually identical
    but were more recently touched. If you tell it to ignore modification
    times, though, you can speed the equality checks by using a less expensive
    hash because it will be running on many more files.

    Thank you guys for all the feedback!

    Jose

    El sábado, 16 de marzo de 2013 08:59:25 UTC+1, Kyle Lemons escribió:
    Reading a 4GB data file off the harddrive will take a long time. If
    both your client and your server read the data file off their respective
    drives and hash them, you have incurred a significant amount of extra time.
    This can be fine, of course, if what you care about are the bytes
    transferred and not the amount of time the transfer takes. The
    pathological case here, if I understand your algorithm correctly, would be
    adding 1 byte to the beginning of a 4GB file, which would cause every hash
    to change, so the download time becomes t(localhash) + t(remotehash) +
    t(transfer(size/chunksize***hashsize)) + t(diff) + t(transfer(size))
    which is almost certainly > t(transfer(size)).

    On Fri, Mar 15, 2013 at 7:03 AM, josvazg wrote:

    When you read the paper the zsync idea seems much more complex, al
    least more than what I was planning to do.

    An alike might be possible, though, if it just implements a
    simplified idea of zsync.

    I am toying with the idea of saving a hash dump (kind of the zsync
    file) so that I save its generation time and I also speed up the diff
    processing.

    Both client and server can pre-generate the hashdump for each file
    they manage with a certain default slice size (like 1MiB).

    Jose

    El viernes, 15 de marzo de 2013 13:13:44 UTC+1, Tamás Gulácsi
    escribió:
    I meant IMPLEMENT zsync, or an alike - the idea is to checksum
    blocks once, distribute the checksums, so the client can search
    missing/changed blocks by itself, without server work. Then ask only the
    required blocks.
    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to golang-nuts...@**googlegroups.com.

    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedMar 14, '13 at 1:05p
activeMar 20, '13 at 8:37a
posts26
users6
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase