FAQ
Hi,

I have this code
[https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
that simply calculates the hash (sha256sum) of each file of an specific
directory and outputs a JSON file with results (adding other info like
size, ... of the files).

I know it's CPU intensive (sha256) but I want to know if it's possible to
be more faster and consume less memory (it consumes fewer memory)

Can someone audit my code?

Thanks in advance,
Xan.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Xan at Jul 16, 2013 at 4:11 pm
    With scan of 21692 files of my home it consumes 13 minutes:

    $ ./walk --help
    Usage of ./walk:
       -output="./output.json": The JSON output file in which we save the results
       -path="./": Source of files
       -pattern="*": Pattern search expression of searching files
    [xan@gerret find-duplicated-files]$ time ./walk -path=/home/xan/
    Find Duplicated Files: Go Walk Hash Calculation...
         * Pattern: *
         * Route: /home/xan/
         * Output filename: ./output.json
    Written 21692 entries.

    real 13m53.147s
    user 8m12.507s
    sys 1m54.597s


    El dimarts 16 de juliol de 2013 17:40:40 UTC+2, Xan va escriure:
    Hi,

    I have this code [
    https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
    that simply calculates the hash (sha256sum) of each file of an specific
    directory and outputs a JSON file with results (adding other info like
    size, ... of the files).

    I know it's CPU intensive (sha256) but I want to know if it's possible to
    be more faster and consume less memory (it consumes fewer memory)

    Can someone audit my code?

    Thanks in advance,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Stephen Day at Jul 16, 2013 at 4:15 pm
    There are two techniques that immediately stand out as problematic:

    1. The output file is being opened multiple times. This is probably costly
    and definitely unnecessary. Implement VisitFile as a method on a struct
    with a file member or implement it as a closure in main so that the same
    Writer can be used on each call.

    2. When Message is written to the output file with io.WriteString method,
    "content" is cast to a string, causing an unnecessary copy from the []byte
    returned by json.Marshal. Just directly call f.Write with the buffer
    returned by json.Marshal.

    I hope this gets you started!
    On Tuesday, July 16, 2013 8:40:40 AM UTC-7, Xan wrote:

    Hi,

    I have this code [
    https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
    that simply calculates the hash (sha256sum) of each file of an specific
    directory and outputs a JSON file with results (adding other info like
    size, ... of the files).

    I know it's CPU intensive (sha256) but I want to know if it's possible to
    be more faster and consume less memory (it consumes fewer memory)

    Can someone audit my code?

    Thanks in advance,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Xan at Jul 17, 2013 at 3:22 pm
    Thanks for comments, good.

    El dimarts 16 de juliol de 2013 18:15:07 UTC+2, Stephen Day va escriure:
    There are two techniques that immediately stand out as problematic:

    1. The output file is being opened multiple times. This is probably costly
    and definitely unnecessary. Implement VisitFile as a method on a struct
    with a file member or implement it as a closure in main so that the same
    Writer can be used on each call.
    Can you help me on that? How can open the file once before main procedure
    (filepath.Walk) and having it like global variable?

    2. When Message is written to the output file with io.WriteString method,
    "content" is cast to a string, causing an unnecessary copy from the []byte
    returned by json.Marshal. Just directly call f.Write with the buffer
    returned by json.Marshal.
    The second is implemented:
    https://github.com/quatrilio/find-duplicated-files/commits/master
    I hope this gets you started!
    On Tuesday, July 16, 2013 8:40:40 AM UTC-7, Xan wrote:

    Hi,

    I have this code [
    https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
    that simply calculates the hash (sha256sum) of each file of an specific
    directory and outputs a JSON file with results (adding other info like
    size, ... of the files).

    I know it's CPU intensive (sha256) but I want to know if it's possible to
    be more faster and consume less memory (it consumes fewer memory)

    Can someone audit my code?

    Thanks in advance,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Stephen Day at Jul 18, 2013 at 10:51 pm
    You can either implement the walk method as an anonymous function within
    the function that you open the file, closing over the os.File variable or
    implement your walk function as a method on a type containing the reader.
    On Wednesday, July 17, 2013 8:21:57 AM UTC-7, Xan wrote:

    Thanks for comments, good.

    El dimarts 16 de juliol de 2013 18:15:07 UTC+2, Stephen Day va escriure:
    There are two techniques that immediately stand out as problematic:

    1. The output file is being opened multiple times. This is probably
    costly and definitely unnecessary. Implement VisitFile as a method on a
    struct with a file member or implement it as a closure in main so that the
    same Writer can be used on each call.
    Can you help me on that? How can open the file once before main procedure
    (filepath.Walk) and having it like global variable?

    2. When Message is written to the output file with io.WriteString method,
    "content" is cast to a string, causing an unnecessary copy from the []byte
    returned by json.Marshal. Just directly call f.Write with the buffer
    returned by json.Marshal.
    The second is implemented:
    https://github.com/quatrilio/find-duplicated-files/commits/master
    I hope this gets you started!
    On Tuesday, July 16, 2013 8:40:40 AM UTC-7, Xan wrote:

    Hi,

    I have this code [
    https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
    that simply calculates the hash (sha256sum) of each file of an specific
    directory and outputs a JSON file with results (adding other info like
    size, ... of the files).

    I know it's CPU intensive (sha256) but I want to know if it's possible
    to be more faster and consume less memory (it consumes fewer memory)

    Can someone audit my code?

    Thanks in advance,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Xan at Jul 19, 2013 at 2:01 pm
    What's the simplest method (and efficient) of open only an output JSON file
    *once* time only? Can anyone give some code?

    Thanks,

    El divendres 19 de juliol de 2013 0:51:09 UTC+2, Stephen Day va escriure:
    You can either implement the walk method as an anonymous function within
    the function that you open the file, closing over the os.File variable or
    implement your walk function as a method on a type containing the reader.
    On Wednesday, July 17, 2013 8:21:57 AM UTC-7, Xan wrote:

    Thanks for comments, good.

    El dimarts 16 de juliol de 2013 18:15:07 UTC+2, Stephen Day va escriure:
    There are two techniques that immediately stand out as problematic:

    1. The output file is being opened multiple times. This is probably
    costly and definitely unnecessary. Implement VisitFile as a method on a
    struct with a file member or implement it as a closure in main so that the
    same Writer can be used on each call.
    Can you help me on that? How can open the file once before main procedure
    (filepath.Walk) and having it like global variable?

    2. When Message is written to the output file with io.WriteString
    method, "content" is cast to a string, causing an unnecessary copy from the
    []byte returned by json.Marshal. Just directly call f.Write with the buffer
    returned by json.Marshal.
    The second is implemented:
    https://github.com/quatrilio/find-duplicated-files/commits/master
    I hope this gets you started!
    On Tuesday, July 16, 2013 8:40:40 AM UTC-7, Xan wrote:

    Hi,

    I have this code [
    https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
    that simply calculates the hash (sha256sum) of each file of an specific
    directory and outputs a JSON file with results (adding other info like
    size, ... of the files).

    I know it's CPU intensive (sha256) but I want to know if it's possible
    to be more faster and consume less memory (it consumes fewer memory)

    Can someone audit my code?

    Thanks in advance,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Stephen Day at Jul 20, 2013 at 8:52 pm
    Here is a simple example that uses a
    closure: http://play.golang.org/p/5hj6BTZP-8.

    It's just using os.Stdout, but you can replace that with your output file.
    It runs about 4 seconds on i5 Macbook Air running through the go source
    tree.
    On Friday, July 19, 2013 7:01:30 AM UTC-7, Xan wrote:

    What's the simplest method (and efficient) of open only an output JSON
    file *once* time only? Can anyone give some code?

    Thanks,

    El divendres 19 de juliol de 2013 0:51:09 UTC+2, Stephen Day va escriure:
    You can either implement the walk method as an anonymous function within
    the function that you open the file, closing over the os.File variable or
    implement your walk function as a method on a type containing the reader.
    On Wednesday, July 17, 2013 8:21:57 AM UTC-7, Xan wrote:

    Thanks for comments, good.

    El dimarts 16 de juliol de 2013 18:15:07 UTC+2, Stephen Day va escriure:
    There are two techniques that immediately stand out as problematic:

    1. The output file is being opened multiple times. This is probably
    costly and definitely unnecessary. Implement VisitFile as a method on a
    struct with a file member or implement it as a closure in main so that the
    same Writer can be used on each call.
    Can you help me on that? How can open the file once before main
    procedure (filepath.Walk) and having it like global variable?

    2. When Message is written to the output file with io.WriteString
    method, "content" is cast to a string, causing an unnecessary copy from the
    []byte returned by json.Marshal. Just directly call f.Write with the buffer
    returned by json.Marshal.
    The second is implemented:
    https://github.com/quatrilio/find-duplicated-files/commits/master
    I hope this gets you started!
    On Tuesday, July 16, 2013 8:40:40 AM UTC-7, Xan wrote:

    Hi,

    I have this code [
    https://github.com/quatrilio/find-duplicated-files/blob/master/walk.go]
    that simply calculates the hash (sha256sum) of each file of an specific
    directory and outputs a JSON file with results (adding other info like
    size, ... of the files).

    I know it's CPU intensive (sha256) but I want to know if it's possible
    to be more faster and consume less memory (it consumes fewer memory)

    Can someone audit my code?

    Thanks in advance,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andrey mirtchovski at Jul 16, 2013 at 4:34 pm

    I know it's CPU intensive (sha256) but I want to know if it's possible to be
    more faster and consume less memory (it consumes fewer memory)
    it would be good if you could define what 'faster' and 'fewer memory'
    mean here. it's easy to make your program faster, it's also possible
    to make it generate less garbage (resulting in a smaller memory
    footprint) but it's hard to do both, especially if you set your
    targets too ambitiously. that said:

    - please use gofmt to make your code easier on the eyes.
    - you're not really using any of the concurrency primitives available
    in Go. sha256 calculation is cpu-bound, therefore you would do good to
    do more than one calculation at the same time. for that you may want
    to spin goroutines (or hand off work to previously spun goroutines)
    that independently calculate the hash for each file, then send the
    result to a collector goroutine

    it would help if we put it in perspective using the Go tree:
    - on my machine your code takes 4 seconds to hash all 12323 files (by
    the way, it reports that it has hashed 12324, but only 12323 appear in
    the output log)
    - your code takes, at maximum, 3.8MB of memory to do the job (as
    reported by /usr/bin/time -l on OSX):

    $ /usr/bin/time -l ./t -path=/Users/aam/go
    Find Duplicated Files: Go Walk Hash Calculation...
         * Pattern: *
         * Route: /Users/aam/go
         * Output filename: ./output.json
    Written 12324 entries.
             3.70 real 2.51 user 1.18 sys
        3776512 maximum resident set size

    - if you _only_ changed the buffer size on line 65 to something larger
    than 100 (8192 for example) that time drops to 2.5 seconds with only
    marginal memory consumption increase (you're much better off if you
    use io.Copy instead of doing this for loop by yourself, by the way):

    $ /usr/bin/time -l ./t -path=/Users/aam/go
    Find Duplicated Files: Go Walk Hash Calculation...
         * Pattern: *
         * Route: /Users/aam/go
         * Output filename: ./output.json
    Written 12324 entries.
             2.49 real 1.96 user 0.33 sys
        3981312 maximum resident set size


    from here on your best bet is using the concurrency primitives in Go.
    I wrote a similar program for a different purpose and banged it into
    shape just now to illustrate the benefits of concurrency. the program
    is here: http://mirtchovski.com/go/sha256.go. it allows you to vary
    the number of OS threads used, as well as the number of worker
    goroutines that read files. in its default form (1 worker, 1 cpu) it
    takes 2.2 seconds and 4.3MB of memory:

    $ /usr/bin/time -l ./sha256 -workers=1 -cpus 1 ~/go > /dev/null
    2013/07/16 10:24:05 starting...
    2013/07/16 10:24:07 completed in 2.11673696s. 12323 hashed.
             2.12 real 1.91 user 0.21 sys
        4300800 maximum resident set size

    however when we crank up the concurrency to 16 workers on 8 cpus we
    lower the time taken to 0.65 seconds at the expense of 7MB memory
    footprint:

    $ /usr/bin/time -l ./sha256 -workers=16 -cpus=8 ~/go > /dev/null
    2013/07/16 10:28:17 starting...
    2013/07/16 10:28:18 completed in 646.008192ms. 12323 hashed.
             0.65 real 3.55 user 0.43 sys
        7094272 maximum resident set size

    hope this helps,
    andrey

    NB: all timings are done on hot caches

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Xancorreu at Jul 17, 2013 at 3:18 pm

    Al 16/07/13 18:34, En/na andrey mirtchovski ha escrit:
    I know it's CPU intensive (sha256) but I want to know if it's possible to be
    more faster and consume less memory (it consumes fewer memory)
    First of all, thank you very much for this post. very useful for me.

    - if you _only_ changed the buffer size on line 65 to something larger
    than 100 (8192 for example) that time drops to 2.5 seconds with only
    Done: https://github.com/quatrilio/find-duplicated-files/commits/master
    marginal memory consumption increase (you're much better off if you
    use io.Copy instead of doing this for loop by yourself, by the way):
    How can I achieve this:
    funcio.Reset() //Reset hash buffer
                      // Read file contents until EOF and add these contents
    to hash buffer
                      file, err := os.Open(fp) // For read access.
                      if err != nil {
                              log.Fatal(err)
                      }
                      data := make([]byte, 8192)
                      for count, error := file.Read(data); error!=io.EOF;
    count, error = file.Read(data) {
                              if error != nil {
                                      log.Fatal(error)
                              }
                              funcio.Write(data[:count]) // calcules and
    writes the sha2 of file contents
                      }
                      defer file.Close()

    with io.Copy?
    $ /usr/bin/time -l ./t -path=/Users/aam/go
    Find Duplicated Files: Go Walk Hash Calculation...
    * Pattern: *
    * Route: /Users/aam/go
    * Output filename: ./output.json
    Written 12324 entries.
    2.49 real 1.96 user 0.33 sys
    3981312 maximum resident set size


    from here on your best bet is using the concurrency primitives in Go.
    I wrote a similar program for a different purpose and banged it into
    shape just now to illustrate the benefits of concurrency. the program
    is here: http://mirtchovski.com/go/sha256.go. it allows you to vary
    the number of OS threads used, as well as the number of worker
    goroutines that read files. in its default form (1 worker, 1 cpu) it
    takes 2.2 seconds and 4.3MB of memory:

    $ /usr/bin/time -l ./sha256 -workers=1 -cpus 1 ~/go > /dev/null
    2013/07/16 10:24:05 starting...
    2013/07/16 10:24:07 completed in 2.11673696s. 12323 hashed.
    2.12 real 1.91 user 0.21 sys
    4300800 maximum resident set size

    however when we crank up the concurrency to 16 workers on 8 cpus we
    lower the time taken to 0.65 seconds at the expense of 7MB memory
    footprint:

    $ /usr/bin/time -l ./sha256 -workers=16 -cpus=8 ~/go > /dev/null
    2013/07/16 10:28:17 starting...
    2013/07/16 10:28:18 completed in 646.008192ms. 12323 hashed.
    0.65 real 3.55 user 0.43 sys
    7094272 maximum resident set size

    hope this helps,
    andrey

    NB: all timings are done on hot caches
    I will read your code and will try to adat to my case. I'm a newbee in go.

    Thanks,
    Xan.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andrey mirtchovski at Jul 17, 2013 at 3:24 pm

    How can I achieve this:
    funcio.Reset() //Reset hash buffer
    // Read file contents until EOF and add these contents to hash buffer
    file, err := os.Open(fp) // For read access.
    if err != nil {
    log.Fatal(err)
    }
    here, instead of the read/write loop, you would put:

          io.Copy(funcio, file)

    and handle errors appropriately. you can defer file.Close() before the
    call to io.Copy.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Xancorreu at Jul 18, 2013 at 8:18 am

    Al 17/07/13 17:24, En/na andrey mirtchovski ha escrit:
    How can I achieve this:
    funcio.Reset() //Reset hash buffer
    // Read file contents until EOF and add these contents to hash buffer
    file, err := os.Open(fp) // For read access.
    if err != nil {
    log.Fatal(err)
    }
    here, instead of the read/write loop, you would put:

    io.Copy(funcio, file)

    and handle errors appropriately. you can defer file.Close() before the
    call to io.Copy.
    And what happens with funcio.Write? How can this update hash buffer?

    Thanks,
    Xan.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Xancorreu at Jul 18, 2013 at 8:31 am

    Al 17/07/13 17:24, En/na andrey mirtchovski ha escrit:
    How can I achieve this:
    funcio.Reset() //Reset hash buffer
    // Read file contents until EOF and add these contents to hash buffer
    file, err := os.Open(fp) // For read access.
    if err != nil {
    log.Fatal(err)
    }
    here, instead of the read/write loop, you would put:

    io.Copy(funcio, file)

    and handle errors appropriately. you can defer file.Close() before the
    call to io.Copy.
    It gives me an error:

    $ go build walk.go
    # command-line-arguments
    ./walk.go:95: no new variables on left side of :=
    ./walk.go:95: cannot assign int to n (type int64) in multiple assignment

    See
    https://github.com/quatrilio/find-duplicated-files/blob/cf613f9a990dac245b4d530de70c96649c0899b4/walk.go

    Thanks,
    Xan.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Alex at Jul 18, 2013 at 3:14 pm
    The error is rather descriptive. The problem is that you'd previously used
    variable named n in line 68, which types n as an int64. Then on line 95,
    you used := which you cannot do when using previously defined operators.
    However, changing to = wouldn't work as Write returns an int, yet n has
    been typed as int64 remember. I'd just change variable named 'n' on line
    95 and 96 to something else and leave :=.

    Good Luck,
    Ales
    On Thursday, July 18, 2013 4:31:16 AM UTC-4, Xan wrote:

    Al 17/07/13 17:24, En/na andrey mirtchovski ha escrit:
    How can I achieve this:
    funcio.Reset() //Reset hash buffer
    // Read file contents until EOF and add these contents
    to hash buffer
    file, err := os.Open(fp) // For read access.
    if err != nil {
    log.Fatal(err)
    }
    here, instead of the read/write loop, you would put:

    io.Copy(funcio, file)

    and handle errors appropriately. you can defer file.Close() before the
    call to io.Copy.
    It gives me an error:

    $ go build walk.go
    # command-line-arguments
    ./walk.go:95: no new variables on left side of :=
    ./walk.go:95: cannot assign int to n (type int64) in multiple assignment

    See

    https://github.com/quatrilio/find-duplicated-files/blob/cf613f9a990dac245b4d530de70c96649c0899b4/walk.go

    Thanks,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Xan at Jul 18, 2013 at 7:14 pm
    Thanks. I was ofuscated.

    By the other hand,
    find . -type f -exec sha256sum {} + > find.txt

    it's more fast than my program. Can I modify find to do what I do with my
    program?

    Will it be more or less memory-consumption (the same for speed)?

    El dijous 18 de juliol de 2013 17:14:38 UTC+2, al...@lx.lc va escriure:
    The error is rather descriptive. The problem is that you'd previously
    used variable named n in line 68, which types n as an int64. Then on line
    95, you used := which you cannot do when using previously defined
    operators. However, changing to = wouldn't work as Write returns an int,
    yet n has been typed as int64 remember. I'd just change variable named
    'n' on line 95 and 96 to something else and leave :=.

    Good Luck,
    Ales
    On Thursday, July 18, 2013 4:31:16 AM UTC-4, Xan wrote:

    Al 17/07/13 17:24, En/na andrey mirtchovski ha escrit:
    How can I achieve this:
    funcio.Reset() //Reset hash buffer
    // Read file contents until EOF and add these
    contents to hash buffer
    file, err := os.Open(fp) // For read access.
    if err != nil {
    log.Fatal(err)
    }
    here, instead of the read/write loop, you would put:

    io.Copy(funcio, file)

    and handle errors appropriately. you can defer file.Close() before the
    call to io.Copy.
    It gives me an error:

    $ go build walk.go
    # command-line-arguments
    ./walk.go:95: no new variables on left side of :=
    ./walk.go:95: cannot assign int to n (type int64) in multiple assignment

    See

    https://github.com/quatrilio/find-duplicated-files/blob/cf613f9a990dac245b4d530de70c96649c0899b4/walk.go

    Thanks,
    Xan.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJul 16, '13 at 3:40p
activeJul 20, '13 at 8:52p
posts14
users4
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase