FAQ
Hello all!

I'm currently working on a Go library that ingests a large corpus of text
(on the order of 10s of GBs) and outputs a CSV file that maps each unique
word to a unique index. I've been developing it on a late 2009 Macbook Pro
running Mac OS X 10.9, 4 GB RAM, and was able to do some significant memory
optimization. However, when I ran the same code on an Ubuntu machine (24 GB
RAM), the memory usage blew up and the program crashed. I ran the memory
profiler available through Go's "runtime" library on both machines to
compare the output:

Mac OS X 10.9:

(pprof) top

Total: 50.5 MB

     33.0 65.3% 65.3% 34.5 68.3% github.com/gtfierro/tokenizer.process

     15.0 29.7% 95.0% 15.5 30.7%
github.com/gtfierro/tokenizer.tokenize

      1.5 3.0% 98.0% 1.5 3.0% evacuate

      0.5 1.0% 99.0% 0.5 1.0% cnew

      0.5 1.0% 100.0% 0.5 1.0% newdefer

This memory usage correlates with the "runtime" cpu profiler output, which
shows memory being taken up by the conversion of []byte to string, which
happens in the tokenizer.tokenize method, and the formation of the map in
tokenizer.process. In the Ubuntu output below, 2GB of memory are taken up
by the bytes.Replace method, and secondly by the []byte <-> string
conversion in tokenizer.tokenize. It doesn't seem like Go is garbage
collecting the code when it is run on Ubuntu, and when I run the code with
GOGCTRACE=1, Ubuntu seems to take up more and more memory and OSX regularly
discards the unneeded []byte/string allocations as the program progresses.
Ubuntu 12.04:

(pprof) top


Total: 4739.5 MB


   2522.0 53.2% 53.2% 2522.0 53.2% bytes.Replace


   2191.0 46.2% 99.4% 2191.0 46.2%
github.com/gtfierro/tokenizer.tokenize

     18.5 0.4% 99.8% 18.5 0.4%
github.com/gtfierro/tokenizer.process

      6.5 0.1% 100.0% 6.5 0.1% runtime.malg


      1.0 0.0% 100.0% 1.0 0.0% runtime.deferproc


      0.5 0.0% 100.0% 4714.5 99.5%
github.com/gtfierro/tokenizer.deliver

      0.0 0.0% 100.0% 6.5 0.1%
github.com/gtfierro/tokenizer.CreateDict

      0.0 0.0% 100.0% 6.5 0.1% main.main


      0.0 0.0% 100.0% 6.5 0.1% runtime.main


      0.0 0.0% 100.0% 6.5 0.1% runtime.newproc


The test script I'm running is available
at http://play.golang.org/p/CUM90ZH25o, and the library I'm developing is
at https://github.com/gtfierro/tokenizer (the relevant code is in dict.go
and tokenizer.go)

I'm curious as to why the memory usage is so different. Is this because of
how Ubuntu handles memory vs how OSX handles memory? Is this an issue with
the Go compiler? Or is there something drastically wrong with my code?

Any help would be greatly appreciated, and I'm happy to run additional
benchmarks and the like.

Thanks!

Gabe

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Dave Cheney at Oct 25, 2013 at 9:57 pm
    Are you using the same version of Go on both platforms ?

    I cannot reproduce the issue

    lucky(~/src) % /usr/bin/time -v ./pro /usr/share/dict/words
    /usr/share/dict/words
    Creating token dictionary
    Finished reading input file /usr/share/dict/words
    Finished creating token dictionary with 83837 items
             Command being timed: "./pro /usr/share/dict/words"
             User time (seconds): 16.11
             System time (seconds): 2.68
             Percent of CPU this job got: 175%
             Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.70
             Average shared text size (kbytes): 0
             Average unshared data size (kbytes): 0
             Average stack size (kbytes): 0
             Average total size (kbytes): 0
             Maximum resident set size (kbytes): 96348
             Average resident set size (kbytes): 0
             Major (requiring I/O) page faults: 0
             Minor (reclaiming a frame) page faults: 184652
             Voluntary context switches: 555525
             Involuntary context switches: 3691
             Swaps: 0
             File system inputs: 0
             File system outputs: 2464
             Socket messages sent: 0
             Socket messages received: 0
             Signals delivered: 0
             Page size (bytes): 4096
             Exit status: 0
    On Sat, Oct 26, 2013 at 8:45 AM, Gabe Fierro wrote:
    Hello all!

    I'm currently working on a Go library that ingests a large corpus of text
    (on the order of 10s of GBs) and outputs a CSV file that maps each unique
    word to a unique index. I've been developing it on a late 2009 Macbook Pro
    running Mac OS X 10.9, 4 GB RAM, and was able to do some significant memory
    optimization. However, when I ran the same code on an Ubuntu machine (24 GB
    RAM), the memory usage blew up and the program crashed. I ran the memory
    profiler available through Go's "runtime" library on both machines to
    compare the output:

    Mac OS X 10.9:

    (pprof) top

    Total: 50.5 MB

    33.0 65.3% 65.3% 34.5 68.3% github.com/gtfierro/tokenizer.process

    15.0 29.7% 95.0% 15.5 30.7%
    github.com/gtfierro/tokenizer.tokenize

    1.5 3.0% 98.0% 1.5 3.0% evacuate

    0.5 1.0% 99.0% 0.5 1.0% cnew

    0.5 1.0% 100.0% 0.5 1.0% newdefer

    This memory usage correlates with the "runtime" cpu profiler output, which
    shows memory being taken up by the conversion of []byte to string, which
    happens in the tokenizer.tokenize method, and the formation of the map in
    tokenizer.process. In the Ubuntu output below, 2GB of memory are taken up by
    the bytes.Replace method, and secondly by the []byte <-> string conversion
    in tokenizer.tokenize. It doesn't seem like Go is garbage collecting the
    code when it is run on Ubuntu, and when I run the code with GOGCTRACE=1,
    Ubuntu seems to take up more and more memory and OSX regularly discards the
    unneeded []byte/string allocations as the program progresses.

    Ubuntu 12.04:

    (pprof) top

    Total: 4739.5 MB

    2522.0 53.2% 53.2% 2522.0 53.2% bytes.Replace

    2191.0 46.2% 99.4% 2191.0 46.2%
    github.com/gtfierro/tokenizer.tokenize

    18.5 0.4% 99.8% 18.5 0.4% github.com/gtfierro/tokenizer.process

    6.5 0.1% 100.0% 6.5 0.1% runtime.malg

    1.0 0.0% 100.0% 1.0 0.0% runtime.deferproc

    0.5 0.0% 100.0% 4714.5 99.5% github.com/gtfierro/tokenizer.deliver

    0.0 0.0% 100.0% 6.5 0.1%
    github.com/gtfierro/tokenizer.CreateDict

    0.0 0.0% 100.0% 6.5 0.1% main.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.newproc


    The test script I'm running is available at
    http://play.golang.org/p/CUM90ZH25o, and the library I'm developing is at
    https://github.com/gtfierro/tokenizer (the relevant code is in dict.go and
    tokenizer.go)

    I'm curious as to why the memory usage is so different. Is this because of
    how Ubuntu handles memory vs how OSX handles memory? Is this an issue with
    the Go compiler? Or is there something drastically wrong with my code?

    Any help would be greatly appreciated, and I'm happy to run additional
    benchmarks and the like.

    Thanks!

    Gabe


    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Gabe Fierro at Oct 25, 2013 at 10:37 pm
    I realize now that checking the versions should have been one of the first
    things that occurred to me...thanks for the reminder, though! I only had
    go1 installed on my Ubuntu machine compared with go1.1.2 on OSX. I now have
    go1.1.2 on the Ubuntu machine and the memory usage and general performance
    has improved, but still isn't on par with OSX. Also, seems more of the
    memory is being taken up by "cnew" now, which doesn't show up on my OSX
    memory benchmarks at all.

    Ubuntu 12.04 running go1.1.2:

    (pprof) top

    Total: 5899.5 MB

       4947.0 83.9% 83.9% 4947.0 83.9% cnew

        810.5 13.7% 97.6% 2213.0 37.5%
    github.com/gtfierro/tokenizer.tokenize

         66.5 1.1% 98.7% 72.0 1.2% github.com/gtfierro/tokenizer.process

         57.5 1.0% 99.7% 57.5 1.0% newdefer

          9.5 0.2% 99.9% 9.5 0.2% runtime.malg

          5.5 0.1% 99.9% 5.5 0.1% evacuate

          2.5 0.0% 100.0% 5817.0 98.6% github.com/gtfierro/tokenizer.deliver

          0.5 0.0% 100.0% 0.5 0.0% runtime.parforalloc

          0.0 0.0% 100.0% 0.5 0.0% bufio.(*Reader).ReadBytes

    For reproducing the issue, though, the program works fine until the input
    size gets too large. On my Ubuntu running go1, it dies processing an input
    file of size 3GB or more. I have a 7zipped file in an S3 bucket here:
    http://fungpatdownloads.s3.amazonaws.com/200000.7z. The benchmark above was
    run using this file. Running go1.1.2, it can handle that size fine. I'm
    running benchmarks on larger files now.
    On Friday, October 25, 2013 2:57:19 PM UTC-7, Dave Cheney wrote:

    Are you using the same version of Go on both platforms ?

    I cannot reproduce the issue

    lucky(~/src) % /usr/bin/time -v ./pro /usr/share/dict/words
    /usr/share/dict/words
    Creating token dictionary
    Finished reading input file /usr/share/dict/words
    Finished creating token dictionary with 83837 items
    Command being timed: "./pro /usr/share/dict/words"
    User time (seconds): 16.11
    System time (seconds): 2.68
    Percent of CPU this job got: 175%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.70
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 96348
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 184652
    Voluntary context switches: 555525
    Involuntary context switches: 3691
    Swaps: 0
    File system inputs: 0
    File system outputs: 2464
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
    On Sat, Oct 26, 2013 at 8:45 AM, Gabe Fierro wrote:
    Hello all!

    I'm currently working on a Go library that ingests a large corpus of text
    (on the order of 10s of GBs) and outputs a CSV file that maps each unique
    word to a unique index. I've been developing it on a late 2009 Macbook Pro
    running Mac OS X 10.9, 4 GB RAM, and was able to do some significant memory
    optimization. However, when I ran the same code on an Ubuntu machine (24 GB
    RAM), the memory usage blew up and the program crashed. I ran the memory
    profiler available through Go's "runtime" library on both machines to
    compare the output:

    Mac OS X 10.9:

    (pprof) top

    Total: 50.5 MB

    33.0 65.3% 65.3% 34.5 68.3%
    github.com/gtfierro/tokenizer.process
    15.0 29.7% 95.0% 15.5 30.7%
    github.com/gtfierro/tokenizer.tokenize

    1.5 3.0% 98.0% 1.5 3.0% evacuate

    0.5 1.0% 99.0% 0.5 1.0% cnew

    0.5 1.0% 100.0% 0.5 1.0% newdefer

    This memory usage correlates with the "runtime" cpu profiler output, which
    shows memory being taken up by the conversion of []byte to string, which
    happens in the tokenizer.tokenize method, and the formation of the map in
    tokenizer.process. In the Ubuntu output below, 2GB of memory are taken up by
    the bytes.Replace method, and secondly by the []byte <-> string
    conversion
    in tokenizer.tokenize. It doesn't seem like Go is garbage collecting the
    code when it is run on Ubuntu, and when I run the code with GOGCTRACE=1,
    Ubuntu seems to take up more and more memory and OSX regularly discards the
    unneeded []byte/string allocations as the program progresses.

    Ubuntu 12.04:

    (pprof) top

    Total: 4739.5 MB

    2522.0 53.2% 53.2% 2522.0 53.2% bytes.Replace

    2191.0 46.2% 99.4% 2191.0 46.2%
    github.com/gtfierro/tokenizer.tokenize

    18.5 0.4% 99.8% 18.5 0.4%
    github.com/gtfierro/tokenizer.process
    6.5 0.1% 100.0% 6.5 0.1% runtime.malg

    1.0 0.0% 100.0% 1.0 0.0% runtime.deferproc

    0.5 0.0% 100.0% 4714.5 99.5%
    github.com/gtfierro/tokenizer.deliver
    0.0 0.0% 100.0% 6.5 0.1%
    github.com/gtfierro/tokenizer.CreateDict

    0.0 0.0% 100.0% 6.5 0.1% main.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.newproc


    The test script I'm running is available at
    http://play.golang.org/p/CUM90ZH25o, and the library I'm developing is at
    https://github.com/gtfierro/tokenizer (the relevant code is in dict.go and
    tokenizer.go)

    I'm curious as to why the memory usage is so different. Is this because of
    how Ubuntu handles memory vs how OSX handles memory? Is this an issue with
    the Go compiler? Or is there something drastically wrong with my code?

    Any help would be greatly appreciated, and I'm happy to run additional
    benchmarks and the like.

    Thanks!

    Gabe


    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Oct 25, 2013 at 10:39 pm
    For Ubuntu Precise I recommend using the 1.1.2 or 1.2rc2 installer
    from the download area of the golang.org website.
    On Sat, Oct 26, 2013 at 9:37 AM, Gabe Fierro wrote:
    I realize now that checking the versions should have been one of the first
    things that occurred to me...thanks for the reminder, though! I only had go1
    installed on my Ubuntu machine compared with go1.1.2 on OSX. I now have
    go1.1.2 on the Ubuntu machine and the memory usage and general performance
    has improved, but still isn't on par with OSX. Also, seems more of the
    memory is being taken up by "cnew" now, which doesn't show up on my OSX
    memory benchmarks at all.

    Ubuntu 12.04 running go1.1.2:

    (pprof) top

    Total: 5899.5 MB

    4947.0 83.9% 83.9% 4947.0 83.9% cnew

    810.5 13.7% 97.6% 2213.0 37.5%
    github.com/gtfierro/tokenizer.tokenize

    66.5 1.1% 98.7% 72.0 1.2% github.com/gtfierro/tokenizer.process

    57.5 1.0% 99.7% 57.5 1.0% newdefer

    9.5 0.2% 99.9% 9.5 0.2% runtime.malg

    5.5 0.1% 99.9% 5.5 0.1% evacuate

    2.5 0.0% 100.0% 5817.0 98.6% github.com/gtfierro/tokenizer.deliver

    0.5 0.0% 100.0% 0.5 0.0% runtime.parforalloc

    0.0 0.0% 100.0% 0.5 0.0% bufio.(*Reader).ReadBytes


    For reproducing the issue, though, the program works fine until the input
    size gets too large. On my Ubuntu running go1, it dies processing an input
    file of size 3GB or more. I have a 7zipped file in an S3 bucket here:
    http://fungpatdownloads.s3.amazonaws.com/200000.7z. The benchmark above was
    run using this file. Running go1.1.2, it can handle that size fine. I'm
    running benchmarks on larger files now.
    On Friday, October 25, 2013 2:57:19 PM UTC-7, Dave Cheney wrote:

    Are you using the same version of Go on both platforms ?

    I cannot reproduce the issue

    lucky(~/src) % /usr/bin/time -v ./pro /usr/share/dict/words
    /usr/share/dict/words
    Creating token dictionary
    Finished reading input file /usr/share/dict/words
    Finished creating token dictionary with 83837 items
    Command being timed: "./pro /usr/share/dict/words"
    User time (seconds): 16.11
    System time (seconds): 2.68
    Percent of CPU this job got: 175%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.70
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 96348
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 184652
    Voluntary context switches: 555525
    Involuntary context switches: 3691
    Swaps: 0
    File system inputs: 0
    File system outputs: 2464
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
    On Sat, Oct 26, 2013 at 8:45 AM, Gabe Fierro wrote:
    Hello all!

    I'm currently working on a Go library that ingests a large corpus of
    text
    (on the order of 10s of GBs) and outputs a CSV file that maps each
    unique
    word to a unique index. I've been developing it on a late 2009 Macbook
    Pro
    running Mac OS X 10.9, 4 GB RAM, and was able to do some significant
    memory
    optimization. However, when I ran the same code on an Ubuntu machine (24
    GB
    RAM), the memory usage blew up and the program crashed. I ran the memory
    profiler available through Go's "runtime" library on both machines to
    compare the output:

    Mac OS X 10.9:

    (pprof) top

    Total: 50.5 MB

    33.0 65.3% 65.3% 34.5 68.3%
    github.com/gtfierro/tokenizer.process

    15.0 29.7% 95.0% 15.5 30.7%
    github.com/gtfierro/tokenizer.tokenize

    1.5 3.0% 98.0% 1.5 3.0% evacuate

    0.5 1.0% 99.0% 0.5 1.0% cnew

    0.5 1.0% 100.0% 0.5 1.0% newdefer

    This memory usage correlates with the "runtime" cpu profiler output,
    which
    shows memory being taken up by the conversion of []byte to string, which
    happens in the tokenizer.tokenize method, and the formation of the map
    in
    tokenizer.process. In the Ubuntu output below, 2GB of memory are taken
    up by
    the bytes.Replace method, and secondly by the []byte <-> string
    conversion
    in tokenizer.tokenize. It doesn't seem like Go is garbage collecting the
    code when it is run on Ubuntu, and when I run the code with GOGCTRACE=1,
    Ubuntu seems to take up more and more memory and OSX regularly discards
    the
    unneeded []byte/string allocations as the program progresses.

    Ubuntu 12.04:

    (pprof) top

    Total: 4739.5 MB

    2522.0 53.2% 53.2% 2522.0 53.2% bytes.Replace

    2191.0 46.2% 99.4% 2191.0 46.2%
    github.com/gtfierro/tokenizer.tokenize

    18.5 0.4% 99.8% 18.5 0.4%
    github.com/gtfierro/tokenizer.process

    6.5 0.1% 100.0% 6.5 0.1% runtime.malg

    1.0 0.0% 100.0% 1.0 0.0% runtime.deferproc

    0.5 0.0% 100.0% 4714.5 99.5%
    github.com/gtfierro/tokenizer.deliver

    0.0 0.0% 100.0% 6.5 0.1%
    github.com/gtfierro/tokenizer.CreateDict

    0.0 0.0% 100.0% 6.5 0.1% main.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.newproc


    The test script I'm running is available at
    http://play.golang.org/p/CUM90ZH25o, and the library I'm developing is
    at
    https://github.com/gtfierro/tokenizer (the relevant code is in dict.go
    and
    tokenizer.go)

    I'm curious as to why the memory usage is so different. Is this because
    of
    how Ubuntu handles memory vs how OSX handles memory? Is this an issue
    with
    the Go compiler? Or is there something drastically wrong with my code?

    Any help would be greatly appreciated, and I'm happy to run additional
    benchmarks and the like.

    Thanks!

    Gabe


    --
    You received this message because you are subscribed to the Google
    Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Oct 26, 2013 at 12:03 am
    You application is creating millions of goroutines.

    goroutine 3438 [chan send]:
    github.com/gtfierro/tokenizer.deliver(0xc2211f72a0, 0x54, 0x54)
             /home/dfc/src/github.com/gtfierro/tokenizer/dict.go:62 +0xdf8
    created by github.com/gtfierro/tokenizer.CreateDict
             /home/dfc/src/github.com/gtfierro/tokenizer/dict.go:92 +0x18f

    Goroutines are cheap, but they are not free, unless those goroutines
    can be serviced faster than they are created your memory usage will
    bloat.
    On Sat, Oct 26, 2013 at 9:39 AM, Dave Cheney wrote:
    For Ubuntu Precise I recommend using the 1.1.2 or 1.2rc2 installer
    from the download area of the golang.org website.
    On Sat, Oct 26, 2013 at 9:37 AM, Gabe Fierro wrote:
    I realize now that checking the versions should have been one of the first
    things that occurred to me...thanks for the reminder, though! I only had go1
    installed on my Ubuntu machine compared with go1.1.2 on OSX. I now have
    go1.1.2 on the Ubuntu machine and the memory usage and general performance
    has improved, but still isn't on par with OSX. Also, seems more of the
    memory is being taken up by "cnew" now, which doesn't show up on my OSX
    memory benchmarks at all.

    Ubuntu 12.04 running go1.1.2:

    (pprof) top

    Total: 5899.5 MB

    4947.0 83.9% 83.9% 4947.0 83.9% cnew

    810.5 13.7% 97.6% 2213.0 37.5%
    github.com/gtfierro/tokenizer.tokenize

    66.5 1.1% 98.7% 72.0 1.2% github.com/gtfierro/tokenizer.process

    57.5 1.0% 99.7% 57.5 1.0% newdefer

    9.5 0.2% 99.9% 9.5 0.2% runtime.malg

    5.5 0.1% 99.9% 5.5 0.1% evacuate

    2.5 0.0% 100.0% 5817.0 98.6% github.com/gtfierro/tokenizer.deliver

    0.5 0.0% 100.0% 0.5 0.0% runtime.parforalloc

    0.0 0.0% 100.0% 0.5 0.0% bufio.(*Reader).ReadBytes


    For reproducing the issue, though, the program works fine until the input
    size gets too large. On my Ubuntu running go1, it dies processing an input
    file of size 3GB or more. I have a 7zipped file in an S3 bucket here:
    http://fungpatdownloads.s3.amazonaws.com/200000.7z. The benchmark above was
    run using this file. Running go1.1.2, it can handle that size fine. I'm
    running benchmarks on larger files now.
    On Friday, October 25, 2013 2:57:19 PM UTC-7, Dave Cheney wrote:

    Are you using the same version of Go on both platforms ?

    I cannot reproduce the issue

    lucky(~/src) % /usr/bin/time -v ./pro /usr/share/dict/words
    /usr/share/dict/words
    Creating token dictionary
    Finished reading input file /usr/share/dict/words
    Finished creating token dictionary with 83837 items
    Command being timed: "./pro /usr/share/dict/words"
    User time (seconds): 16.11
    System time (seconds): 2.68
    Percent of CPU this job got: 175%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.70
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 96348
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 184652
    Voluntary context switches: 555525
    Involuntary context switches: 3691
    Swaps: 0
    File system inputs: 0
    File system outputs: 2464
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
    On Sat, Oct 26, 2013 at 8:45 AM, Gabe Fierro wrote:
    Hello all!

    I'm currently working on a Go library that ingests a large corpus of
    text
    (on the order of 10s of GBs) and outputs a CSV file that maps each
    unique
    word to a unique index. I've been developing it on a late 2009 Macbook
    Pro
    running Mac OS X 10.9, 4 GB RAM, and was able to do some significant
    memory
    optimization. However, when I ran the same code on an Ubuntu machine (24
    GB
    RAM), the memory usage blew up and the program crashed. I ran the memory
    profiler available through Go's "runtime" library on both machines to
    compare the output:

    Mac OS X 10.9:

    (pprof) top

    Total: 50.5 MB

    33.0 65.3% 65.3% 34.5 68.3%
    github.com/gtfierro/tokenizer.process

    15.0 29.7% 95.0% 15.5 30.7%
    github.com/gtfierro/tokenizer.tokenize

    1.5 3.0% 98.0% 1.5 3.0% evacuate

    0.5 1.0% 99.0% 0.5 1.0% cnew

    0.5 1.0% 100.0% 0.5 1.0% newdefer

    This memory usage correlates with the "runtime" cpu profiler output,
    which
    shows memory being taken up by the conversion of []byte to string, which
    happens in the tokenizer.tokenize method, and the formation of the map
    in
    tokenizer.process. In the Ubuntu output below, 2GB of memory are taken
    up by
    the bytes.Replace method, and secondly by the []byte <-> string
    conversion
    in tokenizer.tokenize. It doesn't seem like Go is garbage collecting the
    code when it is run on Ubuntu, and when I run the code with GOGCTRACE=1,
    Ubuntu seems to take up more and more memory and OSX regularly discards
    the
    unneeded []byte/string allocations as the program progresses.

    Ubuntu 12.04:

    (pprof) top

    Total: 4739.5 MB

    2522.0 53.2% 53.2% 2522.0 53.2% bytes.Replace

    2191.0 46.2% 99.4% 2191.0 46.2%
    github.com/gtfierro/tokenizer.tokenize

    18.5 0.4% 99.8% 18.5 0.4%
    github.com/gtfierro/tokenizer.process

    6.5 0.1% 100.0% 6.5 0.1% runtime.malg

    1.0 0.0% 100.0% 1.0 0.0% runtime.deferproc

    0.5 0.0% 100.0% 4714.5 99.5%
    github.com/gtfierro/tokenizer.deliver

    0.0 0.0% 100.0% 6.5 0.1%
    github.com/gtfierro/tokenizer.CreateDict

    0.0 0.0% 100.0% 6.5 0.1% main.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.newproc


    The test script I'm running is available at
    http://play.golang.org/p/CUM90ZH25o, and the library I'm developing is
    at
    https://github.com/gtfierro/tokenizer (the relevant code is in dict.go
    and
    tokenizer.go)

    I'm curious as to why the memory usage is so different. Is this because
    of
    how Ubuntu handles memory vs how OSX handles memory? Is this an issue
    with
    the Go compiler? Or is there something drastically wrong with my code?

    Any help would be greatly appreciated, and I'm happy to run additional
    benchmarks and the like.

    Thanks!

    Gabe


    --
    You received this message because you are subscribed to the Google
    Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Gabe Fierro at Oct 27, 2013 at 9:34 pm
    With the new compiler, I was able to discover that there was a large amount
    of memory being used by the string <-> []byte conversions, so I'm fixing
    those up. I'll also look into common idioms for limiting the number of
    goroutines created.

    Thanks so much for your help!
    On Friday, October 25, 2013 5:03:40 PM UTC-7, Dave Cheney wrote:

    You application is creating millions of goroutines.

    goroutine 3438 [chan send]:
    github.com/gtfierro/tokenizer.deliver(0xc2211f72a0, 0x54, 0x54)
    /home/dfc/src/github.com/gtfierro/tokenizer/dict.go:62 +0xdf8
    created by github.com/gtfierro/tokenizer.CreateDict
    /home/dfc/src/github.com/gtfierro/tokenizer/dict.go:92 +0x18f

    Goroutines are cheap, but they are not free, unless those goroutines
    can be serviced faster than they are created your memory usage will
    bloat.
    On Sat, Oct 26, 2013 at 9:39 AM, Dave Cheney wrote:
    For Ubuntu Precise I recommend using the 1.1.2 or 1.2rc2 installer
    from the download area of the golang.org website.
    On Sat, Oct 26, 2013 at 9:37 AM, Gabe Fierro wrote:
    I realize now that checking the versions should have been one of the
    first
    things that occurred to me...thanks for the reminder, though! I only
    had go1
    installed on my Ubuntu machine compared with go1.1.2 on OSX. I now have
    go1.1.2 on the Ubuntu machine and the memory usage and general
    performance
    has improved, but still isn't on par with OSX. Also, seems more of the
    memory is being taken up by "cnew" now, which doesn't show up on my OSX
    memory benchmarks at all.

    Ubuntu 12.04 running go1.1.2:

    (pprof) top

    Total: 5899.5 MB

    4947.0 83.9% 83.9% 4947.0 83.9% cnew

    810.5 13.7% 97.6% 2213.0 37.5%
    github.com/gtfierro/tokenizer.tokenize

    66.5 1.1% 98.7% 72.0 1.2%
    github.com/gtfierro/tokenizer.process
    57.5 1.0% 99.7% 57.5 1.0% newdefer

    9.5 0.2% 99.9% 9.5 0.2% runtime.malg

    5.5 0.1% 99.9% 5.5 0.1% evacuate

    2.5 0.0% 100.0% 5817.0 98.6%
    github.com/gtfierro/tokenizer.deliver
    0.5 0.0% 100.0% 0.5 0.0% runtime.parforalloc

    0.0 0.0% 100.0% 0.5 0.0% bufio.(*Reader).ReadBytes


    For reproducing the issue, though, the program works fine until the
    input
    size gets too large. On my Ubuntu running go1, it dies processing an
    input
    file of size 3GB or more. I have a 7zipped file in an S3 bucket here:
    http://fungpatdownloads.s3.amazonaws.com/200000.7z. The benchmark
    above was
    run using this file. Running go1.1.2, it can handle that size fine. I'm
    running benchmarks on larger files now.
    On Friday, October 25, 2013 2:57:19 PM UTC-7, Dave Cheney wrote:

    Are you using the same version of Go on both platforms ?

    I cannot reproduce the issue

    lucky(~/src) % /usr/bin/time -v ./pro /usr/share/dict/words
    /usr/share/dict/words
    Creating token dictionary
    Finished reading input file /usr/share/dict/words
    Finished creating token dictionary with 83837 items
    Command being timed: "./pro /usr/share/dict/words"
    User time (seconds): 16.11
    System time (seconds): 2.68
    Percent of CPU this job got: 175%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.70
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 96348
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 184652
    Voluntary context switches: 555525
    Involuntary context switches: 3691
    Swaps: 0
    File system inputs: 0
    File system outputs: 2464
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
    On Sat, Oct 26, 2013 at 8:45 AM, Gabe Fierro wrote:
    Hello all!

    I'm currently working on a Go library that ingests a large corpus of
    text
    (on the order of 10s of GBs) and outputs a CSV file that maps each
    unique
    word to a unique index. I've been developing it on a late 2009
    Macbook
    Pro
    running Mac OS X 10.9, 4 GB RAM, and was able to do some significant
    memory
    optimization. However, when I ran the same code on an Ubuntu machine
    (24
    GB
    RAM), the memory usage blew up and the program crashed. I ran the
    memory
    profiler available through Go's "runtime" library on both machines
    to
    compare the output:

    Mac OS X 10.9:

    (pprof) top

    Total: 50.5 MB

    33.0 65.3% 65.3% 34.5 68.3%
    github.com/gtfierro/tokenizer.process

    15.0 29.7% 95.0% 15.5 30.7%
    github.com/gtfierro/tokenizer.tokenize

    1.5 3.0% 98.0% 1.5 3.0% evacuate

    0.5 1.0% 99.0% 0.5 1.0% cnew

    0.5 1.0% 100.0% 0.5 1.0% newdefer

    This memory usage correlates with the "runtime" cpu profiler output,
    which
    shows memory being taken up by the conversion of []byte to string,
    which
    happens in the tokenizer.tokenize method, and the formation of the
    map
    in
    tokenizer.process. In the Ubuntu output below, 2GB of memory are
    taken
    up by
    the bytes.Replace method, and secondly by the []byte <-> string
    conversion
    in tokenizer.tokenize. It doesn't seem like Go is garbage collecting
    the
    code when it is run on Ubuntu, and when I run the code with
    GOGCTRACE=1,
    Ubuntu seems to take up more and more memory and OSX regularly
    discards
    the
    unneeded []byte/string allocations as the program progresses.

    Ubuntu 12.04:

    (pprof) top

    Total: 4739.5 MB

    2522.0 53.2% 53.2% 2522.0 53.2% bytes.Replace

    2191.0 46.2% 99.4% 2191.0 46.2%
    github.com/gtfierro/tokenizer.tokenize

    18.5 0.4% 99.8% 18.5 0.4%
    github.com/gtfierro/tokenizer.process

    6.5 0.1% 100.0% 6.5 0.1% runtime.malg

    1.0 0.0% 100.0% 1.0 0.0% runtime.deferproc

    0.5 0.0% 100.0% 4714.5 99.5%
    github.com/gtfierro/tokenizer.deliver

    0.0 0.0% 100.0% 6.5 0.1%
    github.com/gtfierro/tokenizer.CreateDict

    0.0 0.0% 100.0% 6.5 0.1% main.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.main

    0.0 0.0% 100.0% 6.5 0.1% runtime.newproc


    The test script I'm running is available at
    http://play.golang.org/p/CUM90ZH25o, and the library I'm developing
    is
    at
    https://github.com/gtfierro/tokenizer (the relevant code is in
    dict.go
    and
    tokenizer.go)

    I'm curious as to why the memory usage is so different. Is this
    because
    of
    how Ubuntu handles memory vs how OSX handles memory? Is this an
    issue
    with
    the Go compiler? Or is there something drastically wrong with my
    code?
    Any help would be greatly appreciated, and I'm happy to run
    additional
    benchmarks and the like.

    Thanks!

    Gabe


    --
    You received this message because you are subscribed to the Google
    Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it,
    send
    an
    email to golang-nuts...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google
    Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedOct 25, '13 at 9:46p
activeOct 27, '13 at 9:34p
posts6
users2
websitegolang.org

2 users in discussion

Dave Cheney: 3 posts Gabe Fierro: 3 posts

People

Translate

site design / logo © 2021 Grokbase