FAQ
Hello,

Memory usage by the gc compiler series is always a problem for small
memory arm devices. For example compiling the go files in the net
package with 5g takes approximately 108mb of ram. Last night I
annotated gc/subr.c:mal to try to figure out where all the memory was
going. Here are some of the details. The struct types, if known are
listed to the right.

dfc@qnap:~/go/src/pkg/net$ sort /tmp/mal.txt | uniq -c | sort -nr | head -n 20
272852 mal 216 // Node
163507 mal 120 // Type
137702 mal 12 // NodeList ?
106986 mal 8 // Strlit / Val ?
97424 mal 128
51894 mal 68 // Mpint
33183 mal 40 // Sym
4917 mal 13
1549 mal 16
975 mal 7
961 mal 224
817 mal 14
802 mal 11
765 mal 10
676 mal 9
561 mal 5
537 mal 15
533 mal 6
432 mal 17
405 mal 32

The biggest cost is 272,852 Node structures, consuming in excess of
50mb alone, after that the other main offenders do their share,
including 8mb of Strlit structures. Obviously any saving, no matter
how small (alignment being considered) will have a large payoff for
all the compilers.

Interestingly I applied, https://codereview.appspot.com/6650054/, and
found no improvement, possibly because bitfields are not implemented
or come at no advantage under gcc on arm (I tried both -Os and -O3).

Suggestions warmly welcomed.

Cheers

Dave

Search Discussions

  • Robert Griesemer at Dec 7, 2012 at 4:29 pm
    Somebody familiar with the gc compiler should probably answer this,
    but superficially it appears that Nodes are a "poor man's" union in
    the sense that they hold the fields for all possible uses of a Node,
    but I suspect that for any given Node only a few fields are used at a
    time.

    It probably would take some clever refactoring but I wouldn't be
    surprised if the amount of memory used by nodes could be reduced by a
    factor of 2 or more (for instance by moving the fields used only by
    functions into a different struct, and doing similar regroupings for
    other fields as well).

    Russ will know better.

    - gri
    On Fri, Dec 7, 2012 at 1:39 AM, Dave Cheney wrote:
    Hello,

    Memory usage by the gc compiler series is always a problem for small
    memory arm devices. For example compiling the go files in the net
    package with 5g takes approximately 108mb of ram. Last night I
    annotated gc/subr.c:mal to try to figure out where all the memory was
    going. Here are some of the details. The struct types, if known are
    listed to the right.

    dfc@qnap:~/go/src/pkg/net$ sort /tmp/mal.txt | uniq -c | sort -nr | head -n 20
    272852 mal 216 // Node
    163507 mal 120 // Type
    137702 mal 12 // NodeList ?
    106986 mal 8 // Strlit / Val ?
    97424 mal 128
    51894 mal 68 // Mpint
    33183 mal 40 // Sym
    4917 mal 13
    1549 mal 16
    975 mal 7
    961 mal 224
    817 mal 14
    802 mal 11
    765 mal 10
    676 mal 9
    561 mal 5
    537 mal 15
    533 mal 6
    432 mal 17
    405 mal 32

    The biggest cost is 272,852 Node structures, consuming in excess of
    50mb alone, after that the other main offenders do their share,
    including 8mb of Strlit structures. Obviously any saving, no matter
    how small (alignment being considered) will have a large payoff for
    all the compilers.

    Interestingly I applied, https://codereview.appspot.com/6650054/, and
    found no improvement, possibly because bitfields are not implemented
    or come at no advantage under gcc on arm (I tried both -Os and -O3).

    Suggestions warmly welcomed.

    Cheers

    Dave
  • Russ Cox at Dec 7, 2012 at 4:49 pm
    I'd love to see this get fixed. I think we need a little more data. We
    know we have lots of Nodes, but the next step would be to know what
    kind of Nodes we have. If you can run the compiler using the C++
    version of pprof (Linux/x86-64; http://code.google.com/p/gperftools;
    use the LD_PRELOAD mode) then you can do pprof --svg to generate an
    SVG version of the call graph. That will tell us what allocates Nodes,
    which works as a proxy for what kinds of Nodes they are.

    Thanks.
    Russ
  • Daniel Morsing at Dec 7, 2012 at 5:16 pm

    On Fri, Dec 7, 2012 at 5:49 PM, Russ Cox wrote:
    I'd love to see this get fixed. I think we need a little more data. We
    know we have lots of Nodes, but the next step would be to know what
    kind of Nodes we have. If you can run the compiler using the C++
    version of pprof (Linux/x86-64; http://code.google.com/p/gperftools;
    use the LD_PRELOAD mode) then you can do pprof --svg to generate an
    SVG version of the call graph. That will tell us what allocates Nodes,
    which works as a proxy for what kinds of Nodes they are.

    Thanks.
    Russ
    This wont give good results since the Node structures are allocated
    from a pool via the mal() function.

    Every C standard library is probably going to have a pretty well
    optimized malloc implementation, so I don't know if there are any
    advantages to keeping mal() around.

    Regards,
    Daniel Morsing
  • Russ Cox at Dec 7, 2012 at 5:35 pm

    This wont give good results since the Node structures are allocated
    from a pool via the mal() function.
    Ah yes. To do this test you'd have to make mal call malloc for each allocation.
    Every C standard library is probably going to have a pretty well
    optimized malloc implementation, so I don't know if there are any
    advantages to keeping mal() around.
    Try it and see. I'm not as optimistic as you are. Measure both total
    process memory usage and run time.

    Russ
  • Dave Cheney at Dec 7, 2012 at 9:15 pm

    This wont give good results since the Node structures are allocated
    from a pool via the mal() function.
    Ah yes. To do this test you'd have to make mal call malloc for each allocation.
    I'll find a way to annotate these and report back.
    Every C standard library is probably going to have a pretty well
    optimized malloc implementation, so I don't know if there are any
    advantages to keeping mal() around.
    Try it and see. I'm not as optimistic as you are. Measure both total
    process memory usage and run time.
    I have in the past experimented with replacing mal with malloc, etc
    and found no improvement in memory usage. This was on a 6g machine so
    the effect of the malloc overhead was not easily observable (net
    compilation time on amd64 is in the order of 400ms, but close to 10x
    longer on 5g). Mal is very efficient but could probably be made a
    little more swap friendly if NHUNK was larger, say 1mb.
  • Dave Cheney at Dec 8, 2012 at 12:12 am
    I've hacked cmd/gc to use calloc instead of mal and am getting some
    results (under 8g)
    Total: 74.3 MB
    37.7 50.7% 50.7% 37.7 50.7% nod
    16.7 22.5% 73.2% 16.7 22.5% typ
    7.4 9.9% 83.1% 7.4 9.9% clearp (inline)
    5.4 7.3% 90.4% 12.7 17.1% prog
    3.3 4.5% 94.9% 3.3 4.5% _yylex (inline)
    1.1 1.5% 96.4% 1.1 1.5% pkglookup
    0.7 1.0% 97.3% 0.7 1.0% list1
    0.5 0.6% 98.0% 0.5 0.6% nodconst
    0.4 0.6% 98.6% 0.4 0.6% push
    0.2 0.3% 98.9% 0.2 0.3% convconst
    0.2 0.3% 99.1% 0.2 0.3% remal
    0.2 0.3% 99.4% 0.2 0.3% addmove
    0.2 0.2% 99.6% 0.2 0.2% rega
    0.1 0.1% 99.7% 0.1 0.1% typecheckdef

    Can anyone with more perftools fu recommend some way of getting the
    caller's of nod/typ/prog ?

    CL for the above, https://codereview.appspot.com/6900053

    Cheers

    Dave
    On Sat, Dec 8, 2012 at 4:35 AM, Russ Cox wrote:
    This wont give good results since the Node structures are allocated
    from a pool via the mal() function.
    Ah yes. To do this test you'd have to make mal call malloc for each allocation.
    Every C standard library is probably going to have a pretty well
    optimized malloc implementation, so I don't know if there are any
    advantages to keeping mal() around.
    Try it and see. I'm not as optimistic as you are. Measure both total
    process memory usage and run time.

    Russ
  • Dave Cheney at Dec 8, 2012 at 1:44 am
    After a bit of fiddling (-O0 is required to get a full stack trace), PTAL


    On Sat, Dec 8, 2012 at 11:12 AM, Dave Cheney wrote:
    I've hacked cmd/gc to use calloc instead of mal and am getting some
    results (under 8g)
    Total: 74.3 MB
    37.7 50.7% 50.7% 37.7 50.7% nod
    16.7 22.5% 73.2% 16.7 22.5% typ
    7.4 9.9% 83.1% 7.4 9.9% clearp (inline)
    5.4 7.3% 90.4% 12.7 17.1% prog
    3.3 4.5% 94.9% 3.3 4.5% _yylex (inline)
    1.1 1.5% 96.4% 1.1 1.5% pkglookup
    0.7 1.0% 97.3% 0.7 1.0% list1
    0.5 0.6% 98.0% 0.5 0.6% nodconst
    0.4 0.6% 98.6% 0.4 0.6% push
    0.2 0.3% 98.9% 0.2 0.3% convconst
    0.2 0.3% 99.1% 0.2 0.3% remal
    0.2 0.3% 99.4% 0.2 0.3% addmove
    0.2 0.2% 99.6% 0.2 0.2% rega
    0.1 0.1% 99.7% 0.1 0.1% typecheckdef

    Can anyone with more perftools fu recommend some way of getting the
    caller's of nod/typ/prog ?

    CL for the above, https://codereview.appspot.com/6900053

    Cheers

    Dave
    On Sat, Dec 8, 2012 at 4:35 AM, Russ Cox wrote:
    This wont give good results since the Node structures are allocated
    from a pool via the mal() function.
    Ah yes. To do this test you'd have to make mal call malloc for each allocation.
    Every C standard library is probably going to have a pretty well
    optimized malloc implementation, so I don't know if there are any
    advantages to keeping mal() around.
    Try it and see. I'm not as optimistic as you are. Measure both total
    process memory usage and run time.

    Russ
  • Rémy Oudompheng at Dec 7, 2012 at 8:05 pm

    On 2012/12/7 Dave Cheney wrote:
    Hello,

    Memory usage by the gc compiler series is always a problem for small
    memory arm devices. For example compiling the go files in the net
    package with 5g takes approximately 108mb of ram. Last night I
    annotated gc/subr.c:mal to try to figure out where all the memory was
    going. Here are some of the details. The struct types, if known are
    listed to the right.

    dfc@qnap:~/go/src/pkg/net$ sort /tmp/mal.txt | uniq -c | sort -nr | head -n 20
    272852 mal 216 // Node
    163507 mal 120 // Type
    137702 mal 12 // NodeList ?
    106986 mal 8 // Strlit / Val ?
    97424 mal 128
    51894 mal 68 // Mpint
    33183 mal 40 // Sym
    4917 mal 13
    1549 mal 16
    975 mal 7
    961 mal 224
    817 mal 14
    802 mal 11
    765 mal 10
    676 mal 9
    561 mal 5
    537 mal 15
    533 mal 6
    432 mal 17
    405 mal 32

    The biggest cost is 272,852 Node structures, consuming in excess of
    50mb alone, after that the other main offenders do their share,
    including 8mb of Strlit structures. Obviously any saving, no matter
    how small (alignment being considered) will have a large payoff for
    all the compilers.

    Interestingly I applied, https://codereview.appspot.com/6650054/, and
    found no improvement, possibly because bitfields are not implemented
    or come at no advantage under gcc on arm (I tried both -Os and -O3).

    Suggestions warmly welcomed.
    A very large number of Nodes are created and immediately discarded
    during parsing of imports. A significant part is during import of
    methods: whenever a package exported symbol references a type from
    another package, the type definition and methods are put in the
    package.

    Since many packages reference for example time.Time, or big.Int
    (notably crypto packages, and it refers to big.nat) the method
    signatures for these types are imported many times, and are very
    Node-hungry. Many of these nodes are immediately discarded because the
    type definition is actually already imported.

    Rémy.
  • Rémy Oudompheng at Dec 8, 2012 at 2:31 pm

    On 2012/12/7 Dave Cheney wrote:
    Hello,

    Memory usage by the gc compiler series is always a problem for small
    memory arm devices. For example compiling the go files in the net
    package with 5g takes approximately 108mb of ram. Last night I
    annotated gc/subr.c:mal to try to figure out where all the memory was
    going. Here are some of the details. The struct types, if known are
    listed to the right.

    dfc@qnap:~/go/src/pkg/net$ sort /tmp/mal.txt | uniq -c | sort -nr | head -n 20
    272852 mal 216 // Node
    163507 mal 120 // Type
    137702 mal 12 // NodeList ?
    106986 mal 8 // Strlit / Val ?
    97424 mal 128
    51894 mal 68 // Mpint
    33183 mal 40 // Sym
    4917 mal 13
    1549 mal 16
    975 mal 7
    961 mal 224
    817 mal 14
    802 mal 11
    765 mal 10
    676 mal 9
    561 mal 5
    537 mal 15
    533 mal 6
    432 mal 17
    405 mal 32

    The biggest cost is 272,852 Node structures, consuming in excess of
    50mb alone, after that the other main offenders do their share,
    including 8mb of Strlit structures. Obviously any saving, no matter
    how small (alignment being considered) will have a large payoff for
    all the compilers.
    Hello,

    I have written down two ideas that reduce the number of Nodes:
    * CL6905055: introduce a new type Field (and associated FieldList) for
    use by the parser, notably to denote arguments or imported functions.
    This new type Field is much smaller than a Node.
    https://codereview.appspot.com/6905055/
    * CL6902064: recycles ONAME nodes that are generated in each call to regopt().
    https://codereview.appspot.com/6902064/

    Using 6g on anamd64 machine to compile net/http uses up:
    * 118MB resident memory before CL6856126
    * 103MB resident memory after CL6856126 (already submitted)
    * 93MB resident memory after CL6905055
    * 89MB resident memory after CL6902064.

    I'd be interested in feedback on these ideas.

    Rémy.
  • Rémy Oudompheng at Dec 8, 2012 at 4:58 pm

    On 2012/12/8 Rémy Oudompheng wrote:
    Using 6g on anamd64 machine to compile net/http uses up:
    * 118MB resident memory before CL6856126
    * 103MB resident memory after CL6856126 (already submitted)
    * 93MB resident memory after CL6905055
    * 89MB resident memory after CL6902064.
    I have updated the latter CL so that it goes down to 72MB by saving
    unnecessary work on duplicate imported functions/methods.

    Rémy.
  • Luuk van Dijk at Dec 8, 2012 at 8:52 pm
    i had a minimal CL along the same lines (reusing nodes from discarded
    method imports)

    https://codereview.appspot.com/6906056

    i haven't had a chance to benchmark.

    /L

    On Sat, Dec 8, 2012 at 5:58 PM, Rémy Oudompheng wrote:
    On 2012/12/8 Rémy Oudompheng wrote:
    Using 6g on anamd64 machine to compile net/http uses up:
    * 118MB resident memory before CL6856126
    * 103MB resident memory after CL6856126 (already submitted)
    * 93MB resident memory after CL6905055
    * 89MB resident memory after CL6902064.
    I have updated the latter CL so that it goes down to 72MB by saving
    unnecessary work on duplicate imported functions/methods.

    Rémy.
  • Luuk van Dijk at Dec 8, 2012 at 8:52 pm
    hm, mine doesn't do much yet b/c $$ isn't connected to very much. I'll
    look closer later.

    On Sat, Dec 8, 2012 at 6:32 PM, Luuk van Dijk wrote:

    i had a minimal CL along the same lines (reusing nodes from discarded
    method imports)

    https://codereview.appspot.com/6906056

    i haven't had a chance to benchmark.

    /L

    On Sat, Dec 8, 2012 at 5:58 PM, Rémy Oudompheng wrote:
    On 2012/12/8 Rémy Oudompheng wrote:
    Using 6g on anamd64 machine to compile net/http uses up:
    * 118MB resident memory before CL6856126
    * 103MB resident memory after CL6856126 (already submitted)
    * 93MB resident memory after CL6905055
    * 89MB resident memory after CL6902064.
    I have updated the latter CL so that it goes down to 72MB by saving
    unnecessary work on duplicate imported functions/methods.

    Rémy.
  • Luuk van Dijk at Dec 8, 2012 at 8:52 pm
    for the record, from build -a in net/http, these are the potential savings
    of addmethod() calls,
    eg for package net, only 208 out of 3437 addmethods() are first ones.

    maybe it would be even more efficient to skip over all import lines for
    symbols that are in already imported packages, and forego the consistency
    check. that would require some reorganisation of the lexer.

    0/0 crypto
    4/4 crypto/aes
    0/0 crypto/hmac
    1/1 crypto/rc4
    3/3 crypto/sha1
    3/3 crypto/sha256
    0/0 crypto/subtle
    0/0 errors
    0/0 math
    0/0 runtime
    0/0 runtime/cgo
    0/0 sort
    1/1 strconv
    0/0 sync
    0/0 sync/atomic
    0/0 unicode
    0/0 unicode/utf8
    56/56 bufio
    25/47 bytes
    21/21 crypto/cipher
    20/20 crypto/des
    12/12 crypto/md5
    22/22 encoding/base64
    56/56 encoding/hex
    81/81 encoding/pem
    21/21 hash
    16/16 hash/crc32
    17/18 io
    16/16 math/rand
    62/65 net/url
    15/29 path
    26/36 reflect
    25/46 strings
    25/84 syscall
    60/102 compress/flate
    79/137 math/big
    55/145 time
    138/203 compress/gzip
    144/144 crypto/dsa
    152/249 crypto/ecdsa
    154/281 crypto/elliptic
    310/404 crypto/rand
    147/388 crypto/rsa
    177/177 crypto/x509/pkix
    348/739 encoding/asn1
    130/152 encoding/binary
    258/510 fmt
    201/420 io/ioutil
    160/212 log
    201/249 mime
    340/832 mime/multipart
    324/498 net/textproto
    117/912 os
    186/480 path/filepath
    474/3084 crypto/tls
    318/1694 crypto/x509
    208/3437 net
    694/3540 net/http


    On Sat, Dec 8, 2012 at 6:54 PM, Luuk van Dijk wrote:

    hm, mine doesn't do much yet b/c $$ isn't connected to very much. I'll
    look closer later.

    On Sat, Dec 8, 2012 at 6:32 PM, Luuk van Dijk wrote:

    i had a minimal CL along the same lines (reusing nodes from discarded
    method imports)

    https://codereview.appspot.com/6906056

    i haven't had a chance to benchmark.

    /L


    On Sat, Dec 8, 2012 at 5:58 PM, Rémy Oudompheng <remyoudompheng@gmail.com
    wrote:
    On 2012/12/8 Rémy Oudompheng wrote:
    Using 6g on anamd64 machine to compile net/http uses up:
    * 118MB resident memory before CL6856126
    * 103MB resident memory after CL6856126 (already submitted)
    * 93MB resident memory after CL6905055
    * 89MB resident memory after CL6902064.
    I have updated the latter CL so that it goes down to 72MB by saving
    unnecessary work on duplicate imported functions/methods.

    Rémy.
  • Dave Cheney at Dec 8, 2012 at 11:03 pm
    Thank you to Remy and Luuk, these are awesome suggestions. I've also
    been working trimming the Node and Type structures themselves, here is
    one suggestion

    https://codereview.appspot.com/6868088

    Cheers

    Dave
    On Sun, Dec 9, 2012 at 6:31 AM, Luuk van Dijk wrote:
    for the record, from build -a in net/http, these are the potential savings
    of addmethod() calls,
    eg for package net, only 208 out of 3437 addmethods() are first ones.

    maybe it would be even more efficient to skip over all import lines for
    symbols that are in already imported packages, and forego the consistency
    check. that would require some reorganisation of the lexer.

    0/0 crypto
    4/4 crypto/aes
    0/0 crypto/hmac
    1/1 crypto/rc4
    3/3 crypto/sha1
    3/3 crypto/sha256
    0/0 crypto/subtle
    0/0 errors
    0/0 math
    0/0 runtime
    0/0 runtime/cgo
    0/0 sort
    1/1 strconv
    0/0 sync
    0/0 sync/atomic
    0/0 unicode
    0/0 unicode/utf8
    56/56 bufio
    25/47 bytes
    21/21 crypto/cipher
    20/20 crypto/des
    12/12 crypto/md5
    22/22 encoding/base64
    56/56 encoding/hex
    81/81 encoding/pem
    21/21 hash
    16/16 hash/crc32
    17/18 io
    16/16 math/rand
    62/65 net/url
    15/29 path
    26/36 reflect
    25/46 strings
    25/84 syscall
    60/102 compress/flate
    79/137 math/big
    55/145 time
    138/203 compress/gzip
    144/144 crypto/dsa
    152/249 crypto/ecdsa
    154/281 crypto/elliptic
    310/404 crypto/rand
    147/388 crypto/rsa
    177/177 crypto/x509/pkix
    348/739 encoding/asn1
    130/152 encoding/binary
    258/510 fmt
    201/420 io/ioutil
    160/212 log
    201/249 mime
    340/832 mime/multipart
    324/498 net/textproto
    117/912 os
    186/480 path/filepath
    474/3084 crypto/tls
    318/1694 crypto/x509
    208/3437 net
    694/3540 net/http


    On Sat, Dec 8, 2012 at 6:54 PM, Luuk van Dijk wrote:

    hm, mine doesn't do much yet b/c $$ isn't connected to very much. I'll
    look closer later.

    On Sat, Dec 8, 2012 at 6:32 PM, Luuk van Dijk wrote:

    i had a minimal CL along the same lines (reusing nodes from discarded
    method imports)

    https://codereview.appspot.com/6906056

    i haven't had a chance to benchmark.

    /L


    On Sat, Dec 8, 2012 at 5:58 PM, Rémy Oudompheng
    wrote:
    On 2012/12/8 Rémy Oudompheng wrote:
    Using 6g on anamd64 machine to compile net/http uses up:
    * 118MB resident memory before CL6856126
    * 103MB resident memory after CL6856126 (already submitted)
    * 93MB resident memory after CL6905055
    * 89MB resident memory after CL6902064.
    I have updated the latter CL so that it goes down to 72MB by saving
    unnecessary work on duplicate imported functions/methods.

    Rémy.
  • Rémy Oudompheng at Dec 8, 2012 at 11:51 pm

    On 2012/12/9 Dave Cheney wrote:
    Thank you to Remy and Luuk, these are awesome suggestions. I've also
    been working trimming the Node and Type structures themselves, here is
    one suggestion

    https://codereview.appspot.com/6868088

    Cheers

    Dave
    Width of types, array lengths can overflow 32-bit integers on 64-bit
    architectures, so this change doesn't seem to be correct.

    Rémy.
  • Dave Cheney at Dec 8, 2012 at 11:52 pm
    Yes, you are correct, but using vlongs on 32bit arch's is wasteful,
    can I use intptr here as an arch specific width type ?

    On Sun, Dec 9, 2012 at 10:50 AM, Rémy Oudompheng
    wrote:
    On 2012/12/9 Dave Cheney wrote:
    Thank you to Remy and Luuk, these are awesome suggestions. I've also
    been working trimming the Node and Type structures themselves, here is
    one suggestion

    https://codereview.appspot.com/6868088

    Cheers

    Dave
    Width of types, array lengths can overflow 32-bit integers on 64-bit
    architectures, so this change doesn't seem to be correct.

    Rémy.
  • Rémy Oudompheng at Dec 9, 2012 at 12:12 am

    On 2012/12/9 Dave Cheney wrote:
    Yes, you are correct, but using vlongs on 32bit arch's is wasteful,
    can I use intptr here as an arch specific width type ?
    If you do something like this, it can be arch-specific with respect to
    the target (32-bit on 5g, 8g, 64-bit on 6g), but not with respect to
    the host architecture. It should be possible to use 6g on an ARM host,
    and have 64-bit widths anyway.

    Rémy.
  • Russ Cox at Dec 9, 2012 at 4:20 pm
    Probably the lowest hanging fruit would be to process each package
    import just once during a compilation. Right now if two files in
    package p import "os" the import data gets parsed into memory twice
    during the compilation of p, one for each import statement. If the
    first import filled a cache mapping import path to list of imported
    symbols (can just make up a fake symbol and hang it off of the Sym*),
    the second import could walk that list declaring those names instead
    of rereading the .a file. It's possible that having Pkg* means we
    don't even need a list, but it's been a while since I looked at those
    data structures. That will reduce I/O in addition to memory usage.

    Renamed or dot imports may or may not require a little extra effort;
    it would be fine to get it working with unrenamed imports first.

    Another, small optimization: for import _, the import could just skip
    the parsing entirely: the only thing it needs to know is whether there
    is a metadata line containing:

    func @"".init()

    Another, larger optimization: when importing a package we've not
    imported before, we may still have seen pieces in the export data from
    other packages. If the lexer during imports pulled in one line at a
    time from the file and checked the symbol name in the prefix (var
    @"".Name or whatever) in the symbol table, it could throw away lines
    that describe already-known symbols. The code would have to be
    slightly smart, in that when it discards a type it needs to discard
    the method lines that follow too, if any. The reader should use Brdstr
    not Brdline, so as to handle input lines larger than bufio's block
    size.

    Russ
  • Rémy Oudompheng at Dec 9, 2012 at 7:04 pm

    On 2012/12/9 Russ Cox wrote:
    Probably the lowest hanging fruit would be to process each package
    import just once during a compilation. Right now if two files in
    package p import "os" the import data gets parsed into memory twice
    during the compilation of p, one for each import statement.
    The difference is impressive (on amd64, compiling net package:
    153MB->68MB, net/http: 98MB->65MB).
    If the
    first import filled a cache mapping import path to list of imported
    symbols (can just make up a fake symbol and hang it off of the Sym*),
    the second import could walk that list declaring those names instead
    of rereading the .a file. It's possible that having Pkg* means we
    don't even need a list, but it's been a while since I looked at those
    data structures. That will reduce I/O in addition to memory usage.
    Symbols are already put in the right places (it seems), so there is
    essentially nothing to do.
    See https://codereview.appspot.com/6903059

    Maybe Dave has some nice measurements to do.

    Rémy.
  • Dave Cheney at Dec 9, 2012 at 8:36 pm
    Wow. This is looking really good. pkg/net dropped from 95mb to 43mb.
    I'm bringing my 386 machine up to the latest code and patches and will
    produce some svg's soon. I'll also report some figures from juju,
    where the list of Deps is growing rapidly

    Thank you for such an enthusiastic response; less calls to mal mean
    less calls to memset which means faster and smaller compilations for
    everyone. This doesn't just benefit tiny rubber band powered arm5
    machines. For example, the nexus 7 is a 4 core armv7 host with a gb of
    ram, even with most of the desktop shutdown you'd be lucky to get half
    of that for user space programs, so these memory reductions help
    current hardware use all the cores available to them.

    Cheers

    Dave

    On Mon, Dec 10, 2012 at 6:03 AM, Rémy Oudompheng
    wrote:
    On 2012/12/9 Russ Cox wrote:
    Probably the lowest hanging fruit would be to process each package
    import just once during a compilation. Right now if two files in
    package p import "os" the import data gets parsed into memory twice
    during the compilation of p, one for each import statement.
    The difference is impressive (on amd64, compiling net package:
    153MB->68MB, net/http: 98MB->65MB).
    If the
    first import filled a cache mapping import path to list of imported
    symbols (can just make up a fake symbol and hang it off of the Sym*),
    the second import could walk that list declaring those names instead
    of rereading the .a file. It's possible that having Pkg* means we
    don't even need a list, but it's been a while since I looked at those
    data structures. That will reduce I/O in addition to memory usage.
    Symbols are already put in the right places (it seems), so there is
    essentially nothing to do.
    See https://codereview.appspot.com/6903059

    Maybe Dave has some nice measurements to do.

    Rémy.
  • Dave Cheney at Dec 9, 2012 at 9:01 pm
    Compare this graph to the one posted several days ago (q.svg), the
    results are striking.

    valgrind reports the number of allocations is roughly half the original value

    ==17008== HEAP SUMMARY:
    ==17008== in use at exit: 44,971,011 bytes in 419,610 blocks
    ==17008== total heap usage: 456,943 allocs, 37,333 frees, 47,316,991
    bytes allocated

    On Mon, Dec 10, 2012 at 6:03 AM, Rémy Oudompheng
    wrote:
    On 2012/12/9 Russ Cox wrote:
    Probably the lowest hanging fruit would be to process each package
    import just once during a compilation. Right now if two files in
    package p import "os" the import data gets parsed into memory twice
    during the compilation of p, one for each import statement.
    The difference is impressive (on amd64, compiling net package:
    153MB->68MB, net/http: 98MB->65MB).
    If the
    first import filled a cache mapping import path to list of imported
    symbols (can just make up a fake symbol and hang it off of the Sym*),
    the second import could walk that list declaring those names instead
    of rereading the .a file. It's possible that having Pkg* means we
    don't even need a list, but it's been a while since I looked at those
    data structures. That will reduce I/O in addition to memory usage.
    Symbols are already put in the right places (it seems), so there is
    essentially nothing to do.
    See https://codereview.appspot.com/6903059

    Maybe Dave has some nice measurements to do.

    Rémy.
  • Dave Cheney at Dec 9, 2012 at 10:54 pm
    Some folks contacted me offlist as they were having difficulties
    reading the SVGs. Please try these temporary links, I've found that
    chrome works well.

    http://rc.cheney.net/before.svg

    http://rc.cheney.net/6903059.svg
    On Mon, Dec 10, 2012 at 7:55 AM, Dave Cheney wrote:
    Compare this graph to the one posted several days ago (q.svg), the
    results are striking.

    valgrind reports the number of allocations is roughly half the original value

    ==17008== HEAP SUMMARY:
    ==17008== in use at exit: 44,971,011 bytes in 419,610 blocks
    ==17008== total heap usage: 456,943 allocs, 37,333 frees, 47,316,991
    bytes allocated

    On Mon, Dec 10, 2012 at 6:03 AM, Rémy Oudompheng
    wrote:
    On 2012/12/9 Russ Cox wrote:
    Probably the lowest hanging fruit would be to process each package
    import just once during a compilation. Right now if two files in
    package p import "os" the import data gets parsed into memory twice
    during the compilation of p, one for each import statement.
    The difference is impressive (on amd64, compiling net package:
    153MB->68MB, net/http: 98MB->65MB).
    If the
    first import filled a cache mapping import path to list of imported
    symbols (can just make up a fake symbol and hang it off of the Sym*),
    the second import could walk that list declaring those names instead
    of rereading the .a file. It's possible that having Pkg* means we
    don't even need a list, but it's been a while since I looked at those
    data structures. That will reduce I/O in addition to memory usage.
    Symbols are already put in the right places (it seems), so there is
    essentially nothing to do.
    See https://codereview.appspot.com/6903059

    Maybe Dave has some nice measurements to do.

    Rémy.
  • Anthony Starks at Dec 13, 2012 at 3:19 am
    here's a comparison of 5g and 5l build times of godoc on the Raspberry Pi
    (512MB version); .

    pi@raspberrypi ~/go/src/cmd/godoc $ go version
    go version devel +afac768ad2fe Wed Dec 12 21:38:52 2012 +1100 linux/arm

    pi@raspberrypi ~/go/src/cmd/godoc $ /usr/bin/time -v
    /home/pi/go/pkg/tool/linux_arm/5g -o $WORK/cmd/godoc/_obj/_go_.5 -p
    cmd/godoc -D _/home/pi/go/src/cmd/godoc -I $WORK ./codewalk.go
    ./dirtrees.go ./filesystem.go ./format.go ./godoc.go ./index.go ./main.go
    ./parser.go ./play-local.go ./play.go ./snippet.go ./spec.go ./template.go
    ./throttle.go ./utils.go ./zip.go

    Command being timed: "/home/pi/go/pkg/tool/linux_arm/5g -o
    /tmp/go-build279461654/cmd/godoc/_obj/_go_.5 -p cmd/godoc -D
    _/home/pi/go/src/cmd/godoc -I /tmp/go-build279461654 ./codewalk.go
    ./dirtrees.go ./filesystem.go ./format.go ./godoc.go ./index.go ./main.go
    ./parser.go ./play-local.go ./play.go ./snippet.go ./spec.go ./template.go
    ./throttle.go ./utils.go ./zip.go"

    User time (seconds): 4.02

    System time (seconds): 0.27

    Percent of CPU this job got: 99%

    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.30

    Average shared text size (kbytes): 0

    Average unshared data size (kbytes): 0

    Average stack size (kbytes): 0

    Average total size (kbytes): 0

    Maximum resident set size (kbytes): 51984

    Average resident set size (kbytes): 0

    Major (requiring I/O) page faults: 0

    Minor (reclaiming a frame) page faults: 13058

    Voluntary context switches: 1

    Involuntary context switches: 96

    Swaps: 0

    File system inputs: 0

    File system outputs: 3096

    Socket messages sent: 0

    Socket messages received: 0

    Signals delivered: 0

    Page size (bytes): 4096

    Exit status: 0


    ...

    pi@raspberrypi ~/go/src/cmd/godoc $ /usr/bin/time -v
    /home/pi/go/pkg/tool/linux_arm/5l -o $WORK/cmd/godoc/_obj/a.out -L $WORK
    $WORK/cmd/godoc.a

    Command being timed: "/home/pi/go/pkg/tool/linux_arm/5l -o
    /tmp/go-build279461654/cmd/godoc/_obj/a.out -L /tmp/go-build279461654
    /tmp/go-build279461654/cmd/godoc.a"

    User time (seconds): 9.49

    System time (seconds): 0.81

    Percent of CPU this job got: 99%

    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.33

    Average shared text size (kbytes): 0

    Average unshared data size (kbytes): 0

    Average stack size (kbytes): 0

    Average total size (kbytes): 0

    Maximum resident set size (kbytes): 113816

    Average resident set size (kbytes): 0

    Major (requiring I/O) page faults: 0

    Minor (reclaiming a frame) page faults: 28643

    Voluntary context switches: 1

    Involuntary context switches: 243

    Swaps: 0

    File system inputs: 0

    File system outputs: 13976

    Socket messages sent: 0

    Socket messages received: 0

    Signals delivered: 0

    Page size (bytes): 4096

    Exit status: 0


    On Friday, December 7, 2012 4:39:19 AM UTC-5, Dave Cheney wrote:

    Hello,

    Memory usage by the gc compiler series is always a problem for small
    memory arm devices. For example compiling the go files in the net
    package with 5g takes approximately 108mb of ram. Last night I
    annotated gc/subr.c:mal to try to figure out where all the memory was
    going. Here are some of the details. The struct types, if known are
    listed to the right.

    dfc@qnap:~/go/src/pkg/net$ sort /tmp/mal.txt | uniq -c | sort -nr | head
    -n 20
    272852 mal 216 // Node
    163507 mal 120 // Type
    137702 mal 12 // NodeList ?
    106986 mal 8 // Strlit / Val ?
    97424 mal 128
    51894 mal 68 // Mpint
    33183 mal 40 // Sym
    4917 mal 13
    1549 mal 16
    975 mal 7
    961 mal 224
    817 mal 14
    802 mal 11
    765 mal 10
    676 mal 9
    561 mal 5
    537 mal 15
    533 mal 6
    432 mal 17
    405 mal 32

    The biggest cost is 272,852 Node structures, consuming in excess of
    50mb alone, after that the other main offenders do their share,
    including 8mb of Strlit structures. Obviously any saving, no matter
    how small (alignment being considered) will have a large payoff for
    all the compilers.

    Interestingly I applied, https://codereview.appspot.com/6650054/, and
    found no improvement, possibly because bitfields are not implemented
    or come at no advantage under gcc on arm (I tried both -Os and -O3).

    Suggestions warmly welcomed.

    Cheers

    Dave

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedDec 7, '12 at 9:39a
activeDec 13, '12 at 3:19a
posts24
users7
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase