FAQ
Sorry about this total noob post, but the "contibutor" docs say pretty
clearly to first post here...

I am interested in changing the memory layout of structs in Go. The layout
defined in C is nearly as bad as it gets when it comes to high speed memory
intensive applications. Your inner loops typically access 2 or 3 fields of
any given object, and the rest of the object's fields clobber otherwise
potentially useful data in the cache. If the default memory layout is to
group fields of the same struct type together, average cache performance
improves greatly (my place and route tools run 60% faster!). To make this
work, pointers are implemented under the hood as regular ints, and used to
index into these arrays of properties. I've developed C code with this
method for over 20 years, and it always produces far faster algorithms.

For my implementation of a C data structure generator for C that make use
of this layout, see DataDraw:

     http://datadraw.sourceforge.net/

I assume that Go structs today are implemented as C structs. That's too
bad, because this simple change under the hood could make Go faster than C
in many memory intensive applications.

Is this something I should be encouraged or discouraged from pursuing?

Thanks,
Bill

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Ian Lance Taylor at Jan 21, 2015 at 3:02 am

    On Tue, Jan 20, 2015 at 6:39 PM, Bill Cox wrote:
    I am interested in changing the memory layout of structs in Go. The layout
    defined in C is nearly as bad as it gets when it comes to high speed memory
    intensive applications. Your inner loops typically access 2 or 3 fields of
    any given object, and the rest of the object's fields clobber otherwise
    potentially useful data in the cache. If the default memory layout is to
    group fields of the same struct type together, average cache performance
    improves greatly (my place and route tools run 60% faster!). To make this
    work, pointers are implemented under the hood as regular ints, and used to
    index into these arrays of properties. I've developed C code with this
    method for over 20 years, and it always produces far faster algorithms.

    For my implementation of a C data structure generator for C that make use of
    this layout, see DataDraw:

    http://datadraw.sourceforge.net/
    It sounds like you are suggesting an optimization that applies to
    applications that write loops over large slices of structs where the
    loops only access a few fields from each struct. For such an
    application, it's better to split up the structs so that the fields
    accessed by the loop are contiguous in memory. This allows the
    application to fit more structs into a single cache line, and
    therefore the loop runs faster.

    It's possible that I have completely misunderstood, in which case
    please correct me. I did not understand what datadraw has to do with
    this, as datadraw seems to be a much higher level program.

    I assume that Go structs today are implemented as C structs. That's too
    bad, because this simple change under the hood could make Go faster than C
    in many memory intensive applications.

    Is this something I should be encouraged or discouraged from pursuing?
    What exactly do you propose to do?

    Ian

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bill Cox at Jan 21, 2015 at 4:18 pm

    On Tue, Jan 20, 2015 at 7:02 PM, Ian Lance Taylor wrote:

    It sounds like you are suggesting an optimization that applies to
    applications that write loops over large slices of structs where the
    loops only access a few fields from each struct. For such an
    application, it's better to split up the structs so that the fields
    accessed by the loop are contiguous in memory. This allows the
    application to fit more structs into a single cache line, and
    therefore the loop runs faster.
    You are correct. This is exactly what I am proposing, but I would make the
    default having each field in it's own separate slice. This results in far
    better average cache performance. Also, you don't want two kinds of
    pointers hanging around to deal with two slices of the same object. With
    this method, the object reference is it's array index rather than a pointer.

    For even better performace, we could optimize the memory layout based on
    the inner loop code seen in applications. This would require a late-stage
    global optimization step, but it would be a pretty major speed enhancer for
    memory intensive applications.

    The reason to go all the way to arrays of individual fields is that our
    CPUs have indexed indirect addressing operations which are fast. If we had
    objects of size 391 bytes, the compiler would introduce a 3-5 cycle latency
    integer multiply. In some cases, such as red-black trees or any case where
    you know you will access 2 to 4 fields in a loop, these should be packed
    together into a struct that still avoids the multiply.

    Another significant improvement is that you can use the right sized integer
    to reference objects, rather than 64-bit pointers. I never once needed a
    64-bit integer in my EDA applications, though at Google scale, some
    applications will probably have over 4 billion objects of a given type.
    This not only saves considerable memory, but it also improves cache
    performance.

    It's possible that I have completely misunderstood, in which case
    please correct me. I did not understand what datadraw has to do with
    this, as datadraw seems to be a much higher level program.

    DataDraw is a code generator, similar to what we use for protocol buffers.
    It generates all the access functions needed to hide how object memory is
    refereneced under the hood. All memory access is done through inline
    function calls. This introduces zero additional overhead when using gcc
    now days. It used to be that I had to use macros rather than inline
    functions to get the best performance, but the compiler improved it's
    inlining.

    Is this something I should be encouraged or discouraged from pursuing?
    What exactly do you propose to do?
    The first step will be doing a lot of reading and learning so I'm no longer
    such a noob :-)

    The primary objective would be to make Go faster than C for most large
    memory applications. I am confident this can be done whenever the C struct
    layout is not visible to the application. I need to understand the Go
    source code and the language much better before I can be confident of the
    steps involved, but at this point I guess the steps would look like:

    - Fork the code. The first implementation would be to prove the concept,
    not to create clean code ready for integration with the main Go code base.
    - Refactor the code so that all structs declared in Go are accessed through
    an abstract inline functional interface using abstract references in C
    - Get the current memory layout working with the new abstract interface,
    passing the Go test suite.
    - Create an alternate implementation for this abstract interface which uses
    arrays of individual properties for storage, and 32-bit integers for
    references rather than 64-bit pointers. Try again to pass the Go test
    suite.
    - Benchmark various real applications using the new memory layout vs the
    production Go compiler.

    If the benchmark are compelling enough, I would propose doing this exercise
    again from scratch. The second time I always do a better job :-) The
    second implementation would be done with the intention of being accepted
    for integration into the Go compiler, either the regular one or the gcc
    front end. Which would be best for this project?

    Additional features would be needed, such as a way to specify 64-bit (or
    16-bit) index sizes for object references, and the compiler may need to be
    able to generate either the old style or new based on some flag. A global
    optimizer that combines some fields into small structs for improved cache
    performance in inner loops could be implemented, as well as compiler hints
    to guide combination of fields.

    Does this make sense?

    Thanks,
    Bill

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Ian Lance Taylor at Jan 21, 2015 at 6:10 pm

    On Wed, Jan 21, 2015 at 8:18 AM, Bill Cox wrote:
    - Fork the code. The first implementation would be to prove the concept,
    not to create clean code ready for integration with the main Go code base.
    - Refactor the code so that all structs declared in Go are accessed through
    an abstract inline functional interface using abstract references in C
    - Get the current memory layout working with the new abstract interface,
    passing the Go test suite.
    - Create an alternate implementation for this abstract interface which uses
    arrays of individual properties for storage, and 32-bit integers for
    references rather than 64-bit pointers. Try again to pass the Go test
    suite.
    - Benchmark various real applications using the new memory layout vs the
    production Go compiler.
    It's going to be impossible to do this in general. The Go language
    has an unsafe.Offsetof function, similar to the C offsetof macro.
    Reflection information has a StructField type, and that type has an
    Offset field. So it is possible, albeit awkward, to write Go code
    that takes a pointer to a struct and accesses the fields using pointer
    offsets. We can't break that code. The same issues arise in C, of
    course.

    In general I think you are describing an optimization that I know
    under the name struct-reorg. See
    http://www.research.ibm.com/haifa/dept/svt/papers/golovanevsky.pdf .
    However, that work did not really pan out in GCC. At least for C/C++
    code, it turns into a benchmark game: it's applicable to so few real
    world cases that you wind up optimizing for benchmarks rather than
    actual programs. The result is an optimization that is relatively
    expensive in terms of compile time that never actually helps.

    It's never going to make sense to apply this to all structs. It only
    makes sense when programs contain loops over large slices of structs
    where the loop bodies only access a subset of the fields in the
    struct. That is a small subset of real programs.

    Given the constraints on when the optimization applies, and the
    constraints imposed by type reflection and Offsetof on struct layout
    in Go, I don't think this will work as a compiler change.

    My feeling about this kind of thing has long been that it is much more
    effective to not try to do this as a compiler optimization, but
    instead to do this as a tool that analyzes a program and suggests
    areas where it can be improved. In this case, the tool would look for
    loop bodies over slices of structs where only a subset of the fields
    are accessed. Then it can report a potential optimization: break up
    the struct into smaller structs and use slices of those. The
    programmer can see this advice and, if appropriate, apply it to the
    program. My experience is that it is approximately ten thousand times
    easier for the programmer to transform the program safely and
    efficiently than it is for the compiler to do so.

    Ian

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bill Cox at Jan 21, 2015 at 7:17 pm
    Thanks for taking the time for a thoughtful reply. Comments below.
    On Wed, Jan 21, 2015 at 10:10 AM, Ian Lance Taylor wrote:

    It's going to be impossible to do this in general. The Go language
    has an unsafe.Offsetof function, similar to the C offsetof macro.
    Reflection information has a StructField type, and that type has an
    Offset field. So it is possible, albeit awkward, to write Go code
    that takes a pointer to a struct and accesses the fields using pointer
    offsets. We can't break that code. The same issues arise in C, of
    course.
    I could introduce this as either a new type (other than struct) or as some
    sort of option or hint about the struct so that existing code uses the
    existing struct layout.

    In general I think you are describing an optimization that I know
    under the name struct-reorg. See
    http://www.research.ibm.com/haifa/dept/svt/papers/golovanevsky.pdf .
    Thanks for that link. That paper describes what I do as "peeling".
    However, this statement in the paper is wrong:

    "Structure peeling however is not always possible.
    For example, for data structures interconnected
    by pointers, like link lists and trees, connecting
    pointers cannot be separated from the
    rest of the data in the struct. In this case the
    structure can still be decomposed using additional
    pointers to keep its original unity, as described
    next."

    For peeling to work on data structures with pointers, we have to globally
    replace the pointers to the peeled structures with integer indexes, and
    replace all star operators with inline function accessor methods (which are
    just as fast). This saves space if the index is 32 bits while pointers are
    64, and generally runs faster due to cache performance. We also have to
    allocate all these fields together in arrays while C structures tend to be
    allocated sequentially out of larger blocks of memory that can be scattered
    throughout the heap, rather than fully sequential. If Go currently just
    allocates structs in a single heap where structs of different types get
    intermixed, then this change will also speed it up, even when using the
    regular struct layout, since cache performance generally improves when
    like-type data is contiguous in memory.

    However, that work did not really pan out in GCC. At least for C/C++
    code, it turns into a benchmark game: it's applicable to so few real
    world cases that you wind up optimizing for benchmarks rather than
    actual programs. The result is an optimization that is relatively
    expensive in terms of compile time that never actually helps.
    They only peeled structures that no one points to. It is a pretty uncommon
    for such structures to dominate memory and thrash the cache. Replacing
    pointers with integer indexes globally is going to be quite challenging in
    a C compiler. In a code generator, it is comparatively easy.

    It's never going to make sense to apply this to all structs. It only
    makes sense when programs contain loops over large slices of structs
    where the loop bodies only access a subset of the fields in the
    struct. That is a small subset of real programs.
    Programs where speed matters tend to fall into two categories: CPU bound
    and cache bound. CPU bound programs are not typically going to be effected
    by this change. Cache bound programs, which are very common now days,
    generally benefit a great deal from full structure peeling.

    I can prove this through benchmarks, and not just a chosen few. Most large
    Go programs that uses a lot of memory will do. At my previous job, we
    auto-placed and routed large chips. I could switch from regular C
    structures and pointers to fully peeled structs and integer indexes with a
    compiler flag. Our place and route code on modern Intel hardware ran 60%
    faster with fully peeled data. This is not some hand picked benchmark, but
    a full 600K line EDA tool flow compiled with gcc.

    Given the constraints on when the optimization applies, and the
    constraints imposed by type reflection and Offsetof on struct layout
    in Go, I don't think this will work as a compiler change.
    I agree that this change would break some existing reflection code. Some
    sort of addition in syntax, pragmas, flags, or some such is needed to keep
    from breaking this code.

    My feeling about this kind of thing has long been that it is much more
    effective to not try to do this as a compiler optimization, but
    instead to do this as a tool that analyzes a program and suggests
    areas where it can be improved. In this case, the tool would look for
    loop bodies over slices of structs where only a subset of the fields
    are accessed. Then it can report a potential optimization: break up
    the struct into smaller structs and use slices of those. The
    programmer can see this advice and, if appropriate, apply it to the
    program. My experience is that it is approximately ten thousand times
    easier for the programmer to transform the program safely and
    efficiently than it is for the compiler to do so.
    Rather than a complex compiler optimization, this style of peeling is a
    change in the code generator to use a different data layout. It does not
    add much complexity in my experience. It is nothing like what was
    described in that paper, where we need to formally prove that struct layout
    information has not escaped. This is a simple memory layout, allocation,
    and accessor change. It likely would effect many lines of code throughout
    the code base, but in a uniform way. Hopefully some awk/sed/grep scripts
    can do the job.

    My don't I go ahead and hack up a version of Go implementing this change,
    so we can test the impact on real-world benchmarks? Which would be a
    better compiler to hack for this change, the GCC Go front end or the gc
    compiler? I'm such a noob, I don't know where to look for documentation on
    these compilers. Is there a wiki I should read, or should I just go read
    the code?

    Thanks again,
    Bill

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Ian Lance Taylor at Jan 21, 2015 at 7:42 pm

    On Wed, Jan 21, 2015 at 11:17 AM, Bill Cox wrote:
    I agree that this change would break some existing reflection code. Some
    sort of addition in syntax, pragmas, flags, or some such is needed to keep
    from breaking this code.
    I want to be clear that this is something we are extremely unlikely to
    do in Go. For one thing, it would break the Go 1 guarantee
    (http://golang.org/doc/go1compat.html).

    My don't I go ahead and hack up a version of Go implementing this change, so
    we can test the impact on real-world benchmarks? Which would be a better
    compiler to hack for this change, the GCC Go front end or the gc compiler?
    I'm such a noob, I don't know where to look for documentation on these
    compilers. Is there a wiki I should read, or should I just go read the
    code?
    Both compilers are quite underdocumented. I wrote the gccgo frontend,
    so I find it easier to work with that code. The gc compiler is in the
    process of being converted from C to Go, a process that will take many
    months to complete. I wouldn't try complex work with that compiler
    until that work is done.

    Ian

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bill Cox at Jan 21, 2015 at 9:26 pm

    On Wed, Jan 21, 2015 at 11:42 AM, Ian Lance Taylor wrote:
    On Wed, Jan 21, 2015 at 11:17 AM, Bill Cox wrote:

    I agree that this change would break some existing reflection code. Some
    sort of addition in syntax, pragmas, flags, or some such is needed to keep
    from breaking this code.
    I want to be clear that this is something we are extremely unlikely to
    do in Go. For one thing, it would break the Go 1 guarantee
    (http://golang.org/doc/go1compat.html).

    Breaking backwards compatibility is not allowed. I get it. If this
    feature is integrated, it will have to be invoked with some new construct.
    For example, adding a keyword other than struct for peeled structs
    (pstruct? record? I don't like them). Maybe a modifier keyword like
    peeled? I have no idea, but breaking backwards compatibility is not an
    option.

    I think it will be useful to understand the potential benefits of peeled
    structs. I wont be upset if my code does not get integrated. The first
    version would be written specifically as throw-away code.

    My don't I go ahead and hack up a version of Go implementing this
    change, so
    we can test the impact on real-world benchmarks? Which would be a better
    compiler to hack for this change, the GCC Go front end or the gc compiler?
    I'm such a noob, I don't know where to look for documentation on these
    compilers. Is there a wiki I should read, or should I just go read the
    code?
    Both compilers are quite underdocumented. I wrote the gccgo frontend,
    so I find it easier to work with that code. The gc compiler is in the
    process of being converted from C to Go, a process that will take many
    months to complete. I wouldn't try complex work with that compiler
    until that work is done.
    This is a tough call... It's throw away code, so I probably just want to do
    the work where it's easiest, possibly the stable 1.4 branch of gc. On the
    other hand, I read that gccgo generates faster code, and I want to compare
    results to C and C++, and I want the fastest possible result. My goal is
    for Go to win vs C on many memory intensive algorithms. Then again, I
    probably wont be able to compile many Google internal programs which are
    mostly built with gc, and I want to benchmark those. Which code base do
    you suspect I will find easier to work with? In my experience, a lot
    depends on the community, and you're here answering dumb noob questions.
    That's a strong point in favor of going with gccgo. What do you think?

    Bill

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Ian Lance Taylor at Jan 21, 2015 at 11:06 pm

    On Wed, Jan 21, 2015 at 1:26 PM, Bill Cox wrote:
    This is a tough call... It's throw away code, so I probably just want to do
    the work where it's easiest, possibly the stable 1.4 branch of gc. On the
    other hand, I read that gccgo generates faster code, and I want to compare
    results to C and C++, and I want the fastest possible result. My goal is
    for Go to win vs C on many memory intensive algorithms. Then again, I
    probably wont be able to compile many Google internal programs which are
    mostly built with gc, and I want to benchmark those. Which code base do you
    suspect I will find easier to work with? In my experience, a lot depends on
    the community, and you're here answering dumb noob questions. That's a
    strong point in favor of going with gccgo. What do you think?
    I don't know, but I'll comment that in general anything that can be
    built with gc can be built with gccgo. They are quite compatible at
    the language level.

    Ian

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at Jan 22, 2015 at 8:11 am
    BTW. Jonathan Blow recently showed some interesting ideas how to write
    high-level but at the same time keep different memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA).

    Two interesting ideas from there (converted into Go-ish approach).

    First one is a "soa" qualifier that modifies the memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA#t=2852):

    type Particle soa struct {
    X, Y, Z float64
    }
    type Particles [4]Particle
    // memory layout for "Particles" would be
    // X, X, X, X, Y, Y, Y, Y, Z, Z, Z, Z

    And the other idea was implicit identity function
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA#t=4133)

    var particles := [256]ParticleInfo{}

    type ParticleInfo struct { X, Y, Z float64 }

    type Particle struct {
    Index uint8

    // embedding an implicit function
    func(p *Particle) identity() *ParticleInfo {
    return &globalParticles[p.Index]
    }
    }

    func example(){
    p := Particle{1}
    p.X = p.Y * p.Z
    }

    // would be converted by the compiler to:
    func example(){
    p := Particle{1}
    pi := p.identity()
    pi.X = pi.Y * pi.Z
    }

    Basically it's embedding a function that implicitly is called "as a getter"
    for the properties.

    The first idea seems like something that could fit into Go - although I
    have no clue how much effort it needs. Obviously it would introduce a new
    type "soa struct". This would probably resolve most of the problems that
    need different layout without having to convert lots of code to that layout
    manually.

    The second idea could also simply be implemented as:

    type Particle uint8
    func (p Particle) Info() *Info { return &particles[int(p)] }

    func example(){
    p := Particle{1}
    pi := p.Info()
    pi.X = pi.Y * pi.Z
    }

    So it is probably less useful.

    Basically, I don't think optimizer needs to automatically determine what
    and how to reorganize - it's probably sufficient when you can see what
    needs to be reorganized and mark things for reorganization (which also
    could be done by a preprocessor).

    + Egon

    *PS: I'm not suggesting that these should be implemented for Go; I'm simply
    saying that they were interesting.*

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Sebastien Binet at Jan 22, 2015 at 8:28 am

    On Thu, Jan 22, 2015 at 9:11 AM, Egon wrote:
    BTW. Jonathan Blow recently showed some interesting ideas how to write
    high-level but at the same time keep different memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA).

    Two interesting ideas from there (converted into Go-ish approach).

    First one is a "soa" qualifier that modifies the memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA#t=2852):

    type Particle soa struct {
    X, Y, Z float64
    }
    type Particles [4]Particle
    // memory layout for "Particles" would be
    // X, X, X, X, Y, Y, Y, Y, Z, Z, Z, Z
    alternatively, a "simple" 'eg/gorename', go-fix or go-generate could
    also be a useful tool to transform a Particle/Particles
    Array-Of-Structs into a Struct-Of-Arrays... (or slices)

    -s

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bill Cox at Jan 22, 2015 at 5:12 pm

    On Thu, Jan 22, 2015 at 12:28 AM, Sebastien Binet wrote:
    On Thu, Jan 22, 2015 at 9:11 AM, Egon wrote:
    BTW. Jonathan Blow recently showed some interesting ideas how to write
    high-level but at the same time keep different memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA).

    Two interesting ideas from there (converted into Go-ish approach).

    First one is a "soa" qualifier that modifies the memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA#t=2852):

    type Particle soa struct {
    X, Y, Z float64
    }
    type Particles [4]Particle
    // memory layout for "Particles" would be
    // X, X, X, X, Y, Y, Y, Y, Z, Z, Z, Z
    This is very cool. On the down side, this approach probably loses garbage
    collection, and users will have a difficult time freeing a single
    particle. Also, reallocation to a larger or smaller array will require
    quite a bit of overhead.

    Now days programs compiled with gcc on Linux (at least Ubuntu) have some
    very cool memory features. The following code allocates 15TiB of memory,
    and runs in 10ms on my laptop:

    #include <stdint.h>
    #include <stdio.h>
    #include <stdlib.h>

    #define N 1000
    #define SIZE (1llu << 34u)

    int main() {
       uint8_t **arrays = malloc(N*sizeof(uint8_t *));
       uint64_t i;
       uint64_t total = 3;
       for(i = 0; i < N; i++) {
         arrays[i] = (uint8_t *)malloc(SIZE);
         if(arrays[i] == NULL) {
           printf("Unable to allocate %llu bytes\n", SIZE);
         }
         arrays[i][0] = i*total;
         uint64_t j = (total*12345) % (i+1);
         total += arrays[j][0];
       }
       printf("total %lu\n", total);
       return 0;
    }

    If I loop forever before returning, top shows I have 15TiB allocated of
    virtual memory. What's happening is Linux doesn't allocate the memory
    until I either read or write any given page. This means we can use malloc
    effectively for implementing memory efficient dynamic arrays. This allows
    peeled structs to be reallocated rarely, since the "capacity" we allocate
    can be several times more than we currently need, without wasting
    significant memory.

    Rather than leaving management of memory to the user, the memory layout
    needs to be reworked, and this means substantial changes to the garbage
    collector. The garbage collector should deal with compacting data in these
    arrays automatically when needed to recover fragmented memory.

    For this to be efficient, we need the X, Y, and Z fields of Particle
    objects to live in separate malloc-ed memory arrays.

    alternatively, a "simple" 'eg/gorename', go-fix or go-generate could
    also be a useful tool to transform a Particle/Particles
    Array-Of-Structs into a Struct-Of-Arrays... (or slices)
    I am not familiar with eg/gorename, go-fix or go-generate. Are these code
    generators for Go? I really need to read the LRM...

    Bill

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Sebastien Binet at Jan 22, 2015 at 5:17 pm
    Bill,
    On Thu, Jan 22, 2015 at 6:12 PM, Bill Cox wrote:
    alternatively, a "simple" 'eg/gorename', go-fix or go-generate could
    also be a useful tool to transform a Particle/Particles
    Array-Of-Structs into a Struct-Of-Arrays... (or slices)

    I am not familiar with eg/gorename, go-fix or go-generate. Are these code
    generators for Go? I really need to read the LRM...
    go generate is (the scaffolding for) a code generator.
    see: http://blog.golang.org/generate

    eg, gorename and 'go fix' are code refactoring/rewriting tools.
    https://godoc.org/golang.org/x/tools/cmd/gorename
    https://godoc.org/golang.org/x/tools/cmd/eg

    hth,
    -s

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at Jan 22, 2015 at 7:00 pm

    On Thursday, 22 January 2015 19:13:30 UTC+2, Bill Cox wrote:
    On Thu, Jan 22, 2015 at 12:28 AM, Sebastien Binet <seb....@gmail.com
    <javascript:>> wrote:
    On Thu, Jan 22, 2015 at 9:11 AM, Egon <egon...@gmail.com <javascript:>>
    wrote:
    BTW. Jonathan Blow recently showed some interesting ideas how to write
    high-level but at the same time keep different memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA).

    Two interesting ideas from there (converted into Go-ish approach).

    First one is a "soa" qualifier that modifies the memory layout
    (https://www.youtube.com/watch?v=ZHqFrNyLlpA#t=2852):

    type Particle soa struct {
    X, Y, Z float64
    }
    type Particles [4]Particle
    // memory layout for "Particles" would be
    // X, X, X, X, Y, Y, Y, Y, Z, Z, Z, Z
    This is very cool. On the down side, this approach probably loses garbage
    collection, and users will have a difficult time freeing a single
    particle. Also, reallocation to a larger or smaller array will require
    quite a bit of overhead.
    You would batch such things:

    type ParticleBatch[32]Particle
    var Particles []ParticleBatch

    Although, yes there is an overhead with resizing and manual management.

    Now days programs compiled with gcc on Linux (at least Ubuntu) have some
    very cool memory features. The following code allocates 15TiB of memory,
    and runs in 10ms on my laptop:

    #include <stdint.h>
    #include <stdio.h>
    #include <stdlib.h>

    #define N 1000
    #define SIZE (1llu << 34u)

    int main() {
    uint8_t **arrays = malloc(N*sizeof(uint8_t *));
    uint64_t i;
    uint64_t total = 3;
    for(i = 0; i < N; i++) {
    arrays[i] = (uint8_t *)malloc(SIZE);
    if(arrays[i] == NULL) {
    printf("Unable to allocate %llu bytes\n", SIZE);
    }
    arrays[i][0] = i*total;
    uint64_t j = (total*12345) % (i+1);
    total += arrays[j][0];
    }
    printf("total %lu\n", total);
    return 0;
    }

    If I loop forever before returning, top shows I have 15TiB allocated of
    virtual memory. What's happening is Linux doesn't allocate the memory
    until I either read or write any given page. This means we can use malloc
    effectively for implementing memory efficient dynamic arrays. This allows
    peeled structs to be reallocated rarely, since the "capacity" we allocate
    can be several times more than we currently need, without wasting
    significant memory.

    Rather than leaving management of memory to the user, the memory layout
    needs to be reworked, and this means substantial changes to the garbage
    collector. The garbage collector should deal with compacting data in these
    arrays automatically when needed to recover fragmented memory.

    For this to be efficient, we need the X, Y, and Z fields of Particle
    objects to live in separate malloc-ed memory arrays.

    alternatively, a "simple" 'eg/gorename', go-fix or go-generate could
    also be a useful tool to transform a Particle/Particles
    Array-Of-Structs into a Struct-Of-Arrays... (or slices)
    I am not familiar with eg/gorename, go-fix or go-generate. Are these code
    generators for Go? I really need to read the LRM...

    Bill
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Bill Cox at Jan 22, 2015 at 11:00 pm
    On Thu, Jan 22, 2015 at 11:00 AM, Egon wrote:
    This is very cool. On the down side, this approach probably loses
    garbage collection, and users will have a difficult time freeing a single
    particle. Also, reallocation to a larger or smaller array will require
    quite a bit of overhead.
    You would batch such things:

    type ParticleBatch[32]Particle
    var Particles []ParticleBatch

    Although, yes there is an overhead with resizing and manual management.
    Thanks for the video link. Batches are how I normally allocate AOS memory
    layouts. For batch-processing tools where the embarrassing pauses are no
    problem (such as the place-and-route tools I am used to working on), it is
    faster and simpler overall to allocate everything in one batch for SOA
    memory layouts. For an initial version, I'd prefer the simpler one-array
    per field.

    Garbage collection looks tricky in optimized gcc code. How is it possible
    to know the type of a value in a register, or even on the stack after
    optimization? I read something that makes me think gccgo does
    'conservative' GC, which is does not work as well when using integer
    indexes rather than pointers. In our place and route system, we
    auto-generated recursive destructors. To free an object, you had to call
    it's destroy method, so I have not dealt with GC.

    One nice property of SOA memory layout is it becomes simple to dynamically
    add fields to existing objects. You just malloc and initialize a new
    array. This allowed our various tools to extend existing objects and add
    tool specific data to them. For example, the placer could add a bounding
    box to all nets, and free them when placement is done.

    Bill

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Jesper Louis Andersen at Jan 22, 2015 at 11:12 pm

    On Thu, Jan 22, 2015 at 9:28 AM, Sebastien Binet wrote:

    alternatively, a "simple" 'eg/gorename', go-fix or go-generate could
    also be a useful tool to transform a Particle/Particles
    Array-Of-Structs into a Struct-Of-Arrays... (or slices)
    This is the approach taken by most games anyway, for other structures. They
    have elaborate schemes where you feed into a nice structure and a compiler
    (go generate) then figures out an alternative efficient in-memory layout
    and provides you with appropriate functions to to access. The smart thing
    is that iteration over the layout can reap the benefit of cache locality
    providing much better speeds of processing. But as a programmer, you don't
    need to worry about the in-memory layout a priori.

    My hunch is that you need to command when you want the alternative layout.
    If it were generally applicable, I would imagine lots of languages would
    have picked up the automated transformation by now. It is fairly well known
    folklore, and I can't fathom if there wasn't at least some compiler writer
    who would have seen-the-light. For heavyweight processing of data however,
    cache locality and memory bandwidth is so important nowadays you can't
    ignore it.


    --
    J.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Egon at Jan 21, 2015 at 7:49 am

    On Wednesday, 21 January 2015 04:39:52 UTC+2, Bill Cox wrote:
    Sorry about this total noob post, but the "contibutor" docs say pretty
    clearly to first post here...

    I am interested in changing the memory layout of structs in Go. The
    layout defined in C is nearly as bad as it gets when it comes to high speed
    memory intensive applications. Your inner loops typically access 2 or 3
    fields of any given object, and the rest of the object's fields clobber
    otherwise potentially useful data in the cache. If the default memory
    layout is to group fields of the same struct type together, average cache
    performance improves greatly (my place and route tools run 60% faster!).
    To make this work, pointers are implemented under the hood as regular
    ints, and used to index into these arrays of properties. I've developed C
    code with this method for over 20 years, and it always produces far faster
    algorithms.

    For my implementation of a C data structure generator for C that make use
    of this layout, see DataDraw:

    http://datadraw.sourceforge.net/
    <http://www.google.com/url?q=http%3A%2F%2Fdatadraw.sourceforge.net%2F&sa=D&sntz=1&usg=AFQjCNGUOa_dgFEWLrenZkxA7GOiC8lnNQ>

    I assume that Go structs today are implemented as C structs. That's too
    bad, because this simple change under the hood could make Go faster than C
    in many memory intensive applications.

    Is this something I should be encouraged or discouraged from pursuing?
    Also, could you explain some of the real-world problems and/or show some
    code examples, currently the text is rather vague. Maybe there are
    alternative ways for making those things faster or approaching the problem.

    + Egon

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedJan 21, '15 at 2:39a
activeJan 22, '15 at 11:12p
posts16
users5
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase