FAQ
Is there anything like that? With current transistor budgets and clever implementations for channels in hw, could we have massive, efficient parallelism with perhaps smaller word size?

Fixed word size is so arbitrary, like predefined address lenghts? Can't we have hierarchical memory? And parallelised small word operations even for things like multiplication of big numbers. I understand 64 bit fixed everything consumes huges amounts of transistors, certainly exponentially more than small word sizes.

--

Search Discussions

  • Anssi Porttikivi at Nov 22, 2012 at 4:32 pm
    Looks like the current crop has 10^10 transistors, while 16-bit ones had 10^5. So you could have 10000 of them, to run your goroutines, if you would not be fascinated with BIG words.

    --
  • Michael Jones at Nov 22, 2012 at 4:29 pm
    Your math is suspect here. Big/fast CPUs have only a small percentage of
    their space and transistors in ALU/registerfile/dispatch. Most is in
    hierarchies of cache. Cache is good for all bit sizes...
    On Thu, Nov 22, 2012 at 8:25 AM, Anssi Porttikivi wrote:

    Looks like the current crop has 10^10 transistors, while 16-bit ones had
    10^5. So you could have 10000 of them, to run your goroutines, if you would
    not be fascinated with BIG words.

    --


    --
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765

    --
  • Bryanturley at Nov 22, 2012 at 5:27 pm
    They are selling desktop level 6 core cpus, maybe 8 core now.
    The first dual core mass market desktop cpu came out in 2005 so in 7ish
    years we have x4 cores.
    Who knows what happens in the next 5 years.

    If you have the money, 4 socket quad cores is 16 cores.

    --
  • Minux at Nov 23, 2012 at 9:33 am

    On Fri, Nov 23, 2012 at 12:17 AM, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't we
    have hierarchical memory? And parallelised small word operations even for
    things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
    This is an actively exploring field.

    However, I'm afraid that with massively parallel processors, globally
    shared memory is not feasible, so
    Go's concurrency model might not be a perfect fit for them.

    Current trend in that field (for example Network on Chip) favors many cores
    with small local memory
    and dedicated routing co-processor for each core together with a mesh
    communication network.

    word size is not a problem at all (the required hardware for n-bit ALU is
    at most O(n^2)), in fact, in
    modern processors, the ALU is a vanishingly small part of the silicon; the
    controller and cache account
    for the major part of the silicon.

    --
  • John Nagle at Jan 25, 2013 at 10:17 pm

    On 11/23/2012 1:33 AM, minux wrote:
    On Fri, Nov 23, 2012 at 12:17 AM, Anssi Porttikivi wrote:

    Is there anything like that?
    The Intel MIC is a potential target. 60 cores for about $2000.
    Runs x86 instructions, mostly. There was an attempt to
    port Go to the MIC in 2011 (http://communities.intel.com/thread/25692)
    but it doesn't seem to have been finished. The MIC is now out
    as a product (as the "Xeon Phi"), but Go isn't mentioned by Intel.
    They're using C, C++, FORTRAN, and OpenMP.

    While this is a shared-memory multiprocessor, you have to lay
    out memory so as not to force too many cache misses. Otherwise
    you won't get the benefit of all those cores. CPU dispatching
    has to be smart about which threads run on which CPUs. Locking
    has to be carefully worked out so as not to stall too many CPUs.
    Garbage collection tends to have a big impact.

    So once Go is running, there are many low-level issues to
    be dealt with to get the performance up.

    John Nagle

    --
  • Anssi Porttikivi at Nov 23, 2012 at 10:52 am
    Yes, my transistor math was stupid, my word size babble was stupid and I
    know nothing about chip design. But the fact remains, that you could
    implement 10^5 CPUs cores on a chip with todays 10^10 transistor hw, if
    they would be no more complex than a 10^5 transistor 80286 used to be.

    And the question of using silicon for cache comes to a question, that maybe
    you could use that for local core memory instead? Just divide your cache
    and your everything into 10^5 small cores...

    And if you ignore my stupid ideas, I think we could still discuss: Is Go
    suitable for massively parallel designs? What kind of? Could they be
    cost-effective mass market solutions? I am a Go fanatic, but I think the
    answer is yes for both!

    --
  • ⚛ at Nov 23, 2012 at 1:04 pm

    On Friday, November 23, 2012 11:52:14 AM UTC+1, Anssi Porttikivi wrote:

    Yes, my transistor math was stupid, my word size babble was stupid and I
    know nothing about chip design. But the fact remains, that you could
    implement 10^5 CPUs cores on a chip with todays 10^10 transistor hw, if
    they would be no more complex than a 10^5 transistor 80286 used to be.

    Dividing 10^10 transistors into N compartments and each compartment is 10^5
    transistors cannot yield N=10^5, because the communication network between
    the N CPUs also requires transitors and space. A high-speed interconnect
    requires many transistors.

    80286 doesn't have on-CPU cache, an FPU, performance monitoring registers.

    With a hundred or more CPUs, each CPU may need to have its own local memory.

    And the question of using silicon for cache comes to a question, that maybe
    you could use that for local core memory instead? Just divide your cache
    and your everything into 10^5 small cores...

    And if you ignore my stupid ideas, I think we could still discuss: Is Go
    suitable for massively parallel designs? What kind of? Could they be
    cost-effective mass market solutions? I am a Go fanatic, but I think the
    answer is yes for both!
    In my opinion, Go as such isn't suitable for massively parallel designs.

    --
  • Anssi Porttikivi at Nov 23, 2012 at 11:01 am
    The tenet of not using global memory should not be so hard for Go
    algorithms. Just pass chunks of globals as value parameters, do all
    inner-loop processing in parallel goroutines with local memory allocated in
    local core hw, and then coordinate with channels, as locally as possible.
    Pass the results back to global memory.

    --
  • Minux at Nov 23, 2012 at 1:22 pm

    On Fri, Nov 23, 2012 at 6:56 PM, Anssi Porttikivi wrote:

    The tenet of not using global memory should not be so hard for Go
    algorithms. Just pass chunks of globals as value parameters, do all
    inner-loop processing in parallel goroutines with local memory allocated in
    local core hw, and then coordinate with channels, as locally as possible.
    Pass the results back to global memory.
    Why fight the language?
    IMHO, Go is simply not practical on massively parallel processors that
    don't have globally shared memory.

    Note that Go is designed for today's multicore processors that share
    memory.
    And the language suitable for many-core machines that doesn't share
    global memory is still unknown to us. Lots of people are doing research
    in this field.

    --
  • Anssi Porttikivi at Nov 23, 2012 at 3:41 pm
    If you want new kind of mass market massive parallelism to replace Wintel,
    you can create a new language for that also. And rewrite all softeware in
    the world. That is a possibility.

    But if Go becomes a huge success, which I am betting on, and creates a
    culture of parallel algorithms, that is something that changes the starting
    point. Now you can come up with a massive parallel cheap chip and just
    provice a Go compiler.

    --
  • Minux at Nov 23, 2012 at 4:13 pm

    On Fri, Nov 23, 2012 at 11:41 PM, Anssi Porttikivi wrote:

    If you want new kind of mass market massive parallelism to replace Wintel,
    you can create a new language for that also. And rewrite all softeware in
    the world. That is a possibility.
    that's why massively parallel processors do not yet take over the world.
    Familiar languages are
    ineffective for them.
    But if Go becomes a huge success, which I am betting on, and creates a
    culture of parallel algorithms, that is something that changes the starting
    point. Now you can come up with a massive parallel cheap chip and just
    provice a Go compiler.
    I really hope we can find a way to build massively parallel machines while
    at the same time
    retain the globally shared memory and the easy memory consistency model we
    enjoyed today
    on x86 (or x64) processors. However, that is not an easy thing to do. But
    predicting the future
    is impossible, maybe someone could come up with a novel way to do that
    (although i doubt it).

    --
  • Steve wang at Nov 23, 2012 at 4:24 pm
    Maybe we should look forward to quantum computer, with which all these
    difficult things maybe can be done with ease. :D
    On Saturday, November 24, 2012 12:04:51 AM UTC+8, minux wrote:


    On Fri, Nov 23, 2012 at 11:41 PM, Anssi Porttikivi <portt...@gmail.com<javascript:>
    wrote:
    If you want new kind of mass market massive parallelism to replace
    Wintel, you can create a new language for that also. And rewrite all
    softeware in the world. That is a possibility.
    that's why massively parallel processors do not yet take over the world.
    Familiar languages are
    ineffective for them.
    But if Go becomes a huge success, which I am betting on, and creates a
    culture of parallel algorithms, that is something that changes the starting
    point. Now you can come up with a massive parallel cheap chip and just
    provice a Go compiler.
    I really hope we can find a way to build massively parallel machines while
    at the same time
    retain the globally shared memory and the easy memory consistency model we
    enjoyed today
    on x86 (or x64) processors. However, that is not an easy thing to do. But
    predicting the future
    is impossible, maybe someone could come up with a novel way to do that
    (although i doubt it).
    --
  • Bryanturley at Nov 23, 2012 at 4:36 pm
    If you stick with the share by communicating model, massively parallel with
    go could work.
    That is how the machines in the top500 work, various message passing
    algorithms.

    *Perhaps* in 20 years go 3.0 denies all shared memory and forces only
    channels on the new 1024 core x86 non cache coherent chips ;)
    Assuming arm doesn't kill them by then.

    But in the now you can make 1024+ goroutines on your quad core cpu, and
    possibly get speed boosts as more cores become available.

    --
  • Aram Hăvărneanu at Nov 23, 2012 at 6:03 pm

    If you stick with the share by communicating model, massively parallel with
    go could work.
    You can't share memory by communicating if your nodes do not share memory.

    --
    Aram Hăvărneanu

    --
  • Bryanturley at Nov 24, 2012 at 4:50 am

    On Friday, November 23, 2012 12:03:28 PM UTC-6, Aram Hăvărneanu wrote:
    If you stick with the share by communicating model, massively parallel with
    go could work.
    You can't share memory by communicating if your nodes do not share memory.
    really how did my computer get the message you just sent this mailing list?
    they don't share memory but they have surely communicated.
    http://en.wikipedia.org/wiki/Network_interface_card ;)


    --
    Aram Hăvărneanu
    --
  • A Vansteenkiste at Nov 24, 2012 at 1:30 am
    These guys promise a cheap board with lots and lots of ARM
    cores: http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone
    Would be very nice if it runs Go.

    -Arne.

    On Thursday, November 22, 2012 5:17:42 PM UTC+1, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't we
    have hierarchical memory? And parallelised small word operations even for
    things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
    --
  • Matt Kane's Brain at Nov 24, 2012 at 4:57 am
    Only two ARM cores are on each board. The 16x or 64x cores on the eval
    boards are their own architecture, called Epiphany.
    On Fri, Nov 23, 2012 at 4:04 PM, wrote:

    These guys promise a cheap board with lots and lots of ARM cores:
    http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyoneWould be very nice if it runs Go.

    -Arne.

    On Thursday, November 22, 2012 5:17:42 PM UTC+1, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't
    we have hierarchical memory? And parallelised small word operations even
    for things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
    --



    --
    matt kane's brain
    http://hydrogenproject.com

    --
  • Bryanturley at Nov 24, 2012 at 5:10 am
    http://www.anandtech.com/show/6418/amd-will-build-64bit-arm-based-opteron-cpus-for-servers-production-in-2014
    I suspect this to have a lot more cores than the x86 world, minus perhaps
    the xeon phi 2014 edition.
    That is pure speculation though, there are no details on what it is at
    release time.

    --
  • Steve wang at Jan 25, 2013 at 8:12 pm
    How many cores can x86 system get up to? Of course on a basis that all
    these cores share global memory and make it work efficiently.
    It seems that Intel has slowed down the speed of cramming more and more
    cores into one cpu.
    How much would this tendency influence the development of concurrency
    programming which Go is born for?
    On Friday, November 23, 2012 12:17:42 AM UTC+8, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't we
    have hierarchical memory? And parallelised small word operations even for
    things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
  • Niklas Schnelle at Jan 25, 2013 at 8:28 pm
    You mean something like this:
    http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone
    On Friday, January 25, 2013 9:12:22 PM UTC+1, steve wang wrote:

    How many cores can x86 system get up to? Of course on a basis that all
    these cores share global memory and make it work efficiently.
    It seems that Intel has slowed down the speed of cramming more and more
    cores into one cpu.
    How much would this tendency influence the development of concurrency
    programming which Go is born for?
    On Friday, November 23, 2012 12:17:42 AM UTC+8, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't
    we have hierarchical memory? And parallelised small word operations even
    for things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
    --
  • ⚛ at Jan 25, 2013 at 8:30 pm

    On Friday, January 25, 2013 9:12:22 PM UTC+1, steve wang wrote:

    How many cores can x86 system get up to?

    Infinitely many cores. The maximum level of parallelism does not depend on
    the x86 architecture itself - it depends on the structure of code and data
    the CPUs are processing.

    --
  • Ron minnich at Jan 25, 2013 at 8:33 pm

    On Fri, Jan 25, 2013 at 12:30 PM, ⚛ wrote:
    On Friday, January 25, 2013 9:12:22 PM UTC+1, steve wang wrote:

    How many cores can x86 system get up to?

    Infinitely many cores. The maximum level of parallelism does not depend on
    the x86 architecture itself - it depends on the structure of code and data
    the CPUs are processing.
    um, as long as you don't think memory bandwidth has any importance.

    ron

    --
  • ⚛ at Jan 25, 2013 at 8:53 pm

    On Friday, January 25, 2013 9:33:20 PM UTC+1, ron minnich wrote:

    On Fri, Jan 25, 2013 at 12:30 PM, ⚛ <0xe2.0x...@gmail.com <javascript:>>
    wrote:
    On Friday, January 25, 2013 9:12:22 PM UTC+1, steve wang wrote:

    How many cores can x86 system get up to?

    Infinitely many cores. The maximum level of parallelism does not depend on
    the x86 architecture itself - it depends on the structure of code and
    data
    Clarification: I meant the x86 instruction set architecture, memory
    protection, paging, etc. That is: the programmer interface to the CPU.

    the CPUs are processing.
    um, as long as you don't think memory bandwidth has any importance.
    Memory bandwidth is a program parameter. Efficiency means that programs are
    able to compute how to utilize the bandwidth.

    The x86 architecture (programmer interface) as such plays no role in how
    big the memory bandwidth can get over forthcoming decades of years.

    --
  • Niklas Schnelle at Jan 25, 2013 at 8:38 pm
    By the way there are Linux systems with shared memory and a single kernel
    image for > 1024 cores.
    For example: https://en.wikipedia.org/wiki/SGI_Altix#Altix_4000
    On the other hand many problems do get memory bandwidth bound easily even
    on lower core counts. I see this phenomen quite often on my Opteron bases
    16 core workstation.
    Especially since getting NUMA right is pretty hard. I think however that Go
    can in fact be pretty good on many core systems, maybe several cores will
    share quite a bit of memory,
    after all very often one just needs access to more data, even on graphics
    cards all memory is accessible though not in arbitary ways.
    So maybe it will be processes sharing memory among a number of Goroutines
    and communication via channels.
    It wouldn't be too hard to have the channel concept extended for inter chip
    networking and coordinate a bunch of processes that don't share memory
    On Friday, January 25, 2013 9:30:19 PM UTC+1, ⚛ wrote:
    On Friday, January 25, 2013 9:12:22 PM UTC+1, steve wang wrote:

    How many cores can x86 system get up to?

    Infinitely many cores. The maximum level of parallelism does not depend on
    the x86 architecture itself - it depends on the structure of code and data
    the CPUs are processing.
    --
  • Michael Jones at Jan 25, 2013 at 8:48 pm
    The simple answer is 4xCPUs on a motherboard each of the E5-4650L class (8
    cores, 2-way SMT), for a total of 4x8 = 32 physical cores and 64 logical
    cores. The next-gen version of this is on my shopping list...
    Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1
    650-335-5765
  • Bryanturley at Jan 25, 2013 at 9:02 pm

    On Friday, January 25, 2013 2:12:22 PM UTC-6, steve wang wrote:
    How many cores can x86 system get up to? Of course on a basis that all
    these cores share global memory and make it work efficiently.
    It seems that Intel has slowed down the speed of cramming more and more
    cores into one cpu.
    How much would this tendency influence the development of concurrency
    programming which Go is born for?
    By x86 system do you mean single computer? If not this system has 299,008
    opteron cores.
    http://www.olcf.ornl.gov/titan/
  • Rodrigo Chacon at Jan 26, 2013 at 12:22 am
    What about the Parallella [1] board? They promise up to 64 ARM v9 cores. :)

    [1] http://www.adapteva.com/
    On Jan 25, 2013 6:12 PM, "steve wang" wrote:

    How many cores can x86 system get up to? Of course on a basis that all
    these cores share global memory and make it work efficiently.
    It seems that Intel has slowed down the speed of cramming more and more
    cores into one cpu.
    How much would this tendency influence the development of concurrency
    programming which Go is born for?
    On Friday, November 23, 2012 12:17:42 AM UTC+8, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't
    we have hierarchical memory? And parallelised small word operations even
    for things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
    --
  • Minux at Jan 26, 2013 at 12:33 am

    On Sat, Jan 26, 2013 at 4:28 AM, Rodrigo Chacon wrote:

    What about the Parallella [1] board? They promise up to 64 ARM v9 cores. :)
    it's a custom RISC cpu core, not ARM, and certainly not v9. (I'm not even
    aware of any
    commercially available 64-bit ARM v8 implementations at this point; I'm
    happy to be
    proved otherwise as I'm eager to own a 64-bit ARM v8 evaluation board to
    port Go to)
  • Rodrigo Chacon at Jan 26, 2013 at 2:34 pm
    You're right. I saw ARM v9 in this [1] page and post it. Looking again,
    yes, custom RISC. My bad. :)

    [1] http://www.adapteva.com/products/eval-kits/parallella/
    On Jan 25, 2013 10:33 PM, "minux" wrote:

    On Sat, Jan 26, 2013 at 4:28 AM, Rodrigo Chacon wrote:

    What about the Parallella [1] board? They promise up to 64 ARM v9 cores. :)
    it's a custom RISC cpu core, not ARM, and certainly not v9. (I'm not even
    aware of any
    commercially available 64-bit ARM v8 implementations at this point; I'm
    happy to be
    proved otherwise as I'm eager to own a 64-bit ARM v8 evaluation board to
    port Go to)
  • Bryanturley at Jan 26, 2013 at 5:07 pm

    On Saturday, January 26, 2013 8:34:26 AM UTC-6, Rodrigo Chacon wrote:
    You're right. I saw ARM v9 in this [1] page and post it. Looking again,
    yes, custom RISC. My bad. :)
    Arm's names are confusing.
    The first arm cpu I worked with was an ARM7TDMI which implemented ARMv4.
    ARM7s use either ARMv3 or ARMv4 instruction set and ARM11s did ARMv6 for
    example.
    So ARM9 (ARMv5) != ARMv9

    --
  • Minux at Jan 26, 2013 at 5:27 pm
    Off topic, i just can't help but complain about the naming issue...
    On Sun, Jan 27, 2013 at 1:07 AM, bryanturley wrote:
    On Saturday, January 26, 2013 8:34:26 AM UTC-6, Rodrigo Chacon wrote:

    You're right. I saw ARM v9 in this [1] page and post it. Looking again,
    yes, custom RISC. My bad. :)
    Arm's names are confusing.
    surely they are!
    The first arm cpu I worked with was an ARM7TDMI which implemented ARMv4.
    Strictly speaking, ARM7TDMI implements ARMv4T. ;-)
    because ARMv4 refers to version 4 of the ARM architecture without the thumb
    instruction set
    (for example, the venerable SA-110).
    ARM7s use either ARMv3 or ARMv4 instruction set and ARM11s did ARMv6 for
    example.
    So ARM9 (ARMv5) != ARMv9
    right. ARM9 doesn't guarantee ARMv5. For example, ARM9TDMI is in fact a
    ARMv4T
    core.

    What's more confusing is that after ARM introduced the Cortex family to
    solve the naming
    issue (the ARM9 ARMv4 vs ARMv5 confusions) people tends to assume all cpus
    of the
    family implement the ARMv7 architecture, as earlier members do have this
    trait (Cortex-M3,
    Cortex-A8, and Cortex-R4), then ARM introduces Cortex-M0 which implements
    ARMv6-M
    architecture, and soon the confusion come back again, only in a different
    form. Now,
    not even the whole of Cortex-A cores implement ARMv7, because recently they
    introduced
    Cortex-A53/57 for ARMv8 architecture....

    They are really not good at naming things besides using numbers.
  • Watson Ladd at Jan 26, 2013 at 2:51 am
    Pick up Patterson, Computer Architecture, read Chapter 8 and you will begin
    to understand the problems.
    Maintaining a global state across distributed systems is hard and consumes
    bandwidth. Having a single global memory doesn't scale beyond about
    8-16 processors because the memory bus becomes constrained. Frequencies for
    RAM are limited by power consumption much more then CPU frequencies, due to
    hardware considerations.

    Once we reach the kinds of core numbers you are talking about NUMA won't
    do. Even if NUMA did controlling sharing is crucial to performance, and Go
    does not let the programmer direct sharing and scheduling in enough detail
    for this. Once we get to message passing you are in the land of OpenMPI,
    and that is not going to be easy writing of goroutines. (You could write
    the node software in Go, but I'm not sure why you would want to)

    When you have high interconnect bandwidth, crossbar switches are overly
    expensive. As a result not all pairs of nodes have full bandwidth at the
    same time. Communication patterns have to be hand optimized for each
    algorithm, often relying on hardware details. Magic bullets would be nice
    for this, but they probably don't exist. And lastly Fortran has had so much
    poured into it that it will be hard to beat.

    Sincerely,
    Watson Ladd
    On Thursday, November 22, 2012 10:17:42 AM UTC-6, Anssi Porttikivi wrote:

    Is there anything like that? With current transistor budgets and clever
    implementations for channels in hw, could we have massive, efficient
    parallelism with perhaps smaller word size?

    Fixed word size is so arbitrary, like predefined address lenghts? Can't we
    have hierarchical memory? And parallelised small word operations even for
    things like multiplication of big numbers. I understand 64 bit fixed
    everything consumes huges amounts of transistors, certainly exponentially
    more than small word sizes.
    --
  • John Nagle at Jan 26, 2013 at 6:42 pm

    On 1/25/2013 6:51 PM, Watson Ladd wrote:
    On Thursday, November 22, 2012 10:17:42 AM UTC-6, Anssi Porttikivi
    wrote:
    Is there anything like that? With current transistor budgets and
    clever implementations for channels in hw, could we have massive,
    efficient parallelism with perhaps smaller word size?
    There's a long history of "Build it and they will come"
    multiprocessors. It's straightforward to build machines with
    huge numbers of intercommunicating processors, and it's been
    done many times. The ILLIAC IV, the BBN Butterfly, the NCube
    Hypercube, the Transputer, the Sony PS3 - the hardware worked, but
    the things were too hard to program. Many people tried very hard to
    make those things go, but in each case, more conventional architectures
    won out.

    Three architectures are known to work - shared-memory
    multiprocessors, clusters, and graphics processing units.
    Much of the cutting-edge thinking in parallelism today involves GPUs,
    which are the one big success in non-shared-memory machines.
    Supercomputers today are usually clusters of shared-memory
    multiprocessors, and the software is explicitly sending messages
    across the internal network to communicate. Usually with the
    MPI protocol.
    Pick up Patterson, Computer Architecture, read Chapter 8 and you
    will begin to understand the problems. Maintaining a global state
    across distributed systems is hard and consumes bandwidth. Having a
    single global memory doesn't scale beyond about 8-16 processors
    because the memory bus becomes constrained.
    Current thinking seems to be that you can get into the 40-60
    processor range with shared memory before the cache traffic limits
    the system speed. The Intel MIC is at 50. But each CPU is roughly
    comparable to an x86 CPU of a decade ago, while the inter-cache
    bus is 512 bits wide and state of the art.
    Once we reach the kinds of core numbers you are talking about NUMA
    won't do. Even if NUMA did controlling sharing is crucial to
    performance, and Go does not let the programmer direct sharing and
    scheduling in enough detail for this. Once we get to message
    passing you are in the land of OpenMPI, and that is not going to be
    easy writing of goroutines. (You could write the node software in Go,
    but I'm not sure why you would want to)
    If Go enforced "Do not communicate by sharing memory;
    instead, share memory by communicating", it would map well to the
    OpenMPI/hypercube/Cell model, where each CPU has local memory and
    all intercommunication is via message passing. Such machines are
    straightforward to build but tough to program. Go has some
    potential as an alternative to MPI, but channels will need more
    functions for that to work, and goroutines will need more isolation.

    The Cell in its PS3 form was painful to program. With only 256K
    of RAM per processor, about all you could do in with the Cell processors
    was run sequential DSP-type algorithms which processed data as
    it flowed through. Great for audio processing and video compression,
    bad for almost everything else. For most PS3 games, the single PowerPC
    CPU and the NVidia graphics processor are doing most of the work,
    with a few auxiliary functions offloaded to the eight Cell processors.
    If the Cell had 16MB of RAM per processor, it might have been different.
    Then you could get some real work done in each Cell processor before
    you had to ship the data out.
    Communication patterns have to be hand optimized for each algorithm,
    often relying on hardware details.
    Painfully true. It's even worse when all CPUs are not the same.
    The PS3 has two different instruction sets to deal with; three if
    you count the GPU.

    John Nagle
  • Anssi Porttikivi at Jan 30, 2013 at 12:15 pm
    I am still wondering. If you just refuse to reference global variables in
    your goroutines, and if the channels refuse to pass pointers to global
    memory, why can't that work on a message-passing parallel hardware?

    The compiler can enforce that. Or it can be a "soft" limitation, allowing
    you to break that, and the compiler/runtime provides some inefficient
    back-up implementation for limited shared memory. In extreme case that
    would be simulated with message passing and sw locks. Or it could be
    implemented with a limited slow shared hardware bus to shared memory.

    So you design Go algorithms for message-passsing-by-value goroutines, if
    the hw is great for that.

    You could have optimized, automatically or manually controlled, topography
    for less than fully cross-switched core connections. The
    compiler/scheduler/programmer can try to allocate busy channels so that
    their end-point goroutines run on well-connected core airs.

    But you could still have occasional access to global data structures, or
    run slow bandwith "signalling" channels between non-connected cores. If you
    access few kilobytes once in a second, it doesn't matter if it is super
    efficient. Even software emulation on strictly message passing hardware
    could work.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Brendan Tracey at Mar 5, 2013 at 5:41 am
    I would like to second Anssi's questions about Go's suitability for being
    an MPI replacement. It does seem that if you share memory by communicating,
    the shared memory restriction shouldn't actually be that tight. I know it's
    not easy, but is this something is Go's development plans?
    On Wednesday, January 30, 2013 4:15:19 AM UTC-8, Anssi Porttikivi wrote:

    I am still wondering. If you just refuse to reference global variables in
    your goroutines, and if the channels refuse to pass pointers to global
    memory, why can't that work on a message-passing parallel hardware?

    The compiler can enforce that. Or it can be a "soft" limitation, allowing
    you to break that, and the compiler/runtime provides some inefficient
    back-up implementation for limited shared memory. In extreme case that
    would be simulated with message passing and sw locks. Or it could be
    implemented with a limited slow shared hardware bus to shared memory.

    So you design Go algorithms for message-passsing-by-value goroutines, if
    the hw is great for that.

    You could have optimized, automatically or manually controlled, topography
    for less than fully cross-switched core connections. The
    compiler/scheduler/programmer can try to allocate busy channels so that
    their end-point goroutines run on well-connected core airs.

    But you could still have occasional access to global data structures, or
    run slow bandwith "signalling" channels between non-connected cores. If you
    access few kilobytes once in a second, it doesn't matter if it is super
    efficient. Even software emulation on strictly message passing hardware
    could work.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Bryanturley at Mar 5, 2013 at 5:46 am

    On Monday, March 4, 2013 11:41:23 PM UTC-6, Brendan Tracey wrote:
    I would like to second Anssi's questions about Go's suitability for being
    an MPI replacement. It does seem that if you share memory by communicating,
    the shared memory restriction shouldn't actually be that tight. I know it's
    not easy, but is this something is Go's development plans?
    It could probably be used in place of MPI *IF* you were staying inside one
    machine with MPI to begin with.
    Channels need help with networking, and I have yet to see (and don't expect
    to see) channels + rdma so...

    On Wednesday, January 30, 2013 4:15:19 AM UTC-8, Anssi Porttikivi wrote:

    I am still wondering. If you just refuse to reference global variables in
    your goroutines, and if the channels refuse to pass pointers to global
    memory, why can't that work on a message-passing parallel hardware?

    The compiler can enforce that. Or it can be a "soft" limitation, allowing
    you to break that, and the compiler/runtime provides some inefficient
    back-up implementation for limited shared memory. In extreme case that
    would be simulated with message passing and sw locks. Or it could be
    implemented with a limited slow shared hardware bus to shared memory.

    So you design Go algorithms for message-passsing-by-value goroutines, if
    the hw is great for that.

    You could have optimized, automatically or manually controlled,
    topography for less than fully cross-switched core connections. The
    compiler/scheduler/programmer can try to allocate busy channels so that
    their end-point goroutines run on well-connected core airs.

    But you could still have occasional access to global data structures, or
    run slow bandwith "signalling" channels between non-connected cores. If you
    access few kilobytes once in a second, it doesn't matter if it is super
    efficient. Even software emulation on strictly message passing hardware
    could work.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kamil Kisiel at Mar 5, 2013 at 6:24 am
    I think it would be more interesting to have a modified go runtime that
    implemented channels over MPI and a scheduler that could dispatch
    goroutines to different MPI nodes.
    On Monday, March 4, 2013 9:46:28 PM UTC-8, bryanturley wrote:


    On Monday, March 4, 2013 11:41:23 PM UTC-6, Brendan Tracey wrote:

    I would like to second Anssi's questions about Go's suitability for being
    an MPI replacement. It does seem that if you share memory by communicating,
    the shared memory restriction shouldn't actually be that tight. I know it's
    not easy, but is this something is Go's development plans?
    It could probably be used in place of MPI *IF* you were staying inside one
    machine with MPI to begin with.
    Channels need help with networking, and I have yet to see (and don't
    expect to see) channels + rdma so...

    On Wednesday, January 30, 2013 4:15:19 AM UTC-8, Anssi Porttikivi wrote:

    I am still wondering. If you just refuse to reference global variables
    in your goroutines, and if the channels refuse to pass pointers to global
    memory, why can't that work on a message-passing parallel hardware?

    The compiler can enforce that. Or it can be a "soft" limitation,
    allowing you to break that, and the compiler/runtime provides some
    inefficient back-up implementation for limited shared memory. In extreme
    case that would be simulated with message passing and sw locks. Or it could
    be implemented with a limited slow shared hardware bus to shared memory.

    So you design Go algorithms for message-passsing-by-value goroutines, if
    the hw is great for that.

    You could have optimized, automatically or manually controlled,
    topography for less than fully cross-switched core connections. The
    compiler/scheduler/programmer can try to allocate busy channels so that
    their end-point goroutines run on well-connected core airs.

    But you could still have occasional access to global data structures, or
    run slow bandwith "signalling" channels between non-connected cores. If you
    access few kilobytes once in a second, it doesn't matter if it is super
    efficient. Even software emulation on strictly message passing hardware
    could work.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • John Nagle at Mar 5, 2013 at 7:13 am

    On 3/4/2013 9:46 PM, bryanturley wrote:
    On Monday, March 4, 2013 11:41:23 PM UTC-6, Brendan Tracey wrote:

    I would like to second Anssi's questions about Go's suitability for being
    an MPI replacement. It does seem that if you share memory by communicating,
    the shared memory restriction shouldn't actually be that tight. I know it's
    not easy, but is this something is Go's development plans?
    It could probably be used in place of MPI *IF* you were staying inside one
    machine with MPI to begin with.
    Channels need help with networking, and I have yet to see (and don't expect
    to see) channels + rdma so...
    To do that, you'd need a type of channel that won't pass
    references, but otherwise works like existing channels.
    Compiler-supported deep copy would be needed for performance
    (deep copy via reflection is painfully slow). There'd also
    have to be a form of remote goroutine that doesn't pass references
    closure-style. That would enforce "share memory by communicating"
    and allow shipping out work to CPUs that don't share memory.

    Then you could run a Go program on a whole cluster of
    machines.

    John Nagle

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

People

Translate

site design / logo © 2021 Grokbase