On 1/25/2013 6:51 PM, Watson Ladd wrote:
On Thursday, November 22, 2012 10:17:42 AM UTC-6, Anssi Porttikivi
Is there anything like that? With current transistor budgets and
clever implementations for channels in hw, could we have massive,
efficient parallelism with perhaps smaller word size?
There's a long history of "Build it and they will come"
multiprocessors. It's straightforward to build machines with
huge numbers of intercommunicating processors, and it's been
done many times. The ILLIAC IV, the BBN Butterfly, the NCube
Hypercube, the Transputer, the Sony PS3 - the hardware worked, but
the things were too hard to program. Many people tried very hard to
make those things go, but in each case, more conventional architectures
Three architectures are known to work - shared-memory
multiprocessors, clusters, and graphics processing units.
Much of the cutting-edge thinking in parallelism today involves GPUs,
which are the one big success in non-shared-memory machines.
Supercomputers today are usually clusters of shared-memory
multiprocessors, and the software is explicitly sending messages
across the internal network to communicate. Usually with the
Pick up Patterson, Computer Architecture, read Chapter 8 and you
will begin to understand the problems. Maintaining a global state
across distributed systems is hard and consumes bandwidth. Having a
single global memory doesn't scale beyond about 8-16 processors
because the memory bus becomes constrained.
Current thinking seems to be that you can get into the 40-60
processor range with shared memory before the cache traffic limits
the system speed. The Intel MIC is at 50. But each CPU is roughly
comparable to an x86 CPU of a decade ago, while the inter-cache
bus is 512 bits wide and state of the art.
Once we reach the kinds of core numbers you are talking about NUMA
won't do. Even if NUMA did controlling sharing is crucial to
performance, and Go does not let the programmer direct sharing and
scheduling in enough detail for this. Once we get to message
passing you are in the land of OpenMPI, and that is not going to be
easy writing of goroutines. (You could write the node software in Go,
but I'm not sure why you would want to)
If Go enforced "Do not communicate by sharing memory;
instead, share memory by communicating", it would map well to the
OpenMPI/hypercube/Cell model, where each CPU has local memory and
all intercommunication is via message passing. Such machines are
straightforward to build but tough to program. Go has some
potential as an alternative to MPI, but channels will need more
functions for that to work, and goroutines will need more isolation.
The Cell in its PS3 form was painful to program. With only 256K
of RAM per processor, about all you could do in with the Cell processors
was run sequential DSP-type algorithms which processed data as
it flowed through. Great for audio processing and video compression,
bad for almost everything else. For most PS3 games, the single PowerPC
CPU and the NVidia graphics processor are doing most of the work,
with a few auxiliary functions offloaded to the eight Cell processors.
If the Cell had 16MB of RAM per processor, it might have been different.
Then you could get some real work done in each Cell processor before
you had to ship the data out.
Communication patterns have to be hand optimized for each algorithm,
often relying on hardware details.
Painfully true. It's even worse when all CPUs are not the same.
The PS3 has two different instruction sets to deal with; three if
you count the GPU.