FAQ
We have a small Apple cluster running Hadoop. But another option we
have, built into the Apple Server OS, is to use their Xgrid, which
they promote for "supercomputer" scientific applications.

Any thoughts on the relative merits? I know that it depends on the
application. For us, we want to do pattern recognition, turning
raster images into collections of the objects we discover in the
images. Another focus for us is NLP, esp. phrasal analysis.

- Bob Futrelle

Search Discussions

  • Ted Dunning at Dec 5, 2007 at 5:18 am
    IF you are looking at large numbers of independent images then hadoop should
    be close to perfect for this analysis (the problem is embarrassingly
    parallel). If you are looking at video, then you can still do quite well by
    building what is essentially a probabilistic list of recognized items in the
    video stream in the map phase, giving all frames from a single shot the same
    reduce key. Then in the reduce phase, you can correlate the possible
    objects and their probabilities according to object persistence models. It
    would be good to do another pass after that to do scene to scene
    correlations. This formulation gives you near perfect parallelism as well.

    For NLP, the problem at the level of phrasal analysis can also be made
    trivially parallel if you have large numbers of documents. Again, you may
    need to do a secondary pass to find duplicated references across multiple
    documents but this is usually far less intensive than the original analysis.

    Standard scientific HPC architectures are all about facilitating arbitrary
    communication patterns and process boundaries. This is exceedingly hard to
    do really well and few systems attain really good performance. Hadoop is
    all about working with a really simple primitive that is so simple that it
    can be implemented really well with simple and cheap hardware. What is
    surprising (a bit) is that so many problems can be well expressed as
    map-reduce programs. Sometimes this is only true at really large scale
    where correlations become small (allowing the map phase to do useful work on
    many sub-units), sometimes it requires relatively large intermediate data
    (such as many graph algorithms). The fact is, however, that it works
    remarkably well.
    On 12/4/07 7:12 PM, "Bob Futrelle" wrote:

    For us, we want to do pattern recognition, turning
    raster images into collections of the objects we discover in the
    images. Another focus for us is NLP, esp. phrasal analysis.
  • Bob Futrelle at Dec 5, 2007 at 10:05 am
    You've written a spirited statement about the strengths of hadoop.
    But I'd still be interested in hearing from someone who might
    understand why an Xgrid cluster with its attendant.management system
    would or would not be equally good for these problems. After all,
    there are a reasonable number of Xgrid customers who are getting their
    work done.

    Maybe I'll need to learn more about both and also engage in some
    discussions with the Xgrid community. I do intend to bring up the
    Xgrid system on our cluster to see how it works for us. That'll
    certainly deepen my understanding of both.

    Thanks for the detailed reply.

    - Bob

    On Dec 5, 2007 12:17 AM, Ted Dunning wrote:

    IF you are looking at large numbers of independent images then hadoop should
    be close to perfect for this analysis (the problem is embarrassingly
    parallel). If you are looking at video, then you can still do quite well by
    building what is essentially a probabilistic list of recognized items in the
    video stream in the map phase, giving all frames from a single shot the same
    reduce key. Then in the reduce phase, you can correlate the possible
    objects and their probabilities according to object persistence models. It
    would be good to do another pass after that to do scene to scene
    correlations. This formulation gives you near perfect parallelism as well.

    For NLP, the problem at the level of phrasal analysis can also be made
    trivially parallel if you have large numbers of documents. Again, you may
    need to do a secondary pass to find duplicated references across multiple
    documents but this is usually far less intensive than the original analysis.

    Standard scientific HPC architectures are all about facilitating arbitrary
    communication patterns and process boundaries. This is exceedingly hard to
    do really well and few systems attain really good performance. Hadoop is
    all about working with a really simple primitive that is so simple that it
    can be implemented really well with simple and cheap hardware. What is
    surprising (a bit) is that so many problems can be well expressed as
    map-reduce programs. Sometimes this is only true at really large scale
    where correlations become small (allowing the map phase to do useful work on
    many sub-units), sometimes it requires relatively large intermediate data
    (such as many graph algorithms). The fact is, however, that it works
    remarkably well.

    On 12/4/07 7:12 PM, "Bob Futrelle" wrote:

    For us, we want to do pattern recognition, turning
    raster images into collections of the objects we discover in the
    images. Another focus for us is NLP, esp. phrasal analysis.
  • Bob Futrelle at Dec 5, 2007 at 10:42 am
    For the record, here's the Apple Xgrid (hype?) page:

    http://www.apple.com/server/macosx/technology/xgrid.html

    - Bob
    On Dec 5, 2007 5:04 AM, Bob Futrelle wrote:
    You've written a spirited statement about the strengths of hadoop.
    But I'd still be interested in hearing from someone who might
    understand why an Xgrid cluster with its attendant.management system
    would or would not be equally good for these problems. After all,
    there are a reasonable number of Xgrid customers who are getting their
    work done.

    Maybe I'll need to learn more about both and also engage in some
    discussions with the Xgrid community. I do intend to bring up the
    Xgrid system on our cluster to see how it works for us. That'll
    certainly deepen my understanding of both.

    Thanks for the detailed reply.

    - Bob


    On Dec 5, 2007 12:17 AM, Ted Dunning wrote:

    IF you are looking at large numbers of independent images then hadoop should
    be close to perfect for this analysis (the problem is embarrassingly
    parallel). If you are looking at video, then you can still do quite well by
    building what is essentially a probabilistic list of recognized items in the
    video stream in the map phase, giving all frames from a single shot the same
    reduce key. Then in the reduce phase, you can correlate the possible
    objects and their probabilities according to object persistence models. It
    would be good to do another pass after that to do scene to scene
    correlations. This formulation gives you near perfect parallelism as well.

    For NLP, the problem at the level of phrasal analysis can also be made
    trivially parallel if you have large numbers of documents. Again, you may
    need to do a secondary pass to find duplicated references across multiple
    documents but this is usually far less intensive than the original analysis.

    Standard scientific HPC architectures are all about facilitating arbitrary
    communication patterns and process boundaries. This is exceedingly hard to
    do really well and few systems attain really good performance. Hadoop is
    all about working with a really simple primitive that is so simple that it
    can be implemented really well with simple and cheap hardware. What is
    surprising (a bit) is that so many problems can be well expressed as
    map-reduce programs. Sometimes this is only true at really large scale
    where correlations become small (allowing the map phase to do useful work on
    many sub-units), sometimes it requires relatively large intermediate data
    (such as many graph algorithms). The fact is, however, that it works
    remarkably well.

    On 12/4/07 7:12 PM, "Bob Futrelle" wrote:

    For us, we want to do pattern recognition, turning
    raster images into collections of the objects we discover in the
    images. Another focus for us is NLP, esp. phrasal analysis.
  • Ted Dunning at Dec 5, 2007 at 5:23 pm
    Sorry about not addressing this. (and I appreciate your gentle prod)

    The Xgrid would likely work well on these problems. They are, after all,
    nearly trivial to parallelize because of clean communication patterns.

    Consider an alternative problem of solving n-body gravitational dynamics for
    n > 10^6 bodies. Here there is nearly universal communication.

    As another example, last week I heard from some Sun engineers that one of
    their HPC systems had to satisfy a requirement for checkpointing large
    numerical computations in which a large number of computational nodes were
    required to dump 10's of TB of checkpoint data to disk in less than 10
    seconds.

    Finally, many of these HPC systems are designed to fit the entire working
    set into memory so that high numerical computational throughput can be
    maintained. In this regime, communications have to work on memory
    time-scales rather than disk time-scales.

    None of these three example problems are very suitable for Hadoop.

    The sample problems you gave are a different matter.

    On 12/5/07 2:04 AM, "Bob Futrelle" wrote:

    why an Xgrid cluster with its attendant.management system
    would or would not be equally good for these problems
  • Ted Dunning at Dec 5, 2007 at 5:55 pm
    I just read the xgrid page and it is clear that apple has pushed on the
    following parameters (they may be doing lots of other cool stuff that I
    don't know about):

    A) auto-configuration
    B) wider distribution of computation
    C) local checkpointing of processes for restarts

    What they have apparently not done includes

    X) map/reduce
    Y) magic process restarts in the face of failure (see map/reduce)
    Z) distributed file system

    When newbies try to run hadoop the ALWAYS seem to run head-long into the
    lack of (A) (how many times has somebody essentially said "I have a totally
    screwed up DNS and hadoop won't run"?).

    Item (B) is probably a bad thing for hadoop given the bandwidth required for
    the shuffle phase.

    Item (C) is inherent in map-reduce and is pretty neutral either way.



    On 12/5/07 9:23 AM, "Ted Dunning" wrote:



    Sorry about not addressing this. (and I appreciate your gentle prod)

    The Xgrid would likely work well on these problems. They are, after all,
    nearly trivial to parallelize because of clean communication patterns.

    Consider an alternative problem of solving n-body gravitational dynamics for
    n > 10^6 bodies. Here there is nearly universal communication.

    As another example, last week I heard from some Sun engineers that one of
    their HPC systems had to satisfy a requirement for checkpointing large
    numerical computations in which a large number of computational nodes were
    required to dump 10's of TB of checkpoint data to disk in less than 10
    seconds.

    Finally, many of these HPC systems are designed to fit the entire working
    set into memory so that high numerical computational throughput can be
    maintained. In this regime, communications have to work on memory
    time-scales rather than disk time-scales.

    None of these three example problems are very suitable for Hadoop.

    The sample problems you gave are a different matter.

    On 12/5/07 2:04 AM, "Bob Futrelle" wrote:

    why an Xgrid cluster with its attendant.management system
    would or would not be equally good for these problems
  • Bob Futrelle at Dec 5, 2007 at 6:01 pm
    All this feedback is informative and valuable -- Thanks!

    - Bob Futrelle
    Northeastern U.

    On Dec 5, 2007 12:55 PM, Ted Dunning wrote:

    I just read the xgrid page and it is clear that apple has pushed on the
    following parameters (they may be doing lots of other cool stuff that I
    don't know about):

    A) auto-configuration
    B) wider distribution of computation
    C) local checkpointing of processes for restarts

    What they have apparently not done includes

    X) map/reduce
    Y) magic process restarts in the face of failure (see map/reduce)
    Z) distributed file system

    When newbies try to run hadoop the ALWAYS seem to run head-long into the
    lack of (A) (how many times has somebody essentially said "I have a totally
    screwed up DNS and hadoop won't run"?).

    Item (B) is probably a bad thing for hadoop given the bandwidth required for
    the shuffle phase.

    Item (C) is inherent in map-reduce and is pretty neutral either way.




    On 12/5/07 9:23 AM, "Ted Dunning" wrote:



    Sorry about not addressing this. (and I appreciate your gentle prod)

    The Xgrid would likely work well on these problems. They are, after all,
    nearly trivial to parallelize because of clean communication patterns.

    Consider an alternative problem of solving n-body gravitational dynamics for
    n > 10^6 bodies. Here there is nearly universal communication.

    As another example, last week I heard from some Sun engineers that one of
    their HPC systems had to satisfy a requirement for checkpointing large
    numerical computations in which a large number of computational nodes were
    required to dump 10's of TB of checkpoint data to disk in less than 10
    seconds.

    Finally, many of these HPC systems are designed to fit the entire working
    set into memory so that high numerical computational throughput can be
    maintained. In this regime, communications have to work on memory
    time-scales rather than disk time-scales.

    None of these three example problems are very suitable for Hadoop.

    The sample problems you gave are a different matter.

    On 12/5/07 2:04 AM, "Bob Futrelle" wrote:

    why an Xgrid cluster with its attendant.management system
    would or would not be equally good for these problems

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 5, '07 at 3:13a
activeDec 5, '07 at 6:01p
posts7
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Bob Futrelle: 4 posts Ted Dunning: 3 posts

People

Translate

site design / logo © 2022 Grokbase