FAQ
Are any Hadoop implementations planning to add "enterprise features"
in Platform MapReduce?

http://www.youtube.com/watch?v=QV4wJifsqbQ
http://www.youtube.com/watch?v=cDfZTx-BOyY
http://www.youtube.com/watch?v=MEKXo-1hnkQ

Platform said that its MapReduce implementation totally replaces the
JobTracker, while the rest of the Hadoop stack is unchanged. Is there
a Hadoop API that would allow external batch systems (like Grid Engine
or Open Grid Scheduler, PBS, Condor, SLURM, etc) to plug into Hadoop?

--Chi

Search Discussions

  • Robert Evans at Sep 12, 2011 at 8:23 pm
    Chi,

    Most of these features are things that Hadoop is working on. There is an HA branch in progress that should go into trunk relatively soon.

    As far as the batch system integration is concerned if what you care about is scheduling of jobs, which jobs run when and on which machines, you can write your own scheduler which is a standard API.

    --Bobby Evans

    On 9/12/11 1:04 PM, "Chi Chan" wrote:

    Are any Hadoop implementations planning to add "enterprise features"
    in Platform MapReduce?

    http://www.youtube.com/watch?v=QV4wJifsqbQ
    http://www.youtube.com/watch?v=cDfZTx-BOyY
    http://www.youtube.com/watch?v=MEKXo-1hnkQ

    Platform said that its MapReduce implementation totally replaces the
    JobTracker, while the rest of the Hadoop stack is unchanged. Is there
    a Hadoop API that would allow external batch systems (like Grid Engine
    or Open Grid Scheduler, PBS, Condor, SLURM, etc) to plug into Hadoop?

    --Chi
  • Ted Dunning at Sep 13, 2011 at 2:26 am
    See mapr.com

    We have added many enterprise features onto Hadoop including snapshots,
    mirroring, NFS access,
    high availability and higher performance.

    Since this mailing list is primarily for Apache Hadoop, you should contact
    me off-line if you would like more information.
    On Mon, Sep 12, 2011 at 6:04 PM, Chi Chan wrote:

    Are any Hadoop implementations planning to add "enterprise features"
    in Platform MapReduce?

    http://www.youtube.com/watch?v=QV4wJifsqbQ
    http://www.youtube.com/watch?v=cDfZTx-BOyY
    http://www.youtube.com/watch?v=MEKXo-1hnkQ

    Platform said that its MapReduce implementation totally replaces the
    JobTracker, while the rest of the Hadoop stack is unchanged. Is there
    a Hadoop API that would allow external batch systems (like Grid Engine
    or Open Grid Scheduler, PBS, Condor, SLURM, etc) to plug into Hadoop?

    --Chi
  • Steve Loughran at Sep 13, 2011 at 11:50 am

    On 12/09/11 19:04, Chi Chan wrote:
    Are any Hadoop implementations planning to add "enterprise features"
    in Platform MapReduce?

    http://www.youtube.com/watch?v=QV4wJifsqbQ
    http://www.youtube.com/watch?v=cDfZTx-BOyY
    http://www.youtube.com/watch?v=MEKXo-1hnkQ

    Platform said that its MapReduce implementation totally replaces the
    JobTracker, while the rest of the Hadoop stack is unchanged. Is there
    a Hadoop API that would allow external batch systems (like Grid Engine
    or Open Grid Scheduler, PBS, Condor, SLURM, etc) to plug into Hadoop?

    --Chi
    It would be nice to have a scheduler that worked with the other native
    schedulers -and have those schedulers work with Hadoop- so neither
    system will overcommit. The other schedulers aren't topology aware so
    will run code anywhere; it's Hadoop that cares more where stuff goes.

    The way Hadoop is architected you could do the hadoop side of the
    scheduler without breaking anything; I don't know about the others.
  • Steve Loughran at Sep 13, 2011 at 12:21 pm
    As an aside, if you ask for the white paper you get a PDF that
    over-exaggerates the limits of Hadoop.

    http://info.platform.com/rs/platform/images/Whitepaper_Top5ChallengesforHadoopMapReduceintheEnterprise.pdf

    Mostly focusing on a critique of the scheduler -which MR-279 will fix in
    Hadoop 0.23- they say

    "It is designed to be used by IT departments that
    have an army of developers to help fix any issues they en-
    counter"

    I don't believe this. Cloudera and Hortonworks will do this for a fee
    -as will Platform. In most organisations the R&D effort doesn't go into
    the Hadoop codebase, it goes into writing the analysis code, which is
    why things like Pig and Hive help -they make it easier.


    "Their (Clouderas) distribution is based on open source
    which is still an unproven large-scale enterprise full stack
    solution. There are many shortcomings in the open source
    distribution, including the workload management capa-
    bilities.

    Other open source commercial distributions are
    emerging, with IBM and EMC entering the marketplace.
    However, all of these offerings are based on open source
    code and inevitably inherit the strengths and weaknesses
    of that code base and architectural design. "

    Ted will point out that MapR's MR engine isn't limited, as will Brisk,
    while Arun will view that statement in the past tense. Doug and Tom will
    pick up on the word "unproven" too. Which enterprises plan to have
    Hadoop clusters bigger than Yahoo or Facebook?

    Furthermore, as Platform only puts in their own scheduler, leaving the
    filesystem alone, it's a bit weak to critique the architecture of the
    open source distro. Not a way to make friends -or get your bug fixes in.
    Or indeed, promise better scalability.

    "Therefore they cannot meet the enterprise–class requirements for ”big
    data” problems as already mentioned."

    This is daft. The only thing platform brings to the table is a scheduler
    that works with "legacy" grid workloads and a console to see what's
    going on. I don't see that being tangibly more enterprise-class than the
    existing JT -which does persist after an outage. With HDFS underneath a
    new scheduler doesn't even remove the filesystem SPOFs, so the only way
    to get an HA cluster is to swap in a premium filesystem.

    The other thing the marketing blurb gets wrong is its claim that Hadoop
    only works with one distributed file system. Not so. You can read in and
    out of any filesystem, file:// being a handy one what works with NFS
    mount points too.

    Overall, a disappointing white paper, as all it can to do criticise open
    source Hadoop is spread fear about the #of developers you need to
    maintain it, and the limitations of the Hadoop scheduler vs their
    Scheduler -that being the only that differs from the Platform product
    from the full OSS release.

    I missed a talk at the local university by a Platform sales rep last
    month, though I did get to offend one of the authors of condor team
    instead [1]. by pointing out that all grid schedulers contain a major
    assumption: that storage access times are constant across your cluster.
    It is if you can pay for something like GPFS, but you don't get 50TB of
    GPFS storage for $2500, which is what adding 25*2TB SATA drives would
    cost if you stuck them on your compute nodes; $7500 for a fully
    replicated 50TB. That's why I'm not a fan of grid systems -cost of
    storage and networking aren't taken into account. Then there's the
    availablity issues with the larger filesystems, that are a topic for
    another day.

    I look forward to them giving a talk at any forthcoming London HUG event
    and will try to do a follow-on talk introducing MR-279 and arguing in
    favour of an OSS solution because the turnaround time on defects is faster.

    -Steve

    [1] (Miron Livny ), facing the camera, two to the left of Sergey Melnik
    with the camera -the author of Dremel: http://flic.kr/p/akUzE7
  • Brian Bockelman at Sep 13, 2011 at 1:47 pm

    On Sep 13, 2011, at 7:20 AM, Steve Loughran wrote:


    I missed a talk at the local university by a Platform sales rep last month, though I did get to offend one of the authors of condor team instead [1]. by pointing out that all grid schedulers contain a major assumption: that storage access times are constant across your cluster. It is if you can pay for something like GPFS, but you don't get 50TB of GPFS storage for $2500, which is what adding 25*2TB SATA drives would cost if you stuck them on your compute nodes; $7500 for a fully replicated 50TB. That's why I'm not a fan of grid systems -cost of storage and networking aren't taken into account. Then there's the availablity issues with the larger filesystems, that are a topic for another day.
    For what it's worth - I do know folks who have done (are doing) data locality with Condor. Condor is wonderfully flexible, easily flexible enough to shoot yourself in the foot. There was also a grad student who did work in allowing Condor to fire up Hadoop datanodes and job trackers directly.

    For the most part you are right though - all these systems have long treated nodes as individual, independent units (either because the systems were job-oriented, not data oriented, or because they ran at supercomputing centers where money was no concern).

    This is starting to change, but change is always frustratingly slow. On the upside, we now have single Condor pools that span 80 sites around the globe and it is easy to have two Condor pools interoperate and exchange jobs. So, each system has its own strengths and weaknesses.

    Brian
  • Milind Bhandarkar at Sep 19, 2011 at 7:21 pm
    For those who are not aware of 6-year old history :-), Sameer, Owen and I
    made a trip to Wisconsin, Madison to meet with Miron Livny, who built
    Condor, exploring how matchmaking could be used with MapReduce
    (pre-hadoop) in October 2005. (We believed it could be used for
    locality-aware scheduling.) One thing that came out of that meeting was
    that Condor folks were not ready to incorporate multi-threading, which we
    felt was needed for scheduler responsiveness.

    - Milind
    On 9/13/11 9:46 PM, "Brian Bockelman" wrote:

    On Sep 13, 2011, at 7:20 AM, Steve Loughran wrote:


    I missed a talk at the local university by a Platform sales rep last
    month, though I did get to offend one of the authors of condor team
    instead [1]. by pointing out that all grid schedulers contain a major
    assumption: that storage access times are constant across your cluster.
    It is if you can pay for something like GPFS, but you don't get 50TB of
    GPFS storage for $2500, which is what adding 25*2TB SATA drives would
    cost if you stuck them on your compute nodes; $7500 for a fully
    replicated 50TB. That's why I'm not a fan of grid systems -cost of
    storage and networking aren't taken into account. Then there's the
    availablity issues with the larger filesystems, that are a topic for
    another day.
    For what it's worth - I do know folks who have done (are doing) data
    locality with Condor. Condor is wonderfully flexible, easily flexible
    enough to shoot yourself in the foot. There was also a grad student who
    did work in allowing Condor to fire up Hadoop datanodes and job trackers
    directly.

    For the most part you are right though - all these systems have long
    treated nodes as individual, independent units (either because the
    systems were job-oriented, not data oriented, or because they ran at
    supercomputing centers where money was no concern).

    This is starting to change, but change is always frustratingly slow. On
    the upside, we now have single Condor pools that span 80 sites around the
    globe and it is easy to have two Condor pools interoperate and exchange
    jobs. So, each system has its own strengths and weaknesses.

    Brian

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedSep 12, '11 at 6:04p
activeSep 19, '11 at 7:21p
posts7
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase