FAQ
Hadoop newbie here.

I wrapped my company's entity extraction product in a Hadoop task,
and give it a large file of the magnitude of 100MB.
I have 4 VMs running on a 24-core CPU server, and made two of
them the slave nodes, one namenode and another job tracker.
It turned out that processing the same data size takes longer
using Hadoop than processing it in serial.

I am curious that how I can experience the advantage of
Hadoop. Is having many physical machines essential?
Would I need to process Terabytes of data? What would be
the minimum set up where I can experience the advantage
of Hadoop?
----
T. "Kuro" Kurosaka

Search Discussions

  • Brian Bockelman at Aug 31, 2011 at 12:04 pm
    Hi Kuro,

    A 100MB file should take 1 second to read; typically, MR jobs get scheduled on the order of seconds. So, it's unlikely you'll see any benefit.

    You'll probably want to have a look at Amdahl's law:

    http://en.wikipedia.org/wiki/Amdahl%27s_law

    Brian
    On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote:

    Hadoop newbie here.

    I wrapped my company's entity extraction product in a Hadoop task,
    and give it a large file of the magnitude of 100MB.
    I have 4 VMs running on a 24-core CPU server, and made two of
    them the slave nodes, one namenode and another job tracker.
    It turned out that processing the same data size takes longer
    using Hadoop than processing it in serial.

    I am curious that how I can experience the advantage of
    Hadoop. Is having many physical machines essential?
    Would I need to process Terabytes of data? What would be
    the minimum set up where I can experience the advantage
    of Hadoop?
    ----
    T. "Kuro" Kurosaka
  • Teruhiko Kurosaka at Aug 31, 2011 at 11:30 pm
    Brian,
    This particular task takes time in computation, in the order of minutes.
    ----
    T. "Kuro" Kurosaka
    From: Brian Bockelman <bbockelm@cse.unl.edu
    Reply-To: "common-user@hadoop.apache.org " <common-user@hadoop.apache.org
    Date: Wed, 31 Aug 2011 08:04:07 -0400
    To: "common-user@hadoop.apache.org " <common-user@hadoop.apache.org
    Subject: Re: How big data and/or how many machines do I need to take advantage of Hadoop?

    Hi Kuro,

    A 100MB file should take 1 second to read; typically, MR jobs get scheduled on the order of seconds. So, it's unlikely you'll see any benefit.

    You'll probably want to have a look at Amdahl's law:

    http://en.wikipedia.org/wiki/Amdahl%27s_law<http://en.wikipedia.org/wiki/Amdahl's_law>

    Brian

    On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote:

    Hadoop newbie here.

    I wrapped my company's entity extraction product in a Hadoop task,
    and give it a large file of the magnitude of 100MB.
    I have 4 VMs running on a 24-core CPU server, and made two of
    them the slave nodes, one namenode and another job tracker.
    It turned out that processing the same data size takes longer
    using Hadoop than processing it in serial.

    I am curious that how I can experience the advantage of
    Hadoop. Is having many physical machines essential?
    Would I need to process Terabytes of data? What would be
    the minimum set up where I can experience the advantage
    of Hadoop?
    ----
    T. "Kuro" Kurosaka

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 31, '11 at 8:49a
activeAug 31, '11 at 11:30p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase