FAQ
Hi,

Has anyone here used hadoop to process more than 3TB of data? If so we
would like to know how many machines you used in your cluster and
about the hardware configuration. The objective is to know how to
handle huge data in Hadoop cluster.

--
With Regards,
Karthik

Search Discussions

  • Harsh J at Jul 6, 2011 at 10:52 am
    Have you taken a look at http://wiki.apache.org/hadoop/PoweredBy? It
    contains information relevant to your question, if not a detailed
    answer.
    On Wed, Jul 6, 2011 at 4:13 PM, Karthik Kumar wrote:
    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.

    --
    With Regards,
    Karthik


    --
    Harsh J
  • Karthik Kumar at Jul 6, 2011 at 11:06 am
    Hi,

    I wanted to know the time required to process huge datasets and number
    of machines used for them.
    On 7/6/11, Harsh J wrote:
    Have you taken a look at http://wiki.apache.org/hadoop/PoweredBy? It
    contains information relevant to your question, if not a detailed
    answer.
    On Wed, Jul 6, 2011 at 4:13 PM, Karthik Kumar wrote:
    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.

    --
    With Regards,
    Karthik


    --
    Harsh J

    --
    With Regards,
    Karthik
  • Harsh J at Jul 6, 2011 at 11:24 am
    Karthik,

    That's a highly process-dependent question I think -- What you would
    do with the data, would determine the time it takes. No two
    applications are the same in my belief.
    On Wed, Jul 6, 2011 at 4:35 PM, Karthik Kumar wrote:
    Hi,

    I wanted to know the time required to process huge datasets and number
    of machines used for them.
    On 7/6/11, Harsh J wrote:
    Have you taken a look at http://wiki.apache.org/hadoop/PoweredBy? It
    contains information relevant to your question, if not a detailed
    answer.

    On Wed, Jul 6, 2011 at 4:13 PM, Karthik Kumar <karthik84kumar@gmail.com>
    wrote:
    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.

    --
    With Regards,
    Karthik


    --
    Harsh J

    --
    With Regards,
    Karthik


    --
    Harsh J
  • Steve Loughran at Jul 6, 2011 at 11:32 am

    On 06/07/11 11:43, Karthik Kumar wrote:
    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.
    This is too vague a question. What do you mean "process?". Scan through
    some logs looking for values? You could do that on a single machine if
    you weren't in a rush and you have enough disks, you'd just be very IO
    bound, and to be honest HDFS needs a minimum number of machines to
    become fault tolerant. Do complex matrix operations that use lots of RAM
    and CPU? You'll need more machines.

    If your cluster has a blocksize of 512MB then a 3TB file fits into
    (3*1024*1024)/512 blocks: 6144. so you can't have more than 6144
    machines anyway -that's your theoretical maximum, even if your name is
    Facebook or Yahoo!

    What you are looking for is something in between 10 and 6144, the exact
    number driven by
    -how much compute you need to do, and how fast you want it done
    (controls #of CPUs, RAM)
    -how much total HDD storage you anticipate needing
    -whether you want to do leading-edge GPU work (good performance on
    some tasks, but limited work per machine)

    You can use benchmarking tools like gridmix3 to get some more data on
    the characteristics of your workload, which you can then take to your
    server supplier to say "this is what we need, what can you offer?"
    Otherwise everyone is just guessing.

    Remember also that you can add more racks later, but you will need to
    plan ahead on datacentre space, power and -very importantly- how you are
    going to expand the networking. Life is simplest if everything fits into
    one rack, but if you plan to expand you need to have a roadmap of how to
    connect that rack to some new ones, which means adding fast interconnect
    between different top of rack switches. You also need to worry about how
    to get data in and out fast.


    -Steve
  • Steve Loughran at Jul 6, 2011 at 11:33 am

    On 06/07/11 11:43, Karthik Kumar wrote:
    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.
    Actually, I've just thought of simpler answer. 40. It's completely
    random, but if said with confidence it's as valid as any other answer to
    your current question.
  • Michel Segel at Jul 6, 2011 at 12:18 pm
    Wasn't the answer 42? ;-P

    Looking at your calc...
    You forgot to factor in the number of slots per node.
    So the number is only a fraction. Assume 10 slots per node. (10 because it makes the math easier.)

    Then you need only 300 machines. You could then name your cluster lambda. (another literary reference...)

    300 machines is a manageable cluster.

    I agree that the initial question is vague and the only true answer is 'it depends...'
    But if they want to build out a cluster of 300 machines... I've gotta guy... :-)



    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jul 6, 2011, at 6:32 AM, Steve Loughran wrote:
    On 06/07/11 11:43, Karthik Kumar wrote:
    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.
    Actually, I've just thought of simpler answer. 40. It's completely random, but if said with confidence it's as valid as any other answer to your current question.
  • Steve Loughran at Jul 6, 2011 at 4:44 pm

    On 06/07/11 13:18, Michel Segel wrote:
    Wasn't the answer 42? ;-P

    42 = 40 + NN +2ary NN, assuming the JT runs on 2ary or on one of the
    worker nodes
    Looking at your calc...
    You forgot to factor in the number of slots per node.
    So the number is only a fraction. Assume 10 slots per node. (10 because it makes the math easier.)
    I thought something was wrong. Then I thought of the server revenue and
    decided not to look that hard.
  • Karthik Kumar at Jul 7, 2011 at 6:33 am
    Hi,

    Thanks a lot for your timely help. Your valuable answers helped us to
    understand what kind of hardware to use when it comes to huge data.

    With Regards,
    Karthik
    On 7/6/11, Steve Loughran wrote:
    On 06/07/11 13:18, Michel Segel wrote:
    Wasn't the answer 42? ;-P

    42 = 40 + NN +2ary NN, assuming the JT runs on 2ary or on one of the
    worker nodes
    Looking at your calc...
    You forgot to factor in the number of slots per node.
    So the number is only a fraction. Assume 10 slots per node. (10 because it
    makes the math easier.)
    I thought something was wrong. Then I thought of the server revenue and
    decided not to look that hard.

    --
    With Regards,
    Karthik
  • M. C. Srivas at Jul 6, 2011 at 3:37 pm
    We ran the following on a 10+1 machine cluster (2-quad core, 24G DRAM,
    12x2TB drives, 2 NICs each) running the 0.20.2 release

    - 3.5TB terasort took ~4.5 hrs
    - 10TB terasort took ~12.5 hrs
    - 20TB terasort took > 24hrs

    So yeah, Hadoop can handle it. If you want faster times, you'll have to try
    - adding more machines
    - using some other distro
    - or both
    On Wed, Jul 6, 2011 at 3:43 AM, Karthik Kumar wrote:

    Hi,

    Has anyone here used hadoop to process more than 3TB of data? If so we
    would like to know how many machines you used in your cluster and
    about the hardware configuration. The objective is to know how to
    handle huge data in Hadoop cluster.

    --
    With Regards,
    Karthik

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 6, '11 at 10:44a
activeJul 7, '11 at 6:33a
posts10
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase