FAQ
Hi,
I am working of implementing some machine learning algorithms using Map Red.
I want to know that If I have data that takes 5-6 hours to train on a normal
machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo
hadoop tutorial.
"Executing Hadoop on a limited amount of data on a small number of nodes may
not demonstrate particularly stellar performance as the overhead involved in
starting Hadoop programs is relatively high. Other parallel/distributed
programming paradigms such as MPI (Message Passing Interface) may perform
much better on two, four, or perhaps a dozen machines."

I have at my disposal 3 laptops each with 4 G RAM and 150G hard disk space
each... I have 600M of training data....
--
View this message in context: http://www.nabble.com/How-many-nodes-does-one-man-want--tp22733399p22733399.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Kevin Peterson at Mar 27, 2009 at 10:44 pm

    On Thu, Mar 26, 2009 at 4:38 PM, Sid123 wrote:
    I am working of implementing some machine learning algorithms using Map
    Red.
    I want to know that If I have data that takes 5-6 hours to train on a
    normal
    machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo
    hadoop tutorial.
    "Executing Hadoop on a limited amount of data on a small number of nodes
    may
    not demonstrate particularly stellar performance as the overhead involved
    in
    starting Hadoop programs is relatively high. Other parallel/distributed
    programming paradigms such as MPI (Message Passing Interface) may perform
    much better on two, four, or perhaps a dozen machines."

    I have at my disposal 3 laptops each with 4 G RAM and 150G hard disk space
    each... I have 600M of training data....
    I'd say don't bother. Not because adding two machines won't double your
    performance (maybe it will come close) but just because of the hassle of
    setting up hadoop, then having to copy data in and out of HDFS,
    restructuring your code within map-reduce paradigm, and so on.

    I have a machine learning task that takes about an hour on my machine. I
    find this significantly more convenient than running it on hadoop, and I'm
    already working within hadoop. Of course, some of this inconvenience is due
    to EC2, not hadoop itself. If I could run from inside eclipse, it might be a
    different story.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 26, '09 at 11:38p
activeMar 27, '09 at 10:44p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Kevin Peterson: 1 post Sid123: 1 post

People

Translate

site design / logo © 2022 Grokbase