FAQ
All,

I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.

I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,

Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.

Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?

I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).

I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)

I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)

I am using CDH3B3, even though I think this is not specific to CDH3B3.

Sorry for the cross post.

Raj

Search Discussions

  • Bharath vissapragada at Jan 11, 2011 at 5:47 am
    Ravi,

    Please post the figures and graphs .. Figures for large clusters (>
    200 nodes) are certainly interesting ..

    Thanks
    On Tue, Jan 11, 2011 at 10:36 AM, Raj V wrote:
    All,

    I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).

    I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj
  • Adarsh Sharma at Jan 11, 2011 at 6:37 am
    If possible Please also post your configuration parameters like
    *dfs.data.dir* , *mapred.local.dir* , map and reduce parmeters, java etc.


    Thanks

    bharath vissapragada wrote:
    Ravi,

    Please post the figures and graphs .. Figures for large clusters (>
    200 nodes) are certainly interesting ..

    Thanks
    On Tue, Jan 11, 2011 at 10:36 AM, Raj V wrote:

    All,

    I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).

    I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj
  • Phil Whelan at Jan 11, 2011 at 6:40 am
    Hi Raj,
    Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.
    This process defaults to just 2 map jobs, so only 2 nodes are
    utilized. Did you try this option? mapred.map.tasks. I found a very
    similar question + answer here...

    http://www.mail-archive.com/common-user@hadoop.apache.org/msg00005.html
    1. The data is generated in a fashion to where it is not balanced
    across my cluster. This is because the data is generated with 2 maps.
    These are due to the default #maps/#reduces in Map-Reduce.
    Use:
    $ bin/hadoop jar hadoop-*-dev-examples.jar teragen - Dmapred.map.tasks=8000 10000000000 /tera/in $ bin/hadoop jar hadoop-*-dev-examples.jar terasort - Dmapred.reduce.tasks=5300 /tera/in /tera/out
    Arun
    Hope that helps.

    Thanks,
    Phil
    On Mon, Jan 10, 2011 at 9:06 PM, Raj V wrote:
    All,

    I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).

    I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj
  • Raj V at Jan 11, 2011 at 4:05 pm
    I used 9500 maps.

    The number of maps defaulty to 2 for teragen. For terasort,  it would depend on the number of input files, the dfs.block.size and number of nodes.



    Raj

    From: Phil Whelan <phil123@gmail.com>
    To: common-user@hadoop.apache.org; Raj V <rajvish@yahoo.com>
    Cc:
    Sent: Monday, January 10, 2011 10:39:29 PM
    Subject: Re: TeraSort question.

    Hi Raj,
    Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.
    This process defaults to just 2 map jobs, so only 2 nodes are
    utilized. Did you try this option? mapred.map.tasks. I found a very
    similar question + answer here...

    http://www.mail-archive.com/common-user@hadoop.apache.org/msg00005.html
    1.      The data is generated in a fashion to where it is not balanced
    across my cluster.  This is because the data is generated with 2 maps.
    These are due to the default #maps/#reduces in Map-Reduce.
    Use:
    $ bin/hadoop jar hadoop-*-dev-examples.jar teragen - Dmapred.map.tasks=8000 10000000000 /tera/in $ bin/hadoop jar hadoop-*-dev-examples.jar terasort - Dmapred.reduce.tasks=5300 /tera/in /tera/out
    Arun
    Hope that helps.

    Thanks,
    Phil
    On Mon, Jan 10, 2011 at 9:06 PM, Raj V wrote:
    All,

    I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).

    I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj
  • Ted Dunning at Jan 11, 2011 at 4:23 pm
    Raj,

    Do you have the job history files? That would be very useful. I would be
    happy to create some swimlane and related graphs for you if you can send me
    the history files.
    On Mon, Jan 10, 2011 at 9:06 PM, Raj V wrote:

    All,

    I have been running terasort on a 480 node hadoop cluster. I have also
    collected cpu,memory,disk, network statistics during this run. The system
    stats are quite intersting. I can post it when I have put them together in
    some presentable format ( if there is interest.). However while looking at
    the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have
    more or less similar behaviour ( time translation was possible) but the
    overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk
    ,network etc. activity when the sort was running. Strangeley enough, it was
    not so., Two of the 5 systems were seriously busy, big IO with lots of disk
    and network activity. The other three systems, CPU was more or less 100%
    idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more
    or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size,
    replication =3).

    I would also be interested in other people timings of sort. Is there some
    place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest,
    tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj
  • Raj V at Jan 11, 2011 at 4:41 pm
    Ted


    Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format.,

    For legal reasons, I really don't want to send the complete job histiory files.

    My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different.

    Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better.

    Raj
    From: Ted Dunning <tdunning@maprtech.com>
    To: common-user@hadoop.apache.org; Raj V <rajvish@yahoo.com>
    Cc:
    Sent: Tuesday, January 11, 2011 8:22:17 AM
    Subject: Re: TeraSort question.

    Raj,

    Do you have the job history files?  That would be very useful.  I would be
    happy to create some swimlane and related graphs for you if you can send me
    the history files.
    On Mon, Jan 10, 2011 at 9:06 PM, Raj V wrote:

    All,

    I have been running terasort on a 480 node hadoop cluster. I have also
    collected cpu,memory,disk, network statistics during this run. The system
    stats are quite intersting. I can post it when I have put them together in
    some presentable format ( if there is interest.). However while looking at
    the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have
    more or less similar behaviour ( time translation was possible) but the
    overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk
    ,network etc. activity when the sort was running. Strangeley enough, it was
    not so., Two of the 5 systems were seriously busy, big IO with lots of disk
    and network activity. The other three systems, CPU was more or less 100%
    idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more
    or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size,
    replication =3).

    I would also be interested in other people timings of sort. Is there some
    place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest,
    tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj
  • Niels Basjes at Jan 11, 2011 at 7:07 pm
    Raj,

    Have a look at the graph shown here:
    http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines

    It should make clear that the number of tasks varies greatly over the
    lifetime of a job.
    Depending on the nodes available this may leave node idle.

    Niels

    2011/1/11 Raj V <rajvish@yahoo.com>:
    Ted


    Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format.,

    For legal reasons, I really don't want to send the complete job histiory files.

    My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different.

    Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better.

    Raj
    From: Ted Dunning <tdunning@maprtech.com>
    To: common-user@hadoop.apache.org; Raj V <rajvish@yahoo.com>
    Cc:
    Sent: Tuesday, January 11, 2011 8:22:17 AM
    Subject: Re: TeraSort question.

    Raj,

    Do you have the job history files?  That would be very useful.  I would be
    happy to create some swimlane and related graphs for you if you can send me
    the history files.
    On Mon, Jan 10, 2011 at 9:06 PM, Raj V wrote:

    All,

    I have been running terasort on a 480 node hadoop cluster. I have also
    collected cpu,memory,disk, network statistics during this run. The system
    stats are quite intersting. I can post it when I have put them together in
    some presentable format ( if there is interest.). However while looking at
    the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have
    more or less similar behaviour ( time translation was possible) but the
    overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk
    ,network etc. activity when the sort was running. Strangeley enough, it was
    not so., Two of the 5 systems were seriously busy, big IO with lots of disk
    and network activity. The other three systems, CPU was more or less 100%
    idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more
    or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size,
    replication =3).

    I would also be interested in other people timings of sort. Is there some
    place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest,
    tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj


    --
    Met vriendelijke groeten,

    Niels Basjes
  • Steve Loughran at Jan 13, 2011 at 11:05 am

    On 11/01/11 16:40, Raj V wrote:
    Ted


    Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format.,

    For legal reasons, I really don't want to send the complete job histiory files.

    My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different.
    They can be different. The JT pushes out work to machines when they
    report in, some may get more work than others, so generate more local
    data. This will have follow-on consequences. In a live system things are
    different as the work tends to follow the data, so machines with (or
    near) the data you need get the work.

    It's a really hard thing to say "is the cluster working right", when
    bringing it up, everyone is really guessing about expected performance.

    -Steve
  • Raj V at Jan 11, 2011 at 8:31 pm
    Can't attach teh pdf file that shows diffeent maps.,

    File is too big,

    From: Niels Basjes <Niels@basjes.nl>
    To: common-user@hadoop.apache.org; Raj V <rajvish@yahoo.com>
    Cc:
    Sent: Tuesday, January 11, 2011 11:07:08 AM
    Subject: Re: TeraSort question.

    Raj,

    Have a look at the graph shown here:
    http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines

    It should make clear that the number of tasks varies greatly over the
    lifetime of a job.
    Depending on the nodes available this may leave node idle.

    Niels

    2011/1/11 Raj V <rajvish@yahoo.com>:
    Ted


    Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format.,

    For legal reasons, I really don't want to send the complete job histiory files.

    My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different.

    Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better.

    Raj
    From: Ted Dunning <tdunning@maprtech.com>
    To: common-user@hadoop.apache.org; Raj V <rajvish@yahoo.com>
    Cc:
    Sent: Tuesday, January 11, 2011 8:22:17 AM
    Subject: Re: TeraSort question.

    Raj,

    Do you have the job history files?  That would be very useful.  I would be
    happy to create some swimlane and related graphs for you if you can send me
    the history files.
    On Mon, Jan 10, 2011 at 9:06 PM, Raj V wrote:

    All,

    I have been running terasort on a 480 node hadoop cluster. I have also
    collected cpu,memory,disk, network statistics during this run. The system
    stats are quite intersting. I can post it when I have put them together in
    some presentable format ( if there is interest.). However while looking at
    the data, I noticed something interesting.

    I thought, intutively, that the all the systems in the cluster would have
    more or less similar behaviour ( time translation was possible) but the
    overall graph would look the same.,

    Just to confirm it I took 5 random nodes and looked at the CPU, disk
    ,network etc. activity when the sort was running. Strangeley enough, it was
    not so., Two of the 5 systems were seriously busy, big IO with lots of disk
    and network activity. The other three systems, CPU was more or less 100%
    idle, slight network and I/O.

    Is that normal and/or expected? SHouldn't all the nodes be utilized in more
    or less manner over the length of the run?

    I generated the data forf the sort using teragen. ( 128MB bloick size,
    replication =3).

    I would also be interested in other people timings of sort. Is there some
    place where people can post sort numbers ( not just the record.)

    I will post the actual graphs of the 5 nodes, if there is interest,
    tomorrow. ( Some logistical issues abt. posting them tonight)

    I am using CDH3B3, even though I think this is not specific to CDH3B3.

    Sorry for the cross post.

    Raj


    --
    Met vriendelijke groeten,

    Niels Basjes
  • Raj V at Jan 13, 2011 at 4:51 pm
    Steve

    Let me plot the graphs for all the nodes. I picked up 6 random nodes out oif 480 and 2 of these were really busy and the otehr 4 were idle. Either that makes me very lucky or the cluster was underutilized.

    I would have found it acceptable if different nodes were utilized in different ways, but in my case , 2 nodes had serious CPU , Network and Disk activity and others  were completely idle.








    From: Steve Loughran <stevel@apache.org>
    To: common-user@hadoop.apache.org
    Cc:
    Sent: Thursday, January 13, 2011 3:05 AM
    Subject: Re: TeraSort question.
    On 11/01/11 16:40, Raj V wrote:
    Ted


    Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format.,

    For legal reasons, I really don't want to send the complete job histiory files.

    My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different.
    They can be different. The JT pushes out work to machines when they report in, some may get more work than others, so generate more local data. This will have follow-on consequences. In a live system things are different as the work tends to follow the data, so machines with (or near) the data you need get the work.

    It's a really hard thing to say "is the cluster working right", when bringing it up, everyone is really guessing about expected performance.

    -Steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 11, '11 at 5:07a
activeJan 13, '11 at 4:51p
posts11
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase