FAQ
Preparing a Hadoop presentation here. For demonstration I start up a 5
machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2
launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over
into HDFS. The Mapper will receive a XML block as the key, select a
email address from the xml and use this as the key for the reducer and
the orginal xml as the value. The Reducer just aggregates the number of
XML blocks per email address.

Running this on the cluster takes about 2:30 min. The frameworks uses 8
Mappers (Spills) and 2 Reducers. About 600.000 xml elements are
contained in the file. How can I speed up processing time? One thing I
can think of, is to have more than just 2 email addresses in the sample
document to be able to use more than 2 reducers in parallel. Why did the
framework choose to use 8 mappers and not more? Maybe my sample data is
too small to benefit from parallel processing.

Thanks in advance

Search Discussions

  • Gang Luo at Mar 17, 2010 at 1:16 pm
    HI,
    you can control the number of reducers by JobConf.setNumReduceTasks(n). The number of mappers is defined by (file size) / (split size). By default the split size is 64MB. Since you dataset is not very large, there should be no big difference if you change these.

    if you are only interested in the number of blocks per email address, you don't need to send the "original xml" as the value in the intermediate result. This can reduce the amount of data sent from mappers to reducers. Use combiner to pre-aggregate the data may also help.

    -Gang




    ----- 原始邮件 ----
    发件人: Reik Schatz <reik.schatz@bwin.org>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2010/3/17 (周三) 5:04:33 上午
    主 题: optimization help needed

    Preparing a Hadoop presentation here. For demonstration I start up a 5 machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into HDFS. The Mapper will receive a XML block as the key, select a email address from the xml and use this as the key for the reducer and the orginal xml as the value. The Reducer just aggregates the number of XML blocks per email address.

    Running this on the cluster takes about 2:30 min. The frameworks uses 8 Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in the file. How can I speed up processing time? One thing I can think of, is to have more than just 2 email addresses in the sample document to be able to use more than 2 reducers in parallel. Why did the framework choose to use 8 mappers and not more? Maybe my sample data is too small to benefit from parallel processing.
    Thanks in advance
  • Reik Schatz at Mar 17, 2010 at 2:14 pm
    Very good input not to sent the "original xml" over to the reducers. For
    the JobConf.setNumReduceTasks(n) isn't that just a hint but the real
    number will be determined based on the Partitioner I use, which will be
    the default HashPartioner? One other thought I had, what will happen if
    the values list sent to a single Reducer is to big to fit into memory?

    /Reik



    Gang Luo wrote:
    HI,
    you can control the number of reducers by JobConf.setNumReduceTasks(n). The number of mappers is defined by (file size) / (split size). By default the split size is 64MB. Since you dataset is not very large, there should be no big difference if you change these.

    if you are only interested in the number of blocks per email address, you don't need to send the "original xml" as the value in the intermediate result. This can reduce the amount of data sent from mappers to reducers. Use combiner to pre-aggregate the data may also help.

    -Gang




    ----- 原始邮件 ----
    发件人: Reik Schatz <reik.schatz@bwin.org>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2010/3/17 (周三) 5:04:33 上午
    主 题: optimization help needed

    Preparing a Hadoop presentation here. For demonstration I start up a 5 machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into HDFS. The Mapper will receive a XML block as the key, select a email address from the xml and use this as the key for the reducer and the orginal xml as the value. The Reducer just aggregates the number of XML blocks per email address.

    Running this on the cluster takes about 2:30 min. The frameworks uses 8 Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in the file. How can I speed up processing time? One thing I can think of, is to have more than just 2 email addresses in the sample document to be able to use more than 2 reducers in parallel. Why did the framework choose to use 8 mappers and not more? Maybe my sample data is too small to benefit from parallel processing.
    Thanks in advance



    --

    *Reik Schatz*
    Technical Lead, Platform
    P: +46 8 562 470 00
    M: +46 76 25 29 872
    F: +46 8 562 470 01
    E: reik.schatz@bwin.org
    */bwin/* Games AB
    Klarabergsviadukten 82,
    111 64 Stockholm, Sweden

    [This e-mail may contain confidential and/or privileged information. If
    you are not the intended recipient (or have received this e-mail in
    error) please notify the sender immediately and destroy this e-mail. Any
    unauthorised copying, disclosure or distribution of the material in this
    e-mail is strictly forbidden.]
  • Gang Luo at Mar 17, 2010 at 2:44 pm
    Hi Reik,
    the number of reducer is not a hint (mappers # is a hint). The default hash partitioner will hash and sent records to each reducer in round-robin way based on the reducers #. If the values list is too large to fit into heap memory, then you will get an exception and job will fail after several attempts. You may need to increase the heap size for each task by JobConf.set("mapred.child.java.opts","-Xmx***m).

    -Gang




    ----- 原始邮件 ----
    发件人: Reik Schatz <reik.schatz@bwin.org>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2010/3/17 (周三) 10:13:45 上午
    主 题: Re: optimization help needed

    Very good input not to sent the "original xml" over to the reducers. For
    the JobConf.setNumReduceTasks(n) isn't that just a hint but the real
    number will be determined based on the Partitioner I use, which will be
    the default HashPartioner? One other thought I had, what will happen if
    the values list sent to a single Reducer is to big to fit into memory?

    /Reik



    Gang Luo wrote:
    HI,
    you can control the number of reducers by JobConf.setNumReduceTasks(n). The number of mappers is defined by (file size) / (split size). By default the split size is 64MB. Since you dataset is not very large, there should be no big difference if you change these.

    if you are only interested in the number of blocks per email address, you don't need to send the "original xml" as the value in the intermediate result. This can reduce the amount of data sent from mappers to reducers. Use combiner to pre-aggregate the data may also help.

    -Gang




    ----- 原始邮件 ----
    发件人: Reik Schatz <reik.schatz@bwin.org>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2010/3/17 (周三) 5:04:33 上午
    主 题: optimization help needed

    Preparing a Hadoop presentation here. For demonstration I start up a 5 machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into HDFS. The Mapper will receive a XML block as the key, select a email address from the xml and use this as the key for the reducer and the orginal xml as the value. The Reducer just aggregates the number of XML blocks per email address.

    Running this on the cluster takes about 2:30 min. The frameworks uses 8 Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in the file. How can I speed up processing time? One thing I can think of, is to have more than just 2 email addresses in the sample document to be able to use more than 2 reducers in parallel. Why did the framework choose to use 8 mappers and not more? Maybe my sample data is too small to benefit from parallel processing.
    Thanks in advance



    --

    *Reik Schatz*
    Technical Lead, Platform
    P: +46 8 562 470 00
    M: +46 76 25 29 872
    F: +46 8 562 470 01
    E: reik.schatz@bwin.org
    */bwin/* Games AB
    Klarabergsviadukten 82,
    111 64 Stockholm, Sweden

    [This e-mail may contain confidential and/or privileged information. If
    you are not the intended recipient (or have received this e-mail in
    error) please notify the sender immediately and destroy this e-mail. Any
    unauthorised copying, disclosure or distribution of the material in this
    e-mail is strictly forbidden.]
  • Reik Schatz at Mar 17, 2010 at 2:48 pm
    Thanks Gang, I will do some testing tomorrow - skip sending whole XML,
    maybe adding some Reducers - and see where I end up.

    Gang Luo wrote:
    Hi Reik,
    the number of reducer is not a hint (mappers # is a hint). The default hash partitioner will hash and sent records to each reducer in round-robin way based on the reducers #. If the values list is too large to fit into heap memory, then you will get an exception and job will fail after several attempts. You may need to increase the heap size for each task by JobConf.set("mapred.child.java.opts","-Xmx***m).

    -Gang




    ----- 原始邮件 ----
    发件人: Reik Schatz <reik.schatz@bwin.org>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2010/3/17 (周三) 10:13:45 上午
    主 题: Re: optimization help needed

    Very good input not to sent the "original xml" over to the reducers. For
    the JobConf.setNumReduceTasks(n) isn't that just a hint but the real
    number will be determined based on the Partitioner I use, which will be
    the default HashPartioner? One other thought I had, what will happen if
    the values list sent to a single Reducer is to big to fit into memory?

    /Reik



    Gang Luo wrote:
    HI,
    you can control the number of reducers by JobConf.setNumReduceTasks(n). The number of mappers is defined by (file size) / (split size). By default the split size is 64MB. Since you dataset is not very large, there should be no big difference if you change these.

    if you are only interested in the number of blocks per email address, you don't need to send the "original xml" as the value in the intermediate result. This can reduce the amount of data sent from mappers to reducers. Use combiner to pre-aggregate the data may also help.

    -Gang




    ----- 原始邮件 ----
    发件人: Reik Schatz <reik.schatz@bwin.org>
    收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    发送日期: 2010/3/17 (周三) 5:04:33 上午
    主 题: optimization help needed

    Preparing a Hadoop presentation here. For demonstration I start up a 5 machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into HDFS. The Mapper will receive a XML block as the key, select a email address from the xml and use this as the key for the reducer and the orginal xml as the value. The Reducer just aggregates the number of XML blocks per email address.

    Running this on the cluster takes about 2:30 min. The frameworks uses 8 Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in the file. How can I speed up processing time? One thing I can think of, is to have more than just 2 email addresses in the sample document to be able to use more than 2 reducers in parallel. Why did the framework choose to use 8 mappers and not more? Maybe my sample data is too small to benefit from parallel processing.
    Thanks in advance




    --

    *Reik Schatz*
    Technical Lead, Platform
    P: +46 8 562 470 00
    M: +46 76 25 29 872
    F: +46 8 562 470 01
    E: reik.schatz@bwin.org
    */bwin/* Games AB
    Klarabergsviadukten 82,
    111 64 Stockholm, Sweden

    [This e-mail may contain confidential and/or privileged information. If
    you are not the intended recipient (or have received this e-mail in
    error) please notify the sender immediately and destroy this e-mail. Any
    unauthorised copying, disclosure or distribution of the material in this
    e-mail is strictly forbidden.]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 17, '10 at 9:05a
activeMar 17, '10 at 2:48p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Reik Schatz: 3 posts Gang Luo: 2 posts

People

Translate

site design / logo © 2022 Grokbase