Grokbase Groups HBase user April 2009
FAQ
hi,
I have a 4-node cluster, with the following configuration:

1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase
Master and HBase Region Server
2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase
Region Server

All the DNs on slaves are about 66% usage, while the DN on master is about
36% usage.

mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves)
mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves)

I am doing this job: I read a bunch of CSV files (hundreds) recursively from
a specified directory on HDFS, parse the file line by line. The first line
of each file is a "column list" for that particular file. My map task is
used to parse the files line by line, and reduce task is used to write the
parsed result into HBase. The total file size is about 2.6GB.

CSV ==> <NamedRowOffset, Text> == (map) ==>
<ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==>
<ImmutableBytesWritable, BatchUpdate>

Note: NamedRowOffset is a custom class so we can know the current file name,
column names, etc.

I tried different number of map tasks and reduce tasks, and the total
throughput are different. I am trying to answer:

1) What's the best numbers for map and reduce tasks in my particular
scenario?
2) Besides the number of map and reduce tasks, do any other parameter(s)
matter?
3) What's the common approach to observe and fine tune the parameters
(considering both Hadoop and HBase)?

Regards,
Yan

Search Discussions

  • Liu Yan at Apr 6, 2009 at 2:20 pm
    I had my questions in the last email. Here are some of my observations:
    -- 24 map tasks is the total capacity of my cluster. But when I specified 24
    maps, it only launched 18 map tasks. When I specified 32 maps, it launched
    24 map tasks. When I specified 12 maps, it launched 10 map tasks. From the
    document, the number of map tasks specified by application is only a hint to
    the framework. My question is how to deduce the actual number based on this
    hint?

    -- I noticed the following task summary in the log:
    Task A:
    File Systems
    HDFS bytes read 152,379,850
    Local bytes read 559,188,620
    Local bytes written 1,118,100,276
    Task B:
    File Systems
    HDFS bytes read 8,725
    Local bytes written 31,316

    Is this meaning that Task B had only read from HDFS, while Task A read a
    significant amount of data from its local disk (hence better performance)?
    Task A also had many bytes of read from HDFS, is reducing this part as much
    as possible a direction of performance enhancement?

    -- I also noticed the following summary:
    Task C:
    Map-Reduce Framework
    Combine output records 0
    Map input records 111
    Map output bytes 30,791
    Map input bytes 533
    Combine input records 0
    Map output records 111
    Task D:
    Map-Reduce Framework
    Combine output records 0
    Map input records 1,391,167
    Map output bytes 554,233,388
    Map input bytes 152,371,658
    Combine input records 0
    Map output records 1,391,145

    It seems Task D is a much heavier one than Task C. I have 10 map tasks, 8 of
    them are similar (heavy ones) to Tasks D, and 2 of them are similar to Task
    C (very light). Why did this happen? Can I control this?

    -- I tried different number of reduce tasks. This is important since the
    time spent on map is very small (10 to 15 minutes) while the time spent on
    reduce is the major one (5 to 8 hours).
    While I specified 4 reducers (the same number as my region servers), I got
    the best throughput (a little less than 4 hours). When I specified 6 or 8
    reduces, I got a much worse result (6~8 hours). The question is should I use
    the number of reduce tasks exactly the same as number of region servers?

    Regards,
    Yan


    2009/4/6 Liu Yan <gzbigegg.ml@gmail.com>
    hi,
    I have a 4-node cluster, with the following configuration:

    1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase
    Master and HBase Region Server
    2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase
    Region Server

    All the DNs on slaves are about 66% usage, while the DN on master is about
    36% usage.

    mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves)
    mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves)

    I am doing this job: I read a bunch of CSV files (hundreds) recursively
    from a specified directory on HDFS, parse the file line by line. The first
    line of each file is a "column list" for that particular file. My map task
    is used to parse the files line by line, and reduce task is used to write
    the parsed result into HBase. The total file size is about 2.6GB.

    CSV ==> <NamedRowOffset, Text> == (map) ==>
    <ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==>
    <ImmutableBytesWritable, BatchUpdate>

    Note: NamedRowOffset is a custom class so we can know the current file
    name, column names, etc.

    I tried different number of map tasks and reduce tasks, and the total
    throughput are different. I am trying to answer:

    1) What's the best numbers for map and reduce tasks in my particular
    scenario?
    2) Besides the number of map and reduce tasks, do any other parameter(s)
    matter?
    3) What's the common approach to observe and fine tune the parameters
    (considering both Hadoop and HBase)?

    Regards,
    Yan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedApr 6, '09 at 1:54p
activeApr 6, '09 at 2:20p
posts2
users1
websitehbase.apache.org

1 user in discussion

Liu Yan: 2 posts

People

Translate

site design / logo © 2022 Grokbase