Grokbase Groups HBase user April 2010
FAQ
Hi guys,

I have about 8G Hbase table and I want to run MR job against it. It works extremely slow in my case. One thing I noticed is that job runs only 2 map tasks. Is it any way to setup bigger number of map tasks? I sow some method in mapred package, but can't find anything like this in new mapreduce package.

I run my MR job one a single machine in cluster mode.


_____________________________________________________________
Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/

Search Discussions

  • Amandeep Khurana at Apr 11, 2010 at 3:05 am
    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko wrote:

    Hi guys,

    I have about 8G Hbase table and I want to run MR job against it. It works
    extremely slow in my case. One thing I noticed is that job runs only 2 map
    tasks. Is it any way to setup bigger number of map tasks? I sow some method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/
  • Andriy Kolyadenko at Apr 11, 2010 at 3:18 am
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko wrote:

    Hi guys,

    I have about 8G Hbase table and I want to run MR job against it. It works
    extremely slow in my case. One thing I noticed is that job runs only 2 map
    tasks. Is it any way to setup bigger number of map tasks? I sow some method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/
  • Amandeep Khurana at Apr 11, 2010 at 3:38 am
    Ah.. I still use the old api for the job configuration. In the JobConf
    object, you can call the setNumMapTasks() function.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    On Sat, Apr 10, 2010 at 8:17 PM, Andriy Kolyadenko wrote:

    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table and I want to run MR job against it. It works
    extremely slow in my case. One thing I noticed is that job runs only 2 map
    tasks. Is it any way to setup bigger number of map tasks? I sow some method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/
  • Jean-Daniel Cryans at Apr 11, 2010 at 9:15 am
    A map against a HBase table by default cannot have more tasks than the
    number of regions in that table.

    Also you want to enable scanner caching. Pass a Scan object to the
    TableMapReduceUtil.initTableMapperJob that is configured with
    scan.setCaching(some_value) where the value should be the number of
    rows to fetch every time we hit a region server with next(). On rows
    of 100-200 bytes, our jobs usually are configured with 1000 up to
    10000.

    Finally, is your job running in local mode or on a job tracker? Even
    if HBase uses HDFS, it usually doesn't know of the job tracker unless
    you configure HBase's classpath with Hadoop's conf.

    J-D

    On Sun, Apr 11, 2010 at 3:17 AM, Andriy Kolyadenko
    wrote:
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table  and I want to run MR job against it. It works
    extremely slow in my case. One thing I noticed is that job runs only 2 map
    tasks. Is it any way to setup bigger number of map tasks? I sow some method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/
  • Ted Yu at Apr 11, 2010 at 1:31 pm
    I noticed mapreduce.Export.createSubmittableJob() doesn't call setCaching()
    in 0.20.3

    Should call to setCaching() be added ?

    Thanks
    On Sun, Apr 11, 2010 at 2:14 AM, Jean-Daniel Cryans wrote:

    A map against a HBase table by default cannot have more tasks than the
    number of regions in that table.

    Also you want to enable scanner caching. Pass a Scan object to the
    TableMapReduceUtil.initTableMapperJob that is configured with
    scan.setCaching(some_value) where the value should be the number of
    rows to fetch every time we hit a region server with next(). On rows
    of 100-200 bytes, our jobs usually are configured with 1000 up to
    10000.

    Finally, is your job running in local mode or on a job tracker? Even
    if HBase uses HDFS, it usually doesn't know of the job tracker unless
    you configure HBase's classpath with Hadoop's conf.

    J-D

    On Sun, Apr 11, 2010 at 3:17 AM, Andriy Kolyadenko
    wrote:
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.
    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table and I want to run MR job against it. It
    works
    extremely slow in my case. One thing I noticed is that job runs only 2
    map
    tasks. Is it any way to setup bigger number of map tasks? I sow some
    method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/
  • Jean-Daniel Cryans at Apr 11, 2010 at 2:10 pm
    Yes an option could be added, along with a write buffer option for Import.

    J-D
    On Sun, Apr 11, 2010 at 3:30 PM, Ted Yu wrote:
    I noticed mapreduce.Export.createSubmittableJob() doesn't call setCaching()
    in 0.20.3

    Should call to setCaching() be added ?

    Thanks
    On Sun, Apr 11, 2010 at 2:14 AM, Jean-Daniel Cryans wrote:

    A map against a HBase table by default cannot have more tasks than the
    number of regions in that table.

    Also you want to enable scanner caching. Pass a Scan object to the
    TableMapReduceUtil.initTableMapperJob that is configured with
    scan.setCaching(some_value) where the value should be the number of
    rows to fetch every time we hit a region server with next(). On rows
    of 100-200 bytes, our jobs usually are configured with 1000 up to
    10000.

    Finally, is your job running in local mode or on a job tracker? Even
    if HBase uses HDFS, it usually doesn't know of the job tracker unless
    you configure HBase's classpath with Hadoop's conf.

    J-D

    On Sun, Apr 11, 2010 at 3:17 AM, Andriy Kolyadenko
    wrote:
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.
    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table  and I want to run MR job against it. It
    works
    extremely slow in my case. One thing I noticed is that job runs only 2
    map
    tasks. Is it any way to setup bigger number of map tasks? I sow some
    method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/
  • Ted Yu at Apr 11, 2010 at 2:56 pm
    https://issues.apache.org/jira/browse/HBASE-2434 has been logged.
    On Sun, Apr 11, 2010 at 7:09 AM, Jean-Daniel Cryans wrote:

    Yes an option could be added, along with a write buffer option for Import.

    J-D
    On Sun, Apr 11, 2010 at 3:30 PM, Ted Yu wrote:
    I noticed mapreduce.Export.createSubmittableJob() doesn't call
    setCaching()
    in 0.20.3

    Should call to setCaching() be added ?

    Thanks

    On Sun, Apr 11, 2010 at 2:14 AM, Jean-Daniel Cryans <jdcryans@apache.org
    wrote:
    A map against a HBase table by default cannot have more tasks than the
    number of regions in that table.

    Also you want to enable scanner caching. Pass a Scan object to the
    TableMapReduceUtil.initTableMapperJob that is configured with
    scan.setCaching(some_value) where the value should be the number of
    rows to fetch every time we hit a region server with next(). On rows
    of 100-200 bytes, our jobs usually are configured with 1000 up to
    10000.

    Finally, is your job running in local mode or on a job tracker? Even
    if HBase uses HDFS, it usually doesn't know of the job tracker unless
    you configure HBase's classpath with Hadoop's conf.

    J-D

    On Sun, Apr 11, 2010 at 3:17 AM, Andriy Kolyadenko
    wrote:
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.
    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table and I want to run MR job against it. It
    works
    extremely slow in my case. One thing I noticed is that job runs only
    2
    map
    tasks. Is it any way to setup bigger number of map tasks? I sow some
    method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/
  • Andriy Kolyadenko at Apr 11, 2010 at 9:22 pm
    My table is about 8g, and according to logs has about 70 regions.

    I don't understand the logic, but scan.setCaching(1000) increased the number of map tasks from 2 to 65 and increased performance significantly, thanks for hint!



    --- jdcryans@apache.org wrote:

    From: Jean-Daniel Cryans <jdcryans@apache.org>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sun, 11 Apr 2010 09:14:35 +0000

    A map against a HBase table by default cannot have more tasks than the
    number of regions in that table.

    Also you want to enable scanner caching. Pass a Scan object to the
    TableMapReduceUtil.initTableMapperJob that is configured with
    scan.setCaching(some_value) where the value should be the number of
    rows to fetch every time we hit a region server with next(). On rows
    of 100-200 bytes, our jobs usually are configured with 1000 up to
    10000.

    Finally, is your job running in local mode or on a job tracker? Even
    if HBase uses HDFS, it usually doesn't know of the job tracker unless
    you configure HBase's classpath with Hadoop's conf.

    J-D

    On Sun, Apr 11, 2010 at 3:17 AM, Andriy Kolyadenko
    wrote:
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table  and I want to run MR job against it. It works
    extremely slow in my case. One thing I noticed is that job runs only 2 map
    tasks. Is it any way to setup bigger number of map tasks? I sow some method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/
  • Jean-Daniel Cryans at Apr 11, 2010 at 10:13 pm
    WRT your higher number of mappers, you probably did something else
    that corrected it (for example, you restarted mapreduce and it took
    into account a change you made to the configurations). The scanner
    caching is totally unrelated to mapreduce.

    Glad to hear it's much faster.

    J-D

    On Sun, Apr 11, 2010 at 11:21 PM, Andriy Kolyadenko
    wrote:
    My table is about 8g, and according to logs has about 70 regions.

    I don't understand the logic, but scan.setCaching(1000) increased the number of map tasks from 2 to 65 and increased performance significantly, thanks for hint!



    --- jdcryans@apache.org wrote:

    From: Jean-Daniel Cryans <jdcryans@apache.org>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sun, 11 Apr 2010 09:14:35 +0000

    A map against a HBase table by default cannot have more tasks than the
    number of regions in that table.

    Also you want to enable scanner caching. Pass a Scan object to the
    TableMapReduceUtil.initTableMapperJob that is configured with
    scan.setCaching(some_value) where the value should be the number of
    rows to fetch every time we hit a region server with next(). On rows
    of 100-200 bytes, our jobs usually are configured with 1000 up to
    10000.

    Finally, is your job running in local mode or on a job tracker? Even
    if HBase uses HDFS, it usually doesn't know of the job tracker unless
    you configure HBase's classpath with Hadoop's conf.

    J-D

    On Sun, Apr 11, 2010 at 3:17 AM, Andriy Kolyadenko
    wrote:
    Hi,

    thanks for quick response. I tried to do following in the code:

    job.getConfiguration().setInt("mapred.map.tasks", 10000);

    but unfortunately have the same result.

    Any other ideas?

    --- amansk@gmail.com wrote:

    From: Amandeep Khurana <amansk@gmail.com>
    To: hbase-user@hadoop.apache.org, crypto5@mail.saturnfans.com
    Subject: Re: set number of map tasks for HBase MR
    Date: Sat, 10 Apr 2010 20:04:18 -0700

    You can set the number of map tasks in your job config to a big number (eg:
    100000), and the library will automatically spawn one map task per region.

    -ak


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz


    On Sat, Apr 10, 2010 at 7:59 PM, Andriy Kolyadenko <
    crypto5@mail.saturnfans.com> wrote:
    Hi guys,

    I have about 8G Hbase table  and I want to run MR job against it. It works
    extremely slow in my case. One thing I noticed is that job runs only 2 map
    tasks. Is it any way to setup bigger number of map tasks? I sow some method
    in mapred package, but can't find anything like this in new mapreduce
    package.

    I run my MR job one a single machine in cluster mode.


    _____________________________________________________________
    Sign up for your free SaturnFans email account at
    http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/



    _____________________________________________________________
    Sign up for your free SaturnFans email account at http://webmail.saturnfans.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedApr 11, '10 at 2:59a
activeApr 11, '10 at 10:13p
posts10
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase