Grokbase Groups Hive user May 2011
FAQ
Hi Guys

I use flume to store log file , and use hive to query.

Flume always store the small file with suffix .seq Now I have over 35
thousand seq files. Every time when I launch query script, 35 thousand map
tasks will be created and it's so long time to wait for completing.

I also try to set CombineHiveInputFormat, but if I set this option, it seems
the task will be executed slowly. Because total size of the data folder over
700M. Now in my testing env, I only have 3 data nodes. I also tried to add
mapred.map.tasks=5 after the CombineHiveInputFormat setting, seems doesn't
work. There's alway only one map task if set CombineHiveInputFormat.

Can you plz show me a solution in which I can set map task number freely

BTW: version for hadoop is 20 and hive is 0.5

Richard

Search Discussions

  • Junxian Yan at Jun 1, 2011 at 7:38 am
    Today I tried CombineHiveInputFormat and set the max split size for hadoop
    input. Seems I can get the expected map tasks number. But another problem is
    the cpu is consumed highly by map tasks. almost 100%.

    I just ran a query with simple WHERE condition over testing files,whose
    total size is about 30M and there are about 10 thousand small files. The
    execution time over 700s. It's killing us. Because the files are generated
    by flume, all files is seq file.


    R
    On Tue, May 31, 2011 at 2:55 AM, Junxian Yan wrote:

    Hi Guys

    I use flume to store log file , and use hive to query.

    Flume always store the small file with suffix .seq Now I have over 35
    thousand seq files. Every time when I launch query script, 35 thousand map
    tasks will be created and it's so long time to wait for completing.

    I also try to set CombineHiveInputFormat, but if I set this option, it
    seems the task will be executed slowly. Because total size of the data
    folder over 700M. Now in my testing env, I only have 3 data nodes. I also
    tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
    seems doesn't work. There's alway only one map task if
    set CombineHiveInputFormat.

    Can you plz show me a solution in which I can set map task number freely

    BTW: version for hadoop is 20 and hive is 0.5

    Richard
  • Igor Tatarinov at Jun 1, 2011 at 5:12 pm
    Can you pre-aggregate your historical data to reduce the number of files?

    We used to partition our data by date but that created too many output files
    so now we partition by month.

    I do find it odd that Hive (0.6) can't merge compressed output files. We
    could have gotten away with daily partitioning if Hive could merge small
    files. I tried disabling compression but it actually caused some execution
    problems (perhaps xcievers -related I am not sure)
    On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan wrote:

    Today I tried CombineHiveInputFormat and set the max split size for hadoop
    input. Seems I can get the expected map tasks number. But another problem is
    the cpu is consumed highly by map tasks. almost 100%.

    I just ran a query with simple WHERE condition over testing files,whose
    total size is about 30M and there are about 10 thousand small files. The
    execution time over 700s. It's killing us. Because the files are generated
    by flume, all files is seq file.


    R
    On Tue, May 31, 2011 at 2:55 AM, Junxian Yan wrote:

    Hi Guys

    I use flume to store log file , and use hive to query.

    Flume always store the small file with suffix .seq Now I have over 35
    thousand seq files. Every time when I launch query script, 35 thousand map
    tasks will be created and it's so long time to wait for completing.

    I also try to set CombineHiveInputFormat, but if I set this option, it
    seems the task will be executed slowly. Because total size of the data
    folder over 700M. Now in my testing env, I only have 3 data nodes. I also
    tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
    seems doesn't work. There's alway only one map task if
    set CombineHiveInputFormat.

    Can you plz show me a solution in which I can set map task number freely

    BTW: version for hadoop is 20 and hive is 0.5

    Richard
  • Edward Capriolo at Jun 1, 2011 at 7:38 pm

    On Wed, Jun 1, 2011 at 1:12 PM, Igor Tatarinov wrote:

    Can you pre-aggregate your historical data to reduce the number of files?

    We used to partition our data by date but that created too many output
    files so now we partition by month.

    I do find it odd that Hive (0.6) can't merge compressed output files. We
    could have gotten away with daily partitioning if Hive could merge small
    files. I tried disabling compression but it actually caused some execution
    problems (perhaps xcievers -related I am not sure)
    On Wed, Jun 1, 2011 at 12:38 AM, Junxian Yan wrote:

    Today I tried CombineHiveInputFormat and set the max split size for hadoop
    input. Seems I can get the expected map tasks number. But another problem is
    the cpu is consumed highly by map tasks. almost 100%.

    I just ran a query with simple WHERE condition over testing files,whose
    total size is about 30M and there are about 10 thousand small files. The
    execution time over 700s. It's killing us. Because the files are generated
    by flume, all files is seq file.


    R
    On Tue, May 31, 2011 at 2:55 AM, Junxian Yan wrote:

    Hi Guys

    I use flume to store log file , and use hive to query.

    Flume always store the small file with suffix .seq Now I have over 35
    thousand seq files. Every time when I launch query script, 35 thousand map
    tasks will be created and it's so long time to wait for completing.

    I also try to set CombineHiveInputFormat, but if I set this option, it
    seems the task will be executed slowly. Because total size of the data
    folder over 700M. Now in my testing env, I only have 3 data nodes. I also
    tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
    seems doesn't work. There's alway only one map task if
    set CombineHiveInputFormat.

    Can you plz show me a solution in which I can set map task number freely

    BTW: version for hadoop is 20 and hive is 0.5

    Richard
    We have open sourced our filecrusher/optimizer, you post reminded be to
    throw our new V2 version over the open source fence.

    http://www.jointhegrid.com/hadoop_filecrush/index.jsp

    I know many are looking for an in-hive solution, but file crusher does the
    job for us.

    Edward

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 31, '11 at 9:56a
activeJun 1, '11 at 7:38p
posts4
users3
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase