Grokbase Groups Hive user June 2010
FAQ
Hi,
I'm running the latest version of trunk r953172. I'm doing doing a
dynamic partition insert overwrite query which generates a lot of small
files in each of the partition. I was hoping this could be solved by
setting hive.merge.mapredfiles to true. However, it seems like whenever the
job is submitted it is always set to false, thus it doesnt seem to have any
effect. I also tried to modified this property in the hive-default.xml, but
it didn't work either.

Thanks,
Sammy

Search Discussions

  • Ted Yu at Jun 13, 2010 at 3:41 pm
    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.
    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:

    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition. I was hoping this could be solved by
    setting hive.merge.mapredfiles to true. However, it seems like whenever the
    job is submitted it is always set to false, thus it doesnt seem to have any
    effect. I also tried to modified this property in the hive-default.xml, but
    it didn't work either.

    Thanks,
    Sammy

  • Sammy Yu at Jun 14, 2010 at 5:42 am
    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to true
    via the shell tool and hive-default.xml configuration file. However, it
    appears somehow the job configuration is changed before the job is
    submitted. Is there another condition that can cause this to happen?

    Thanks,
    Sammy

    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:

    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.

    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:

    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition. I was hoping this could be solved by
    setting hive.merge.mapredfiles to true. However, it seems like whenever the
    job is submitted it is always set to false, thus it doesnt seem to have any
    effect. I also tried to modified this property in the hive-default.xml, but
    it didn't work either.

    Thanks,
    Sammy

  • Yongqiang He at Jun 14, 2010 at 5:56 am
    I think there is another parameter ³hive.merge.smallfiles.avgsize² to see
    whether to do the merge job or not based on the average output files¹ size.
    The default for that parameter is 16M. So if the average output¹s size is
    larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:

    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to true
    via the shell tool and hive-default.xml configuration file.  However, it
    appears somehow the job configuration is changed before the job is submitted.
    Is there another condition that can cause this to happen?

    Thanks,
    Sammy

    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.

    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
    Hi,
    I'm running the latest version of trunk r953172.  I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition.  I was hoping this could be solved by
    setting hive.merge.mapredfiles to true.  However, it seems like whenever the
    job is submitted it is always set to false, thus it doesnt seem to have any
    effect.  I also tried to modified this property in the hive-default.xml, but
    it didn't work either.

    Thanks,
    Sammy

  • Viraj Bhat at Jul 2, 2010 at 6:31 am
    Hi Yongqiang,

    I am facing a similar situation, I am using the latest trunk of Hive. I
    am using dynamic partitioning of Hive and it is a Map only job, which
    converts files from compressed TXT gz to RC format.

    The DDL of the task looks similar to:



    FROM gztable



    INSERT OVERWRITE TABLE rctable



    ...

    PARTITION(datestamp, partitionlevel1, partitionlevel1)





    SELECT ...





    ..

    set hive.merge.mapredfiles=true;

    set hive.merge.mapfiles=true;

    set hive.merge.smallfiles.avgsize=256000000;

    set hive.merge.size.smallfiles.avgsize=256000000;



    When I run a job, I see that the following are set to false in the
    job.xml when the job starts up.

    hive.merge.mapfiles = false;

    hive.merge.mapredfiles = false;



    Is this a bug with dynamic partitioning? Is there something else I need
    to set to get this to work and remove small files I might be generating.



    Viraj



    ________________________________

    From: Yongqiang He
    Sent: Sunday, June 13, 2010 10:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: merging the size of the reduce output



    I think there is another parameter "hive.merge.smallfiles.avgsize" to
    see whether to do the merge job or not based on the average output
    files' size. The default for that parameter is 16M. So if the average
    output's size is larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:

    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
    true via the shell tool and hive-default.xml configuration file.
    However, it appears somehow the job configuration is changed before the
    job is submitted. Is there another condition that can cause this to
    happen?

    Thanks,
    Sammy


    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:

    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.


    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:

    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition. I was hoping this could be solved by
    setting hive.merge.mapredfiles to true. However, it seems like whenever
    the job is submitted it is always set to false, thus it doesnt seem to
    have any effect. I also tried to modified this property in the
    hive-default.xml, but it didn't work either.

    Thanks,
    Sammy
  • Viraj Bhat at Jul 2, 2010 at 6:41 am
    Okay I read that this is a work in progress

    https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
    when doing dynamic partitioning.

    There was a suggestion to try:

    hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    for Hadoop 20 when running queries on this partition.

    Viraj



    ________________________________

    From: Viraj Bhat
    Sent: Thursday, July 01, 2010 11:31 PM
    To: hive-user@hadoop.apache.org
    Cc: athusoo@facebook.com
    Subject: RE: merging the size of the reduce output



    Hi Yongqiang,

    I am facing a similar situation, I am using the latest trunk of Hive. I
    am using dynamic partitioning of Hive and it is a Map only job, which
    converts files from compressed TXT gz to RC format.

    The DDL of the task looks similar to:



    FROM gztable



    INSERT OVERWRITE TABLE rctable



    ...

    PARTITION(datestamp, partitionlevel1, partitionlevel1)





    SELECT ...





    ..

    set hive.merge.mapredfiles=true;

    set hive.merge.mapfiles=true;

    set hive.merge.smallfiles.avgsize=256000000;

    set hive.merge.size.smallfiles.avgsize=256000000;



    When I run a job, I see that the following are set to false in the
    job.xml when the job starts up.

    hive.merge.mapfiles = false;

    hive.merge.mapredfiles = false;



    Is this a bug with dynamic partitioning? Is there something else I need
    to set to get this to work and remove small files I might be generating.



    Viraj



    ________________________________

    From: Yongqiang He
    Sent: Sunday, June 13, 2010 10:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: merging the size of the reduce output



    I think there is another parameter "hive.merge.smallfiles.avgsize" to
    see whether to do the merge job or not based on the average output
    files' size. The default for that parameter is 16M. So if the average
    output's size is larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:

    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
    true via the shell tool and hive-default.xml configuration file.
    However, it appears somehow the job configuration is changed before the
    job is submitted. Is there another condition that can cause this to
    happen?

    Thanks,
    Sammy


    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:

    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.


    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:

    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition. I was hoping this could be solved by
    setting hive.merge.mapredfiles to true. However, it seems like whenever
    the job is submitted it is always set to false, thus it doesnt seem to
    have any effect. I also tried to modified this property in the
    hive-default.xml, but it didn't work either.

    Thanks,
    Sammy
  • John Sichi at Jul 2, 2010 at 6:59 am
    Ning is currently out on vacation; I think he'll be back to working on this when he returns.

    JVS

    ________________________________________
    From: Viraj Bhat [viraj@yahoo-inc.com]
    Sent: Thursday, July 01, 2010 11:40 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Okay I read that this is a work in progress
    https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files when doing dynamic partitioning.
    There was a suggestion to try:
    hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat for Hadoop 20 when running queries on this partition.
    Viraj

    ________________________________
    From: Viraj Bhat
    Sent: Thursday, July 01, 2010 11:31 PM
    To: hive-user@hadoop.apache.org
    Cc: athusoo@facebook.com
    Subject: RE: merging the size of the reduce output

    Hi Yongqiang,
    I am facing a similar situation, I am using the latest trunk of Hive. I am using dynamic partitioning of Hive and it is a Map only job, which converts files from compressed TXT gz to RC format.
    The DDL of the task looks similar to:

    FROM gztable

    INSERT OVERWRITE TABLE rctable


    PARTITION(datestamp, partitionlevel1, partitionlevel1)


    SELECT …


    ..
    set hive.merge.mapredfiles=true;
    set hive.merge.mapfiles=true;
    set hive.merge.smallfiles.avgsize=256000000;
    set hive.merge.size.smallfiles.avgsize=256000000;

    When I run a job, I see that the following are set to false in the job.xml when the job starts up.
    hive.merge.mapfiles = false;
    hive.merge.mapredfiles = false;

    Is this a bug with dynamic partitioning? Is there something else I need to set to get this to work and remove small files I might be generating.

    Viraj

    ________________________________
    From: Yongqiang He
    Sent: Sunday, June 13, 2010 10:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: merging the size of the reduce output

    I think there is another parameter “hive.merge.smallfiles.avgsize” to see whether to do the merge job or not based on the average output files’ size. The default for that parameter is 16M. So if the average output’s size is larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:
    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to true via the shell tool and hive-default.xml configuration file. However, it appears somehow the job configuration is changed before the job is submitted. Is there another condition that can cause this to happen?

    Thanks,
    Sammy


    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
    Looking at ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java, hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.


    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a dynamic partition insert overwrite query which generates a lot of small files in each of the partition. I was hoping this could be solved by setting hive.merge.mapredfiles to true. However, it seems like whenever the job is submitted it is always set to false, thus it doesnt seem to have any effect. I also tried to modified this property in the hive-default.xml, but it didn't work either.

    Thanks,
    Sammy
  • Viraj Bhat at Jul 2, 2010 at 11:46 pm
    Hi John,
    Thanks again for letting me know. This came be overcome though by using
    the CombineInputFormat, unfortunately I am not using that branch ;)
    Also a large number of small files for some partitions cause poor
    utilization to the Namenode.
    Please let me know if you need help with the patch.
    Thanks
    Viraj

    -----Original Message-----
    From: John Sichi
    Sent: Thursday, July 01, 2010 11:57 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Ning is currently out on vacation; I think he'll be back to working on
    this when he returns.

    JVS

    ________________________________________
    From: Viraj Bhat [viraj@yahoo-inc.com]
    Sent: Thursday, July 01, 2010 11:40 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Okay I read that this is a work in progress
    https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
    when doing dynamic partitioning.
    There was a suggestion to try:
    hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    for Hadoop 20 when running queries on this partition.
    Viraj

    ________________________________
    From: Viraj Bhat
    Sent: Thursday, July 01, 2010 11:31 PM
    To: hive-user@hadoop.apache.org
    Cc: athusoo@facebook.com
    Subject: RE: merging the size of the reduce output

    Hi Yongqiang,
    I am facing a similar situation, I am using the latest trunk of Hive. I
    am using dynamic partitioning of Hive and it is a Map only job, which
    converts files from compressed TXT gz to RC format.
    The DDL of the task looks similar to:

    FROM gztable

    INSERT OVERWRITE TABLE rctable

    ...
    PARTITION(datestamp, partitionlevel1, partitionlevel1)


    SELECT ...


    ..
    set hive.merge.mapredfiles=true;
    set hive.merge.mapfiles=true;
    set hive.merge.smallfiles.avgsize=256000000;
    set hive.merge.size.smallfiles.avgsize=256000000;

    When I run a job, I see that the following are set to false in the
    job.xml when the job starts up.
    hive.merge.mapfiles = false;
    hive.merge.mapredfiles = false;

    Is this a bug with dynamic partitioning? Is there something else I need
    to set to get this to work and remove small files I might be generating.

    Viraj

    ________________________________
    From: Yongqiang He
    Sent: Sunday, June 13, 2010 10:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: merging the size of the reduce output

    I think there is another parameter "hive.merge.smallfiles.avgsize" to
    see whether to do the merge job or not based on the average output
    files' size. The default for that parameter is 16M. So if the average
    output's size is larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:
    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
    true via the shell tool and hive-default.xml configuration file.
    However, it appears somehow the job configuration is changed before the
    job is submitted. Is there another condition that can cause this to
    happen?

    Thanks,
    Sammy


    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.


    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition. I was hoping this could be solved by
    setting hive.merge.mapredfiles to true. However, it seems like whenever
    the job is submitted it is always set to false, thus it doesnt seem to
    have any effect. I also tried to modified this property in the
    hive-default.xml, but it didn't work either.

    Thanks,
    Sammy
  • John Sichi at Jul 7, 2010 at 1:28 am
    I'm sure Ning will appreciate any help you can give, so if you make progress, feel free to upload an updated patch.

    JVS
    On Jul 2, 2010, at 4:44 PM, Viraj Bhat wrote:

    Hi John,
    Thanks again for letting me know. This came be overcome though by using
    the CombineInputFormat, unfortunately I am not using that branch ;)
    Also a large number of small files for some partitions cause poor
    utilization to the Namenode.
    Please let me know if you need help with the patch.
    Thanks
    Viraj

    -----Original Message-----
    From: John Sichi
    Sent: Thursday, July 01, 2010 11:57 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Ning is currently out on vacation; I think he'll be back to working on
    this when he returns.

    JVS

    ________________________________________
    From: Viraj Bhat [viraj@yahoo-inc.com]
    Sent: Thursday, July 01, 2010 11:40 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Okay I read that this is a work in progress
    https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
    when doing dynamic partitioning.
    There was a suggestion to try:
    hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    for Hadoop 20 when running queries on this partition.
    Viraj

    ________________________________
    From: Viraj Bhat
    Sent: Thursday, July 01, 2010 11:31 PM
    To: hive-user@hadoop.apache.org
    Cc: athusoo@facebook.com
    Subject: RE: merging the size of the reduce output

    Hi Yongqiang,
    I am facing a similar situation, I am using the latest trunk of Hive. I
    am using dynamic partitioning of Hive and it is a Map only job, which
    converts files from compressed TXT gz to RC format.
    The DDL of the task looks similar to:

    FROM gztable

    INSERT OVERWRITE TABLE rctable

    ...
    PARTITION(datestamp, partitionlevel1, partitionlevel1)


    SELECT ...


    ..
    set hive.merge.mapredfiles=true;
    set hive.merge.mapfiles=true;
    set hive.merge.smallfiles.avgsize=256000000;
    set hive.merge.size.smallfiles.avgsize=256000000;

    When I run a job, I see that the following are set to false in the
    job.xml when the job starts up.
    hive.merge.mapfiles = false;
    hive.merge.mapredfiles = false;

    Is this a bug with dynamic partitioning? Is there something else I need
    to set to get this to work and remove small files I might be generating.

    Viraj

    ________________________________
    From: Yongqiang He
    Sent: Sunday, June 13, 2010 10:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: merging the size of the reduce output

    I think there is another parameter "hive.merge.smallfiles.avgsize" to
    see whether to do the merge job or not based on the average output
    files' size. The default for that parameter is 16M. So if the average
    output's size is larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:
    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
    true via the shell tool and hive-default.xml configuration file.
    However, it appears somehow the job configuration is changed before the
    job is submitted. Is there another condition that can cause this to
    happen?

    Thanks,
    Sammy


    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.


    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
    Hi,
    I'm running the latest version of trunk r953172. I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition. I was hoping this could be solved by
    setting hive.merge.mapredfiles to true. However, it seems like whenever
    the job is submitted it is always set to false, thus it doesnt seem to
    have any effect. I also tried to modified this property in the
    hive-default.xml, but it didn't work either.

    Thanks,
    Sammy

  • Yongqiang he at Jul 7, 2010 at 1:52 am
    Hi viraj,

    Ning assigned this jira to me offline to finish. Unfortunately, i do
    think i can pick up this jira in the coming 2~3 weeks. So it will be
    awesome if you can help to finish this one.

    Thanks
    Yongqiang
    On Fri, Jul 2, 2010 at 4:44 PM, Viraj Bhat wrote:
    Hi John,
    Thanks again for letting me know. This came be overcome though by using
    the CombineInputFormat, unfortunately I am not using that branch ;)
    Also a large number of small files for some partitions cause poor
    utilization to the Namenode.
    Please let me know if you need help with the patch.
    Thanks
    Viraj

    -----Original Message-----
    From: John Sichi
    Sent: Thursday, July 01, 2010 11:57 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Ning is currently out on vacation; I think he'll be back to working on
    this when he returns.

    JVS

    ________________________________________
    From: Viraj Bhat [viraj@yahoo-inc.com]
    Sent: Thursday, July 01, 2010 11:40 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: merging the size of the reduce output

    Okay I read that this is a work in progress
    https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
    when doing dynamic partitioning.
    There was a suggestion to try:
    hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    for Hadoop 20 when running queries on this partition.
    Viraj

    ________________________________
    From: Viraj Bhat
    Sent: Thursday, July 01, 2010 11:31 PM
    To: hive-user@hadoop.apache.org
    Cc: athusoo@facebook.com
    Subject: RE: merging the size of the reduce output

    Hi Yongqiang,
    I am facing a similar situation, I am using the latest trunk of Hive. I
    am using dynamic partitioning of Hive and it is a Map only job, which
    converts files from compressed TXT gz to RC format.
    The DDL of the task looks similar to:

    FROM gztable

    INSERT OVERWRITE TABLE  rctable

    ...
    PARTITION(datestamp, partitionlevel1, partitionlevel1)


    SELECT ...


    ..
    set hive.merge.mapredfiles=true;
    set hive.merge.mapfiles=true;
    set hive.merge.smallfiles.avgsize=256000000;
    set hive.merge.size.smallfiles.avgsize=256000000;

    When I run a job, I see that the following are set to false in the
    job.xml when the job starts up.
    hive.merge.mapfiles = false;
    hive.merge.mapredfiles = false;

    Is this a bug with dynamic partitioning?  Is there something else I need
    to set to get this to work and remove small files I might be generating.

    Viraj

    ________________________________
    From: Yongqiang He
    Sent: Sunday, June 13, 2010 10:56 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: merging the size of the reduce output

    I think there is another parameter "hive.merge.smallfiles.avgsize"  to
    see whether to do the merge job or not based on the average output
    files' size. The default for that parameter is 16M. So if the average
    output's size is larger than 16M, will not merge.
    Maybe you can try to increase that value to see.

    Thanks
    Yongqiang
    On 6/13/10 10:41 PM, "Sammy Yu" wrote:
    Hi,
    I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
    true via the shell tool and hive-default.xml configuration file.
    However, it appears somehow the job configuration is changed before the
    job is submitted.  Is there another condition that can cause this to
    happen?

    Thanks,
    Sammy


    On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
    Looking at
    ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
    hive.merge.mapredfiles is effective if there is a reducer for your job.
    Otherwise you should have set hive.merge.mapfiles to true.


    On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
    Hi,
    I'm running the latest version of trunk r953172.  I'm doing doing a
    dynamic partition insert overwrite query which generates a lot of small
    files in each of the partition.  I was hoping this could be solved by
    setting hive.merge.mapredfiles to true.  However, it seems like whenever
    the job is submitted it is always set to false, thus it doesnt seem to
    have any effect.  I also tried to modified this property in the
    hive-default.xml, but it didn't work either.

    Thanks,
    Sammy


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 13, '10 at 6:23a
activeJul 7, '10 at 1:52a
posts10
users5
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase