FAQ
Hello,

I'm trying out the nice mapjoin query hint, and it's basically working
great, but I've run into an odd thing, and was wondering if anyone
could help out:

I have a query which looks like:

INSERT OVERWRITE DIRECTORY 'hive_lookup_new_conversions/
2009/11/06/15/45'
SELECT /*+ MAPJOIN(l)*/ t.portal_id, l.lead_id, t.visit_time
...

It's consuming data from a lot of smallish partitions -- about
300-400. Because of the mapside join, it's doing the SELECT step
quite fast -- just 300+ separate map tasks, each of which runs very
quickly. The overall amount of data output is tiny.

But it looks like then, to do the INSERT OVERWRITE DIRECTORY (back
into normal HDFS), it's running *another* 300+ map tasks, to get the
(at this point, tiny amount of data), funneled through a single reducer.

Is there any way to tell hive to just do the reducer from the initial
step? E.g. to do the mapside join, but also run a single reducer to
collect the results directly (rather than stopping all the mappers,
and starting up another set immediately)?

Or is there a better way to address this?

Thanks,
-Dan Milstein

Search Discussions

  • Namit Jain at Nov 6, 2009 at 9:39 pm
    Most probably, it is trying to reduce the number of output files -
    Can you do a explain on the query to see the plan ?

    You can turn off compaction by setting:

    hive.merge.mapfiles

    to false


    Thanks,
    -namit

    -----Original Message-----
    From: Dan Milstein
    Sent: Friday, November 06, 2009 1:34 PM
    To: hive-user@hadoop.apache.org
    Subject: mapjoin / insert overwrite directory question

    Hello,

    I'm trying out the nice mapjoin query hint, and it's basically working
    great, but I've run into an odd thing, and was wondering if anyone
    could help out:

    I have a query which looks like:

    INSERT OVERWRITE DIRECTORY 'hive_lookup_new_conversions/
    2009/11/06/15/45'
    SELECT /*+ MAPJOIN(l)*/ t.portal_id, l.lead_id, t.visit_time
    ...

    It's consuming data from a lot of smallish partitions -- about
    300-400. Because of the mapside join, it's doing the SELECT step
    quite fast -- just 300+ separate map tasks, each of which runs very
    quickly. The overall amount of data output is tiny.

    But it looks like then, to do the INSERT OVERWRITE DIRECTORY (back
    into normal HDFS), it's running *another* 300+ map tasks, to get the
    (at this point, tiny amount of data), funneled through a single reducer.

    Is there any way to tell hive to just do the reducer from the initial
    step? E.g. to do the mapside join, but also run a single reducer to
    collect the results directly (rather than stopping all the mappers,
    and starting up another set immediately)?

    Or is there a better way to address this?

    Thanks,
    -Dan Milstein

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedNov 6, '09 at 9:34p
activeNov 6, '09 at 9:39p
posts2
users2
websitehive.apache.org

2 users in discussion

Namit Jain: 1 post Dan Milstein: 1 post

People

Translate

site design / logo © 2022 Grokbase