I'm trying out the nice mapjoin query hint, and it's basically working
great, but I've run into an odd thing, and was wondering if anyone
could help out:
I have a query which looks like:
INSERT OVERWRITE DIRECTORY 'hive_lookup_new_conversions/
SELECT /*+ MAPJOIN(l)*/ t.portal_id, l.lead_id, t.visit_time
It's consuming data from a lot of smallish partitions -- about
300-400. Because of the mapside join, it's doing the SELECT step
quite fast -- just 300+ separate map tasks, each of which runs very
quickly. The overall amount of data output is tiny.
But it looks like then, to do the INSERT OVERWRITE DIRECTORY (back
into normal HDFS), it's running *another* 300+ map tasks, to get the
(at this point, tiny amount of data), funneled through a single reducer.
Is there any way to tell hive to just do the reducer from the initial
step? E.g. to do the mapside join, but also run a single reducer to
collect the results directly (rather than stopping all the mappers,
and starting up another set immediately)?
Or is there a better way to address this?