FAQ
I'm trying to create a local os file from a hive query:

INSERT OVERWRITE LOCAL DIRECTORY '../../dwh_out/click_term_20091219.dat'
SELECT a.date_key, a.deal_id, a.is_roi, a.search_query,
a.traffic_source, a.country_id FROM str_click_term_final a

I expected to have a file "called click_term_20091219.dat" in directory
"'../../dwh_out", instead I got a directory named
"../../dwh_out/click_term_20091219.dat" and in it multifile files like
"attempt_200912182309_0102_m_000036_0*",
"attempt_200912182309_0102_m_000037_0*"...etc.

Any idea how I can have one file (I know I can "cat" on the os, but I'm
looking for Hive solution)



Thanks.


Lee Sagi | Data Warehouse Tech Lead & Architect

Search Discussions

  • David Lerman at Dec 22, 2009 at 1:18 am
    The nature of Hadoop is that it runs tasks in parallel, so any Hadoop job
    will result in an output file per reducer (or per mapper in your case since
    it looks like your query doesn't do any grouping or joining so doesn't use
    any reducers). In general, there are a couple options for merging the
    output of a Hadoop job:

    * You can ask Hadoop to merge the output after the fact using "hadoop fs
    -getmerge <dir>" (which will essentially do a cat for you).

    * You can use a single reducer instead of multiple reducers using "set
    mapred.reduce.tasks=1;". This will likely slow down the query since the
    reducer stage will only run on one node, but it will result in one output
    file.

    * You can add an extra map-reduce job (or just a reduce step if your job is
    map only) to the end of the pipeline which just merges the results. Check
    out the hive.merge.mapfiles and hive.merge.mapredfiles options in
    hive/conf/hive-default.xml which tell Hive to do this for you (this was
    added relatively recently so make sure you're using a recent build).

    In your particular case, the query you're running doesn't use any reducers,
    so this complicates it a bit. You could add a group by clause to the
    statement to force Hive to use a reducer and then set mapred.reduce.tasks=1
    which would merge the output. You should also be able to do the same thing
    by setting hive.merge.mapfiles to true which should add a reduce step that
    just merges the output - but for some reason, on my build this wasn't
    working.
    On 12/21/09 7:45 PM, "Sagi, Lee" wrote:

    I'm trying to create a local os file from a hive query:

    INSERT OVERWRITE LOCAL DIRECTORY '../../dwh_out/click_term_20091219.dat'
    SELECT a.date_key, a.deal_id, a.is_roi, a.search_query, a.traffic_source,
    a.country_id FROM str_click_term_final a

    I expected to have a file "called click_term_20091219.dat" in directory
    "'../../dwh_out", instead I got a directory named
    "../../dwh_out/click_term_20091219.dat" and in it multifile files like
    "attempt_200912182309_0102_m_000036_0*",
    "attempt_200912182309_0102_m_000037_0*"Setc.

    Any idea how I can have one file (I know I can "cat" on the os, but I'm
    looking for Hive solution)



    Thanks.


    Lee Sagi | Data Warehouse Tech Lead & Architect

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedDec 22, '09 at 12:46a
activeDec 22, '09 at 1:18a
posts2
users2
websitehive.apache.org

2 users in discussion

David Lerman: 1 post Sagi, Lee: 1 post

People

Translate

site design / logo © 2021 Grokbase