FAQ
Hello,

The original map-reduce paper states: "After successful completion, the
output of the map-reduce execution is available in the R output files (one
per reduce task, with file names as specified by the user)." However, when
using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
single file called part-00000. I was wondering how and when this merging
process is done. When the reducer calls output.collect(key,value), is this
record written to a local temporary output file in the reducer's disk and
then these local files (a total of R) are later merged into one single file
with a final thread or is it directly written to the final output file
(part-00000)? I am asking this because I'd like to get an ordered sample of
the final output data, ie. one record per every 1000 records or something
similar and I don't want to run a serial process that iterates on the final
output file.

Thanks,
Jim

Search Discussions

  • Tienduc_dinh at Jan 11, 2009 at 1:24 pm
    part-00000 means, there is only one reduce task in your configuration.

    Hope, this helps.

    Tien Duc Dinh


    Jim Twensky wrote:
    Hello,

    The original map-reduce paper states: "After successful completion, the
    output of the map-reduce execution is available in the R output files (one
    per reduce task, with file names as specified by the user)." However, when
    using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
    single file called part-00000. I was wondering how and when this merging
    process is done. When the reducer calls output.collect(key,value), is this
    record written to a local temporary output file in the reducer's disk and
    then these local files (a total of R) are later merged into one single
    file
    with a final thread or is it directly written to the final output file
    (part-00000)? I am asking this because I'd like to get an ordered sample
    of
    the final output data, ie. one record per every 1000 records or something
    similar and I don't want to run a serial process that iterates on the
    final
    output file.

    Thanks,
    Jim
    --
    View this message in context: http://www.nabble.com/Merging-reducer-outputs-into-a-single-part-00000-file-tp21396867p21399089.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Stefan Will at Jan 11, 2009 at 8:59 pm
    Jim,

    As far as I know, there is no difference in terms of the number of output
    partitions relative to the OutputFormat used.

    If you want to sample your output file, I'd suggest you write a new MR job
    that uses a random number generator to sample your output files, and outputs
    text key/value pairs in the mapper, and uses exactly one reducer with the
    TextOutputFormat. You don't even need to supply a reducer class if your
    mapper outputs Text/Text key/value pairs.

    -- Stefan

    From: Jim Twensky <jim.twensky@gmail.com>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Sun, 11 Jan 2009 01:55:35 -0600
    To: <core-user@hadoop.apache.org>
    Subject: Merging reducer outputs into a single part-00000 file

    Hello,
    The original map-reduce paper states: "After successful completion,
    the
    output of the map-reduce execution is available in the R output files
    (one
    per reduce task, with file names as specified by the user)." However,
    when
    using Hadoop's TextOutputFormat, all the reducer outputs are combined in
    a
    single file called part-00000. I was wondering how and when this
    merging
    process is done. When the reducer calls output.collect(key,value), is
    this
    record written to a local temporary output file in the reducer's disk
    and
    then these local files (a total of R) are later merged into one single
    file
    with a final thread or is it directly written to the final output
    file
    (part-00000)? I am asking this because I'd like to get an ordered sample
    of
    the final output data, ie. one record per every 1000 records or
    something
    similar and I don't want to run a serial process that iterates on
    the final
    output file.

    Thanks,
    Jim
  • Rasit OZDAS at Jan 14, 2009 at 8:46 am
    Jim,

    As far as I know, there is no operation done after Reducer.
    At the first look, the situation reminds me of same keys for all the tasks,
    This can be the result of one of following cases:
    - input format reads same keys for every task.
    - mapper collects every incoming key-value pairs under same key.
    - reducer makes the same.

    But if you are a little experienced, you already know these.
    Ordered list means one final file, or am I missing something?

    Hope this helps,
    Rasit


    2009/1/11 Jim Twensky <jim.twensky@gmail.com>:
    Hello,

    The original map-reduce paper states: "After successful completion, the
    output of the map-reduce execution is available in the R output files (one
    per reduce task, with file names as specified by the user)." However, when
    using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
    single file called part-00000. I was wondering how and when this merging
    process is done. When the reducer calls output.collect(key,value), is this
    record written to a local temporary output file in the reducer's disk and
    then these local files (a total of R) are later merged into one single file
    with a final thread or is it directly written to the final output file
    (part-00000)? I am asking this because I'd like to get an ordered sample of
    the final output data, ie. one record per every 1000 records or something
    similar and I don't want to run a serial process that iterates on the final
    output file.

    Thanks,
    Jim


    --
    M. Raşit ÖZDAŞ
  • Owen O'Malley at Jan 14, 2009 at 5:24 pm

    On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:

    Jim,

    As far as I know, there is no operation done after Reducer.
    Correct, other than output promotion, which moves the output file to
    the final filename.
    But if you are a little experienced, you already know these.
    Ordered list means one final file, or am I missing something?
    There is no value and a lot of cost associated with creating a single
    file for the output. The question is how you want the keys divided
    between the reduces (and therefore output files). The default
    partitioner hashes the key and mods by the number of reduces, which
    "stripes" the keys across the output files. You can use the
    mapred.lib.InputSampler to generate good partition keys and
    mapred.lib.TotalOrderPartitioner to get completely sorted output based
    on the partition keys. With the total order partitioner, each reduce
    gets an increasing range of keys and thus has all of the nice
    properties of a single file without the costs.

    -- Owen
  • Jim Twensky at Jan 15, 2009 at 12:34 am
    Owen and Rasit,

    Thank you for the responses. I've figured that mapred.reduce.tasks was set
    to 1 in my hadoop-default xml and I didn't overwrite it in my
    hadoop-site.xml configuration file.

    Jim
    On Wed, Jan 14, 2009 at 11:23 AM, Owen O'Malley wrote:

    On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:

    Jim,
    As far as I know, there is no operation done after Reducer.
    Correct, other than output promotion, which moves the output file to the
    final filename.

    But if you are a little experienced, you already know these.
    Ordered list means one final file, or am I missing something?
    There is no value and a lot of cost associated with creating a single file
    for the output. The question is how you want the keys divided between the
    reduces (and therefore output files). The default partitioner hashes the key
    and mods by the number of reduces, which "stripes" the keys across the
    output files. You can use the mapred.lib.InputSampler to generate good
    partition keys and mapred.lib.TotalOrderPartitioner to get completely sorted
    output based on the partition keys. With the total order partitioner, each
    reduce gets an increasing range of keys and thus has all of the nice
    properties of a single file without the costs.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 11, '09 at 7:56a
activeJan 15, '09 at 12:34a
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase