FAQ
Hi all,

There is lots of SequenceFile in HDFS, how can I merge them into one
SequenceFile?

Thanks for you suggestion.

-Lin

Search Discussions

  • Jason at May 12, 2011 at 5:21 am
    M/R job with a single reducer would do the job. This way you can
    utilize distributed sort and merge/combine/dedupe key/values as you
    wish.
    On 5/11/11, 丛林 wrote:
    Hi all,

    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?

    Thanks for you suggestion.

    -Lin
  • 丛林 at May 12, 2011 at 11:16 am
    Dear Jason,

    If the order of the keys in sequence file is not important to me, in
    other words, the sort process is not necessary, how can I stop the
    distributed sort to save the consumption of resource?

    Thanks for your suggestion.

    Best Wishes,

    -Lin

    2011/5/12 jason <urgisb@gmail.com>:
    M/R job with a single reducer would do the job. This way you can
    utilize distributed sort and merge/combine/dedupe key/values as you
    wish.
    On 5/11/11, 丛林 wrote:
    Hi all,

    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?

    Thanks for you suggestion.

    -Lin
  • Christoph Schmitz at May 12, 2011 at 11:45 am
    Hi Lin,

    you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).

    That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.

    Regards,
    Christoph

    -----Ursprüngliche Nachricht-----
    Von: 丛林
    Gesendet: Donnerstag, 12. Mai 2011 13:16
    An: mapreduce-user@hadoop.apache.org
    Betreff: Re: How to merge several SequenceFile into one?

    Dear Jason,

    If the order of the keys in sequence file is not important to me, in
    other words, the sort process is not necessary, how can I stop the
    distributed sort to save the consumption of resource?

    Thanks for your suggestion.

    Best Wishes,

    -Lin

    2011/5/12 jason <urgisb@gmail.com>:
    M/R job with a single reducer would do the job. This way you can
    utilize distributed sort and merge/combine/dedupe key/values as you
    wish.
    On 5/11/11, 丛林 wrote:
    Hi all,

    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?

    Thanks for you suggestion.

    -Lin
  • 丛林 at May 12, 2011 at 12:30 pm
    Hi Christoph,

    If there is no reducer, how can these sequence files be merged?

    Thanks for you advice.

    Best Wishes,

    -Lin

    在 2011年5月12日 下午7:44,Christoph Schmitz <Christoph.Schmitz@1und1.de> 写道:
    Hi Lin,

    you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).

    That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.

    Regards,
    Christoph

    -----Ursprüngliche Nachricht-----
    Von: 丛林
    Gesendet: Donnerstag, 12. Mai 2011 13:16
    An: mapreduce-user@hadoop.apache.org
    Betreff: Re: How to merge several SequenceFile into one?

    Dear Jason,

    If the order of the keys in sequence file is not important to me, in
    other words, the sort process is not necessary, how can I stop the
    distributed sort to save the consumption of resource?

    Thanks for your suggestion.

    Best Wishes,

    -Lin

    2011/5/12 jason <urgisb@gmail.com>:
    M/R job with a single reducer would do the job. This way you can
    utilize distributed sort and merge/combine/dedupe key/values as you
    wish.
    On 5/11/11, 丛林 wrote:
    Hi all,

    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?

    Thanks for you suggestion.

    -Lin
  • Christoph Schmitz at May 12, 2011 at 1:45 pm
    Oops, sorry, I answered in the wrong thread. I intended to reply to the "How to create a SequenceFile faster" issue.

    Regards,
    Christoph

    -----Ursprüngliche Nachricht-----
    Von: 丛林
    Gesendet: Donnerstag, 12. Mai 2011 14:30
    An: mapreduce-user@hadoop.apache.org
    Betreff: Re: How to merge several SequenceFile into one?

    Hi Christoph,

    If there is no reducer, how can these sequence files be merged?

    Thanks for you advice.

    Best Wishes,

    -Lin

    在 2011年5月12日 下午7:44,Christoph Schmitz <Christoph.Schmitz@1und1.de> 写道:
    Hi Lin,

    you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).

    That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.

    Regards,
    Christoph

    -----Ursprüngliche Nachricht-----
    Von: 丛林
    Gesendet: Donnerstag, 12. Mai 2011 13:16
    An: mapreduce-user@hadoop.apache.org
    Betreff: Re: How to merge several SequenceFile into one?

    Dear Jason,

    If the order of the keys in sequence file is not important to me, in
    other words, the sort process is not necessary, how can I stop the
    distributed sort to save the consumption of resource?

    Thanks for your suggestion.

    Best Wishes,

    -Lin

    2011/5/12 jason <urgisb@gmail.com>:
    M/R job with a single reducer would do the job. This way you can
    utilize distributed sort and merge/combine/dedupe key/values as you
    wish.
    On 5/11/11, 丛林 wrote:
    Hi all,

    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?

    Thanks for you suggestion.

    -Lin
  • Panayotis Antonopoulos at May 25, 2011 at 1:32 am
    I would like to merge some SequenceFiles as well, so any help would be great!

    Although the solution with the single reducer works great, the files are small so I don't need distribution.
    I think I will create a simple java program that will read these files and merge them.
    From: Christoph.Schmitz@1und1.de
    To: mapreduce-user@hadoop.apache.org
    Date: Thu, 12 May 2011 15:44:57 +0200
    Subject: AW: How to merge several SequenceFile into one?

    Oops, sorry, I answered in the wrong thread. I intended to reply to the "How to create a SequenceFile faster" issue.

    Regards,
    Christoph

    -----Ursprüngliche Nachricht-----
    Von: 丛林
    Gesendet: Donnerstag, 12. Mai 2011 14:30
    An: mapreduce-user@hadoop.apache.org
    Betreff: Re: How to merge several SequenceFile into one?

    Hi Christoph,

    If there is no reducer, how can these sequence files be merged?

    Thanks for you advice.

    Best Wishes,

    -Lin

    在 2011年5月12日 下午7:44,Christoph Schmitz <Christoph.Schmitz@1und1.de> 写道:
    Hi Lin,

    you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)).

    That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle/reduce overhead.

    Regards,
    Christoph

    -----Ursprüngliche Nachricht-----
    Von: 丛林
    Gesendet: Donnerstag, 12. Mai 2011 13:16
    An: mapreduce-user@hadoop.apache.org
    Betreff: Re: How to merge several SequenceFile into one?

    Dear Jason,

    If the order of the keys in sequence file is not important to me, in
    other words, the sort process is not necessary, how can I stop the
    distributed sort to save the consumption of resource?

    Thanks for your suggestion.

    Best Wishes,

    -Lin

    2011/5/12 jason <urgisb@gmail.com>:
    M/R job with a single reducer would do the job. This way you can
    utilize distributed sort and merge/combine/dedupe key/values as you
    wish.
    On 5/11/11, 丛林 wrote:
    Hi all,

    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?

    Thanks for you suggestion.

    -Lin
  • Niels Basjes at May 25, 2011 at 7:26 pm
    Hi,
    There is lots of SequenceFile in HDFS, how can I merge them into one
    SequenceFile?
    The simplest way to do that is to create a job that
    - input format = sequence file
    - map = identity mapper
    - reduce = identity reduce
    - output = sequence file
    and
    job.setNumReduceTasks(1)

    However: I think it is a useless thing to do.
    Sequence files are only really useful inside a Hadoop cluster serving
    as input for later jobs.
    And having multiple files only helps Hadoop in scaling out.

    So my question to you: Why do you want that?



    --
    Best regards / Met vriendelijke groeten,

    Niels Basjes

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMay 12, '11 at 12:18a
activeMay 25, '11 at 7:26p
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase