FAQ
Hi all,

Does anyone know how to efficiently concatenate 2 different files in
HDFS, as well as splitting a file into 2 different ones ?
I did this by read from a file, write to another one. Of course, this is
very slow, a lot of I/O time was spent. Being only a splitting or a
putting togheter job I am wondering if I can do this faster.

Also, what can I do in oder to control a reducer output file size ? This
could be a solution of the previous question. If I would be able to do
this, further concats&splits are not neccessary.

Thank you for your help.
Best,
Teodor

Search Discussions

  • Harsh J at Aug 19, 2010 at 4:57 pm
    Hello,

    Why are you looking to concatenate or split files on the HDFS? Am just
    curious cause using directories as inputs and outputs works fine with
    Hadoop MR and HDFS, as the latter uses a block storage concept at its
    core.
    On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicas wrote:
    Hi all,

    Does anyone know how to efficiently concatenate 2 different files in HDFS,
    as well as splitting a file into 2 different ones ?
    I did this by read from a file, write to another one. Of course, this is
    very slow, a lot of I/O time was spent. Being only a splitting or a putting
    togheter job I am wondering if I can do this faster.

    Also, what can I do in oder to control a reducer output file size ? This
    could be a solution of the previous question. If I would be able to do this,
    further concats&splits are not neccessary.

    Thank you for your help.
    Best,
    Teodor


    --
    Harsh J
    www.harshj.com
  • Teodor Macicas at Aug 19, 2010 at 5:12 pm
    Hello,

    I was expecting this question.
    The reason is that I want to run 2 MR jobs on the same data in the
    following manner: output of 1st job is collected and then I want to
    create bins of a certain amount of bytes which will be the input for the
    next jobs. At the end I want to isolate each processed bin results [2nd
    reducenrs outputs].

    Anyway, I do have a reason for wanting to do this.
    Any ideas ?

    Thank you.
    -Tedy
    On 08/19/2010 06:57 PM, Harsh J wrote:
    Hello,

    Why are you looking to concatenate or split files on the HDFS? Am just
    curious cause using directories as inputs and outputs works fine with
    Hadoop MR and HDFS, as the latter uses a block storage concept at its
    core.

    On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicaswrote:
    Hi all,

    Does anyone know how to efficiently concatenate 2 different files in HDFS,
    as well as splitting a file into 2 different ones ?
    I did this by read from a file, write to another one. Of course, this is
    very slow, a lot of I/O time was spent. Being only a splitting or a putting
    togheter job I am wondering if I can do this faster.

    Also, what can I do in oder to control a reducer output file size ? This
    could be a solution of the previous question. If I would be able to do this,
    further concats&splits are not neccessary.

    Thank you for your help.
    Best,
    Teodor
  • Xiujin yang at Aug 20, 2010 at 8:45 am
    Hi

    For mapred it is easy to realize the first job's output to be second job's input.
    You just need to point out the path will be ok.

    Xiujinyang



    Date: Thu, 19 Aug 2010 19:11:53 +0200
    From: teodor.macicas@epfl.ch
    To: common-user@hadoop.apache.org
    Subject: Re: HDFS efficiently concat&split files

    Hello,

    I was expecting this question.
    The reason is that I want to run 2 MR jobs on the same data in the
    following manner: output of 1st job is collected and then I want to
    create bins of a certain amount of bytes which will be the input for the
    next jobs. At the end I want to isolate each processed bin results [2nd
    reducenrs outputs].

    Anyway, I do have a reason for wanting to do this.
    Any ideas ?

    Thank you.
    -Tedy
    On 08/19/2010 06:57 PM, Harsh J wrote:
    Hello,

    Why are you looking to concatenate or split files on the HDFS? Am just
    curious cause using directories as inputs and outputs works fine with
    Hadoop MR and HDFS, as the latter uses a block storage concept at its
    core.

    On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicaswrote:
    Hi all,

    Does anyone know how to efficiently concatenate 2 different files in HDFS,
    as well as splitting a file into 2 different ones ?
    I did this by read from a file, write to another one. Of course, this is
    very slow, a lot of I/O time was spent. Being only a splitting or a putting
    togheter job I am wondering if I can do this faster.

    Also, what can I do in oder to control a reducer output file size ? This
    could be a solution of the previous question. If I would be able to do this,
    further concats&splits are not neccessary.

    Thank you for your help.
    Best,
    Teodor
  • Teodor Macicas at Aug 20, 2010 at 10:35 am
    Hi,

    Basically, you are right. But in my case the second input is a
    combination of previous outputs.
    As I already told, I want only a certain amount of bytes to be the
    second input. Hence, I need to split some files.

    Also, I need to concatenate the final reducers' outputs. Does anyone
    know how to concatenate 2 files faster in HDFS ?

    Thank you.
    Best,
    Teodor
    On 08/20/2010 10:44 AM, xiujin yang wrote:

    Hi

    For mapred it is easy to realize the first job's output to be second job's input.
    You just need to point out the path will be ok.

    Xiujinyang



    Date: Thu, 19 Aug 2010 19:11:53 +0200
    From: teodor.macicas@epfl.ch
    To: common-user@hadoop.apache.org
    Subject: Re: HDFS efficiently concat&split files

    Hello,

    I was expecting this question.
    The reason is that I want to run 2 MR jobs on the same data in the
    following manner: output of 1st job is collected and then I want to
    create bins of a certain amount of bytes which will be the input for the
    next jobs. At the end I want to isolate each processed bin results [2nd
    reducenrs outputs].

    Anyway, I do have a reason for wanting to do this.
    Any ideas ?

    Thank you.
    -Tedy
    On 08/19/2010 06:57 PM, Harsh J wrote:
    Hello,

    Why are you looking to concatenate or split files on the HDFS? Am just
    curious cause using directories as inputs and outputs works fine with
    Hadoop MR and HDFS, as the latter uses a block storage concept at its
    core.

    On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicaswrote:
    Hi all,

    Does anyone know how to efficiently concatenate 2 different files in HDFS,
    as well as splitting a file into 2 different ones ?
    I did this by read from a file, write to another one. Of course, this is
    very slow, a lot of I/O time was spent. Being only a splitting or a putting
    togheter job I am wondering if I can do this faster.

    Also, what can I do in oder to control a reducer output file size ? This
    could be a solution of the previous question. If I would be able to do this,
    further concats&splits are not neccessary.

    Thank you for your help.
    Best,
    Teodor
  • Teodor Macicas at Aug 20, 2010 at 8:46 pm
    Hi again,

    Please, can anyone suggest me how to stick (concatenate) 2 HDFS files in
    one bigger file with less I/O time ?

    Thank you.
    Regards,
    Teodor
    On 08/20/2010 12:34 PM, Teodor Macicas wrote:
    Hi,

    Basically, you are right. But in my case the second input is a
    combination of previous outputs.
    As I already told, I want only a certain amount of bytes to be the
    second input. Hence, I need to split some files.

    Also, I need to concatenate the final reducers' outputs. Does anyone
    know how to concatenate 2 files faster in HDFS ?

    Thank you.
    Best,
    Teodor
    On 08/20/2010 10:44 AM, xiujin yang wrote:

    Hi

    For mapred it is easy to realize the first job's output to be second job's input.
    You just need to point out the path will be ok.

    Xiujinyang



    Date: Thu, 19 Aug 2010 19:11:53 +0200
    From: teodor.macicas@epfl.ch
    To: common-user@hadoop.apache.org
    Subject: Re: HDFS efficiently concat&split files

    Hello,

    I was expecting this question.
    The reason is that I want to run 2 MR jobs on the same data in the
    following manner: output of 1st job is collected and then I want to
    create bins of a certain amount of bytes which will be the input for the
    next jobs. At the end I want to isolate each processed bin results [2nd
    reducenrs outputs].

    Anyway, I do have a reason for wanting to do this.
    Any ideas ?

    Thank you.
    -Tedy
    On 08/19/2010 06:57 PM, Harsh J wrote:
    Hello,

    Why are you looking to concatenate or split files on the HDFS? Am just
    curious cause using directories as inputs and outputs works fine with
    Hadoop MR and HDFS, as the latter uses a block storage concept at its
    core.

    On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicaswrote:
    Hi all,

    Does anyone know how to efficiently concatenate 2 different files in HDFS,
    as well as splitting a file into 2 different ones ?
    I did this by read from a file, write to another one. Of course, this is
    very slow, a lot of I/O time was spent. Being only a splitting or a putting
    togheter job I am wondering if I can do this faster.

    Also, what can I do in oder to control a reducer output file size ? This
    could be a solution of the previous question. If I would be able to do this,
    further concats&splits are not neccessary.

    Thank you for your help.
    Best,
    Teodor
  • Harsh J at Aug 20, 2010 at 9:03 pm
    Appending one file to the other on HDFS itself would be the optimal
    way to do this, but you can't avail the append support in
    apache-hadoop-hdfs 0.20.x and previous releases. Its only available in
    trunk / 0.21.

    A similar question asked previously got some replies as well, you
    might want to check it out:
    http://www.mentby.com/Group/hadoop-core-user/concatenating-files-on-hdfs.html

    2010/8/21 Teodor Macicas <teodor.macicas@epfl.ch>:
    Hi again,

    Please, can anyone suggest me how to stick (concatenate) 2 HDFS files in
    one bigger file with less I/O time ?

    Thank you.
    Regards,
    Teodor
    On 08/20/2010 12:34 PM, Teodor Macicas wrote:
    Hi,

    Basically, you are right. But in my case the second input is a
    combination of previous outputs.
    As I already told, I want only a certain amount of bytes to be the
    second input. Hence, I need to split some files.

    Also, I need to concatenate the final reducers' outputs. Does anyone
    know how to concatenate 2 files faster in HDFS ?

    Thank you.
    Best,
    Teodor
    On 08/20/2010 10:44 AM, xiujin yang wrote:

    Hi

    For mapred it is easy to realize the first job's output to be second job's input.
    You just need to point out the path will be ok.

    Xiujinyang



    Date: Thu, 19 Aug 2010 19:11:53 +0200
    From: teodor.macicas@epfl.ch
    To: common-user@hadoop.apache.org
    Subject: Re: HDFS efficiently concat&split files

    Hello,

    I was expecting this question.
    The reason is that I want to run 2 MR jobs on the same data in the
    following manner: output of 1st job is collected and then I want to
    create bins of a certain amount of bytes which will be the input for the
    next jobs. At the end I want to isolate each processed bin results [2nd
    reducenrs outputs].

    Anyway, I do have a reason for wanting to do this.
    Any ideas ?

    Thank you.
    -Tedy
    On 08/19/2010 06:57 PM, Harsh J wrote:
    Hello,

    Why are you looking to concatenate or split files on the HDFS? Am just
    curious cause using directories as inputs and outputs works fine with
    Hadoop MR and HDFS, as the latter uses a block storage concept at its
    core.

    On Thu, Aug 19, 2010 at 4:28 PM, Teodor Macicaswrote:
    Hi all,

    Does anyone know how to efficiently concatenate 2 different files in HDFS,
    as well as splitting a file into 2 different ones ?
    I did this by read from a file, write to another one. Of course, this is
    very slow, a lot of I/O time was spent. Being only a splitting or a putting
    togheter job I am wondering if I can do this faster.

    Also, what can I do in oder to control a reducer output file size ? This
    could be a solution of the previous question. If I would be able to do this,
    further concats&splits are not neccessary.

    Thank you for your help.
    Best,
    Teodor


    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 19, '10 at 10:59a
activeAug 20, '10 at 9:03p
posts7
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase