FAQ
Hi,

We have a requirement where

There would be huge number of small files to be pushed to hdfs and then use pig
to do analysis.
To get around the classic "Small File Issue" we merge the files and push a
bigger file in to HDFS.
But we are loosing time in this merging process of our pipeline.

But If we can directly append to an existing file in HDFS we can save this
"Merging Files" time.

Can you please suggest if there a newer stable version of Hadoop where can go
for appending ?

Thanks and Regards,
Jagaran

Search Discussions

  • Xiaobo Gu at Jun 17, 2011 at 1:27 am
    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • Jagaran das at Jun 17, 2011 at 2:30 am
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20 are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • Xiaobo Gu at Jun 17, 2011 at 3:01 am
    You can merge multiple files into a new one, there is no means to
    append to a existing file.
    On Fri, Jun 17, 2011 at 10:29 AM, jagaran das wrote:
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20  are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • Jagaran das at Jun 17, 2011 at 4:52 am
    Thanks a lot Xiabo.

    I have tried with the below code in HDFS version 0.20.20 and it worked.
    Is it not stable yet?

    public class HadoopFileWriter {
    public static void main (String [] args) throws Exception{
    try{
    URI uri = new
    URI("hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat");

    Path pt=new Path(uri);
    FileSystem fs = FileSystem.get(new Configuration());
    BufferedWriter br;
    if(fs.isFile(pt)){
    br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
    br.newLine();
    }else{
    br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
    }
    String line = args[0];
    System.out.println(line);
    br.write(line);
    br.close();
    }catch(Exception e){
    e.printStackTrace();
    System.out.println("File not found");
    }
    }
    }

    Thanks a lot for your help.

    Regards,
    Jagaran




    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 8:01:14 PM
    Subject: Re: HDFS File Appending URGENT

    You can merge multiple files into a new one, there is no means to
    append to a existing file.
    On Fri, Jun 17, 2011 at 10:29 AM, jagaran das wrote:
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20 are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • Jagaran das at Jun 17, 2011 at 6:15 pm
    Please help me on this.
    I need it very urgently

    Regards,
    Jagaran


    ----- Forwarded Message ----
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 9:51:51 PM
    Subject: Re: HDFS File Appending URGENT

    Thanks a lot Xiabo.

    I have tried with the below code in HDFS version 0.20.20 and it worked.
    Is it not stable yet?

    public class HadoopFileWriter {
    public static void main (String [] args) throws Exception{
    try{
    URI uri = new
    URI("hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat");


    Path pt=new Path(uri);
    FileSystem fs = FileSystem.get(new Configuration());
    BufferedWriter br;
    if(fs.isFile(pt)){
    br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
    br.newLine();
    }else{
    br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
    }
    String line = args[0];
    System.out.println(line);
    br.write(line);
    br.close();
    }catch(Exception e){
    e.printStackTrace();
    System.out.println("File not found");
    }
    }
    }

    Thanks a lot for your help.

    Regards,
    Jagaran




    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 8:01:14 PM
    Subject: Re: HDFS File Appending URGENT

    You can merge multiple files into a new one, there is no means to
    append to a existing file.
    On Fri, Jun 17, 2011 at 10:29 AM, jagaran das wrote:
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20 are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • Tsz Wo \(Nicholas\), Sze at Jun 17, 2011 at 6:45 pm
    Hi Jagaran,

    Short answer: the append feature is not in any release. In this sense, it is
    not stable. Below are more details on the Append feature status.

    - 0.20.x (includes release 0.20.2)
    There are known bugs in append. The bugs may cause data loss.

    - 0.20-append
    There were effort on fixing the known append bugs but there are no releases. I
    heard Facebook was using it (with additional patches?) in production but I did
    not have the details.

    - 0.21
    It has a new append design (HDFS-265). However, the 0.21.0 release is only a
    minor release. It has not undergone testing at scale and should not be
    considered stable or suitable for production. Also, 0.21 development has been
    discontinued. Newly discovered bugs may not be fixed.

    - 0.22, 0.23
    Not yet released.


    Regards,
    Tsz-Wo




    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Fri, June 17, 2011 11:15:04 AM
    Subject: Fw: HDFS File Appending URGENT

    Please help me on this.
    I need it very urgently

    Regards,
    Jagaran


    ----- Forwarded Message ----
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 9:51:51 PM
    Subject: Re: HDFS File Appending URGENT

    Thanks a lot Xiabo.

    I have tried with the below code in HDFS version 0.20.20 and it worked.
    Is it not stable yet?

    public class HadoopFileWriter {
    public static void main (String [] args) throws Exception{
    try{
    URI uri = new
    URI("hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat");



    Path pt=new Path(uri);
    FileSystem fs = FileSystem.get(new Configuration());
    BufferedWriter br;
    if(fs.isFile(pt)){
    br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
    br.newLine();
    }else{
    br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
    }
    String line = args[0];
    System.out.println(line);
    br.write(line);
    br.close();
    }catch(Exception e){
    e.printStackTrace();
    System.out.println("File not found");
    }
    }
    }

    Thanks a lot for your help.

    Regards,
    Jagaran




    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 8:01:14 PM
    Subject: Re: HDFS File Appending URGENT

    You can merge multiple files into a new one, there is no means to
    append to a existing file.
    On Fri, Jun 17, 2011 at 10:29 AM, jagaran das wrote:
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20 are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • Jagaran das at Jun 17, 2011 at 7:04 pm
    Thanks a lot guys.

    Another query for production.

    Do we have any way by which we can purge the hdfs job and history logs on time
    basis.
    For example we want to keep only last 30 days log and its size is increasing a
    lot in production.

    Thanks again

    Regards,
    Jagaran



    ________________________________
    From: "Tsz Wo (Nicholas), Sze" <s29752-hadoopuser@yahoo.com>
    To: common-user@hadoop.apache.org
    Sent: Fri, 17 June, 2011 11:45:22 AM
    Subject: Re: HDFS File Appending URGENT

    Hi Jagaran,

    Short answer: the append feature is not in any release. In this sense, it is
    not stable. Below are more details on the Append feature status.

    - 0.20.x (includes release 0.20.2)
    There are known bugs in append. The bugs may cause data loss.

    - 0.20-append
    There were effort on fixing the known append bugs but there are no releases. I
    heard Facebook was using it (with additional patches?) in production but I did
    not have the details.

    - 0.21
    It has a new append design (HDFS-265). However, the 0.21.0 release is only a
    minor release. It has not undergone testing at scale and should not be
    considered stable or suitable for production. Also, 0.21 development has been
    discontinued. Newly discovered bugs may not be fixed.

    - 0.22, 0.23
    Not yet released.


    Regards,
    Tsz-Wo




    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Fri, June 17, 2011 11:15:04 AM
    Subject: Fw: HDFS File Appending URGENT

    Please help me on this.
    I need it very urgently

    Regards,
    Jagaran


    ----- Forwarded Message ----
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 9:51:51 PM
    Subject: Re: HDFS File Appending URGENT

    Thanks a lot Xiabo.

    I have tried with the below code in HDFS version 0.20.20 and it worked.
    Is it not stable yet?

    public class HadoopFileWriter {
    public static void main (String [] args) throws Exception{
    try{
    URI uri = new
    URI("hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat");




    Path pt=new Path(uri);
    FileSystem fs = FileSystem.get(new Configuration());
    BufferedWriter br;
    if(fs.isFile(pt)){
    br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
    br.newLine();
    }else{
    br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
    }
    String line = args[0];
    System.out.println(line);
    br.write(line);
    br.close();
    }catch(Exception e){
    e.printStackTrace();
    System.out.println("File not found");
    }
    }
    }

    Thanks a lot for your help.

    Regards,
    Jagaran




    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 8:01:14 PM
    Subject: Re: HDFS File Appending URGENT

    You can merge multiple files into a new one, there is no means to
    append to a existing file.
    On Fri, Jun 17, 2011 at 10:29 AM, jagaran das wrote:
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20 are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran
  • 潘飞 at Jun 20, 2011 at 8:08 am
    it seems that CDH3 hadoop 0.20.2 support appending feature

    2011/6/18 jagaran das <jagaran_das@yahoo.co.in>
    Thanks a lot guys.

    Another query for production.

    Do we have any way by which we can purge the hdfs job and history logs on
    time
    basis.
    For example we want to keep only last 30 days log and its size is
    increasing a
    lot in production.

    Thanks again

    Regards,
    Jagaran



    ________________________________
    From: "Tsz Wo (Nicholas), Sze" <s29752-hadoopuser@yahoo.com>
    To: common-user@hadoop.apache.org
    Sent: Fri, 17 June, 2011 11:45:22 AM
    Subject: Re: HDFS File Appending URGENT

    Hi Jagaran,

    Short answer: the append feature is not in any release. In this sense, it
    is
    not stable. Below are more details on the Append feature status.

    - 0.20.x (includes release 0.20.2)
    There are known bugs in append. The bugs may cause data loss.

    - 0.20-append
    There were effort on fixing the known append bugs but there are no
    releases. I
    heard Facebook was using it (with additional patches?) in production but I
    did
    not have the details.

    - 0.21
    It has a new append design (HDFS-265). However, the 0.21.0 release is only
    a
    minor release. It has not undergone testing at scale and should not be
    considered stable or suitable for production. Also, 0.21 development has
    been
    discontinued. Newly discovered bugs may not be fixed.

    - 0.22, 0.23
    Not yet released.


    Regards,
    Tsz-Wo




    ________________________________
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Fri, June 17, 2011 11:15:04 AM
    Subject: Fw: HDFS File Appending URGENT

    Please help me on this.
    I need it very urgently

    Regards,
    Jagaran


    ----- Forwarded Message ----
    From: jagaran das <jagaran_das@yahoo.co.in>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 9:51:51 PM
    Subject: Re: HDFS File Appending URGENT

    Thanks a lot Xiabo.

    I have tried with the below code in HDFS version 0.20.20 and it worked.
    Is it not stable yet?

    public class HadoopFileWriter {
    public static void main (String [] args) throws Exception{
    try{
    URI uri = new

    URI("hdfs://localhost:9000/Users/jagarandas/Work-Assignment/Analytics/analytics-poc/hadoop-0.20.203.0/data/test.dat");




    Path pt=new Path(uri);
    FileSystem fs = FileSystem.get(new Configuration());
    BufferedWriter br;
    if(fs.isFile(pt)){
    br=new BufferedWriter(new OutputStreamWriter(fs.append(pt)));
    br.newLine();
    }else{
    br=new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
    }
    String line = args[0];
    System.out.println(line);
    br.write(line);
    br.close();
    }catch(Exception e){
    e.printStackTrace();
    System.out.println("File not found");
    }
    }
    }

    Thanks a lot for your help.

    Regards,
    Jagaran




    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 8:01:14 PM
    Subject: Re: HDFS File Appending URGENT

    You can merge multiple files into a new one, there is no means to
    append to a existing file.
    On Fri, Jun 17, 2011 at 10:29 AM, jagaran das wrote:
    Is the hadoop version Hadoop 0.20.203.0 API

    That means still the hadoop files in HDFS version 0.20.20 are immutable?
    And there is no means we can append to an existing file in HDFS?

    We need to do this urgently as we have do set up the pipeline accordingly in
    production?

    Regards,
    Jagaran



    ________________________________
    From: Xiaobo Gu <guxiaobo1982@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Thu, 16 June, 2011 6:26:45 PM
    Subject: Re: HDFS File Appending

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then
    use
    pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and
    push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save
    this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where
    can go
    for appending ?

    Thanks and Regards,
    Jagaran


    --
    Stay Hungry. Stay Foolish.
  • Madhu phatak at Jun 21, 2011 at 10:11 am
    HDFS doesnot support Appending i think . I m not sure about pig , if you are
    using Hadoop directly you can zip the files and use zip as the input the
    jobs.
    On Fri, Jun 17, 2011 at 6:56 AM, Xiaobo Gu wrote:

    please refer to FileUtil.CopyMerge
    On Fri, Jun 17, 2011 at 8:33 AM, jagaran das wrote:
    Hi,

    We have a requirement where

    There would be huge number of small files to be pushed to hdfs and then use pig
    to do analysis.
    To get around the classic "Small File Issue" we merge the files and push a
    bigger file in to HDFS.
    But we are loosing time in this merging process of our pipeline.

    But If we can directly append to an existing file in HDFS we can save this
    "Merging Files" time.

    Can you please suggest if there a newer stable version of Hadoop where can go
    for appending ?

    Thanks and Regards,
    Jagaran

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 17, '11 at 12:34a
activeJun 21, '11 at 10:11a
posts10
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase