FAQ
Hi,
We have a custom setup in hadoop cdh3u0 which tries to read files from
*.gz files, but due to issue in hadoop 6835, we are not able to do it. We
have plans to upgrade but not now.
Since flume writes concatenated gz files , is it possible to change flume
to write gz files which are not concatenated. We have a bucket of 5 minutes.
I have changed the HDFSCompressedDataStream.java as follows: to keep in
store the data till 5 minutes and write it only once to the gz file. Is it
okay to do this?

@Override
public void append(Event e, FlumeFormatter fmt) throws IOException {
if (isFinished) {
cmpOut.resetState();
isFinished = false;
}
if(result==null){
result=fmt.getBytes(e);
System.out.println("1. sundi flume debug writing to sink");
}else{
System.out.println("2. sundi flume debug writing to sink");
byte[] bValue = fmt.getBytes(e);
byte[] addtoresult = new byte[result.length + bValue.length];
System.arraycopy(result, 0, addtoresult, 0, result.length);
System.arraycopy(bValue, 0, addtoresult, result.length,
bValue.length);
result = addtoresult;
}
//byte[] bValue = fmt.getBytes(e);
//cmpOut.write(bValue);
}

@Override
public void sync() throws IOException {
// We must use finish() and resetState() here -- flush() is apparently
not
// supported by the compressed output streams (it's a no-op).
// Also, since resetState() writes headers, avoid calling it without an
// additional write/append operation.
// Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see
HADOOP-8522.
if (!isFinished) {
System.out.println("3. sundi flume debug writing to sink");
cmpOut.write(result);
result=null;
cmpOut.finish();
isFinished = true;
}
fsOut.flush();
fsOut.sync();
}


--

Search Discussions

  • Mike Percy at Mar 27, 2013 at 5:04 pm
    Trying to write non-concatenated files violates Flume's durability
    guarantees. If Flume crashes you will very likely lose data since there
    will be an arbitrary amount of data in the compressor buffer. I strongly
    encourage you not to do this.

    Regards,
    Mike

    On Wed, Mar 27, 2013 at 9:47 AM, Sundi wrote:

    Hi,
    We have a custom setup in hadoop cdh3u0 which tries to read files
    from *.gz files, but due to issue in hadoop 6835, we are not able to do it.
    We have plans to upgrade but not now.
    Since flume writes concatenated gz files , is it possible to change flume
    to write gz files which are not concatenated. We have a bucket of 5 minutes.
    I have changed the HDFSCompressedDataStream.java as follows: to keep in
    store the data till 5 minutes and write it only once to the gz file. Is it
    okay to do this?

    @Override
    public void append(Event e, FlumeFormatter fmt) throws IOException {
    if (isFinished) {
    cmpOut.resetState();
    isFinished = false;
    }
    if(result==null){
    result=fmt.getBytes(e);
    System.out.println("1. sundi flume debug writing to sink");
    }else{
    System.out.println("2. sundi flume debug writing to sink");
    byte[] bValue = fmt.getBytes(e);
    byte[] addtoresult = new byte[result.length + bValue.length];
    System.arraycopy(result, 0, addtoresult, 0, result.length);
    System.arraycopy(bValue, 0, addtoresult, result.length,
    bValue.length);
    result = addtoresult;
    }
    //byte[] bValue = fmt.getBytes(e);
    //cmpOut.write(bValue);
    }

    @Override
    public void sync() throws IOException {
    // We must use finish() and resetState() here -- flush() is apparently
    not
    // supported by the compressed output streams (it's a no-op).
    // Also, since resetState() writes headers, avoid calling it without an
    // additional write/append operation.
    // Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see
    HADOOP-8522.
    if (!isFinished) {
    System.out.println("3. sundi flume debug writing to sink");
    cmpOut.write(result);
    result=null;
    cmpOut.finish();
    isFinished = true;
    }
    fsOut.flush();
    fsOut.sync();
    }


    --


    --
  • Sundi at Mar 27, 2013 at 5:13 pm
    Thanks a lot Mike.
    On Wednesday, March 27, 2013 10:04:15 AM UTC-7, Mike Percy wrote:

    Trying to write non-concatenated files violates Flume's durability
    guarantees. If Flume crashes you will very likely lose data since there
    will be an arbitrary amount of data in the compressor buffer. I strongly
    encourage you not to do this.

    Regards,
    Mike


    On Wed, Mar 27, 2013 at 9:47 AM, Sundi <sund...@gmail.com <javascript:>>wrote:
    Hi,
    We have a custom setup in hadoop cdh3u0 which tries to read files
    from *.gz files, but due to issue in hadoop 6835, we are not able to do it.
    We have plans to upgrade but not now.
    Since flume writes concatenated gz files , is it possible to change flume
    to write gz files which are not concatenated. We have a bucket of 5 minutes.
    I have changed the HDFSCompressedDataStream.java as follows: to keep in
    store the data till 5 minutes and write it only once to the gz file. Is it
    okay to do this?

    @Override
    public void append(Event e, FlumeFormatter fmt) throws IOException {
    if (isFinished) {
    cmpOut.resetState();
    isFinished = false;
    }
    if(result==null){
    result=fmt.getBytes(e);
    System.out.println("1. sundi flume debug writing to sink");
    }else{
    System.out.println("2. sundi flume debug writing to sink");
    byte[] bValue = fmt.getBytes(e);
    byte[] addtoresult = new byte[result.length + bValue.length];
    System.arraycopy(result, 0, addtoresult, 0, result.length);
    System.arraycopy(bValue, 0, addtoresult, result.length,
    bValue.length);
    result = addtoresult;
    }
    //byte[] bValue = fmt.getBytes(e);
    //cmpOut.write(bValue);
    }

    @Override
    public void sync() throws IOException {
    // We must use finish() and resetState() here -- flush() is
    apparently not
    // supported by the compressed output streams (it's a no-op).
    // Also, since resetState() writes headers, avoid calling it without
    an
    // additional write/append operation.
    // Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see
    HADOOP-8522.
    if (!isFinished) {
    System.out.println("3. sundi flume debug writing to sink");
    cmpOut.write(result);
    result=null;
    cmpOut.finish();
    isFinished = true;
    }
    fsOut.flush();
    fsOut.sync();
    }


    --


    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcdh-user @
categorieshadoop
postedMar 27, '13 at 4:48p
activeMar 27, '13 at 5:13p
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Sundi: 2 posts Mike Percy: 1 post

People

Translate

site design / logo © 2022 Grokbase