Hi,
We have a custom setup in hadoop cdh3u0 which tries to read files from
*.gz files, but due to issue in hadoop 6835, we are not able to do it. We
have plans to upgrade but not now.
Since flume writes concatenated gz files , is it possible to change flume
to write gz files which are not concatenated. We have a bucket of 5 minutes.
I have changed the HDFSCompressedDataStream.java as follows: to keep in
store the data till 5 minutes and write it only once to the gz file. Is it
okay to do this?
@Override
public void append(Event e, FlumeFormatter fmt) throws IOException {
if (isFinished) {
cmpOut.resetState();
isFinished = false;
}
if(result==null){
result=fmt.getBytes(e);
System.out.println("1. sundi flume debug writing to sink");
}else{
System.out.println("2. sundi flume debug writing to sink");
byte[] bValue = fmt.getBytes(e);
byte[] addtoresult = new byte[result.length + bValue.length];
System.arraycopy(result, 0, addtoresult, 0, result.length);
System.arraycopy(bValue, 0, addtoresult, result.length,
bValue.length);
result = addtoresult;
}
//byte[] bValue = fmt.getBytes(e);
//cmpOut.write(bValue);
}
@Override
public void sync() throws IOException {
// We must use finish() and resetState() here -- flush() is apparently
not
// supported by the compressed output streams (it's a no-op).
// Also, since resetState() writes headers, avoid calling it without an
// additional write/append operation.
// Note: There are bugs in Hadoop & JDK w/ pure-java gzip; see
HADOOP-8522.
if (!isFinished) {
System.out.println("3. sundi flume debug writing to sink");
cmpOut.write(result);
result=null;
cmpOut.finish();
isFinished = true;
}
fsOut.flush();
fsOut.sync();
}
--