FAQ
Hy Guys,

I have 2 TB of data to process on my MSC work, but I share resources with
others students and don't have all that space.
So I gzipped my files in splits with sizes similars with the block (to
benefit from multiple maps).

The problem is that i'm getting a lot of those errors:

java.io.IOException: incorrect data check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:85)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
at msc.pig.EdgeLoader.getValidFields(Unknown Source)
at msc.pig.EdgeLoader.getNext(Unknown Source)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Does anyone has a suggestion on how can I check which file(s) has this problem?


Thanks

--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.: 55 31 34741485
Lab.: 55 31 34095840

Search Discussions

  • Charles Gonçalves at Feb 10, 2011 at 11:56 pm
    Hi Guys,
    It's not a hadoop solution but it works :

    cdh-hadoop@sayonara:~/mscdata/edgecast/201010 (85) 21:41:47
    201010:> for i in `hadoop fs -ls /user/cdh-hadoop/mscdata/edgecast/201010/ |
    awk '{print $8}'`; do echo $i; hadoop fs -cat $i | gzip -t; done

    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0000.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0001.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0002.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0003.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0004.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0005.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0006.log.gz

    gzip: stdin: invalid compressed data--crc error

    gzip: stdin: invalid compressed data--length error
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0007.log.gz
    /user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0008.log.gz


    The file right before the error message will be corrupted.
    On Thu, Feb 10, 2011 at 6:57 PM, Charles Gonçalves wrote:

    Hy Guys,

    I have 2 TB of data to process on my MSC work, but I share resources with
    others students and don't have all that space.
    So I gzipped my files in splits with sizes similars with the block (to
    benefit from multiple maps).

    The problem is that i'm getting a lot of those errors:

    java.io.IOException: incorrect data check
    at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
    at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
    at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
    at java.io.InputStream.read(InputStream.java:85)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
    at msc.pig.EdgeLoader.getValidFields(Unknown Source)
    at msc.pig.EdgeLoader.getNext(Unknown Source)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    Does anyone has a suggestion on how can I check which file(s) has this problem?


    Thanks

    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedFeb 10, '11 at 8:58p
activeFeb 10, '11 at 11:56p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Charles Gonçalves: 2 posts

People

Translate

site design / logo © 2022 Grokbase