Grokbase Groups Hive user August 2011
FAQ
After I load some 15 files each of about 150M in size into a partition
of a table and run a select count(*) on the table, I keep getting an
error. In the JobTracker Web Interface, this turns out to be due to a
number checksum error, like this:

org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
at 51794944
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)

I have tried to reformat the filesystem and reload, but the problem
persists, although the number of corrupted block varies each time. I
have also tried setting "io.skip.checksum.errors" to true, but it
stills make no difference.

I use fsck to see when the file corruption happens. Oddly after the
data is loaded, fsck still detect no corrupted block. It is after the
select count(*) that fsck detect corrupted block.

If I just hadoop fs -cat on the hadoop file, I get an error like this:

org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_6876231585863639009:of:/user/hive/warehouse/att_log/collect_time=1313592542265/load.dat
at 376832
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:53)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:118)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49)
at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:356)
at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1934)
at org.apache.hadoop.fs.FsShell.cat(FsShell.java:350)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1568)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1790)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)
11/08/17 11:22:41 WARN hdfs.DFSClient: Found Checksum error for
blk_6876231585863639009_1004 from 192.168.50.192:50010 at 376832
11/08/17 11:22:41 INFO hdfs.DFSClient: Could not obtain block
blk_6876231585863639009_1004 from node: java.io.IOException: No live
nodes contain current block
11/08/17 11:22:41 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1
IOException, will wait for 1879.3346505085124 msec.
cat: Checksum error:
/blk_6876231585863639009:of:/user/hive/warehouse/att_log/collect_time=1313592542265/load.dat
at 376832


Does anyone know how to get around this issue? Thanks.

Search Discussions

  • W S Chung at Aug 19, 2011 at 10:26 pm
    For some reason, my questions sent two days ago again never shows up,
    even though I can google the question. I apologize if you have seen
    this question before.

    After loading around 2G or so data in a few files into hive, the
    "select count(*) from table" query keep failing. The JobTracker UI
    gives the following error:

    org.apache.hadoop.fs.ChecksumException: Checksum error:
    /blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
    at 51794944
    at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:159)

    fsck reports there are courrpted blocks. I have tried dropping the
    table and reload a few time. As far as I can see, the behavior is
    somewhat different every time, in the sense of how many corrupted
    blocks and how many files I loaded before the corrupted blocks appear.
    Sometimes the corrupted blocks show up after the data is loaded and
    sometimes only after the "select count(*)" query is made. I have tried
    setting "io.skip.checksum.errors" to true, but has no effect at all.

    I know that checksum error is usually an indication of hardware
    problem. But we are running hive on NFS cluster and has ECC memory.
    Our system admin here is not willing to believe that our high quality
    hardware has so many issues. I did try installing a simpler single
    node hive on another machine and the problem does not appear in this
    install after the data is loaded. Can someone give me some pointers in
    what else to try? Thanks.
  • Aggarwal, Vaibhav at Aug 20, 2011 at 12:58 am
    This is a really curious case.

    How many replicas of each block do you have?

    Are you able to copy the data directly using HDFS client?
    You could try the hadoop fs -copyToLocal command and see if it can copy the data from hdfs correctly.

    That would help you verify that the issue really is at HDFS layer (though it does look like that from the stack trace).

    Which file format are you using?

    Thanks
    Vaibhav

    -----Original Message-----
    From: W S Chung
    Sent: Friday, August 19, 2011 3:26 PM
    To: user@hive.apache.org
    Subject: org.apache.hadoop.fs.ChecksumException: Checksum error:

    For some reason, my questions sent two days ago again never shows up, even though I can google the question. I apologize if you have seen this question before.

    After loading around 2G or so data in a few files into hive, the "select count(*) from table" query keep failing. The JobTracker UI gives the following error:

    org.apache.hadoop.fs.ChecksumException: Checksum error:
    /blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
    at 51794944
    at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:159)

    fsck reports there are courrpted blocks. I have tried dropping the table and reload a few time. As far as I can see, the behavior is somewhat different every time, in the sense of how many corrupted blocks and how many files I loaded before the corrupted blocks appear.
    Sometimes the corrupted blocks show up after the data is loaded and sometimes only after the "select count(*)" query is made. I have tried setting "io.skip.checksum.errors" to true, but has no effect at all.

    I know that checksum error is usually an indication of hardware problem. But we are running hive on NFS cluster and has ECC memory.
    Our system admin here is not willing to believe that our high quality hardware has so many issues. I did try installing a simpler single node hive on another machine and the problem does not appear in this install after the data is loaded. Can someone give me some pointers in what else to try? Thanks.
  • W S Chung at Aug 22, 2011 at 3:04 pm
    I try using hadoop fs -copyToLocal. I also get a stack trace, like this:

    11/08/22 10:53:57 INFO fs.FSInputChecker: Found checksum error:
    b[1024, 1536]=31325431393a32313a31315a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304431374635383833344135464336423341424646357c342e327c313931393638200a323031312d30352d31325431393a32313a31315a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304031374635383833344135464336423341424646357c342e322e317c313931393638200a323031312d30352d31325431393a32323a33395a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304431374635383833344135464336423341424646357c362e322e317c313837373837200a323031312d30352d31325431393a32323a34335a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304431374635383833344135464336423341424646357c362e337c3138373738375f61745f706f736974696f6e5f3835200a323031312d30352d31325431393a32323a34335a7c3137342e
    org.apache.hadoop.fs.ChecksumException: Checksum error:
    /blk_2722854101062410251:of:/user/hive/warehouse/att_log/collect_time=1314024490064/load.dat
    at 64635904
    at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1158)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1718)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1770)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:53)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:72)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:320)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:248)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:199)
    at org.apache.hadoop.fs.FsShell.run(FsShell.java:1754)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.fs.FsShell.main(FsShell.java:1880)
    11/08/22 10:53:57 WARN hdfs.DFSClient: Found Checksum error for
    blk_2722854101062410251_1038 from 192.168.50.192:50010 at 64635904
    11/08/22 10:53:57 INFO hdfs.DFSClient: Could not obtain block
    blk_2722854101062410251_1038 from any node: java.io.IOException: No
    live nodes contain current block
    copyToLocal: Checksum error:
    /blk_2722854101062410251:of:/user/hive/warehouse/att_log/collect_time=1314024490064/load.dat
    at 64635904


    I manage to load two files(by using the Java API copyFromLocal call
    and then a 'load data inpath' call to load the data into the table).
    hadoop fsck does not show corrupted block until I run the 'select
    count(*)' call after loading the second file. 'hadoop fs -copyToLocal'
    also only fails after hadoop fsck shows corrupted block. For the first
    loaded file, 'hadoop fs -copyToLocal' works fine. It does look like
    the problem is with hdfs.

    I originally discover this issue on a two-node cluster with a
    replication factor of 2. But I am now testing on a pseudo-distributed
    install with only one node and a replication factor of 1.

    I am using text file. I would like to try to use sequencefile. I
    understand the "io.skip.checksum.errors" setting only applies to
    sequencefile. But the only way I know to load data into a table with
    sequencefile as storage is to first load the text file into a table
    with textfile as storage and then use a 'insert into select' to load
    the data into the sequencefile table. The 'insert into select' already
    fails with the same problem as running a query on the textfile table.
    Is there any other way to load a sequencefile table?


    On Fri, Aug 19, 2011 at 8:57 PM, Aggarwal, Vaibhav wrote:
    This is a really curious case.

    How many replicas of each block do you have?

    Are you able to copy the data directly using HDFS client?
    You could try the hadoop fs -copyToLocal command and see if it can copy the data from hdfs correctly.

    That would help you verify that the issue really is at HDFS layer (though it does look like that from the stack trace).

    Which file format are you using?

    Thanks
    Vaibhav

    -----Original Message-----
    From: W S Chung
    Sent: Friday, August 19, 2011 3:26 PM
    To: user@hive.apache.org
    Subject: org.apache.hadoop.fs.ChecksumException: Checksum error:

    For some reason, my questions sent two days ago again never shows up, even though I can google the question. I apologize if you have seen this question before.

    After loading around 2G or so data in a few files into hive, the "select count(*) from table" query keep failing. The JobTracker UI gives the following error:

    org.apache.hadoop.fs.ChecksumException: Checksum error:
    /blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
    at 51794944
    at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
    at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:159)

    fsck reports there are courrpted blocks. I have tried dropping the table and reload a few time. As far as I can see, the behavior is somewhat different every time, in the sense of how many corrupted blocks and how many files I loaded before the corrupted blocks appear.
    Sometimes the corrupted blocks show up after the data is loaded and sometimes only after the "select count(*)" query is made. I have tried setting "io.skip.checksum.errors" to true, but has no effect at all.

    I know that checksum error is usually an indication of hardware problem. But we are running hive on NFS cluster and has ECC memory.
    Our system admin here is not willing to believe that our high quality hardware has so many issues. I did try installing a simpler single node hive on another machine and the problem does not appear in this install after the data is loaded. Can someone give me some pointers in what else to try? Thanks.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 17, '11 at 3:26p
activeAug 22, '11 at 3:04p
posts4
users2
websitehive.apache.org

2 users in discussion

W S Chung: 3 posts Aggarwal, Vaibhav: 1 post

People

Translate

site design / logo © 2022 Grokbase