FAQ
Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
------------------------------------------------------------------------------------------------------------------------

Key: HBASE-1099
URL: https://issues.apache.org/jira/browse/HBASE-1099
Project: Hadoop HBase
Issue Type: Bug
Environment: apurtell 25 node cluster
Reporter: stack
Priority: Blocker
Fix For: 0.19.0


In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:

{code}
2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
...
2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
...
2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
....
2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
...
2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
...
2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
...
2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
....
2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
....
2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
er: DFSClient_-1506530059, pendingcreates: 45]
at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
...
{code}

Regionserver side I see this when it tries to open above region:

{code}
2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
. Probably lack-of-HADOOP-1700 causing DATA LOSS!
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
at org.apache.hadoop.hbase.regionserver.HStore.(HRegion.java:1624)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
at java.lang.Thread.run(Thread.java:619)
...
{code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Andrew Purtell (JIRA) at Dec 30, 2008 at 12:36 am
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659712#action_12659712 ]

    Andrew Purtell commented on HBASE-1099:
    ---------------------------------------

    We have a service recovery framework that will restart a crashed regionserver after approximately 90 seconds.
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Priority: Blocker
    Fix For: 0.19.0


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jim Kellerman (JIRA) at Jan 5, 2009 at 6:56 pm
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jim Kellerman reassigned HBASE-1099:
    ------------------------------------

    Assignee: Jim Kellerman
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: Jim Kellerman
    Priority: Blocker
    Fix For: 0.19.0


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jan 6, 2009 at 5:56 am
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661052#action_12661052 ]

    stack commented on HBASE-1099:
    ------------------------------

    Jim, if you want me to work on this or on hbase-1096, just hand it over (you have all outstanding 0.19.0 issues bar hbase-856)
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: Jim Kellerman
    Priority: Blocker
    Fix For: 0.19.0


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jim Kellerman (JIRA) at Jan 6, 2009 at 5:21 pm
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jim Kellerman reassigned HBASE-1099:
    ------------------------------------

    Assignee: stack (was: Jim Kellerman)
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: stack
    Priority: Blocker
    Fix For: 0.19.0


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jan 6, 2009 at 11:06 pm
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1099:
    -------------------------

    Attachment: 1099.patch

    First cut.

    Changed Set of dead servers to a Map. Value is whether or not logs have been split. Changed BaseScanner so it doesn't give out regions if region was on dead server, not unless logs have been split. Also, if regionserver comes in and we still have ghost references in master -- i.e. lease not expired or regionserver is on the deadservers list -- we now put the regionserver into a holding pattern until the references are cleared.

    Testing now.

    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: stack
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 1099.patch


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jan 7, 2009 at 12:34 am
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1099:
    -------------------------

    Attachment: 1099v2.patch

    Here is tested version. Seems to work.

    See things like this in log if I startup a regionserver before its lease has expired:

    {code}
    2009-01-07 00:20:20,076 [IPC Server handler 6 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Lease still held on XX.XX.XX.139:60020... waiting on it
    {code}

    then....

    {code}
    2009-01-07 00:20:30,267 [ServerManager.leaseChecker] INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.139:60020 lease expired
    2009-01-07 00:20:30,355 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: Splitting 9 log(s) in hdfs://XX.XX.XX.u.powerset.com:9000/hbasetrunk2/log_208.76.44.139_1231286896346_60020
    2009-01-07 00:20:30,357 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 1 of 9: hdfs://XX.XX.XX.u.powerset.com:9000/hbasetrunk2/log_208.76.44.139_1231286896346_60020/hlog.dat.1231287184359
    ....
    {code}

    Though the lease has expired, the reporting regionserver is still parked during split processing.. you'll see these messages:

    {code}
    2009-01-07 00:20:50,097 [IPC Server handler 6 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: On list of dead servers: address: XX.XX.XX.139:60020, startcode: 1231287559888, load: (requests=0, regions=0, usedHeap=25, maxHeap=1244) .... waiting on removal
    {code}

    Assignments continue -- if splits (see below) -- but regions from dead server are not assigned:

    {code}
    tTable,0000696213,1231287650867 from XX.XX.XX.141:60020
    2009-01-07 00:20:58,553 [IPC Server handler 2 on 60000] INFO org.apache.hadoop.hbase.master.RegionManager: assigning region TestTable,0000632029,1231287650867 to server XX.XX.XX.141:60020
    2009-01-07 00:20:58,568 [IPC Server handler 2 on 60000] INFO org.apache.hadoop.hbase.master.RegionManager: assigning region TestTable,0000696213,1231287650867 to server XX.XX.XX.141:60020
    2009-01-07 00:21:00,107 [IPC Server handler 6 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: On list of dead servers: address: XX.XX.XX.139:60020, startcode: 1231287559888, load: (requests=0, regions=0, usedHeap=25, maxHeap=1244)....
    {code}


    ... but still not right. I see that we'll still give out regions from dead server if on dead server. Need to add more checks.
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: stack
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 1099.patch, 1099v2.patch


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jan 7, 2009 at 9:03 pm
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1099:
    -------------------------

    Attachment: 1099v8.patch

    Did a bunch of testing mostly killing half of my little cluster and seeing how it recovered. In new patch, do things like hold incoming new regionservers until old lease expires and its logs have been split before going ahead and registering it. Added shutdown of assigning regions if were on dead server by meta scanners. Testing turned up interesting issues like needing to tickle lease renewal between lease expiration and while waiting on log split to complete but complete testing not possible at this stage because resultant cluster churn with new servers coming on line and loads being rebalanced provokes the outstanding issues in 0.19.0. I'm going to commit what I have. Then test when the last 0.19.0 issues are committed. Will open new issues for anything found.
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: stack
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 1099.patch, 1099v2.patch, 1099v8.patch


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jan 7, 2009 at 9:05 pm
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack resolved HBASE-1099.
    --------------------------

    Resolution: Fixed
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: stack
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 1099.patch, 1099v2.patch, 1099v8.patch


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrew Purtell (JIRA) at Jan 7, 2009 at 9:39 pm
    [ https://issues.apache.org/jira/browse/HBASE-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661718#action_12661718 ]

    Andrew Purtell commented on HBASE-1099:
    ---------------------------------------

    Thanks stack. I'm going to try this out now.
    Regions assigned while master is splitting logs of recently crashed server; regionserver tries to execute incomplete log
    ------------------------------------------------------------------------------------------------------------------------

    Key: HBASE-1099
    URL: https://issues.apache.org/jira/browse/HBASE-1099
    Project: Hadoop HBase
    Issue Type: Bug
    Environment: apurtell 25 node cluster
    Reporter: stack
    Assignee: stack
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 1099.patch, 1099v2.patch, 1099v8.patch


    In master log, I see master trying to process split of a crashed server. Its split two of three logs. The server that just crashed comes back on line. Balancing cuts in and master starts assigning the new server regions. New regionserver starts opening regions and messing with the log file that master is trying to write. It makes for a mess. Here is how the events playout with a focus on a region that gets opened while master is processing split:
    {code}
    2008-12-29 16:16:38,456 INFO org.apache.hadoop.hbase.master.ServerManager: XX.XX.XX.53:60020 lease expired
    ...
    2008-12-29 16:16:39,494 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log and region content,36946541ed9a62f419cf7238d32a6a38,1230448587552
    ...
    2008-12-29 16:17:19,686 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:22,480 DEBUG org.apache.hadoop.hbase.master.BaseScanner: Current assignment of content,36946541ed9a62f419cf7238d32a6a38,1230448587552 is not valid. serverInfo: address: XX.XX.XX.53:60020, startcode: 1230585439593, load: (requests=3, regions=0, usedHeap=28, maxHeap=1777), passed startCode: 1230577089035, storedInfo.startCode: 1230585439593 Region is not unassigned, assigned or pending
    ...
    2008-12-29 16:17:23,622 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region content,36946541ed9a62f419cf7238d32a6a38,1230448587552 to server XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:26,632 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ...
    2008-12-29 16:17:29,666 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: content,36946541ed9a62f419cf7238d32a6a38,1230448587552 from XX.XX.XX.53:60020
    ....
    2008-12-29 16:17:31,933 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Applied 100001 total edits from hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230582612406
    2008-12-29 16:17:31,941 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 3 of 3: hdfs://XX.XX.XX.XX:50000/data/hbase/log_XX.XX.XX.53_1230577089035_60020/hlog.dat.1230584516314
    ....
    2008-12-29 16:17:34,522 INFO org.apache.hadoop.dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /data/hbase/content/1526904420/oldlogfile.log File does not exist. [Lease. Hold
    er: DFSClient_-1506530059, pendingcreates: 45]
    at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1172)
    at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1103)
    at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
    ...
    {code}
    Regionserver side I see this when it tries to open above region:
    {code}
    2008-12-29 21:17:28,811 WARN org.apache.hadoop.hbase.regionserver.HStore: Exception processing reconstruction log hdfs://XX.XX.XX.XX:50000/data/hbase/content/1526904420/oldlogfile.log opening [B@12183272 -- continuing
    . Probably lack-of-HADOOP-1700 causing DATA LOSS!
    java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:180)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:351)
    at org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:296)
    at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:236)
    at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1624)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:270)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1364)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1335)
    at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1251)
    at java.lang.Thread.run(Thread.java:619)
    ...
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedDec 30, '08 at 12:28a
activeJan 7, '09 at 9:39p
posts10
users1
websitehbase.apache.org

1 user in discussion

Andrew Purtell (JIRA): 10 posts

People

Translate

site design / logo © 2022 Grokbase