Grokbase Groups HBase dev May 2008
FAQ
Splitting log in a hostile environment -- bad hdfs -- we drop logs
------------------------------------------------------------------

Key: HBASE-642
URL: https://issues.apache.org/jira/browse/HBASE-642
Project: Hadoop HBase
Issue Type: Bug
Reporter: stack
Fix For: 0.1.3, 0.2.0


The master has noticed that the regionserver that was carrying the .META. region among others has died and it goes to split its logs:

{code}
2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log and region categories,2864153,1211005494348
2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log and region categories,5992242,1211005494349
{code}

Master can't write hdfs for some reason so can't do the log split:

{code}
2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out
2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7852777250062244002
2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 6 time(s).
2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7007215478628265924
2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 7 time(s).
2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_2374125514769088471
2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out
2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-621042589816139684
2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
..
{code}

Weirdly, the above is complaining can't connect to the datanode running on same host as master.

Eventually the split fails with:

{code}
2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing pending operations: ProcessServerShutdown of 10.254.30.79:60020
java.io.IOException: java.io.IOException: Could not complete write to file /hbase/categories/1060231198/oldlogfile.log by DFSClient_520078809
at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main processing loop: ProcessServerShutdown of 10.254.30.79:60020
2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process shutdown of server 10.254.30.79:60020: logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s) in hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log already exists. Copying existing file to new file
2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 9 time(s).
2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception processing hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018 -- continuing. Possible DATA LOSS!
java.net.SocketTimeoutException: timed out waiting for rpc response
...
{code}

And we just move on to the next log -- we notice the earlier attempt at distributing the edits and try to pick them up (though in this case, the file is likely empty) -- but the split of new log also fails.

Though log says 'Possible DATA LOSS!', we keep going . The .META. and other regions are reassigned and deployed though they are likely missing edits.

In this catastrophic case, I'd say master should not move and if it has to, go down rather than reassign regions and try to keep going.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • stack (JIRA) at May 24, 2008 at 9:40 pm
    [ https://issues.apache.org/jira/browse/HBASE-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-642:
    ------------------------

    Summary: Splitting log in a hostile environment -- bad hdfs -- we drop write-ahead-log edits (was: Splitting log in a hostile environment -- bad hdfs -- we drop logs)
    Splitting log in a hostile environment -- bad hdfs -- we drop write-ahead-log edits
    -----------------------------------------------------------------------------------

    Key: HBASE-642
    URL: https://issues.apache.org/jira/browse/HBASE-642
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Fix For: 0.1.3, 0.2.0


    The master has noticed that the regionserver that was carrying the .META. region among others has died and it goes to split its logs:
    {code}
    2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
    2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log and region categories,2864153,1211005494348
    2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log and region categories,5992242,1211005494349
    {code}
    Master can't write hdfs for some reason so can't do the log split:
    {code}
    2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out
    2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7852777250062244002
    2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 6 time(s).
    2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
    2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7007215478628265924
    2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 7 time(s).
    2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
    2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_2374125514769088471
    2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out
    2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-621042589816139684
    2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    ..
    {code}
    Weirdly, the above is complaining can't connect to the datanode running on same host as master.
    Eventually the split fails with:
    {code}
    2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing pending operations: ProcessServerShutdown of 10.254.30.79:60020
    java.io.IOException: java.io.IOException: Could not complete write to file /hbase/categories/1060231198/oldlogfile.log by DFSClient_520078809
    at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
    at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
    at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
    2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main processing loop: ProcessServerShutdown of 10.254.30.79:60020
    2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process shutdown of server 10.254.30.79:60020: logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
    2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s) in hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
    2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
    2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log already exists. Copying existing file to new file
    2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 9 time(s).
    2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception processing hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018 -- continuing. Possible DATA LOSS!
    java.net.SocketTimeoutException: timed out waiting for rpc response
    ...
    {code}
    And we just move on to the next log -- we notice the earlier attempt at distributing the edits and try to pick them up (though in this case, the file is likely empty) -- but the split of new log also fails.
    Though log says 'Possible DATA LOSS!', we keep going . The .META. and other regions are reassigned and deployed though they are likely missing edits.
    In this catastrophic case, I'd say master should not move and if it has to, go down rather than reassign regions and try to keep going.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 28, 2008 at 6:16 pm
    [ https://issues.apache.org/jira/browse/HBASE-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack resolved HBASE-642.
    -------------------------

    Resolution: Fixed
    Assignee: stack

    Committed branch and trunk.

    Changed the try/catches around the reconstruction log replay so that we only catch and pass on the EOFExceptions. If an IOE, just report it and let the region fail on each open attempt until human intervention. If IOE playing reconstruction log, there is something badly wrong probably in hdfs.

    {code}
    Index: src/java/org/apache/hadoop/hbase/HStore.java
    ===================================================================
    --- src/java/org/apache/hadoop/hbase/HStore.java (revision 660681)
    +++ src/java/org/apache/hadoop/hbase/HStore.java (working copy)
    @@ -825,13 +825,21 @@

    try {
    doReconstructionLog(reconstructionLog, maxSeqId, reporter);
    + } catch (EOFException e) {
    + // Presume we got here because of lack of HADOOP-1700; for now keep going
    + // but this is probably not what we want long term. If we got here there
    + // has been data-loss
    + LOG.warn("Exception processing reconstruction log " + reconstructionLog +
    + " opening " + this.storeName +
    + " -- continuing. Probably lack-of-HADOOP-1700 causing DATA LOSS!", e);
    } catch (IOException e) {
    - // Presume we got here because of some HDFS issue or because of a lack of
    - // HADOOP-1700; for now keep going but this is probably not what we want
    - // long term. If we got here there has been data-loss
    + // Presume we got here because of some HDFS issue. Don't just keep going.
    + // Fail to open the HStore. Probably means we'll fail over and over
    + // again until human intervention but alternative has us skipping logs
    + // and losing edits: HBASE-642.
    LOG.warn("Exception processing reconstruction log " + reconstructionLog +
    - " opening " + this.storeName +
    - " -- continuing. Probably DATA LOSS!", e);
    + " opening " + this.storeName, e);
    + throw e;
    }
    {code}

    Splitting log in a hostile environment -- bad hdfs -- we drop write-ahead-log edits
    -----------------------------------------------------------------------------------

    Key: HBASE-642
    URL: https://issues.apache.org/jira/browse/HBASE-642
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: stack
    Fix For: 0.1.3, 0.2.0


    The master has noticed that the regionserver that was carrying the .META. region among others has died and it goes to split its logs:
    {code}
    2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
    2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log and region categories,2864153,1211005494348
    2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log and region categories,5992242,1211005494349
    {code}
    Master can't write hdfs for some reason so can't do the log split:
    {code}
    2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out
    2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7852777250062244002
    2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 6 time(s).
    2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
    2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7007215478628265924
    2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 7 time(s).
    2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
    2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_2374125514769088471
    2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out
    2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-621042589816139684
    2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.219.207:50010
    ..
    {code}
    Weirdly, the above is complaining can't connect to the datanode running on same host as master.
    Eventually the split fails with:
    {code}
    2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing pending operations: ProcessServerShutdown of 10.254.30.79:60020
    java.io.IOException: java.io.IOException: Could not complete write to file /hbase/categories/1060231198/oldlogfile.log by DFSClient_520078809
    at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
    at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
    at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
    2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main processing loop: ProcessServerShutdown of 10.254.30.79:60020
    2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process shutdown of server 10.254.30.79:60020: logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
    2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s) in hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
    2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
    2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log already exists. Copying existing file to new file
    2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.254.30.79:60020. Already tried 9 time(s).
    2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception processing hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018 -- continuing. Possible DATA LOSS!
    java.net.SocketTimeoutException: timed out waiting for rpc response
    ...
    {code}
    And we just move on to the next log -- we notice the earlier attempt at distributing the edits and try to pick them up (though in this case, the file is likely empty) -- but the split of new log also fails.
    Though log says 'Possible DATA LOSS!', we keep going . The .META. and other regions are reassigned and deployed though they are likely missing edits.
    In this catastrophic case, I'd say master should not move and if it has to, go down rather than reassign regions and try to keep going.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedMay 24, '08 at 9:38p
activeMay 28, '08 at 6:16p
posts3
users1
websitehbase.apache.org

1 user in discussion

stack (JIRA): 3 posts

People

Translate

site design / logo © 2022 Grokbase