Grokbase Groups HBase user June 2010
FAQ
Hey,

I've been thinking about how we do out configuration and code updates for
Hadoop and HBase and was wondering what others do and what is the best
practice to avoid errors with HBase.

Currently we do a rolling update where we restart the services on one node
at a time, so shutting down the region server then restarting the datanode
and task trackers depending on what we are updating and what has change. But
with this I have occasional found errors with the HBase cluster afterwards
due to corrupt META table which I think could have been caused by restarting
the datanode, or maybe not waiting long enough for the cluster to sort out
loosing a region server before moving on to the next.

The most resent error upon restarting a node was :-

2010-06-29 10:46:44,970 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

2010-06-29 10:46:44,970 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
HRegionServer: file system not available
java.io.IOException: File system is not available
at
org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)


Followed by this for every region being served :-

2010-06-29 10:46:44,996 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)


After updating all the nodes all the region server shut down after a
few minutes reporting the following :-

2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
10.0.11.4:50010

2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
Could not append. Requesting close of hlog
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


2010-06-29 11:22:09,482 FATAL
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
ioe:
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

2010-06-29 11:22:10,344 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
abort
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


This was fixed by restarting the master and starting the region servers
again, but it would be nice to know how to roll out changes cleaner.

How do other people here roll out updates to HBase / Hadoop? What order do
you restart services in and how long do you wait before moving to the next
node?

Just so you know we currently have 5 nodes and are getting another 10 to add
soon.

Thanks,

--
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Search Discussions

  • Stack at Jun 29, 2010 at 2:50 pm
    Hey Dan

    Are you using raw apache hadoop? If so any patches? Do you have hdfs630?

    Looking at errors below the rs thinks the fs is gone. Nothing in the log before the exception pasted below?

    In general you want to let the regionservers finish their shut down. Any chance you are not letting this happen?

    For rolling restart you should do master first then the regionservers. Best results if cluster is quiet at the time else regions in transition can be "lost" over master restart ( to be fixed in hbase 0.90.0)

    Stack
    On Jun 29, 2010, at 6:43 AM, Dan Harvey wrote:

    Hey,

    I've been thinking about how we do out configuration and code updates for
    Hadoop and HBase and was wondering what others do and what is the best
    practice to avoid errors with HBase.

    Currently we do a rolling update where we restart the services on one node
    at a time, so shutting down the region server then restarting the datanode
    and task trackers depending on what we are updating and what has change. But
    with this I have occasional found errors with the HBase cluster afterwards
    due to corrupt META table which I think could have been caused by restarting
    the datanode, or maybe not waiting long enough for the cluster to sort out
    loosing a region server before moving on to the next.

    The most resent error upon restarting a node was :-

    2010-06-29 10:46:44,970 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

    2010-06-29 10:46:44,970 FATAL
    org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
    HRegionServer: file system not available
    java.io.IOException: File system is not available
    at
    org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)


    Followed by this for every region being served :-

    2010-06-29 10:46:44,996 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)


    After updating all the nodes all the region server shut down after a
    few minutes reporting the following :-

    2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
    Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
    10.0.11.4:50010

    2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
    Could not append. Requesting close of hlog
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


    2010-06-29 11:22:09,482 FATAL
    org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
    ioe:
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    2010-06-29 11:22:10,344 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
    abort
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


    This was fixed by restarting the master and starting the region servers
    again, but it would be nice to know how to roll out changes cleaner.

    How do other people here roll out updates to HBase / Hadoop? What order do
    you restart services in and how long do you wait before moving to the next
    node?

    Just so you know we currently have 5 nodes and are getting another 10 to add
    soon.

    Thanks,

    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
  • Michael Segel at Jun 29, 2010 at 2:51 pm
    Dan,

    I don't think you can do that because your 'new/updated' node will clash with the rest of the cloud.
    (We're talking code and not just cloud tuning parameters.) [Read different jars...]

    If you're going to push an update out, then it has to be an 'all or nothing' push.

    Since we're using Cloudera's release, moving from CDH2 to CDH3 represents a full backup, down the cloud, remove the software completely, and then then install new CDH3. Outside of that major switch, if we were going from one sub release to another, it would be just a $> yum update hadoop-0.20 call on each node.
    Again, you have to take the cloud down to do that.

    So the bottom line... if you're going to do upgrades, you'll need to plan for some down time.

    HTH

    -Mike
    From: dan.harvey@mendeley.com
    Date: Tue, 29 Jun 2010 14:43:26 +0100
    Subject: Rolling out Hadoop/HBase updates
    To: user@hbase.apache.org

    Hey,

    I've been thinking about how we do out configuration and code updates for
    Hadoop and HBase and was wondering what others do and what is the best
    practice to avoid errors with HBase.

    Currently we do a rolling update where we restart the services on one node
    at a time, so shutting down the region server then restarting the datanode
    and task trackers depending on what we are updating and what has change. But
    with this I have occasional found errors with the HBase cluster afterwards
    due to corrupt META table which I think could have been caused by restarting
    the datanode, or maybe not waiting long enough for the cluster to sort out
    loosing a region server before moving on to the next.

    The most resent error upon restarting a node was :-

    2010-06-29 10:46:44,970 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

    2010-06-29 10:46:44,970 FATAL
    org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
    HRegionServer: file system not available
    java.io.IOException: File system is not available
    at
    org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)


    Followed by this for every region being served :-

    2010-06-29 10:46:44,996 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)


    After updating all the nodes all the region server shut down after a
    few minutes reporting the following :-

    2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
    Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
    10.0.11.4:50010

    2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
    Could not append. Requesting close of hlog
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


    2010-06-29 11:22:09,482 FATAL
    org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
    ioe:
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    2010-06-29 11:22:10,344 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
    abort
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


    This was fixed by restarting the master and starting the region servers
    again, but it would be nice to know how to roll out changes cleaner.

    How do other people here roll out updates to HBase / Hadoop? What order do
    you restart services in and how long do you wait before moving to the next
    node?

    Just so you know we currently have 5 nodes and are getting another 10 to add
    soon.

    Thanks,

    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
    _________________________________________________________________
    Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
  • Dan Harvey at Jul 4, 2010 at 5:13 pm
    Hey,

    We're using stock CHD2 without any patches so I'm not sure if we have
    hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
    moving to 0.20.5 soon

    What I did with this rollout of just config changes was take one region
    server down at a time and restart the datanode on the same server. So what I
    gather I should have done was shutdown all the region servers before
    restarting any of the data nodes?

    I guess if I split it into different parts it would be :-

    - HBase Rolling update for point/config releases is supported
    - Update masters first
    - Then update region servers in turn

    - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
    list I guess)
    - Take down HBase
    - Take down datanodes
    - Update all the datanodes code/configs
    - Start datanodes
    - Start HBase

    Would you be able to let me know which of these I've got right/wrong?

    Thanks,
    On 29 June 2010 15:50, Michael Segel wrote:


    Dan,

    I don't think you can do that because your 'new/updated' node will clash
    with the rest of the cloud.
    (We're talking code and not just cloud tuning parameters.) [Read different
    jars...]

    If you're going to push an update out, then it has to be an 'all or
    nothing' push.

    Since we're using Cloudera's release, moving from CDH2 to CDH3 represents a
    full backup, down the cloud, remove the software completely, and then then
    install new CDH3. Outside of that major switch, if we were going from one
    sub release to another, it would be just a $> yum update hadoop-0.20 call on
    each node.
    Again, you have to take the cloud down to do that.

    So the bottom line... if you're going to do upgrades, you'll need to plan
    for some down time.

    HTH

    -Mike
    From: dan.harvey@mendeley.com
    Date: Tue, 29 Jun 2010 14:43:26 +0100
    Subject: Rolling out Hadoop/HBase updates
    To: user@hbase.apache.org

    Hey,

    I've been thinking about how we do out configuration and code updates for
    Hadoop and HBase and was wondering what others do and what is the best
    practice to avoid errors with HBase.

    Currently we do a rolling update where we restart the services on one node
    at a time, so shutting down the region server then restarting the datanode
    and task trackers depending on what we are updating and what has change. But
    with this I have occasional found errors with the HBase cluster
    afterwards
    due to corrupt META table which I think could have been caused by
    restarting
    the datanode, or maybe not waiting long enough for the cluster to sort out
    loosing a region server before moving on to the next.

    The most resent error upon restarting a node was :-

    2010-06-29 10:46:44,970 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

    2010-06-29 10:46:44,970 FATAL
    org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
    HRegionServer: file system not available
    java.io.IOException: File system is not available
    at
    org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)

    Followed by this for every region being served :-

    2010-06-29 10:46:44,996 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)


    After updating all the nodes all the region server shut down after a
    few minutes reporting the following :-

    2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
    Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
    10.0.11.4:50010

    2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
    Could not append. Requesting close of hlog
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    2010-06-29 11:22:09,482 FATAL
    org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
    ioe:
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
    2010-06-29 11:22:10,344 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
    abort
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    This was fixed by restarting the master and starting the region servers
    again, but it would be nice to know how to roll out changes cleaner.

    How do other people here roll out updates to HBase / Hadoop? What order do
    you restart services in and how long do you wait before moving to the next
    node?

    Just so you know we currently have 5 nodes and are getting another 10 to add
    soon.

    Thanks,

    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
    _________________________________________________________________
    Hotmail has tools for the New Busy. Search, chat and e-mail from your
    inbox.

    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1


    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
  • Dan Harvey at Jul 4, 2010 at 5:38 pm
    Just looked into hdfs630 and it looks like it was added in
    cdh2 0.20.1+169.89 and we're currently on 0.20.1+169.68. So would it help
    prevent some of these issues by updating to that so we have the patch?

    Thanks,
    On 4 July 2010 18:12, Dan Harvey wrote:

    Hey,

    We're using stock CHD2 without any patches so I'm not sure if we have
    hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
    moving to 0.20.5 soon

    What I did with this rollout of just config changes was take one region
    server down at a time and restart the datanode on the same server. So what I
    gather I should have done was shutdown all the region servers before
    restarting any of the data nodes?

    I guess if I split it into different parts it would be :-

    - HBase Rolling update for point/config releases is supported
    - Update masters first
    - Then update region servers in turn

    - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
    list I guess)
    - Take down HBase
    - Take down datanodes
    - Update all the datanodes code/configs
    - Start datanodes
    - Start HBase

    Would you be able to let me know which of these I've got right/wrong?

    Thanks,
    On 29 June 2010 15:50, Michael Segel wrote:


    Dan,

    I don't think you can do that because your 'new/updated' node will clash
    with the rest of the cloud.
    (We're talking code and not just cloud tuning parameters.) [Read different
    jars...]

    If you're going to push an update out, then it has to be an 'all or
    nothing' push.

    Since we're using Cloudera's release, moving from CDH2 to CDH3 represents
    a full backup, down the cloud, remove the software completely, and then then
    install new CDH3. Outside of that major switch, if we were going from one
    sub release to another, it would be just a $> yum update hadoop-0.20 call on
    each node.
    Again, you have to take the cloud down to do that.

    So the bottom line... if you're going to do upgrades, you'll need to plan
    for some down time.

    HTH

    -Mike
    From: dan.harvey@mendeley.com
    Date: Tue, 29 Jun 2010 14:43:26 +0100
    Subject: Rolling out Hadoop/HBase updates
    To: user@hbase.apache.org

    Hey,

    I've been thinking about how we do out configuration and code updates for
    Hadoop and HBase and was wondering what others do and what is the best
    practice to avoid errors with HBase.

    Currently we do a rolling update where we restart the services on one node
    at a time, so shutting down the region server then restarting the datanode
    and task trackers depending on what we are updating and what has change. But
    with this I have occasional found errors with the HBase cluster
    afterwards
    due to corrupt META table which I think could have been caused by
    restarting
    the datanode, or maybe not waiting long enough for the cluster to sort out
    loosing a region server before moving on to the next.

    The most resent error upon restarting a node was :-

    2010-06-29 10:46:44,970 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
    java.io.IOException: Filesystem closed
    at
    org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
    2010-06-29 10:46:44,970 FATAL
    org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
    HRegionServer: file system not available
    java.io.IOException: File system is not available
    at
    org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)

    Followed by this for every region being served :-

    2010-06-29 10:46:44,996 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
    java.io.IOException: Filesystem closed
    at
    org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

    After updating all the nodes all the region server shut down after a
    few minutes reporting the following :-

    2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
    Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
    10.0.11.4:50010

    2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
    Could not append. Requesting close of hlog
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    2010-06-29 11:22:09,482 FATAL
    org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
    ioe:
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
    2010-06-29 11:22:10,344 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
    abort
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    This was fixed by restarting the master and starting the region servers
    again, but it would be nice to know how to roll out changes cleaner.

    How do other people here roll out updates to HBase / Hadoop? What order do
    you restart services in and how long do you wait before moving to the next
    node?

    Just so you know we currently have 5 nodes and are getting another 10 to add
    soon.

    Thanks,

    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
    _________________________________________________________________
    Hotmail has tools for the New Busy. Search, chat and e-mail from your
    inbox.

    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1


    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015


    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
  • Stack at Jul 5, 2010 at 5:34 pm

    On Sun, Jul 4, 2010 at 10:36 AM, Dan Harvey wrote:
    Just looked into hdfs630 and it looks like it was added in
    cdh2 0.20.1+169.89 and we're currently on  0.20.1+169.68. So would it help
    prevent some of these issues by updating to that so we have the patch?
    For sure Dan. HDFS-630 will help at a minimum.
    St.Ack

    Thanks,
    On 4 July 2010 18:12, Dan Harvey wrote:

    Hey,

    We're using stock CHD2 without any patches so I'm not sure if we have
    hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
    moving to 0.20.5 soon

    What I did with this rollout of just config changes was take one region
    server down at a time and restart the datanode on the same server. So what I
    gather I should have done was shutdown all the region servers before
    restarting any of the data nodes?

    I guess if I split it into different parts it would be :-

    - HBase Rolling update for point/config releases is supported
    - Update masters first
    - Then update region servers in turn

    - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
    list I guess)
    - Take down HBase
    - Take down datanodes
    - Update all the datanodes code/configs
    - Start datanodes
    - Start HBase

    Would you be able to let me know which of these I've got right/wrong?

    Thanks,
    On 29 June 2010 15:50, Michael Segel wrote:


    Dan,

    I don't think you can do that because your 'new/updated' node will clash
    with the rest of the cloud.
    (We're talking code and not just cloud tuning parameters.) [Read different
    jars...]

    If you're going to push an update out, then it has to be an 'all or
    nothing' push.

    Since we're using Cloudera's release, moving from CDH2 to CDH3 represents
    a full backup, down the cloud, remove the software completely, and then then
    install new CDH3. Outside of that major switch, if we were going from one
    sub release to another, it would be just a $> yum update hadoop-0.20 call on
    each node.
    Again, you have to take the cloud down to do that.

    So the bottom line... if you're going to do upgrades, you'll need to plan
    for some down time.

    HTH

    -Mike
    From: dan.harvey@mendeley.com
    Date: Tue, 29 Jun 2010 14:43:26 +0100
    Subject: Rolling out Hadoop/HBase updates
    To: user@hbase.apache.org

    Hey,

    I've been thinking about how we do out configuration and code updates for
    Hadoop and HBase and was wondering what others do and what is the best
    practice to avoid errors with HBase.

    Currently we do a rolling update where we restart the services on one node
    at a time, so shutting down the region server then restarting the datanode
    and task trackers depending on what we are updating and what has change. But
    with this I have occasional found errors with the HBase cluster
    afterwards
    due to corrupt META table which I think could have been caused by
    restarting
    the datanode, or maybe not waiting long enough for the cluster to sort out
    loosing a region server before moving on to the next.

    The most resent error upon restarting a node was :-

    2010-06-29 10:46:44,970 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
    java.io.IOException: Filesystem closed
    at
    org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
    2010-06-29 10:46:44,970 FATAL
    org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
    HRegionServer: file system not available
    java.io.IOException: File system is not available
    at
    org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)

    Followed by this for every region being served :-

    2010-06-29 10:46:44,996 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
    documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
    java.io.IOException: Filesystem closed
    at
    org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

    After updating all the nodes all the region server shut down after a
    few minutes reporting the following :-

    2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
    Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
    10.0.11.4:50010

    2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
    Could not append. Requesting close of hlog
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    2010-06-29 11:22:09,482 FATAL
    org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
    ioe:
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
    2010-06-29 11:22:10,344 ERROR
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
    abort
    java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

    This was fixed by restarting the master and starting the region servers
    again, but it would be nice to know how to roll out changes cleaner.

    How do other people here roll out updates to HBase / Hadoop? What order do
    you restart services in and how long do you wait before moving to the next
    node?

    Just so you know we currently have 5 nodes and are getting another 10 to add
    soon.

    Thanks,

    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015
    _________________________________________________________________
    Hotmail has tools for the New Busy. Search, chat and e-mail from your
    inbox.

    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1


    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015


    --
    Dan Harvey | Datamining Engineer
    www.mendeley.com/profiles/dan-harvey

    Mendeley Limited | London, UK | www.mendeley.com
    Registered in England and Wales | Company Number 6419015

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJun 29, '10 at 1:45p
activeJul 5, '10 at 5:34p
posts6
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase