FAQ
Hi folks. I'd like to run the following data loss scenario by you to see if
we are doing something obviously wrong with our setup here.

Setup:

- Hadoop 0.20.1
- HBase 0.20.3
- 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
HMaster and 1 Zookeeper (no zookeeper quorum right now)
- 4 child nodes running a Datanode, TaskTracker and RegionServer each
- dfs.replication is set to 2
- Host: Amazon EC2

Up until yesterday, we were frequently experiencing
HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
which kept bringing our RegionServers down. What we realized though is that
we were losing data (a few hours worth) with just one out of four
regionservers going down. This is problematic since we are supposed to
replicate at x2 out of 4 nodes, so at least one other node should be able to
theoretically serve the data that the downed regionserver can't.

Questions:

- When a regionserver goes down unexpectedly, the only data that
theoretically gets lost was whatever didn't make it to the WAL, right? Or
wrong? E.g.
http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
- We ran a hadoop fsck on our cluster and verified the replication factor
as well as that the were no under replicated blocks. So why was our data not
available from another node?
- If the log gets rolled every 60 minutes by default (we haven't touched
the defaults), how can we lose data from up to 24 hours ago?
- When the downed regionserver comes back up, shouldn't that data be
available again? Ours wasn't.
- In such scenarios, is there a recommended approach for restoring the
regionserver that goes down? We just brought them back up by logging on the
node itself an manually restarting them first. Now we have automated crons
that listen for their ports and restart them if they go down within two
minutes.
- Are there way to recover such lost data?
- Are versions 0.89 / 0.90 addressing any of these issues?
- Curiosity question: when a regionserver goes down, does the master try
to replicate that node's data on another node to satisfy the dfs.replication
ratio?

For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0 is the
way to go to avoid the file append issues but it's not production ready
yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

Any tips here are definitely appreciated. I'll be happy to provide more
information as well.

-GS

Search Discussions

  • Todd Lipcon at Sep 19, 2010 at 10:00 pm
    Hi George,

    The data loss problems you mentioned below are known issues when running on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of HDFS
    patches that allow HBase to durably store data. You'll also have to upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd
    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis wrote:

    Hi folks. I'd like to run the following data loss scenario by you to see if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though is that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed to
    replicate at x2 out of 4 nodes, so at least one other node should be able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL, right? Or
    wrong? E.g.

    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the replication factor
    as well as that the were no under replicated blocks. So why was our data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for restoring the
    regionserver that goes down? We just brought them back up by logging on
    the
    node itself an manually restarting them first. Now we have automated
    crons
    that listen for their ports and restart them if they go down within two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the master try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
    no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0 is
    the
    way to go to avoid the file append issues but it's not production ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • George P. Stathis at Sep 20, 2010 at 7:40 pm
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at that
    point, either migrate our customizations, or move back to the out-of-the box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It does not
    matter if the data is replicated, it will still get lost?

    -GS
    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon wrote:

    Hi George,

    The data loss problems you mentioned below are known issues when running on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of HDFS
    patches that allow HBase to durably store data. You'll also have to upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you to see if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though is that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed to
    replicate at x2 out of 4 nodes, so at least one other node should be able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL, right? Or
    wrong? E.g.

    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the replication factor
    as well as that the were no under replicated blocks. So why was our data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for restoring the
    regionserver that goes down? We just brought them back up by logging on
    the
    node itself an manually restarting them first. Now we have automated
    crons
    that listen for their ports and restart them if they go down within two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the master try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
    no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0 is
    the
    way to go to avoid the file append issues but it's not production ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ryan Rawson at Sep 20, 2010 at 7:53 pm
    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file. It will report that length as 0. So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable. That specifically is the
    problem of data loss. You could always make it so your regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get the
    series of patches that fix this hole. It also happens that only 0.89
    runs on it. I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89. Over here at SU we've made a commitment to the future of 0.89
    and are running it in production. Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at that
    point, either migrate our customizations, or move back to the out-of-the box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It does not
    matter if the data is replicated, it will still get lost?

    -GS
    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon wrote:

    Hi George,

    The data loss problems you mentioned below are known issues when running on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of HDFS
    patches that allow HBase to durably store data. You'll also have to upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you to see if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though is that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed to
    replicate at x2 out of 4 nodes, so at least one other node should be able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL, right? Or
    wrong? E.g.

    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the replication factor
    as well as that the were no under replicated blocks. So why was our data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for restoring the
    regionserver that goes down? We just brought them back up by logging on
    the
    node itself an manually restarting them first. Now we have automated
    crons
    that listen for their ports and restart them if they go down within two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the master try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
    no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0 is
    the
    way to go to avoid the  file append issues but it's not production ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • George P. Stathis at Sep 20, 2010 at 8:23 pm
    Thanks for the response Ryan. I have no doubt that 0.89 can be used in
    production and that it has strong support. I just wanted to avoid moving to
    it now because we have limited resources and it would put a dent in our
    roadmap if we were to fast track the migration now. Specifically, we are
    using HBASE-2438 and HBASE-2426 to support pagination across indexes. So we
    either have to migrate those to 0.89 or somehow go stock and be able to
    support pagination across region servers.

    Of course, if the choice is between migrating or losing more data, data
    safety comes first. But if we can buy two or three more months of time and
    avoid region server crashes (like you did for a year), maybe we can go that
    route for now. What do we need to do achieve that?

    -GS

    PS: Out of curiosity, I understand the WAL log append issue for a single
    regionserver when it comes to losing the data on a single node. But if that
    data is also being replicated on another region server, why wouldn't it be
    available there? Or is the WAL log shared across multiple region servers
    (maybe that's what I'm missing)?

    On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson wrote:

    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file. It will report that length as 0. So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable. That specifically is the
    problem of data loss. You could always make it so your regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get the
    series of patches that fix this hole. It also happens that only 0.89
    runs on it. I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89. Over here at SU we've made a commitment to the future of 0.89
    and are running it in production. Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at that
    point, either migrate our customizations, or move back to the out-of-the box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It does not
    matter if the data is replicated, it will still get lost?

    -GS
    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon wrote:

    Hi George,

    The data loss problems you mentioned below are known issues when running
    on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of HDFS
    patches that allow HBase to durably store data. You'll also have to
    upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
    gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you to
    see
    if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer
    each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though is that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed to
    replicate at x2 out of 4 nodes, so at least one other node should be
    able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL,
    right?
    Or
    wrong? E.g.
    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the replication factor
    as well as that the were no under replicated blocks. So why was our data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for restoring
    the
    regionserver that goes down? We just brought them back up by logging
    on
    the
    node itself an manually restarting them first. Now we have automated
    crons
    that listen for their ports and restart them if they go down within
    two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the master try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
    no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0
    is
    the
    way to go to avoid the file append issues but it's not production
    ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide
    more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ryan Rawson at Sep 20, 2010 at 8:56 pm
    When you say replication what exactly do you mean? In normal HDFS, as
    you write the data is sent to 3 nodes yes, but with the flaw I
    outlined, it doesnt matter because the datanodes and namenode will
    pretend a data block just didnt exist if it wasnt closed properly.

    So even with the most careful white glove handling of hbase, you will
    eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
    You can circumvent this by storing the data elsewhere and spooling
    into hbase, or perhaps just not minding if you lose data (yes those
    applications exist).

    Looking at those JIRAs in question, the first is already on trunk
    which is 0.89. The second isn't alas. At this point the
    transactional hbase just isnt being actively maintained by any
    committer and we are reliant on kind people's contributions. So I
    can't promise when it will hit 0.89/0.90.

    -ryan

    On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis wrote:
    Thanks for the response Ryan. I have no doubt that 0.89 can be used in
    production and that it has strong support. I just wanted to avoid moving to
    it now because we have limited resources and it would put a dent in our
    roadmap if we were to fast track the migration now. Specifically, we are
    using HBASE-2438 and HBASE-2426 to support pagination across indexes. So we
    either have to migrate those to 0.89 or somehow go stock and be able to
    support pagination across region servers.

    Of course, if the choice is between migrating or losing more data, data
    safety comes first. But if we can buy two or three more months of time and
    avoid region server crashes (like you did for a year), maybe we can go that
    route for now. What do we need to do achieve that?

    -GS

    PS: Out of curiosity, I understand the WAL log append issue for a single
    regionserver when it comes to losing the data on a single node. But if that
    data is also being replicated on another region server, why wouldn't it be
    available there? Or is the WAL log shared across multiple region servers
    (maybe that's what I'm missing)?

    On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson wrote:

    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file.  It will report that length as 0.  So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable.  That specifically is the
    problem of data loss.  You could always make it so your regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get the
    series of patches that fix this hole.  It also happens that only 0.89
    runs on it.  I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89.  Over here at SU we've made a commitment to the future of 0.89
    and are running it in production.  Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at that
    point, either migrate our customizations, or move back to the out-of-the box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It does not
    matter if the data is replicated, it will still get lost?

    -GS
    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon wrote:

    Hi George,

    The data loss problems you mentioned below are known issues when running
    on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of HDFS
    patches that allow HBase to durably store data. You'll also have to
    upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
    gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you to
    see
    if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer
    each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though is that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed to
    replicate at x2 out of 4 nodes, so at least one other node should be
    able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL,
    right?
    Or
    wrong? E.g.
    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the replication factor
    as well as that the were no under replicated blocks. So why was our data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for restoring
    the
    regionserver that goes down? We just brought them back up by logging
    on
    the
    node itself an manually restarting them first. Now we have automated
    crons
    that listen for their ports and restart them if they go down within
    two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the master try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
    no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0
    is
    the
    way to go to avoid the  file append issues but it's not production
    ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide
    more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • George P. Stathis at Sep 21, 2010 at 12:20 am

    On Mon, Sep 20, 2010 at 4:55 PM, Ryan Rawson wrote:

    When you say replication what exactly do you mean? In normal HDFS, as
    you write the data is sent to 3 nodes yes, but with the flaw I
    outlined, it doesnt matter because the datanodes and namenode will
    pretend a data block just didnt exist if it wasnt closed properly.
    That's the part I was not understanding. I do now. Thanks.

    So even with the most careful white glove handling of hbase, you will
    eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
    You can circumvent this by storing the data elsewhere and spooling
    into hbase, or perhaps just not minding if you lose data (yes those
    applications exist).

    Looking at those JIRAs in question, the first is already on trunk
    which is 0.89. The second isn't alas. At this point the
    transactional hbase just isnt being actively maintained by any
    committer and we are reliant on kind people's contributions. So I
    can't promise when it will hit 0.89/0.90.
    Are you aware of any indexing alternatives in 0.89?

    -ryan

    On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis wrote:
    Thanks for the response Ryan. I have no doubt that 0.89 can be used in
    production and that it has strong support. I just wanted to avoid moving to
    it now because we have limited resources and it would put a dent in our
    roadmap if we were to fast track the migration now. Specifically, we are
    using HBASE-2438 and HBASE-2426 to support pagination across indexes. So we
    either have to migrate those to 0.89 or somehow go stock and be able to
    support pagination across region servers.

    Of course, if the choice is between migrating or losing more data, data
    safety comes first. But if we can buy two or three more months of time and
    avoid region server crashes (like you did for a year), maybe we can go that
    route for now. What do we need to do achieve that?

    -GS

    PS: Out of curiosity, I understand the WAL log append issue for a single
    regionserver when it comes to losing the data on a single node. But if that
    data is also being replicated on another region server, why wouldn't it be
    available there? Or is the WAL log shared across multiple region servers
    (maybe that's what I'm missing)?

    On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson wrote:

    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file. It will report that length as 0. So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable. That specifically is the
    problem of data loss. You could always make it so your regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get the
    series of patches that fix this hole. It also happens that only 0.89
    runs on it. I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89. Over here at SU we've made a commitment to the future of 0.89
    and are running it in production. Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at
    that
    point, either migrate our customizations, or move back to the
    out-of-the
    box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to
    include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It
    does
    not
    matter if the data is replicated, it will still get lost?

    -GS
    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon wrote:

    Hi George,

    The data loss problems you mentioned below are known issues when
    running
    on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of
    HDFS
    patches that allow HBase to durably store data. You'll also have to
    upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
    gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you
    to
    see
    if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver,
    JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer
    each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though
    is
    that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed
    to
    replicate at x2 out of 4 nodes, so at least one other node should
    be
    able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL,
    right?
    Or
    wrong? E.g.
    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the
    replication
    factor
    as well as that the were no under replicated blocks. So why was
    our
    data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data
    be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for
    restoring
    the
    regionserver that goes down? We just brought them back up by
    logging
    on
    the
    node itself an manually restarting them first. Now we have
    automated
    crons
    that listen for their ports and restart them if they go down
    within
    two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the
    master
    try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077>
    fix
    (but
    no one has verified yet). Lars' blog also suggests that Hadoop
    0.21.0
    is
    the
    way to go to avoid the file append issues but it's not production
    ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide
    more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ryan Rawson at Sep 21, 2010 at 12:44 am
    hi,

    sorry i dont. i think the current transactional/indexed person is
    working on bringing it up to 0.89, perhaps they would enjoy your help
    in testing or porting the code?

    I'll poke a few people into replying.

    -ryan
    On Mon, Sep 20, 2010 at 5:19 PM, George P. Stathis wrote:
    On Mon, Sep 20, 2010 at 4:55 PM, Ryan Rawson wrote:

    When you say replication what exactly do you mean?  In normal HDFS, as
    you write the data is sent to 3 nodes yes, but with the flaw I
    outlined, it doesnt matter because the datanodes and namenode will
    pretend a data block just didnt exist if it wasnt closed properly.
    That's the part I was not understanding. I do now. Thanks.

    So even with the most careful white glove handling of hbase, you will
    eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
    You can circumvent this by storing the data elsewhere and spooling
    into hbase, or perhaps just not minding if you lose data (yes those
    applications exist).

    Looking at those JIRAs in question, the first is already on trunk
    which is 0.89.  The second isn't alas.  At this point the
    transactional hbase just isnt being actively maintained by any
    committer and we are reliant on kind people's contributions.  So I
    can't promise when it will hit 0.89/0.90.
    Are you aware of any indexing alternatives in 0.89?

    -ryan


    On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis <gstathis@traackr.com>
    wrote:
    Thanks for the response Ryan. I have no doubt that 0.89 can be used in
    production and that it has strong support. I just wanted to avoid moving to
    it now because we have limited resources and it would put a dent in our
    roadmap if we were to fast track the migration now. Specifically, we are
    using HBASE-2438 and HBASE-2426 to support pagination across indexes. So we
    either have to migrate those to 0.89 or somehow go stock and be able to
    support pagination across region servers.

    Of course, if the choice is between migrating or losing more data, data
    safety comes first. But if we can buy two or three more months of time and
    avoid region server crashes (like you did for a year), maybe we can go that
    route for now. What do we need to do achieve that?

    -GS

    PS: Out of curiosity, I understand the WAL log append issue for a single
    regionserver when it comes to losing the data on a single node. But if that
    data is also being replicated on another region server, why wouldn't it be
    available there? Or is the WAL log shared across multiple region servers
    (maybe that's what I'm missing)?

    On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson wrote:

    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file.  It will report that length as 0.  So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable.  That specifically is the
    problem of data loss.  You could always make it so your regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get the
    series of patches that fix this hole.  It also happens that only 0.89
    runs on it.  I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89.  Over here at SU we've made a commitment to the future of 0.89
    and are running it in production.  Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at
    that
    point, either migrate our customizations, or move back to the
    out-of-the
    box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to
    include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It
    does
    not
    matter if the data is replicated, it will still get lost?

    -GS

    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi George,

    The data loss problems you mentioned below are known issues when
    running
    on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of
    HDFS
    patches that allow HBase to durably store data. You'll also have to
    upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
    gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you
    to
    see
    if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver,
    JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer
    each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though
    is
    that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed
    to
    replicate at x2 out of 4 nodes, so at least one other node should
    be
    able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL,
    right?
    Or
    wrong? E.g.
    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the
    replication
    factor
    as well as that the were no under replicated blocks. So why was
    our
    data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data
    be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for
    restoring
    the
    regionserver that goes down? We just brought them back up by
    logging
    on
    the
    node itself an manually restarting them first. Now we have
    automated
    crons
    that listen for their ports and restart them if they go down
    within
    two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the
    master
    try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077>
    fix
    (but
    no one has verified yet). Lars' blog also suggests that Hadoop
    0.21.0
    is
    the
    way to go to avoid the  file append issues but it's not production
    ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide
    more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Stack at Sep 23, 2010 at 4:00 am
    Hey George:

    James Kennedy is working on getting transactional hbase working w/
    hbase TRUNK. Watch HBASE-2641 for the drop of changes needed in core
    to make it so his github THBase can use HBase core.

    St.Ack
    On Mon, Sep 20, 2010 at 5:43 PM, Ryan Rawson wrote:
    hi,

    sorry i dont.  i think the current transactional/indexed person is
    working on bringing it up to 0.89, perhaps they would enjoy your help
    in testing or porting the code?

    I'll poke a few people into replying.

    -ryan
    On Mon, Sep 20, 2010 at 5:19 PM, George P. Stathis wrote:
    On Mon, Sep 20, 2010 at 4:55 PM, Ryan Rawson wrote:

    When you say replication what exactly do you mean?  In normal HDFS, as
    you write the data is sent to 3 nodes yes, but with the flaw I
    outlined, it doesnt matter because the datanodes and namenode will
    pretend a data block just didnt exist if it wasnt closed properly.
    That's the part I was not understanding. I do now. Thanks.

    So even with the most careful white glove handling of hbase, you will
    eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
    You can circumvent this by storing the data elsewhere and spooling
    into hbase, or perhaps just not minding if you lose data (yes those
    applications exist).

    Looking at those JIRAs in question, the first is already on trunk
    which is 0.89.  The second isn't alas.  At this point the
    transactional hbase just isnt being actively maintained by any
    committer and we are reliant on kind people's contributions.  So I
    can't promise when it will hit 0.89/0.90.
    Are you aware of any indexing alternatives in 0.89?

    -ryan


    On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis <gstathis@traackr.com>
    wrote:
    Thanks for the response Ryan. I have no doubt that 0.89 can be used in
    production and that it has strong support. I just wanted to avoid moving to
    it now because we have limited resources and it would put a dent in our
    roadmap if we were to fast track the migration now. Specifically, we are
    using HBASE-2438 and HBASE-2426 to support pagination across indexes. So we
    either have to migrate those to 0.89 or somehow go stock and be able to
    support pagination across region servers.

    Of course, if the choice is between migrating or losing more data, data
    safety comes first. But if we can buy two or three more months of time and
    avoid region server crashes (like you did for a year), maybe we can go that
    route for now. What do we need to do achieve that?

    -GS

    PS: Out of curiosity, I understand the WAL log append issue for a single
    regionserver when it comes to losing the data on a single node. But if that
    data is also being replicated on another region server, why wouldn't it be
    available there? Or is the WAL log shared across multiple region servers
    (maybe that's what I'm missing)?

    On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson wrote:

    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file.  It will report that length as 0.  So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable.  That specifically is the
    problem of data loss.  You could always make it so your regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get the
    series of patches that fix this hole.  It also happens that only 0.89
    runs on it.  I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89.  Over here at SU we've made a commitment to the future of 0.89
    and are running it in production.  Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have made custom
    modifications to the transactional contrib sources which are now taken out
    of 0.89. We are planning on moving to 0.90 when it comes out and at
    that
    point, either migrate our customizations, or move back to the
    out-of-the
    box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started with hbase,
    there was none that included HBase. I think CDH3 the first one to
    include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock hbase 0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down? It
    does
    not
    matter if the data is replicated, it will still get lost?

    -GS

    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi George,

    The data loss problems you mentioned below are known issues when
    running
    on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number of
    HDFS
    patches that allow HBase to durably store data. You'll also have to
    upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work well.

    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
    gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by you
    to
    see
    if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver,
    JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and RegionServer
    each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized though
    is
    that
    we were losing data (a few hours worth) with just one out of four
    regionservers going down. This is problematic since we are supposed
    to
    replicate at x2 out of 4 nodes, so at least one other node should
    be
    able
    to
    theoretically serve the data that the downed regionserver can't.

    Questions:

    - When a regionserver goes down unexpectedly, the only data that
    theoretically gets lost was whatever didn't make it to the WAL,
    right?
    Or
    wrong? E.g.
    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the
    replication
    factor
    as well as that the were no under replicated blocks. So why was
    our
    data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we haven't touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that data
    be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for
    restoring
    the
    regionserver that goes down? We just brought them back up by
    logging
    on
    the
    node itself an manually restarting them first. Now we have
    automated
    crons
    that listen for their ports and restart them if they go down
    within
    two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the
    master
    try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
    the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077>
    fix
    (but
    no one has verified yet). Lars' blog also suggests that Hadoop
    0.21.0
    is
    the
    way to go to avoid the  file append issues but it's not production
    ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to provide
    more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • George P. Stathis at Sep 23, 2010 at 4:43 pm
    I'm there. Thanks St.Ack.
    On Wed, Sep 22, 2010 at 11:59 PM, Stack wrote:

    Hey George:

    James Kennedy is working on getting transactional hbase working w/
    hbase TRUNK. Watch HBASE-2641 for the drop of changes needed in core
    to make it so his github THBase can use HBase core.

    St.Ack
    On Mon, Sep 20, 2010 at 5:43 PM, Ryan Rawson wrote:
    hi,

    sorry i dont. i think the current transactional/indexed person is
    working on bringing it up to 0.89, perhaps they would enjoy your help
    in testing or porting the code?

    I'll poke a few people into replying.

    -ryan
    On Mon, Sep 20, 2010 at 5:19 PM, George P. Stathis wrote:
    On Mon, Sep 20, 2010 at 4:55 PM, Ryan Rawson wrote:

    When you say replication what exactly do you mean? In normal HDFS, as
    you write the data is sent to 3 nodes yes, but with the flaw I
    outlined, it doesnt matter because the datanodes and namenode will
    pretend a data block just didnt exist if it wasnt closed properly.
    That's the part I was not understanding. I do now. Thanks.

    So even with the most careful white glove handling of hbase, you will
    eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
    You can circumvent this by storing the data elsewhere and spooling
    into hbase, or perhaps just not minding if you lose data (yes those
    applications exist).

    Looking at those JIRAs in question, the first is already on trunk
    which is 0.89. The second isn't alas. At this point the
    transactional hbase just isnt being actively maintained by any
    committer and we are reliant on kind people's contributions. So I
    can't promise when it will hit 0.89/0.90.
    Are you aware of any indexing alternatives in 0.89?

    -ryan


    On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis <
    gstathis@traackr.com>
    wrote:
    Thanks for the response Ryan. I have no doubt that 0.89 can be used
    in
    production and that it has strong support. I just wanted to avoid
    moving
    to
    it now because we have limited resources and it would put a dent in
    our
    roadmap if we were to fast track the migration now. Specifically, we
    are
    using HBASE-2438 and HBASE-2426 to support pagination across indexes.
    So
    we
    either have to migrate those to 0.89 or somehow go stock and be able
    to
    support pagination across region servers.

    Of course, if the choice is between migrating or losing more data,
    data
    safety comes first. But if we can buy two or three more months of
    time
    and
    avoid region server crashes (like you did for a year), maybe we can
    go
    that
    route for now. What do we need to do achieve that?

    -GS

    PS: Out of curiosity, I understand the WAL log append issue for a
    single
    regionserver when it comes to losing the data on a single node. But
    if
    that
    data is also being replicated on another region server, why wouldn't
    it
    be
    available there? Or is the WAL log shared across multiple region
    servers
    (maybe that's what I'm missing)?

    On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson wrote:

    Hey,

    The problem is that the stock 0.20 hadoop wont let you read from a
    non-closed file. It will report that length as 0. So if a
    regionserver crashes, that last WAL log that is still open becomes 0
    length and the data within in unreadable. That specifically is the
    problem of data loss. You could always make it so your
    regionservers
    rarely crash - this is possible btw and I did it for over a year.

    But you will want to run CDH3 or the append-branch releases to get
    the
    series of patches that fix this hole. It also happens that only
    0.89
    runs on it. I would like to avoid the hadoop "everyone uses 0.20
    forever" problem and talk about what we could do to help you get on
    0.89. Over here at SU we've made a commitment to the future of 0.89
    and are running it in production. Let us know what else you'd need.

    -ryan

    On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
    wrote:
    Thanks Todd. We are not quite ready to move to 0.89 yet. We have
    made
    custom
    modifications to the transactional contrib sources which are now
    taken
    out
    of 0.89. We are planning on moving to 0.90 when it comes out and
    at
    that
    point, either migrate our customizations, or move back to the
    out-of-the
    box
    features (which will require a re-write of our code).

    We are well aware of the CDH distros but at the time we started
    with
    hbase,
    there was none that included HBase. I think CDH3 the first one to
    include
    HBase, correct? And is 0.89 the only one supported?

    Moreover, are we saying that there is no way to prevent stock
    hbase
    0.20.6
    and hadoop 0.20.2 from losing data when a single node goes down?
    It
    does
    not
    matter if the data is replicated, it will still get lost?

    -GS

    On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi George,

    The data loss problems you mentioned below are known issues when
    running
    on
    stock Apache 0.20.x hadoop.

    You should consider upgrading to CDH3b2, which includes a number
    of
    HDFS
    patches that allow HBase to durably store data. You'll also have
    to
    upgrade
    to HBase 0.89 - we ship a version as part of CDH that will work
    well.
    Thanks
    -Todd

    On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
    gstathis@traackr.com
    wrote:
    Hi folks. I'd like to run the following data loss scenario by
    you
    to
    see
    if
    we are doing something obviously wrong with our setup here.

    Setup:

    - Hadoop 0.20.1
    - HBase 0.20.3
    - 1 Master Node running Nameserver, SecondaryNameserver,
    JobTracker,
    HMaster and 1 Zookeeper (no zookeeper quorum right now)
    - 4 child nodes running a Datanode, TaskTracker and
    RegionServer
    each
    - dfs.replication is set to 2
    - Host: Amazon EC2

    Up until yesterday, we were frequently experiencing
    HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
    which kept bringing our RegionServers down. What we realized
    though
    is
    that
    we were losing data (a few hours worth) with just one out of
    four
    regionservers going down. This is problematic since we are
    supposed
    to
    replicate at x2 out of 4 nodes, so at least one other node
    should
    be
    able
    to
    theoretically serve the data that the downed regionserver
    can't.
    Questions:

    - When a regionserver goes down unexpectedly, the only data
    that
    theoretically gets lost was whatever didn't make it to the
    WAL,
    right?
    Or
    wrong? E.g.
    http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
    - We ran a hadoop fsck on our cluster and verified the
    replication
    factor
    as well as that the were no under replicated blocks. So why
    was
    our
    data
    not
    available from another node?
    - If the log gets rolled every 60 minutes by default (we
    haven't
    touched
    the defaults), how can we lose data from up to 24 hours ago?
    - When the downed regionserver comes back up, shouldn't that
    data
    be
    available again? Ours wasn't.
    - In such scenarios, is there a recommended approach for
    restoring
    the
    regionserver that goes down? We just brought them back up by
    logging
    on
    the
    node itself an manually restarting them first. Now we have
    automated
    crons
    that listen for their ports and restart them if they go down
    within
    two
    minutes.
    - Are there way to recover such lost data?
    - Are versions 0.89 / 0.90 addressing any of these issues?
    - Curiosity question: when a regionserver goes down, does the
    master
    try
    to replicate that node's data on another node to satisfy the
    dfs.replication
    ratio?

    For now, we have upgraded our HBase to 0.20.6, which is
    supposed to
    contain
    the HBASE-2077 <
    https://issues.apache.org/jira/browse/HBASE-2077>
    fix
    (but
    no one has verified yet). Lars' blog also suggests that Hadoop
    0.21.0
    is
    the
    way to go to avoid the file append issues but it's not
    production
    ready
    yet. Should we stick to 0.20.1? Upgrade to 0.20.2?

    Any tips here are definitely appreciated. I'll be happy to
    provide
    more
    information as well.

    -GS


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Andrew Purtell at Sep 22, 2010 at 9:58 pm
    While 0.89/0.90 is the way to go, there is also the 0.20-append branch of Hadoop, in the hadoop-common repo, which is better than nothing if using HBase 0.20:

    http://github.com/apache/hadoop-common/tree/branch-0.20-append

    There is also an amalgamation of 0.20-append and Yahoo Secure Hadoop 0.20.104:

    http://github.com/apurtell/hadoop-common/tree/yahoo-hadoop-0.20.104-append

    I'd recommend the former unless you also want strong authentication via Kerberos.

    Best regards,

    - Andy

    Why is this email five sentences or less?
    http://five.sentenc.es/
  • George P. Stathis at Sep 23, 2010 at 1:22 am
    Thanks Andy, it's good to know there is an alternative. We'll attempt to go
    to 0.89 but if we can't get reliable indexing, we may have to go with this
    hadoop-append branch.

    -GS
    On Wed, Sep 22, 2010 at 5:57 PM, Andrew Purtell wrote:

    While 0.89/0.90 is the way to go, there is also the 0.20-append branch of
    Hadoop, in the hadoop-common repo, which is better than nothing if using
    HBase 0.20:

    http://github.com/apache/hadoop-common/tree/branch-0.20-append

    There is also an amalgamation of 0.20-append and Yahoo Secure Hadoop
    0.20.104:


    http://github.com/apurtell/hadoop-common/tree/yahoo-hadoop-0.20.104-append

    I'd recommend the former unless you also want strong authentication via
    Kerberos.

    Best regards,

    - Andy

    Why is this email five sentences or less?
    http://five.sentenc.es/



  • Andrew Purtell at Sep 23, 2010 at 3:18 pm
    Something else to consider is after 0.90, or whenever the coprocessor framework goes in, we will quite possibly build some kind of secondary indexing capability as a coprocessor (see HBASE-2000 and sub-issues). This stuff won't be backported, at least by us.

    Best regards,

    - Andy

    Why is this email five sentences or less?
    http://five.sentenc.es/

    From: George P. Stathis
    Subject: Re: A data loss scenario with a single region server going down
    To: user@hbase.apache.org, apurtell@apache.org
    Date: Wednesday, September 22, 2010, 6:21 PM

    Thanks Andy, it's good to know there is an alternative. We'll
    attempt to go to 0.89 but if we can't get reliable indexing, we
    may have to go with this hadoop-append branch.

    -GS
    On Wed, Sep 22, 2010 at 5:57 PM, Andrew Purtell wrote:

    While 0.89/0.90 is the way to go, there is also the
    0.20-append branch of Hadoop, in the hadoop-common repo,
    which is better than nothing if using HBase 0.20:

    http://github.com/apache/hadoop-common/tree/branch-0.20-append

    There is also an amalgamation of 0.20-append and Yahoo
    Secure Hadoop 0.20.104:

    http://github.com/apurtell/hadoop-common/tree/yahoo-hadoop-0.20.104-append

    I'd recommend the former unless you also want strong
    authentication via Kerberos.
  • George P. Stathis at Sep 23, 2010 at 4:57 pm
    Thanks Andy. Good to know that's coming up. I started following HBASE-2038.
    It does seem that it's quite a bit out (I'm guess well into 2011). I think
    we will definitely be interested in migrating any indexes we have to the
    Coprocessor model. We definitely prefer to go with features supported in
    core. Until then though, hbase-transactional-tableindexed seems to be our
    best bet unless there is something else folks here suggest we do.

    -GS
    On Thu, Sep 23, 2010 at 11:17 AM, Andrew Purtell wrote:

    Something else to consider is after 0.90, or whenever the coprocessor
    framework goes in, we will quite possibly build some kind of secondary
    indexing capability as a coprocessor (see HBASE-2000 and sub-issues). This
    stuff won't be backported, at least by us.

    Best regards,

    - Andy

    Why is this email five sentences or less?
    http://five.sentenc.es/

    From: George P. Stathis
    Subject: Re: A data loss scenario with a single region server going down
    To: user@hbase.apache.org, apurtell@apache.org
    Date: Wednesday, September 22, 2010, 6:21 PM

    Thanks Andy, it's good to know there is an alternative. We'll
    attempt to go to 0.89 but if we can't get reliable indexing, we
    may have to go with this hadoop-append branch.

    -GS

    On Wed, Sep 22, 2010 at 5:57 PM, Andrew Purtell <apurtell@apache.org>
    wrote:
    While 0.89/0.90 is the way to go, there is also the
    0.20-append branch of Hadoop, in the hadoop-common repo,
    which is better than nothing if using HBase 0.20:

    http://github.com/apache/hadoop-common/tree/branch-0.20-append

    There is also an amalgamation of 0.20-append and Yahoo
    Secure Hadoop 0.20.104:
    http://github.com/apurtell/hadoop-common/tree/yahoo-hadoop-0.20.104-append
    I'd recommend the former unless you also want strong
    authentication via Kerberos.



Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedSep 19, '10 at 1:58p
activeSep 23, '10 at 4:57p
posts14
users5
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase