FAQ
If I have a write operation that takes a while between opening and closing
the file, what is the effect of a node doing that writing crashing in the middle?
For example, suppose I have large logs that I write to continually, rolling them
every N minutes (say every hour for the sake of discussion). If I have the file
opened and am 90% done my writes and things crash, what happens to the
data I've "written" ... realizing that at some level, data isn't visible to the rest
of the cluster until the file is closed.

--
Steve Sapovits
Invite Media - http://www.invitemedia.com
ssapovits@invitemedia.com

Search Discussions

  • dhruba Borthakur at Feb 26, 2008 at 6:30 pm
    The Namenode maintains a lease for every open file that is being written
    to. If the client that was writing to the file disappears, the Namenode
    will do "lease recovery" after expiry of the lease timeout (1 hour). The
    lease recovery process (in most cases) will remove the last block from
    the file (it was not fully written because the client crashed before it
    could fill up the block) and close the file.

    Thanks,
    dhruba

    -----Original Message-----
    From: Steve Sapovits
    Sent: Monday, February 25, 2008 5:13 PM
    To: core-user@hadoop.apache.org
    Subject: long write operations and data recovery


    If I have a write operation that takes a while between opening and
    closing
    the file, what is the effect of a node doing that writing crashing in
    the middle?
    For example, suppose I have large logs that I write to continually,
    rolling them
    every N minutes (say every hour for the sake of discussion). If I have
    the file
    opened and am 90% done my writes and things crash, what happens to the
    data I've "written" ... realizing that at some level, data isn't visible
    to the rest
    of the cluster until the file is closed.

    --
    Steve Sapovits
    Invite Media - http://www.invitemedia.com
    ssapovits@invitemedia.com
  • Steve Sapovits at Feb 29, 2008 at 12:02 am

    dhruba Borthakur wrote:

    The Namenode maintains a lease for every open file that is being written
    to. If the client that was writing to the file disappears, the Namenode
    will do "lease recovery" after expiry of the lease timeout (1 hour). The
    lease recovery process (in most cases) will remove the last block from
    the file (it was not fully written because the client crashed before it
    could fill up the block) and close the file.
    How does replication affect this? If there's at least one replicated client still
    running, I assume that takes care of it?

    --
    Steve Sapovits
    Invite Media - http://www.invitemedia.com
    ssapovits@invitemedia.com
  • Steve Sapovits at Feb 29, 2008 at 12:55 am

    How does replication affect this? If there's at least one replicated
    client still running, I assume that takes care of it?
    Never mind -- I get this now after reading the docs again.

    My remaining point of failure question concerns name nodes. The docs say manual
    intervention is still required if a name node goes down. How is this typically managed
    in production environments? It would seem even a short name node outage in a
    data intestive environment would lead to data loss (no name node to give the data
    to).

    --
    Steve Sapovits
    Invite Media - http://www.invitemedia.com
    ssapovits@invitemedia.com
  • Joydeep Sen Sarma at Feb 29, 2008 at 2:07 am
    We have had a lot of peace of mind by building a data pipeline that does
    not assume that hdfs is always up and running. If the application is
    primarily non real-time log processing - I would suggest
    batch/incremental copies of data to hdfs that can catch up automatically
    in case of failures/downtimes.

    we have a rsync like map-reduce job that monitors a log directories and
    keeps pulling new data in (and suspect lot of other users do similar
    stuff as well). Might be a useful notion to generalize and put in
    contrib.


    -----Original Message-----
    From: Steve Sapovits
    Sent: Thursday, February 28, 2008 4:54 PM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery

    How does replication affect this? If there's at least one replicated
    client still running, I assume that takes care of it?
    Never mind -- I get this now after reading the docs again.

    My remaining point of failure question concerns name nodes. The docs
    say manual
    intervention is still required if a name node goes down. How is this
    typically managed
    in production environments? It would seem even a short name node
    outage in a
    data intestive environment would lead to data loss (no name node to give
    the data
    to).

    --
    Steve Sapovits
    Invite Media - http://www.invitemedia.com
    ssapovits@invitemedia.com
  • Ted Dunning at Feb 29, 2008 at 2:10 am
    This is exactly what we do as well. We also have auto-detection for
    modifications and downstream processing so that back-filling in the presence
    error correction is possible (the errors can be old processing code or file
    munging).

    On 2/28/08 6:06 PM, "Joydeep Sen Sarma" wrote:

    We have had a lot of peace of mind by building a data pipeline that does
    not assume that hdfs is always up and running. If the application is
    primarily non real-time log processing - I would suggest
    batch/incremental copies of data to hdfs that can catch up automatically
    in case of failures/downtimes.

    we have a rsync like map-reduce job that monitors a log directories and
    keeps pulling new data in (and suspect lot of other users do similar
    stuff as well). Might be a useful notion to generalize and put in
    contrib.
  • dhruba Borthakur at Feb 29, 2008 at 5:55 am
    I agree with Joydeep. For batch processing, it is sufficient to make the
    application not assume that HDFS is always up and active. However, for
    real-time applications that are not batch-centric, it might not be
    sufficient. There are a few things that HDFS could do to better handle
    Namenode outages:

    1. Make Clients handle transient Namenode downtime. This requires that
    Namenode restarts are fast, clients can handle long Namenode outages,
    etc.etc.
    2. Design HDFS Namenode to be a set of two, an active one and a passive
    one. The active Namenode could continuously forward transactions to the
    passive one. In case of failure of the active Namenode, the passive
    could take over. This type of High-Availability would probably be very
    necessary for non-batch-type-applications.

    Thanks,
    dhruba

    -----Orivery necessaginal Message-----
    From: Joydeep Sen Sarma
    Sent: Thursday, February 28, 2008 6:06 PM
    To: core-user@hadoop.apache.org
    Subject: RE: long write operations and data recovery

    We have had a lot of peace of mind by building a data pipeline that does
    not assume that hdfs is always up and running. If the application is
    primarily non real-time log processing - I would suggest
    batch/incremental copies of data to hdfs that can catch up automatically
    in case of failures/downtimes.

    we have a rsync like map-reduce job that monitors a log directories and
    keeps pulling new data in (and suspect lot of other users do similar
    stuff as well). Might be a useful notion to generalize and put in
    contrib.


    -----Original Message-----
    From: Steve Sapovits
    Sent: Thursday, February 28, 2008 4:54 PM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery

    How does replication affect this? If there's at least one replicated
    client still running, I assume that takes care of it?
    Never mind -- I get this now after reading the docs again.

    My remaining point of failure question concerns name nodes. The docs
    say manual
    intervention is still required if a name node goes down. How is this
    typically managed
    in production environments? It would seem even a short name node
    outage in a
    data intestive environment would lead to data loss (no name node to give
    the data
    to).

    --
    Steve Sapovits
    Invite Media - http://www.invitemedia.com
    ssapovits@invitemedia.com
  • Ted Dunning at Feb 29, 2008 at 4:40 pm
    In our case, we looked at the problem and decided that Hadoop wasn't
    feasible for our real-time needs in any case. There were several issues,

    - first, of all, map-reduce itself didn't seem very plausible for real-time
    applications. That left hbase and hdfs as the capabilities offered by
    hadoop (for real-time stuff)

    - hbase was far to immature to consider using. Also, the read rate from
    hbase is not that impressive compared, say to a bank of a dozen or more
    memcaches.

    - hdfs won't handle nearly the volume of files that we need to work with.
    In our main delivery system (one of many needs), we have nearly a billion
    (=10^9) files that we have to be able to export at high data rates. That
    just isn't feasible in hadoop without lots of extra work.

    The upshot is that we use hadoop extensively for batch operations where it
    really shines. The other nice effect is that we don't have to worry all
    that much about HA (at least not real-time HA) since we don't do real-time
    with hadoop.

    On 2/28/08 9:53 PM, "dhruba Borthakur" wrote:

    I agree with Joydeep. For batch processing, it is sufficient to make the
    application not assume that HDFS is always up and active. However, for
    real-time applications that are not batch-centric, it might not be
    sufficient. There are a few things that HDFS could do to better handle
    Namenode outages:

    1. Make Clients handle transient Namenode downtime. This requires that
    Namenode restarts are fast, clients can handle long Namenode outages,
    etc.etc.
    2. Design HDFS Namenode to be a set of two, an active one and a passive
    one. The active Namenode could continuously forward transactions to the
    passive one. In case of failure of the active Namenode, the passive
    could take over. This type of High-Availability would probably be very
    necessary for non-batch-type-applications.

    Thanks,
    dhruba

    -----Orivery necessaginal Message-----
    From: Joydeep Sen Sarma
    Sent: Thursday, February 28, 2008 6:06 PM
    To: core-user@hadoop.apache.org
    Subject: RE: long write operations and data recovery

    We have had a lot of peace of mind by building a data pipeline that does
    not assume that hdfs is always up and running. If the application is
    primarily non real-time log processing - I would suggest
    batch/incremental copies of data to hdfs that can catch up automatically
    in case of failures/downtimes.

    we have a rsync like map-reduce job that monitors a log directories and
    keeps pulling new data in (and suspect lot of other users do similar
    stuff as well). Might be a useful notion to generalize and put in
    contrib.


    -----Original Message-----
    From: Steve Sapovits
    Sent: Thursday, February 28, 2008 4:54 PM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery

    How does replication affect this? If there's at least one replicated
    client still running, I assume that takes care of it?
    Never mind -- I get this now after reading the docs again.

    My remaining point of failure question concerns name nodes. The docs
    say manual
    intervention is still required if a name node goes down. How is this
    typically managed
    in production environments? It would seem even a short name node
    outage in a
    data intestive environment would lead to data loss (no name node to give
    the data
    to).
  • Steve Sapovits at Feb 29, 2008 at 7:19 pm

    Ted Dunning wrote:

    In our case, we looked at the problem and decided that Hadoop wasn't
    feasible for our real-time needs in any case. There were several
    issues,

    - first, of all, map-reduce itself didn't seem very plausible for
    real-time applications. That left hbase and hdfs as the capabilities
    offered by hadoop (for real-time stuff)
    We'll be using map-reduce batch mode, so we're okay there.
    The upshot is that we use hadoop extensively for batch operations
    where it really shines. The other nice effect is that we don't have
    to worry all that much about HA (at least not real-time HA) since we
    don't do real-time with hadoop.
    What I'm struggling with is the write side of things. We'll have a huge
    amount of data to write that's essentially a log format. It would seem
    that writing that outside of HDFS then trying to batch import it would
    be a losing battle -- that you would need the distributed nature of HDFS
    to do very large volume writes directly and wouldn't easily be able to take
    some other flat storage model and feed it in as a secondary step without
    having the HDFS side start to lag behind.

    The realization is that Name Node could go down so we'll have to have a
    backup store that might be used during temporary outages, but that
    most of the writes would be direct HDFS updates.

    The alternative would seem to be to end up with a set of distributed files
    without some unifying distributed file system (e.g., like lots of Apache
    web logs on many many individual boxes) and then have to come up with
    some way to funnel those back into HDFS.

    --
    Steve Sapovits
    Invite Media - http://www.invitemedia.com
    ssapovits@invitemedia.com
  • Ted Dunning at Feb 29, 2008 at 7:33 pm
    Unless your volume is MUCH higher than ours, I think you can get by with a
    relatively small farm of log consolidators that collect and concatenate
    files.

    If each log line is 100 bytes after compression (that is huge really) and
    you have 10,000 events per second (also pretty danged high) then you are
    only writing 1MB/s. If you need a day of buffering (=100,000 seconds), then
    you need 100GB of buffer storage. These are very, very moderate
    requirements for your ingestion point.

    On 2/29/08 11:18 AM, "Steve Sapovits" wrote:

    Ted Dunning wrote:
    In our case, we looked at the problem and decided that Hadoop wasn't
    feasible for our real-time needs in any case. There were several
    issues,

    - first, of all, map-reduce itself didn't seem very plausible for
    real-time applications. That left hbase and hdfs as the capabilities
    offered by hadoop (for real-time stuff)
    We'll be using map-reduce batch mode, so we're okay there.
    The upshot is that we use hadoop extensively for batch operations
    where it really shines. The other nice effect is that we don't have
    to worry all that much about HA (at least not real-time HA) since we
    don't do real-time with hadoop.
    What I'm struggling with is the write side of things. We'll have a huge
    amount of data to write that's essentially a log format. It would seem
    that writing that outside of HDFS then trying to batch import it would
    be a losing battle -- that you would need the distributed nature of HDFS
    to do very large volume writes directly and wouldn't easily be able to take
    some other flat storage model and feed it in as a secondary step without
    having the HDFS side start to lag behind.

    The realization is that Name Node could go down so we'll have to have a
    backup store that might be used during temporary outages, but that
    most of the writes would be direct HDFS updates.

    The alternative would seem to be to end up with a set of distributed files
    without some unifying distributed file system (e.g., like lots of Apache
    web logs on many many individual boxes) and then have to come up with
    some way to funnel those back into HDFS.
  • dhruba Borthakur at Feb 29, 2008 at 7:53 pm
    It would nice if a layer on top of the dfs client can be built to handle
    disconnected operation. That layer can cache files on local disk if HDFS
    is unavailable. It can then upload those files into HDFS when HDFS
    service comes back online. I think such a service will be helpful for
    most HDFS installations.

    Thanks,
    dhruba

    -----Original Message-----
    From: Ted Dunning
    Sent: Friday, February 29, 2008 11:33 AM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery


    Unless your volume is MUCH higher than ours, I think you can get by with
    a
    relatively small farm of log consolidators that collect and concatenate
    files.

    If each log line is 100 bytes after compression (that is huge really)
    and
    you have 10,000 events per second (also pretty danged high) then you are
    only writing 1MB/s. If you need a day of buffering (=100,000 seconds),
    then
    you need 100GB of buffer storage. These are very, very moderate
    requirements for your ingestion point.

    On 2/29/08 11:18 AM, "Steve Sapovits" wrote:

    Ted Dunning wrote:
    In our case, we looked at the problem and decided that Hadoop wasn't
    feasible for our real-time needs in any case. There were several
    issues,

    - first, of all, map-reduce itself didn't seem very plausible for
    real-time applications. That left hbase and hdfs as the capabilities
    offered by hadoop (for real-time stuff)
    We'll be using map-reduce batch mode, so we're okay there.
    The upshot is that we use hadoop extensively for batch operations
    where it really shines. The other nice effect is that we don't have
    to worry all that much about HA (at least not real-time HA) since we
    don't do real-time with hadoop.
    What I'm struggling with is the write side of things. We'll have a huge
    amount of data to write that's essentially a log format. It would seem
    that writing that outside of HDFS then trying to batch import it would
    be a losing battle -- that you would need the distributed nature of HDFS
    to do very large volume writes directly and wouldn't easily be able to take
    some other flat storage model and feed it in as a secondary step without
    having the HDFS side start to lag behind.

    The realization is that Name Node could go down so we'll have to have a
    backup store that might be used during temporary outages, but that
    most of the writes would be direct HDFS updates.

    The alternative would seem to be to end up with a set of distributed files
    without some unifying distributed file system (e.g., like lots of Apache
    web logs on many many individual boxes) and then have to come up with
    some way to funnel those back into HDFS.
  • Joydeep Sen Sarma at Feb 29, 2008 at 9:55 pm
    I would agree with Ted. You should easily be able to get 100MBps write
    throughput on a standard Netapp box (with read bandwidth left over -
    since the peak write throughput rating is more than twice of that). Even
    at an average write throughput rate of 50MBps - the daily data volume
    would be (drumroll ..) 4+TB!

    So buffer to a decent box and copy stuff over ..

    -----Original Message-----
    From: Ted Dunning
    Sent: Friday, February 29, 2008 11:33 AM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery


    Unless your volume is MUCH higher than ours, I think you can get by with
    a
    relatively small farm of log consolidators that collect and concatenate
    files.

    If each log line is 100 bytes after compression (that is huge really)
    and
    you have 10,000 events per second (also pretty danged high) then you are
    only writing 1MB/s. If you need a day of buffering (=100,000 seconds),
    then
    you need 100GB of buffer storage. These are very, very moderate
    requirements for your ingestion point.

    On 2/29/08 11:18 AM, "Steve Sapovits" wrote:

    Ted Dunning wrote:
    In our case, we looked at the problem and decided that Hadoop wasn't
    feasible for our real-time needs in any case. There were several
    issues,

    - first, of all, map-reduce itself didn't seem very plausible for
    real-time applications. That left hbase and hdfs as the capabilities
    offered by hadoop (for real-time stuff)
    We'll be using map-reduce batch mode, so we're okay there.
    The upshot is that we use hadoop extensively for batch operations
    where it really shines. The other nice effect is that we don't have
    to worry all that much about HA (at least not real-time HA) since we
    don't do real-time with hadoop.
    What I'm struggling with is the write side of things. We'll have a huge
    amount of data to write that's essentially a log format. It would seem
    that writing that outside of HDFS then trying to batch import it would
    be a losing battle -- that you would need the distributed nature of HDFS
    to do very large volume writes directly and wouldn't easily be able to take
    some other flat storage model and feed it in as a secondary step without
    having the HDFS side start to lag behind.

    The realization is that Name Node could go down so we'll have to have a
    backup store that might be used during temporary outages, but that
    most of the writes would be direct HDFS updates.

    The alternative would seem to be to end up with a set of distributed files
    without some unifying distributed file system (e.g., like lots of Apache
    web logs on many many individual boxes) and then have to come up with
    some way to funnel those back into HDFS.
  • Andy Li at Feb 29, 2008 at 10:38 pm
    What about a hot standby namenode?

    For write-ahead-log to avoid crash and recovery, I think this is fine for
    small I/O.
    For large volume, the write-ahead-log will actually take up the system IO
    resource
    pretty much that makes 2 IO per block (log and the actual data). This will
    fall back
    how current database design implements recovery and crash.

    Another thing I don't see in the picture is how Hadoop manage system file
    system
    instructions. Each system has different implementation on their file system
    and I believe
    that by calling 'write' or 'flush' does not really flush the data to the
    disk. Not sure if this is
    inevitable and platform OS dependent, but I cannot find any documents to
    describe how
    Hadoop handle this.

    P.S. I handle HA and fail-over mechanism in my own application, but I think
    for a framwork,
    it should be transparent (semi-transparent) to the user.

    -annndy
    On Fri, Feb 29, 2008 at 1:54 PM, Joydeep Sen Sarma wrote:

    I would agree with Ted. You should easily be able to get 100MBps write
    throughput on a standard Netapp box (with read bandwidth left over -
    since the peak write throughput rating is more than twice of that). Even
    at an average write throughput rate of 50MBps - the daily data volume
    would be (drumroll ..) 4+TB!

    So buffer to a decent box and copy stuff over ..

    -----Original Message-----
    From: Ted Dunning
    Sent: Friday, February 29, 2008 11:33 AM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery


    Unless your volume is MUCH higher than ours, I think you can get by with
    a
    relatively small farm of log consolidators that collect and concatenate
    files.

    If each log line is 100 bytes after compression (that is huge really)
    and
    you have 10,000 events per second (also pretty danged high) then you are
    only writing 1MB/s. If you need a day of buffering (=100,000 seconds),
    then
    you need 100GB of buffer storage. These are very, very moderate
    requirements for your ingestion point.

    On 2/29/08 11:18 AM, "Steve Sapovits" wrote:

    Ted Dunning wrote:
    In our case, we looked at the problem and decided that Hadoop wasn't
    feasible for our real-time needs in any case. There were several
    issues,

    - first, of all, map-reduce itself didn't seem very plausible for
    real-time applications. That left hbase and hdfs as the capabilities
    offered by hadoop (for real-time stuff)
    We'll be using map-reduce batch mode, so we're okay there.
    The upshot is that we use hadoop extensively for batch operations
    where it really shines. The other nice effect is that we don't have
    to worry all that much about HA (at least not real-time HA) since we
    don't do real-time with hadoop.
    What I'm struggling with is the write side of things. We'll have a huge
    amount of data to write that's essentially a log format. It would seem
    that writing that outside of HDFS then trying to batch import it would
    be a losing battle -- that you would need the distributed nature of HDFS
    to do very large volume writes directly and wouldn't easily be able to take
    some other flat storage model and feed it in as a secondary step without
    having the HDFS side start to lag behind.

    The realization is that Name Node could go down so we'll have to have a
    backup store that might be used during temporary outages, but that
    most of the writes would be direct HDFS updates.

    The alternative would seem to be to end up with a set of distributed files
    without some unifying distributed file system (e.g., like lots of Apache
    web logs on many many individual boxes) and then have to come up with
    some way to funnel those back into HDFS.
  • Jason Venner at Feb 29, 2008 at 6:53 am
    us also.
    The pulling in of data from external machines then a pipeline of simple
    map/reduces is our standard pattern.

    Joydeep Sen Sarma wrote:
    We have had a lot of peace of mind by building a data pipeline that does
    not assume that hdfs is always up and running. If the application is
    primarily non real-time log processing - I would suggest
    batch/incremental copies of data to hdfs that can catch up automatically
    in case of failures/downtimes.

    we have a rsync like map-reduce job that monitors a log directories and
    keeps pulling new data in (and suspect lot of other users do similar
    stuff as well). Might be a useful notion to generalize and put in
    contrib.


    -----Original Message-----
    From: Steve Sapovits
    Sent: Thursday, February 28, 2008 4:54 PM
    To: core-user@hadoop.apache.org
    Subject: Re: long write operations and data recovery


    How does replication affect this? If there's at least one replicated
    client still running, I assume that takes care of it?
    Never mind -- I get this now after reading the docs again.

    My remaining point of failure question concerns name nodes. The docs
    say manual
    intervention is still required if a name node goes down. How is this
    typically managed
    in production environments? It would seem even a short name node
    outage in a
    data intestive environment would lead to data loss (no name node to give
    the data
    to).

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 26, '08 at 1:13a
activeFeb 29, '08 at 10:38p
posts14
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase