FAQ
Hi again,

Could any body explain to me about the scanning period
policy of DataBlockScanner? That is who often it wake up
and scan a block file.
When looking at the code, I found

static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


but definitely it does not wake up and pick a random block
to verify every three weeks, right?

Thanks a lot,
Thanh

Search Discussions

  • Brian Bockelman at Oct 14, 2010 at 12:08 am
    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an entire node scan. That is, if it's set to 3 weeks, HDFS will try to scan each block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Thanh Do at Oct 14, 2010 at 12:14 am
    Brian,

    When you say *attempt* to complete and *entire* node scan,
    you mean for example, if a node has 100 block files, it will
    try to verify all 100 block every 3 weeks?
    That is in average, a block is scanned every (3 weeks / 100 time interval)?

    Thanks
    Thanh

    On Wed, Oct 13, 2010 at 7:07 PM, Brian Bockelman wrote:

    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an entire
    node scan. That is, if it's set to 3 weeks, HDFS will try to scan each
    block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the
    scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Brian Bockelman at Oct 14, 2010 at 12:19 am
    Hi Thanh,

    That is correct. Last time I read the code, Hadoop scheduled the block verifications randomly throughout the period in order to avoid periodic effects (i.e., high load every N minutes).

    Brian
    On Oct 13, 2010, at 7:14 PM, Thanh Do wrote:

    Brian,

    When you say *attempt* to complete and *entire* node scan,
    you mean for example, if a node has 100 block files, it will
    try to verify all 100 block every 3 weeks?
    That is in average, a block is scanned every (3 weeks / 100 time interval)?

    Thanks
    Thanh

    On Wed, Oct 13, 2010 at 7:07 PM, Brian Bockelman wrote:

    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an entire
    node scan. That is, if it's set to 3 weeks, HDFS will try to scan each
    block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the
    scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Thanh Do at Oct 14, 2010 at 12:30 am
    Hi Brian,

    If this is the case, then is there any chance that,
    some how the DataBlockScanner cannot finishes
    the verification for all the block in three weeks
    (e.g, a node has a very large number of blocks)?

    Thanh
    On Wed, Oct 13, 2010 at 7:18 PM, Brian Bockelman wrote:

    Hi Thanh,

    That is correct. Last time I read the code, Hadoop scheduled the block
    verifications randomly throughout the period in order to avoid periodic
    effects (i.e., high load every N minutes).

    Brian
    On Oct 13, 2010, at 7:14 PM, Thanh Do wrote:

    Brian,

    When you say *attempt* to complete and *entire* node scan,
    you mean for example, if a node has 100 block files, it will
    try to verify all 100 block every 3 weeks?
    That is in average, a block is scanned every (3 weeks / 100 time
    interval)?
    Thanks
    Thanh


    On Wed, Oct 13, 2010 at 7:07 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an
    entire
    node scan. That is, if it's set to 3 weeks, HDFS will try to scan each
    block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the
    scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Brian Bockelman at Oct 14, 2010 at 12:37 am

    On Oct 13, 2010, at 7:29 PM, Thanh Do wrote:

    Hi Brian,

    If this is the case, then is there any chance that,
    some how the DataBlockScanner cannot finishes
    the verification for all the block in three weeks
    (e.g, a node has a very large number of blocks)?
    Yes. At some point, I'd really like to figure out what percentage of our blocks actually get scanned at our site, I suspect some go very long without a scan.

    Brian
    Thanh
    On Wed, Oct 13, 2010 at 7:18 PM, Brian Bockelman wrote:

    Hi Thanh,

    That is correct. Last time I read the code, Hadoop scheduled the block
    verifications randomly throughout the period in order to avoid periodic
    effects (i.e., high load every N minutes).

    Brian
    On Oct 13, 2010, at 7:14 PM, Thanh Do wrote:

    Brian,

    When you say *attempt* to complete and *entire* node scan,
    you mean for example, if a node has 100 block files, it will
    try to verify all 100 block every 3 weeks?
    That is in average, a block is scanned every (3 weeks / 100 time
    interval)?
    Thanks
    Thanh


    On Wed, Oct 13, 2010 at 7:07 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an
    entire
    node scan. That is, if it's set to 3 weeks, HDFS will try to scan each
    block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the
    scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Thanh Do at Oct 14, 2010 at 12:45 am
    Oh, now i see the problem.
    The implication here is that some blocks might not be
    scanned for every long time, because the scanner
    may not finish scan all the blocks during 3 weeks,
    then after that, it start over again, ...

    Interesting, thanks for prompt reply, Brian.

    Thanh


    On Wed, Oct 13, 2010 at 7:37 PM, Brian Bockelman wrote:

    On Oct 13, 2010, at 7:29 PM, Thanh Do wrote:

    Hi Brian,

    If this is the case, then is there any chance that,
    some how the DataBlockScanner cannot finishes
    the verification for all the block in three weeks
    (e.g, a node has a very large number of blocks)?
    Yes. At some point, I'd really like to figure out what percentage of our
    blocks actually get scanned at our site, I suspect some go very long without
    a scan.

    Brian
    Thanh

    On Wed, Oct 13, 2010 at 7:18 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    Hi Thanh,

    That is correct. Last time I read the code, Hadoop scheduled the block
    verifications randomly throughout the period in order to avoid periodic
    effects (i.e., high load every N minutes).

    Brian
    On Oct 13, 2010, at 7:14 PM, Thanh Do wrote:

    Brian,

    When you say *attempt* to complete and *entire* node scan,
    you mean for example, if a node has 100 block files, it will
    try to verify all 100 block every 3 weeks?
    That is in average, a block is scanned every (3 weeks / 100 time
    interval)?
    Thanks
    Thanh


    On Wed, Oct 13, 2010 at 7:07 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an
    entire
    node scan. That is, if it's set to 3 weeks, HDFS will try to scan
    each
    block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the
    scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Thanh Do at Nov 24, 2010 at 1:41 am
    sorry for digging up this old thread.

    Brian, is this the reason you want to add a "data-level" scan
    to HDFS, as in HDFS-221.

    It seems to me that a very rarely read block could
    be silently corrupted, because the DataBlockScanner
    never finish it scanning job in 3 weeks...

    On Wed, Oct 13, 2010 at 7:37 PM, Brian Bockelman wrote:

    On Oct 13, 2010, at 7:29 PM, Thanh Do wrote:

    Hi Brian,

    If this is the case, then is there any chance that,
    some how the DataBlockScanner cannot finishes
    the verification for all the block in three weeks
    (e.g, a node has a very large number of blocks)?
    Yes. At some point, I'd really like to figure out what percentage of our
    blocks actually get scanned at our site, I suspect some go very long without
    a scan.

    Brian
    Thanh

    On Wed, Oct 13, 2010 at 7:18 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    Hi Thanh,

    That is correct. Last time I read the code, Hadoop scheduled the block
    verifications randomly throughout the period in order to avoid periodic
    effects (i.e., high load every N minutes).

    Brian
    On Oct 13, 2010, at 7:14 PM, Thanh Do wrote:

    Brian,

    When you say *attempt* to complete and *entire* node scan,
    you mean for example, if a node has 100 block files, it will
    try to verify all 100 block every 3 weeks?
    That is in average, a block is scanned every (3 weeks / 100 time
    interval)?
    Thanks
    Thanh


    On Wed, Oct 13, 2010 at 7:07 PM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    Hi Thanh,

    The scan period is the period that hadoop *attempts* to complete an
    entire
    node scan. That is, if it's set to 3 weeks, HDFS will try to scan
    each
    block once every 3 weeks.

    Obviously, depending on the bandwidth you have made available to the
    scanning thread, you can specify impossibly small periods.

    Brian
    On Oct 13, 2010, at 7:01 PM, Thanh Do wrote:

    Hi again,

    Could any body explain to me about the scanning period
    policy of DataBlockScanner? That is who often it wake up
    and scan a block file.
    When looking at the code, I found

    static final long DEFAULT_SCAN_PERIOD_HOURS = 21*24L; // three weeks


    but definitely it does not wake up and pick a random block
    to verify every three weeks, right?

    Thanks a lot,
    Thanh
  • Brian Bockelman at Nov 24, 2010 at 1:53 am

    On Nov 23, 2010, at 7:41 PM, Thanh Do wrote:

    sorry for digging up this old thread.

    Brian, is this the reason you want to add a "data-level" scan
    to HDFS, as in HDFS-221.

    It seems to me that a very rarely read block could
    be silently corrupted, because the DataBlockScanner
    never finish it scanning job in 3 weeks...
    Why? What if you restarted your datanode once every 2 weeks? Last I checked, HDFS randomly assigned blocks to be verified throughout a time interval. If you have too many blocks and an insufficient time interval, because HDFS also provides a rate limiting feature, you can easily come up with a case where blocks won't get verified.

    The reason one wants a data-level scan is if the admin wants to manually verify that all copies of a file are good (well, "good" compared to the checksum... maybe the user corrupted it before uploading it :). It'd be a great debugging tool to put site admin's minds at easy.

    Brian

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-dev @
categorieshadoop
postedOct 14, '10 at 12:01a
activeNov 24, '10 at 1:53a
posts9
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Thanh Do: 5 posts Brian Bockelman: 4 posts

People

Translate

site design / logo © 2022 Grokbase