FAQ
Hello All,

I am confused over how MapReduce tasks select data blocks for processing
user requests ?

As data block replication replicates single data block over multiple
datanodes, during job processing how uniquely
data blocks are selected for processing user requests ? How does it
guarantees that no same block gets chosen twice or thrice
for different mapper task.


Thank you

-Mehal

Search Discussions

  • Rishi Yadav at Feb 9, 2013 at 5:00 am
    Hi Mehal,

    When Client makes a read request for a certain file say foo.txt, namenode
    sends information of first block(BlockID) and the datanodes it resides on.

    It's client which decides which datanode to pull information from. If first
    request fails, it can make a retry to get another replica of block from
    another datanode. This process repeats until all data is read.

    Thanks and Regards,

    Rishi Yadav

    (o) 408.988.2000x113 || (f) 408.716.2726

    InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)*

    *INC 500 Fastest growing company in 2012 || 2011*

    *Best Place to work in Bay Area 2012 - *SF Business Times and the Silicon
    Valley / San Jose Business Journal

    2041 Mission College Boulevard, #280 || Santa Clara, CA 95054



    On Fri, Feb 8, 2013 at 4:40 PM, Mehal Patel wrote:

    Hello All,

    I am confused over how MapReduce tasks select data blocks for processing
    user requests ?

    As data block replication replicates single data block over multiple
    datanodes, during job processing how uniquely
    data blocks are selected for processing user requests ? How does it
    guarantees that no same block gets chosen twice or thrice
    for different mapper task.


    Thank you

    -Mehal
  • Harsh J at Feb 9, 2013 at 5:13 am
    Hi Mehal,
    I am confused over how MapReduce tasks select data blocks for processing user requests ?
    I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
    Guide, titled "How MapReduce Works". It explains almost everything you
    need to know in very clear language, and should help you generally if
    you get this or other such good books.
    As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?
    The first point to clear up is that MapReduce is not hard-tied to
    HDFS. It generates splits on any FS and the splits are unique, based
    on your given input path. Each split therefore relates to one task and
    the task's input goal is hence defined at submit-time itself. Each
    split is further defined by its path, start offset into the file and
    length after offset to be processed - "uniquely" defining itself.
    How does it guarantees that no same block gets chosen twice or thrice for different mapper task.
    See above - each "block" (or a "split" in MR terms), is defined by its
    start-offset and length. No two splits generated for a single file
    would be the same, as we generate it that way - and hence there won't
    be such a case you're worried about.
    On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel wrote:
    Hello All,

    I am confused over how MapReduce tasks select data blocks for processing
    user requests ?

    As data block replication replicates single data block over multiple
    datanodes, during job processing how uniquely
    data blocks are selected for processing user requests ? How does it
    guarantees that no same block gets chosen twice or thrice
    for different mapper task.


    Thank you

    -Mehal


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedFeb 9, '13 at 12:41a
activeFeb 9, '13 at 5:13a
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2019 Grokbase