FAQ
When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
------------------------------------------------------------------------------------------------------------------

Key: HADOOP-3293
URL: https://issues.apache.org/jira/browse/HADOOP-3293
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Reporter: Runping Qi




--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Jothi Padmanabhan (JIRA) at Oct 29, 2008 at 5:33 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan reassigned HADOOP-3293:
    -----------------------------------------

    Assignee: Jothi Padmanabhan
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Oct 30, 2008 at 10:02 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643792#action_12643792 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    The reason for this is that FileInputFormat.getBlockIndex() returns the blockindex of the starting block for the given offset. Instead, it should identify all the blocks that this particular split spans and then choose the block that contributes the maximum data for this split.

    We could use the following approach

    {code}
    //Calculate the number of blocks the split spans
    if (numBlocks == 1)
    return startIndex;
    else if (numBlocks == 2)
    return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1;
    else
    return startIndex + 1;
    {code}

    The rationale here is that if there are more than two blocks, we are guaranteed that block 2 is contributing its entire block length for this split.

    Note that we cannot do the identification of the block index based on the amount of data contributed by the individual host, because of the replication factor.
    For example, consider the following example (assume dfs block size = 100)
    Block 1 contributes 20 bytes and its hosts are A,B,C
    Block 2 contributes 100 bytes and its hosts are A, D,E
    Block 3 contributes 10 bytes and its hosts are D,E,F

    If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal.
    Thoughts?
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Oct 30, 2008 at 11:56 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643943#action_12643943 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    bq. If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal.

    To make this clear -- Having decided that A is a good host, we now should also have a good way to decide to pick the correct block from the list of blocks that reside in A. In this case, we should choose between Block1 and Block2. If Block 1 is chosen, it is not very optimal as hosts B & C have only 20 bytes with them.

    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Oct 30, 2008 at 3:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643989#action_12643989 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    Since the BlkIndex is used only to identify the hosts,
    {code}
    int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining,
    splitSize);
    splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
    blkLocations[blkIndex].getHosts()));
    {code}

    we could also modify getBlockIndex() to return a list of hosts that contain the maximum data for that split. For example, if the split was
    Block1 80Bytes Hosts-A,B,C
    Block2 100Bytes Hosts A,D,E
    Block 3 70Bytes Hosts D,F,B

    We would identify the hosts and their contribution as
    A 180
    B 150
    C 80
    D 170
    E 100
    F 70

    We could return A,D,B

    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 30, 2008 at 4:41 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644025#action_12644025 ]

    Runping Qi commented on HADOOP-3293:
    ------------------------------------

    In the above case, I'd say that the prefer hosts for the split should be in the order of A,D,B,E,C,F.
    We should also aggregate the bytes over the racks of those hosts.
    For example, suppose C,E,F share the same rack while other nodes are on different rack.
    Then host E (F, and even C) will offer better rack locality than other hosts.
    In practice, rack locality is almost as good as node locality.

    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Oct 31, 2008 at 6:29 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644224#action_12644224 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    Runping's point on aggregating bytes over racks to determine rack locality makes sense.

    The problem is that the JobClient is unaware of the topology. Some ways to build the topology awareness are:
    # Make the JobClient query the topology service and build its own topology awareness. The problem with this approach is that we need to ensure that the topology script that is used by the JobClient and the JobTracker are always in sync.
    # Let the client get back rack information along with the hosts when it queries the FS for block locations (fs.getFileBlockLocations). We could add a new method fs.getResolvedFileBlockLocations that returns the hosts with the rack information. The default implementation would just return the hosts, DFS would override this method and will return the rack information along with the hosts. We are guranteed correct topology information as the Namenode and JobTracker would be using the same topology information.

    The second approach looks better. Thoughts?


    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Oct 31, 2008 at 9:19 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644244#action_12644244 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    Runping, just to make sure I understand this correctly, consider the following hypothetical scenario

    Rack 1 has hosts A,B,C
    Rack 2 has host D

    A contributes 25, B contributes 20, C contributes 15, D contributes 40

    Then the preference of hosts would be A,B,C,D as A, B and C have an intra-rack contribution of 60 and external contribution of 40, whereas D has intra rack contribution of 40 and external contribution of 60.
    Is this correct?

    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 31, 2008 at 12:38 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644273#action_12644273 ]

    Runping Qi commented on HADOOP-3293:
    ------------------------------------

    yeh.

    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Oct 31, 2008 at 11:08 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644460#action_12644460 ]

    dhruba borthakur commented on HADOOP-3293:
    ------------------------------------------
    DFS would override this method and will return the rack information along with the hosts.
    This is a good idea, but returning only rack location might not work in the general case when there are more than 2 levels in the network topology. Knowing the name of a rack might not tell you how close it is to another rack. But getFileBlockLocations could return the complete path of the host in the network topology. I will provide a patch for this one. See HADOOP-4567
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Nov 13, 2008 at 6:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-3293:
    --------------------------------------

    Attachment: hadoop-3293.patch

    Initial Patch for review
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Nov 13, 2008 at 6:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-3293:
    --------------------------------------

    Status: Patch Available (was: Open)
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Nov 15, 2008 at 12:41 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647771#action_12647771 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-3293:
    ------------------------------------------------
    Initial Patch for review
    For Initial Patch, you might not want to submit it since the patch is likely needed to be updated. Also, you might want to test it locally before submitting it.
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Nov 15, 2008 at 1:07 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647781#action_12647781 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    Sorry, I should have said "Patch for review"; the Patch was locally tested.
    I also did a test to demonstrate the performance improvement from the patch. I allocated a 440 node cluster, ran randomwriter with 40 maps, each map output 25G. I then killed the task trackers on the nodes that ran the maps. I then ran a modified sort (no map output, no reduces) with a minimum input split of 10G. If found that, over an average of three runs, patch was about 17 seconds faster than the trunk (175 secs as opposed to 192 secs)
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Nov 15, 2008 at 1:15 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647783#action_12647783 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-3293:
    ------------------------------------------------
    the Patch was locally tested.
    Have you run all unit tests by "ant test"? It seems that the patch has some problems and it is currently stuck in Hudson. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3593/
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Nov 15, 2008 at 1:19 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647785#action_12647785 ]

    Jothi Padmanabhan commented on HADOOP-3293:
    -------------------------------------------

    Yes, I did
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Nov 15, 2008 at 11:51 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647913#action_12647913 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-3293:
    ------------------------------------------------

    I see. There must be something wrong in the Hudson machine. It seems having a lot problems recently.
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Nov 17, 2008 at 7:01 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648097#action_12648097 ]

    Hadoop QA commented on HADOOP-3293:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12393879/hadoop-3293.patch
    against trunk revision 714107.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3594/console

    This message is automatically generated.
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Nov 18, 2008 at 10:06 am
    [ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-3293:
    --------------------------------

    Resolution: Fixed
    Fix Version/s: 0.20.0
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Jothi!
    When an input split spans cross block boundary, the split location should be the host having most of bytes on it.
    ------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-3293
    URL: https://issues.apache.org/jira/browse/HADOOP-3293
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Jothi Padmanabhan
    Fix For: 0.20.0

    Attachments: hadoop-3293.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedApr 21, '08 at 11:18p
activeNov 18, '08 at 10:06a
posts19
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Devaraj Das (JIRA): 19 posts

People

Translate

site design / logo © 2022 Grokbase