FAQ
DFSClient should choose a block that is local to the node where the client is running
-------------------------------------------------------------------------------------

Key: HADOOP-2060
URL: https://issues.apache.org/jira/browse/HADOOP-2060
Project: Hadoop
Issue Type: Bug
Components: dfs
Reporter: Runping Qi



When I chase down the DFSClient code to see how the data locality impact the dfs read throughput,
I realized that DFSClient does not use data locality info (at least not obvious to me)
when it chooses a block for read from the available replicas.
Here is the relevant code:
{code}
/**
* Pick the best node from which to stream the data.
* Entries in <i>nodes</i> are already in the priority order
*/
private DatanodeInfo bestNode(DatanodeInfo nodes[],
AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)
throws IOException {
if (nodes != null) {
for (int i = 0; i < nodes.length; i++) {
if (!deadNodes.containsKey(nodes[i])) {
return nodes[i];
}
}
}
throw new IOException("No live nodes contain current block");
}

private DNAddrPair chooseDataNode(LocatedBlock block)
throws IOException {
int failures = 0;
while (true) {
DatanodeInfo[] nodes = block.getLocations();
try {
DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName());
return new DNAddrPair(chosenNode, targetAddr);
} catch (IOException ie) {
String blockInfo = block.getBlock() + " file=" + src;
if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
throw new IOException("Could not obtain block: " + blockInfo);
}

if (nodes == null || nodes.length == 0) {
LOG.info("No node available for block: " + blockInfo);
}
LOG.info("Could not obtain block " + block.getBlock() + " from any node: " + ie);
try {
Thread.sleep(3000);
} catch (InterruptedException iex) {
}
deadNodes.clear(); //2nd option is to remove only nodes[blockId]
openInfo();
failures++;
continue;
}
}
}
{code}

It seems to pick the first good replica.
This means that even though some replica is local to the node where the client runs,
it may actually pick a remote replica.

Map/reduce tries to schedule a mapper to a node where some copy of the input split data is local to the node.
However, if the DFSClient does not use that info in choosing replica for read, the mapper may well have to read data
from the network, even though a local replica is available.



I hope I missed something and misunderstood the code.
Otherwise, this will be a serious problem to performance.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Raghu Angadi (JIRA) at Oct 15, 2007 at 8:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534955 ]

    Raghu Angadi commented on HADOOP-2060:
    --------------------------------------

    The nodes array in {{chooseDataNode()}} is sorted by NameNode with the expectation that client accesses them in order. Namenode sorting puts the localnode w.r.t client first in the array. If you notice otherwise, please report.

    DFSClient should choose a block that is local to the node where the client is running
    -------------------------------------------------------------------------------------

    Key: HADOOP-2060
    URL: https://issues.apache.org/jira/browse/HADOOP-2060
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Reporter: Runping Qi

    When I chase down the DFSClient code to see how the data locality impact the dfs read throughput,
    I realized that DFSClient does not use data locality info (at least not obvious to me)
    when it chooses a block for read from the available replicas.
    Here is the relevant code:
    {code}
    /**
    * Pick the best node from which to stream the data.
    * Entries in <i>nodes</i> are already in the priority order
    */
    private DatanodeInfo bestNode(DatanodeInfo nodes[],
    AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)
    throws IOException {
    if (nodes != null) {
    for (int i = 0; i < nodes.length; i++) {
    if (!deadNodes.containsKey(nodes[i])) {
    return nodes[i];
    }
    }
    }
    throw new IOException("No live nodes contain current block");
    }
    private DNAddrPair chooseDataNode(LocatedBlock block)
    throws IOException {
    int failures = 0;
    while (true) {
    DatanodeInfo[] nodes = block.getLocations();
    try {
    DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
    InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName());
    return new DNAddrPair(chosenNode, targetAddr);
    } catch (IOException ie) {
    String blockInfo = block.getBlock() + " file=" + src;
    if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
    throw new IOException("Could not obtain block: " + blockInfo);
    }

    if (nodes == null || nodes.length == 0) {
    LOG.info("No node available for block: " + blockInfo);
    }
    LOG.info("Could not obtain block " + block.getBlock() + " from any node: " + ie);
    try {
    Thread.sleep(3000);
    } catch (InterruptedException iex) {
    }
    deadNodes.clear(); //2nd option is to remove only nodes[blockId]
    openInfo();
    failures++;
    continue;
    }
    }
    }
    {code}
    It seems to pick the first good replica.
    This means that even though some replica is local to the node where the client runs,
    it may actually pick a remote replica.
    Map/reduce tries to schedule a mapper to a node where some copy of the input split data is local to the node.
    However, if the DFSClient does not use that info in choosing replica for read, the mapper may well have to read data
    from the network, even though a local replica is available.
    I hope I missed something and misunderstood the code.
    Otherwise, this will be a serious problem to performance.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 15, 2007 at 8:39 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534962 ]

    Runping Qi commented on HADOOP-2060:
    ------------------------------------

    OK, that is the part I didn't get.

    I'll examine the location list to confirm.

    Thanks,

    DFSClient should choose a block that is local to the node where the client is running
    -------------------------------------------------------------------------------------

    Key: HADOOP-2060
    URL: https://issues.apache.org/jira/browse/HADOOP-2060
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Reporter: Runping Qi

    When I chase down the DFSClient code to see how the data locality impact the dfs read throughput,
    I realized that DFSClient does not use data locality info (at least not obvious to me)
    when it chooses a block for read from the available replicas.
    Here is the relevant code:
    {code}
    /**
    * Pick the best node from which to stream the data.
    * Entries in <i>nodes</i> are already in the priority order
    */
    private DatanodeInfo bestNode(DatanodeInfo nodes[],
    AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)
    throws IOException {
    if (nodes != null) {
    for (int i = 0; i < nodes.length; i++) {
    if (!deadNodes.containsKey(nodes[i])) {
    return nodes[i];
    }
    }
    }
    throw new IOException("No live nodes contain current block");
    }
    private DNAddrPair chooseDataNode(LocatedBlock block)
    throws IOException {
    int failures = 0;
    while (true) {
    DatanodeInfo[] nodes = block.getLocations();
    try {
    DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
    InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName());
    return new DNAddrPair(chosenNode, targetAddr);
    } catch (IOException ie) {
    String blockInfo = block.getBlock() + " file=" + src;
    if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
    throw new IOException("Could not obtain block: " + blockInfo);
    }

    if (nodes == null || nodes.length == 0) {
    LOG.info("No node available for block: " + blockInfo);
    }
    LOG.info("Could not obtain block " + block.getBlock() + " from any node: " + ie);
    try {
    Thread.sleep(3000);
    } catch (InterruptedException iex) {
    }
    deadNodes.clear(); //2nd option is to remove only nodes[blockId]
    openInfo();
    failures++;
    continue;
    }
    }
    }
    {code}
    It seems to pick the first good replica.
    This means that even though some replica is local to the node where the client runs,
    it may actually pick a remote replica.
    Map/reduce tries to schedule a mapper to a node where some copy of the input split data is local to the node.
    However, if the DFSClient does not use that info in choosing replica for read, the mapper may well have to read data
    from the network, even though a local replica is available.
    I hope I missed something and misunderstood the code.
    Otherwise, this will be a serious problem to performance.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Nov 2, 2007 at 1:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Runping Qi resolved HADOOP-2060.
    --------------------------------

    Resolution: Invalid


    The suggested feature is already in.

    DFSClient should choose a block that is local to the node where the client is running
    -------------------------------------------------------------------------------------

    Key: HADOOP-2060
    URL: https://issues.apache.org/jira/browse/HADOOP-2060
    Project: Hadoop
    Issue Type: Bug
    Components: dfs
    Reporter: Runping Qi

    When I chase down the DFSClient code to see how the data locality impact the dfs read throughput,
    I realized that DFSClient does not use data locality info (at least not obvious to me)
    when it chooses a block for read from the available replicas.
    Here is the relevant code:
    {code}
    /**
    * Pick the best node from which to stream the data.
    * Entries in <i>nodes</i> are already in the priority order
    */
    private DatanodeInfo bestNode(DatanodeInfo nodes[],
    AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)
    throws IOException {
    if (nodes != null) {
    for (int i = 0; i < nodes.length; i++) {
    if (!deadNodes.containsKey(nodes[i])) {
    return nodes[i];
    }
    }
    }
    throw new IOException("No live nodes contain current block");
    }
    private DNAddrPair chooseDataNode(LocatedBlock block)
    throws IOException {
    int failures = 0;
    while (true) {
    DatanodeInfo[] nodes = block.getLocations();
    try {
    DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
    InetSocketAddress targetAddr = DataNode.createSocketAddr(chosenNode.getName());
    return new DNAddrPair(chosenNode, targetAddr);
    } catch (IOException ie) {
    String blockInfo = block.getBlock() + " file=" + src;
    if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
    throw new IOException("Could not obtain block: " + blockInfo);
    }

    if (nodes == null || nodes.length == 0) {
    LOG.info("No node available for block: " + blockInfo);
    }
    LOG.info("Could not obtain block " + block.getBlock() + " from any node: " + ie);
    try {
    Thread.sleep(3000);
    } catch (InterruptedException iex) {
    }
    deadNodes.clear(); //2nd option is to remove only nodes[blockId]
    openInfo();
    failures++;
    continue;
    }
    }
    }
    {code}
    It seems to pick the first good replica.
    This means that even though some replica is local to the node where the client runs,
    it may actually pick a remote replica.
    Map/reduce tries to schedule a mapper to a node where some copy of the input split data is local to the node.
    However, if the DFSClient does not use that info in choosing replica for read, the mapper may well have to read data
    from the network, even though a local replica is available.
    I hope I missed something and misunderstood the code.
    Otherwise, this will be a serious problem to performance.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedOct 15, '07 at 7:30p
activeNov 2, '07 at 1:44p
posts4
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Runping Qi (JIRA): 4 posts

People

Translate

site design / logo © 2023 Grokbase