FAQ
DFS data node should not use hard coded 10 minutes as write timeout.
--------------------------------------------------------------------

Key: HADOOP-3124
URL: https://issues.apache.org/jira/browse/HADOOP-3124
Project: Hadoop Core
Issue Type: Bug
Reporter: Runping Qi



This problem happens in 0.17 trunk

I saw reducers waited 10 minutes for writing data to dfs and got timeout.
The client retries again and timeouted after another 19 minutes.

After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
I thing we have three issues:

1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
2. The timeout value should not be hard coded.
3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Raghu Angadi (JIRA) at Mar 28, 2008 at 9:52 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583206#action_12583206 ]

    Raghu Angadi commented on HADOOP-3124:
    --------------------------------------
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    This is already the case : timeout is : 10 min + 5sec * (number of datanodes - position in the pipeline). Client's postion is 0, first datanode's position is 1 etc.

    +1 for making this a config.

    Note that there was no timeout for this before 0.17, it client would get stuck forever. 10 min was added as a very conservative value. What should be the default?

    Though not relevant here, probably we need different write timeouts while receiving a block and while sending a block.

    I am curious to know if you have any info on why one of the datanode's was not able to read for 10minutes.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Runping Qi

    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Mar 29, 2008 at 2:18 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583322#action_12583322 ]

    Runping Qi commented on HADOOP-3124:
    ------------------------------------


    bq. I am curious to know if you have any info on why one of the datanode's was not able to read for 10minutes.

    me too. It may deserve a separate JIRA.
    In my latest case, I believe it was not due to slow disk/network card, or overloaded machines.
    I beleive it is similar to hadoop-3033.


    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Runping Qi

    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Mar 29, 2008 at 2:26 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Runping Qi updated HADOOP-3124:
    -------------------------------

    Component/s: dfs
    Description:
    This problem happens in 0.17 trunk

    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.

    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:

    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.


    was:

    This problem happens in 0.17 trunk

    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.

    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:

    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.


    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Reporter: Runping Qi

    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 8, 2008 at 2:14 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi reassigned HADOOP-3124:
    ------------------------------------

    Assignee: Raghu Angadi
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Reporter: Runping Qi
    Assignee: Raghu Angadi

    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 8, 2008 at 3:26 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Affects Version/s: 0.17.0
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 8, 2008 at 3:26 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Attachment: HADOOP-3124.patch

    Attached patch adds an internal config variable "dfs.datanode.socket.write.timeout".

    Also, when this is set to 0, DataNode uses standard sockets instead of NIO sockets. Runping, could you use this patch with value set to 0 while looking at HADOOP-3132.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Apr 8, 2008 at 6:14 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586668#action_12586668 ]

    Runping Qi commented on HADOOP-3124:
    ------------------------------------


    I ran the gridmix against a build where the write timeout set to 2 minutes.
    It completed smoothly.


    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Apr 8, 2008 at 6:18 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586669#action_12586669 ]

    Runping Qi commented on HADOOP-3124:
    ------------------------------------


    there seems to be at least one place where the constant is still used as the timeout value:
    {code}
    @@ -848,7 +859,7 @@
    /* utility function for sending a respose */
    private static void sendResponse(Socket s, short opStatus) throws IOException {
    DataOutputStream reply =
    - new DataOutputStream(new SocketOutputStream(s, WRITE_TIMEOUT));
    + new DataOutputStream(NetUtils.getOutputStream(s, WRITE_TIMEOUT));
    {code}

    Is this intended?

    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 8, 2008 at 6:28 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586670#action_12586670 ]

    Raghu Angadi commented on HADOOP-3124:
    --------------------------------------

    yes, otherwise I need to make socketWriteTimeout static. Actually I will just pass it to sendResponse() in the next patch. For now it won't matter since sendResponse() is not used after large writes.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Apr 9, 2008 at 1:44 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587028#action_12587028 ]

    Runping Qi commented on HADOOP-3124:
    ------------------------------------



    What is the default timeout?

    I suggest 2 minutes at max.

    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 9, 2008 at 10:10 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587399#action_12587399 ]

    Raghu Angadi commented on HADOOP-3124:
    --------------------------------------


    2 minutes is fine for writes.. though it does not really improve much. Would it matter in the absence of HADOOP-3132?

    I am more concerned about clients reading from DFS since this timeout current applies to those connections as well. Currently DFSClient treats these connection failures are real errors and will try different datanode. I think we need to fix DFSClient before being more aggressive about this timeout.

    0.17 would be the first release that has such a timeout. I am not sure if we should have an aggressive value.

    That said, I am not strongly opposed to reducing it.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 9, 2008 at 10:12 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587399#action_12587399 ]

    rangadi edited comment on HADOOP-3124 at 4/9/08 3:08 PM:
    --------------------------------------------------------------

    2 minutes is fine for writes.. though it does not really improve much. Would it matter in the absence of HADOOP-3132?

    I am more concerned about clients reading from DFS since this timeout current applies to those connections as well. Currently DFSClient treats these connection failures as real errors and will try different datanode. I think we need to fix DFSClient before being more aggressive about this timeout.

    0.17 would be the first release that has such a timeout. I am not sure if we should have an aggressive value in the first release.

    That said, I am not strongly opposed to reducing it.

    was (Author: rangadi):

    2 minutes is fine for writes.. though it does not really improve much. Would it matter in the absence of HADOOP-3132?

    I am more concerned about clients reading from DFS since this timeout current applies to those connections as well. Currently DFSClient treats these connection failures are real errors and will try different datanode. I think we need to fix DFSClient before being more aggressive about this timeout.

    0.17 would be the first release that has such a timeout. I am not sure if we should have an aggressive value.

    That said, I am not strongly opposed to reducing it.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 10, 2008 at 11:18 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Fix Version/s: 0.18.0
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 14, 2008 at 10:30 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Attachment: HADOOP-3124.patch

    Updated patch attached. Changes are :
    # WRITE_TIMEOUT is not used except for default for the config variable (Runping's comment)
    # default is changed to 8 min instead of 10. 10min happened to match default task timeout.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Apr 15, 2008 at 10:37 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589264#action_12589264 ]

    dhruba borthakur commented on HADOOP-3124:
    ------------------------------------------

    I think we should put in this patch for 0.17 and set the write timeout to a value lesser than 10 minutes. This means that the write will timeout before the entire task fails. The failed write will be retried and the task will probably succeed.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Apr 15, 2008 at 10:37 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589265#action_12589265 ]

    dhruba borthakur commented on HADOOP-3124:
    ------------------------------------------

    +1 Code looks good.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 15, 2008 at 10:47 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589267#action_12589267 ]

    Raghu Angadi commented on HADOOP-3124:
    --------------------------------------

    Yes. This patch lowers the value to 8 min. I think 2 min is too short, because 1 min leads to multiple false errors on the cluster I am using for HADOOP-3132. Currently we have this timeout only to catch rare exceptions. I made sure that there is no changes to any logic in the patch other than using regular sockets when the timeout is 0. This is good for 0.17.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 15, 2008 at 10:47 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Status: Patch Available (was: Open)
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Apr 16, 2008 at 8:05 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589459#action_12589459 ]

    Hadoop QA commented on HADOOP-3124:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12380122/HADOOP-3124.patch
    against trunk revision 645773.

    @author +1. The patch does not contain any @author tags.

    tests included -1. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2249/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2249/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2249/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2249/console

    This message is automatically generated.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 16, 2008 at 4:30 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589629#action_12589629 ]

    Raghu Angadi commented on HADOOP-3124:
    --------------------------------------

    Regd unit tests : This patch essentially just makes a constant configurable.

    Regd what the default should be (and why I think 2 min is certainly too low):

    My understanding of what this time is for :
    - Only to catch rare exceptions (like some bugs, hardware failures, kernel hangs etc).
    - Should be long enough that writes don't fail just because a node is currently loaded.

    What this is *not* for :
    - To improve performance.
    - To reduce long tail because of slow nodes.
    -- This needs to be handled at a different level (e.g. NameNode not scheduling so many blocks on such nodes, speculative execution in M/R)
    - Unlike M/R level or at an application level, DFS does not know if some data that it is being asked to write can easily regenerated by another task or can be discarded. So it should try its best to wire to requested number of replicas.

    If you define this timeout to something else, then it is quite possible that much smaller timeout is ok. Please suggest different value (preferably redefining it).

    8min may not be the right value either. Even on 4 disk nodes, just doing 'generateData' stage of gridmix, we have seen that 2 min is not enough. On a heavily loaded cluster running multiple jobs on 2 disk machines, it might be much larger. Thats why making this configurable helps.

    One change we could do is to use different write-timeout values when data is written to DFS and when data is read from DFS (DataNode to DFSClient write).


    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.18.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 16, 2008 at 9:42 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Fix Version/s: (was: 0.18.0)
    0.17.0
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.17.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Apr 16, 2008 at 9:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589738#action_12589738 ]

    Runping Qi commented on HADOOP-3124:
    ------------------------------------


    +1

    The grid operators can configure an appropriate write timeout value based on their own specific situations.


    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.17.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 16, 2008 at 10:06 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.17.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 17, 2008 at 12:52 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Environment: Makes DataNode socket write timeout configurable. User impact : none.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Environment: Makes DataNode socket write timeout configurable. User impact : none.
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.17.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Raghu Angadi (JIRA) at Apr 17, 2008 at 12:58 am
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Raghu Angadi updated HADOOP-3124:
    ---------------------------------

    Environment: (was: Makes DataNode socket write timeout configurable. User impact : none.)
    Release Note: Makes DataNode socket write timeout configurable. User impact : none.
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.17.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Apr 17, 2008 at 12:14 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589990#action_12589990 ]

    Hudson commented on HADOOP-3124:
    --------------------------------

    Integrated in Hadoop-trunk #463 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/463/])
    DFS data node should not use hard coded 10 minutes as write timeout.
    --------------------------------------------------------------------

    Key: HADOOP-3124
    URL: https://issues.apache.org/jira/browse/HADOOP-3124
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.0
    Reporter: Runping Qi
    Assignee: Raghu Angadi
    Fix For: 0.17.0

    Attachments: HADOOP-3124.patch, HADOOP-3124.patch


    This problem happens in 0.17 trunk
    I saw reducers waited 10 minutes for writing data to dfs and got timeout.
    The client retries again and timeouted after another 19 minutes.
    After looking into the code, it seems that the dfs data node uses 10 minutes as timeout for wtiting data into the data node pipeline.
    I thing we have three issues:
    1. The 10 minutes timeout value is too big for writing a chunk of data (64K) through the data node pipeline.
    2. The timeout value should not be hard coded.
    3. Different datanodes in a pipeline should use different timeout values for writing to the downstream.
    A reasonable one maybe (20 secs * numOfDataNodesInTheDownStreamPipe).
    For example, if the replication factor is 3, the client uses 60 secs, the first data node use 40 secs, the second datanode use 20secs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 28, '08 at 8:30p
activeApr 17, '08 at 12:14p
posts27
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 27 posts

People

Translate

site design / logo © 2022 Grokbase