FAQ
Add counters to show number of key/values that have been sorted and merged in the maps and reduces
--------------------------------------------------------------------------------------------------

Key: HADOOP-2774
URL: https://issues.apache.org/jira/browse/HADOOP-2774
Project: Hadoop Core
Issue Type: Bug
Reporter: Owen O'Malley


For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Doug Cutting (JIRA) at Feb 4, 2008 at 6:15 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565450#action_12565450 ]

    Doug Cutting commented on HADOOP-2774:
    --------------------------------------

    Is this proportional to the number of comparisons made or the number of disk reads? I.e., do you ignore in-memory sorting?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Feb 4, 2008 at 6:27 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565455#action_12565455 ]

    Runping Qi commented on HADOOP-2774:
    ------------------------------------


    Additional useful info should also be collected/computed/reported/logged. This info includes the
    total number of spills (by a mapper), the expected number of levels of merges,
    the number of merges at each level, the progress of the merge process, etc.

    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Feb 4, 2008 at 11:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565607#action_12565607 ]

    Owen O'Malley commented on HADOOP-2774:
    ---------------------------------------

    Doug, it is the number of records written to disk, which clearly is the same as the number read + 1.

    Runping, my proposed counters gives you the size of the spills, which is much more important than the number of spills. By dividing by the number of either map outputs or reduce inputs, you would get the effective spill count.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Oct 13, 2008 at 5:29 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi reassigned HADOOP-2774:
    ------------------------------------

    Assignee: Ravi Gummadi
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Oct 21, 2008 at 11:44 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641358#action_12641358 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    Owen, Would you please explain your statement "it is the number of records written to disk, which clearly is the same as the number read + 1. " ?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 26, 2008 at 4:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642803#action_12642803 ]

    Runping Qi commented on HADOOP-2774:
    ------------------------------------


    @Owen the sizes of the spills may not necessarily the same for all spills.
    Thus it is still necessary to know the exact number of spills, instead of inferring it from the sizes.

    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Oct 26, 2008 at 5:08 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642804#action_12642804 ]

    Owen O'Malley commented on HADOOP-2774:
    ---------------------------------------

    Ravi, my statement was about the map side. With no records spilled, each map will write each output record once. Since each record that is spilled before the final merge, is by definition read exactly once as it is merged into the next larger file. On the reduce side, the number of records written in the shuffle is exactly the same as the number of records read.

    Runping, for the most part, the important piece of information is how many records we read and write to disk, since that is the major performance problem. For instance writing 10 files with 1m records each versus writing 2 files with 5m records each will have very similar performance, because it is bound on the disk i/o time. Conversely, the difference between reading and writing 10m records versus 5m records would be substantial.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 26, 2008 at 5:51 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642808#action_12642808 ]

    Runping Qi commented on HADOOP-2774:
    ------------------------------------


    Owen, I think the number of spills is more important than you think.
    Consider these two cases:

    1. a task writes 1000 spills, each 10000 records, totalling 10M records.
    2. a task writes 100 spills, each with 100000 records, totaling 10M records.

    With mergeing factor being 100, the case 2 does not second level merge, while the case 2 does.
    The case 1 requires 2x read/writes of the case 1.

    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Oct 26, 2008 at 11:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642828#action_12642828 ]

    Owen O'Malley commented on HADOOP-2774:
    ---------------------------------------

    Sure and in my proposal includes the records written as intermediates in the merge. So roughly it looks like:

    case 1 = first level spill (10 m row writes) + second level (10 m row writes) + final write (10 m row writes) = 30 m
    case 2 = first level (10 m row writes) + final write (10 m row writes) = 20 m

    which shows the 2 versus 3 levels.

    However, consider case 3 a map writes 500 spills of 20,000 records. It would be merged as:
    second level: 5 pieces + 100 pieces + 100 pieces + 100 pieces + 100 pieces
    final level: 5 big pieces + 95 small pieces
    total = first level (10 m) + second level (405 * 20k = 8.1m) + final write (10m) = 28.1 m

    which is a much better indication of the performance than either levels (3) or first level spills (500).
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 27, 2008 at 5:15 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642962#action_12642962 ]

    Runping Qi commented on HADOOP-2774:
    ------------------------------------

    I still want to know the number of first first level spills, in addition to all other numbers.

    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Oct 27, 2008 at 5:25 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642965#action_12642965 ]

    Owen O'Malley commented on HADOOP-2774:
    ---------------------------------------

    What are you trying to accomplish with the number of first level spills? It clearly doesn't change the performance at all.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Oct 27, 2008 at 6:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643001#action_12643001 ]

    Runping Qi commented on HADOOP-2774:
    ------------------------------------

    it tells me how effective is the spill thread.
    It also gives me some hint as to how to optimize my heapsite/sort.mb setting.

    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi

    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 11, 2008 at 7:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Fix Version/s: 0.20.0
    Status: Patch Available (was: Open)

    The following 2 new counters
    (1) Map First Level Spills(Number of first level spills in map task) and
    (2) Spilled Records(number of records spilled to disk) --- both in Maps and Reduces
    are added with this patch.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 11, 2008 at 7:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    patch attached
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Nov 12, 2008 at 12:48 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646758#action_12646758 ]

    Hadoop QA commented on HADOOP-2774:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12393723/HADOOP-2774.patch
    against trunk revision 713122.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3578/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3578/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3578/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3578/console

    This message is automatically generated.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Nov 12, 2008 at 12:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646889#action_12646889 ]

    Sharad Agarwal commented on HADOOP-2774:
    ----------------------------------------

    Instead of adding code to extract counter at all over the code, what if we intercept the append in IFile.Writer. I believe all intermediate record writes happen thru it. Something like this could be added in Task.java and MapTask/ReduceTask can use this to create Writer:

    {code}
    protected Writer createWriter(Configuration conf, FSDataOutputStream out,
    Class keyClass, Class valueClass, CompressionCodec codec) throws IOException {
    return new Writer(conf, out, keyClass, valueClass, codec) {
    public void append(Object key, Object value) throws IOException {
    super.append(key, value);
    spilledRecordsCounter.increment(1);
    }

    public void append(DataInputBuffer key, DataInputBuffer value)
    throws IOException {
    super.append(key, value);
    spilledRecordsCounter.increment(1);
    }
    };
    }
    {code}
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Nov 12, 2008 at 1:00 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646890#action_12646890 ]

    Sharad Agarwal commented on HADOOP-2774:
    ----------------------------------------

    Also it would be desirable to write a test case for this.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 14, 2008 at 9:27 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    New patch attached here.
    As Sharad pointed out, since most of the writes to disk happen through IFile.Writer.append(), this new code maintains a count of records written.
    I verified the spilled records count with wordcount example and test inputs with different values of io.sort.mb, io.sort.factor and different number of maps & reduces.

    Please do the review of code changes.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 18, 2008 at 9:34 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648549#action_12648549 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    It is difficult to have a testcase for this issue since the number of spilled records depend on a number of factors like the value of io.sort.mb, io.sort.factor, etc. Also, it is not guaranteed that the number will not change when we make changes in the merge/sort algorithm in the future. I tested the patch manually.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Nov 18, 2008 at 9:38 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-2774:
    --------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Ravi!
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 18, 2008 at 11:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas reopened HADOOP-2774:
    -----------------------------------


    I reverted this because using a non-volatile, static long to track records written from IFile is patently incorrect, particularly in the reduce, where there can be dozens of writers.

    Also, the record count should be part of the IFile format, not stored with the index and passed in the header.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 19, 2008 at 3:51 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648903#action_12648903 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    bq. It is difficult to have a testcase for this issue since the number of spilled records depend on a number of factors like the value of io.sort.mb, io.sort.factor, etc. Also, it is not guaranteed that the number will not change when we make changes in the merge/sort algorithm in the future.

    This can- and should- have a unit test. While there are several, configurable parameters that determine the number of spills, it can be calculated and verified. If there are changes to the framework that invalidate the unit test, it can be updated or removed when that happens.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 19, 2008 at 4:49 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648912#action_12648912 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    Chris,
    Will make the static long member in Writer as volatile.

    The record count is kept in IndexRecord and sent as part of http header to reduceNode to avoid processing the records when directly written to disk in shuffleToDisk.
    How about tapping the IFile.Reader for count of records in reduce phase and IFile.Writer in Map phase ? Then there is no need to keep the count of records in IndexRecord(and send it as part of http header) or in IFile.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Arun C Murthy (JIRA) at Nov 19, 2008 at 6:31 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648919#action_12648919 ]

    Arun C Murthy commented on HADOOP-2774:
    ---------------------------------------

    bq. Will make the static long member in Writer as volatile.

    I don't think the answer is making it 'volatile', the correct implementation is to keep the no. of records in the IFile trailer along with the checksum.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Nov 19, 2008 at 7:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648925#action_12648925 ]

    Devaraj Das commented on HADOOP-2774:
    -------------------------------------

    (Sorry that the bug escaped my eye during the commit)
    Here are my thoughts:
    1) In the map task, use a static volatile counter in the IFile.Writer that maintains the total count of records spilled to disk so far.
    2) In the reduce task, use a static volatile counter in the IFile.Reader that maintains the total count of records read from disk so far.
    In the above no information is exchanged between the map and reduce tasks, and the ifile format is not touched too.

    I agree that keeping the information in IFile is in line with keeping the information self-contained, but there is tradeoff in the implementation complexity and the usefulness of that approach versus the one i propose here.. Thoughts?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Nov 19, 2008 at 3:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649075#action_12649075 ]

    Devaraj Das commented on HADOOP-2774:
    -------------------------------------

    We should use AtomicLong instead of volatile if we go with the approach of having static longs in the Reader/Writer classes.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 19, 2008 at 9:25 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649167#action_12649167 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    Sorry, I was unclear. The issue is not the volatility of the aggregator, but that it is static. There are a number of contexts- JVM reuse, for one- where it simply will not work. Aggregating disparate counters in a shared variable only to push its _deltas_ to another aggregator is not an expected approach. A static counter 1) is already implemented in the Counter interface; duplicating its functionality to feed it is suspect and 2) is almost certainly more difficult to get right for all our use cases than some of the alternatives.

    Sharad's proposal seems very reasonable. I'd suggest one variant: adding a Counter formal to the IFile.Reader and IFile.Writer constructors. In the map, creating each Writer with a counter to track each record hitting disk should be accurate. In the reduce, instead of incrementing the counter as the segment is written to disk from the fetch and intermediate merges, updating it as it is *read* from disk will yield the correct value at the end of the job. So for the final merge into the reduce and the intermediate, on-disk merges, a counter will be provided. This makes it unnecessary to transfer the record count to the reduce, lets the IFile format remain exactly as it is, and should be fairly easy to implement.

    Thoughts?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Nov 19, 2008 at 10:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649182#action_12649182 ]

    Hudson commented on HADOOP-2774:
    --------------------------------

    Integrated in Hadoop-trunk #665 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/665/])
    Revert

    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Nov 20, 2008 at 5:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649272#action_12649272 ]

    Devaraj Das commented on HADOOP-2774:
    -------------------------------------

    Yes Chris, that's a very valid point on the JVM reuse. I completely overlooked that. So the basic idea of having counters in the IFile.Reader and IFile.Writer and not touching the IFile format remains. But I'd like to avoid polluting the IFile class with formal Counters if possible.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 20, 2008 at 7:23 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649290#action_12649290 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    I see two approaches now:
    (1) Pass a formal Counters object to the Ifile Reader/Writer and Merger.merge APIs. This would work I think, but the only problem, as Devaraj pointed out is that we are making Ifile and Merger classes dependent on a core MapReduce Counters feature. The Ifile/Merger classes currently does mostly IO related stuff and knows nothing about mapred.Counters/Tasks,etc.

    (2) The other approach is to have a callback kind of a mechanism. Define an interface like IfileDiskOperationsMonitor with two methods recordsReadFromDisk(long num) and recordsWrittenToDisk(long num). The Ifile.Reader could invoke
    IFileDiskOperationsMonitor.recordsReadFromDisk(numRead) whenever
    Reader.close() is called. The Task class could implement the interface and could update its own copy of the relevant Counter then. Similarly for Ifile.Writer (invokes IFileDiskOperationsMonitor.recordsWrittenToDisk(numWritten)).
    The Ifile.Reader,Writer and Merger classes could have an additional argument for IFileDiskOperationsMonitor in the respective constructors (or could have setters in the classes). The argument is the Task object itself since that implements the interface.
    This seems fairly generic and in the future could be used by any other potential user of Ifile/Merger classes (outside MapReduce)..

    Thoughts?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 20, 2008 at 8:59 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649304#action_12649304 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    (1) IFile is a package-private class, no? If it's not visible outside of the mapred package, using other components in that package doesn't seem like pollution to me. If the class responsible for emitting records counts the number of records it emits using the class responsible for counting, that seems like a victory for reuse and coherence. Is there a particular reason the two should be kept separate? Indiscriminate coupling of unrelated components is to be avoided, certainly, but this strikes me as a plain win.

    (2) Just to make sure I understand: the proposal is to add a new interface, IFileDiskOperationsMonitor, which Task would implement. The Task would pass a reference to itself (or an inner class would pass a reference to its containing instance) to the IFile.\{Reader,Writer\} instance, which would hold that reference until it closes, when it would pass its internal count to the IFileDiskOperations instance, which would update the counter. Is that correct?
    * Why the indirection? Why create a new type for monitoring disk operations passing through a particular, intermediate format, a format already limited to a package that already contains a type that implements a superset of the new type's functionality?
    * Task should not gain a new interface each time we want to track a new quantity or type of quantity. Further, the IFile format is not part of the Task type. Metrics from the IFile format certainly are not.
    * If the issue is performance, there's no reason why the counter can't be updated in the same way.
    * Passing a named counter in some contexts is far more readable than passing \*Task.this in some contexts, but not in others.

    bq. This seems fairly generic and in the future could be used by any other potential user of Ifile/Merger classes (outside MapReduce)
    And yet it is emphatically less generic than the Counters, which already provide an interface to consumers outside the mapred package.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Nov 20, 2008 at 12:31 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649354#action_12649354 ]

    Devaraj Das commented on HADOOP-2774:
    -------------------------------------

    I was opposing the introduction of the mapred.Counters in IFile/Merger for the reason that IFile/Merger classes deal today with IO/merge only, and could very well be moved to the org.apache.hadoop.core package as part of the package split work. But this may not be preferred after all since we are also developing a more reusable file format TFile.
    So I am okay either way I guess..
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 20, 2008 at 2:43 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649379#action_12649379 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    So I will do the following:
    I will have a new constructor of IFile.Writer(Wrapper of existing constructor) that will take spilledRecordsCounter as parameter. This constructor is called from MapTask. Writer will have a long that gets updated in append() and spilledRecordsCounter is updated in Writer.close().
    Similarly new constructor in IFile.Reader(wrapper of existing constructor) that will take spilledRecordsCounter as parameter. This constructor is called from ReduceTask. Reader will have a long that gets updated in next() and spilledRecordsCounterr is updated in Reader.close().

    Since Merger.merge is called from both Map and Reduce, inside Merger, we won't have context information of whether called from Map or Reduce. So I will send 2 counters(say readCounter, writeCounter) to merge. In MapTask, Merger.merge(/* other params */, null, spilledRecordsCounter) and in ReduceTask, Merger.merge(/*other params */, spilledRecordsCounter, null) is called. Merger.merge( ) will call the new constructors with Reader(/*other params*/, readCounter) and Writer(/*other params*/, writeCounter) sothat Writes are counted in Map and Reads are counted in Reduce.

    Thoughts ?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 20, 2008 at 11:27 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649530#action_12649530 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    Does the merger require two different counters? Isn't it sufficient to pass an optional counter for each record the merger emits? The caller should have enough context to know whether the segments are coming to/from disk or memory... right?

    Overall, the plan sounds good. This section of code is particular enough that it might be easier to reason about a patch, even a partial one. There are some edge cases that may limit the implementation.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Nov 21, 2008 at 6:19 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649611#action_12649611 ]

    Jothi Padmanabhan commented on HADOOP-2774:
    -------------------------------------------

    bq. Does the merger require two different counters? Isn't it sufficient to pass an optional counter for each record the merger emits? The caller should have enough context to know whether the segments are coming to/from disk or memory... right?

    The two counters are for counting the number of records read and the number of records written, not for determining whether the records came from disk/memory. Since Merger does not know whether a particular call is from Map or Reduce, we need to specify both the couters and use the appropriate counter at the Map/Reduce task level. Can we do without this?
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 21, 2008 at 7:07 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649617#action_12649617 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    bq. The two counters are for counting the number of records read and the number of records written, not for determining whether the records came from disk/memory.

    I'm confused. The map needs to count spilled records as it writes to disk; that's straightforward. The reduce- since some of its fetched segments are written directly to disk- either needs a count from the map as in the original patch, or it should count records as it reads them from disk. Why does the merger need two counters, particularly since the caller is the only one that knows whether it's going to disk or to a reduce? When would the number of records read from its segments differ from the number of records it ultimately emits? Passing an optional object to the merge that gets pinged every time it emits a record is both straightforward as an API change and sufficient for this particular use case. One could probably add the counter to the Merger without changing IFile.Reader and get the same semantics. I like the symmetry of adding counters to both, but either way is fine.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 21, 2008 at 7:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649621#action_12649621 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    As ShuffleToDisk( ) is causing the problem of counting records when directly written to disk without processing, we decided to count the records when they are read from disk in reduce phase(which is same as the number of records written to disk in reuce phase). As Merger.merge is called from both Map and Reduce, merge should be able to count records written to disk(useful when merge is called from Map phase) and records read from disk(useful wnhen merge is called from Reduce phase) separately --- thus 2 counters.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 21, 2008 at 8:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649625#action_12649625 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    1) If there's only one spill in the map, there will be no merge. The counter needs to be given to IFile.Writer and never the merger on the map side.
    2) The merger just needs to count the number of records it emits. That's all it can do; it doesn't know if it's feeding an IFile.Writer, the reduce, or the combiner. The critical piece is figuring out when it is appropriate to give the counter to an IFile.Writer or the merger.
    3) Records read by a merge and records written by the merge are the same quantity. What is the distinction? That the preceding comment passes null for one of the two parameters in each of the preceding cases suggests that their mutual exclusivity is clear. That they are the same quantity is required by the semantics of the merge. A use case would be helpful
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 21, 2008 at 8:59 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649627#action_12649627 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    OK. Talked with Jothi offline. The second counter is handling the _intermediate merge_, which creates its own Writer independent of the counter for records emitted from the RawKeyValueIterator. Specifying how/whether to count these records requires more than the single, Counter parameter to the merge.

    I'm +1 on the approach as described.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 21, 2008 at 6:20 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    Attached the patch with the new code changes.
    Please review the patch and provide your comments. Thanks.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 24, 2008 at 11:31 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650176#action_12650176 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    I haven't been through the rest of the code in detail, but the test case shouldn't need to read/write so much data to test the spill counters.
    * Would it work if you used a smaller io.sort.mb and calibrated the size of your data to trigger a fixed number of spills? In the current version, spills should be triggered based on the number of records, which is a property the test isn't controlling strictly.
    * Why run the combiner? Isn't each word coming out of each map unique?
    * It might be necessary to set mapred.child.java.opts explicitly to make sure the memory limit stays fixed, even for different client configurations. Does it not work with mapred.job.shuffle.buffer.percent = 0?
    * The test cannot create its scratch directory in the working dir. It should use the test.build.data property as the root for its temporary data. It should also clean up when the test completes.
    * testCounters looks like a unit test and only emits log messages. It seems unnecessary and less readable than putting the asserts inline with the unit test
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 24, 2008 at 4:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    Thanks Chris for the comments.

    (1) Would it work if you used a smaller io.sort.mb and calibrated the size of your data to trigger a fixed number of spills? In the current version, spills should be triggered based on the number of records, which is a property the test isn't controlling strictly.

    Yes. It would work. I reduced the size of input files by a factor of 100 and made io.sort.mb=1 now.

    (2) Why run the combiner? Isn't each word coming out of each map unique?

    Wanted to test the path of combiner getting called in Map & Reduce phases --- so have combiner. Words are repeated(fixed number of times) in the input files.

    (3)It might be necessary to set mapred.child.java.opts explicitly to make sure the memory limit stays fixed, even for different client configurations. Does it not work with mapred.job.shuffle.buffer.percent = 0?

    OK, Setting mapred.child.java.opts explicitly now. Made mapred.job.shuffle.buffer.percent=0.

    (4)The test cannot create its scratch directory in the working dir. It should use the test.build.data property as the root for its temporary data. It should also clean up when the test completes.

    Done.

    (5)testCounters looks like a unit test and only emits log messages. It seems unnecessary and less readable than putting the asserts inline with the unit test

    OK. Changed the name of the method to validateCounters. As 2 jobs are run in testSpillCounter()(1 with 3 i/p files and another with 4 i/p files), validateCounters() is called twice. So not inlining it. Hope that is better readable now.


    Attached the patch with the above changes. Please review and provide your comments.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 25, 2008 at 2:37 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650448#action_12650448 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    bq. Wanted to test the path of combiner getting called in Map & Reduce phases — so have combiner. Words are repeated(fixed number of times) in the input files.
    Ah; I'd missed that. OK.

    bq. OK, Setting mapred.child.java.opts explicitly now. Made mapred.job.shuffle.buffer.percent=0.
    Sorry, I meant: if .001 was chosen to get around a bug, i.e. that mapred.job.shuffle.\* couldn't be zero, then the ratio is insufficient to get the behavior the test required. If the property can be zero, then it's not necessary (no need to set mapred.child.java.opts). Wait... this test uses LocalJobRunner. The mapred.job.shuffle\* parameter should do absolutely nothing; the segments will always be read from disk. Unless this test uses the Mini\*Clusters, there's no reason to configure the memory management at the reduce.

    On the map side, for a fixed number of spills, io.sort.record.percent and io.sort.spill.percent should probably be controlled so the number of spills is deterministic. Since you've already sized the test for the defaults, setting them in the test is probably sufficient.

    Other points:
    * The test case should clean up when it fails, as well. Can't it just delete {{TEST_ROOT_DIR}}?
    * IFile.{Reader,Writer} doesn't need another constructor. IFile is a package-private class, so its interface can be changed in a backwards-incompatible way.
    * Quick nit:
    {noformat}
    + if(readRecordsCounter != null)
    + readRecordsCounter.increment(numRecordsRead);
    {noformat}
    This doesn't conform to the [coding standards|http://java.sun.com/docs/codeconv/html/CodeConventions.doc6.html#449].
    * Sorry to revive this, but I really don't see the value in tracking the number of first level spills. Since the changes supporting it are not related to the rest of the patch, I suggest it be pushed to a different JIRA.

    Other than these minor points, the patch looks good.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 25, 2008 at 4:45 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650457#action_12650457 ]

    Ravi Gummadi commented on HADOOP-2774:
    --------------------------------------

    The wordcount example works with mapred.job.shuffle.buffer.percent=0 also(No need of that small +ve value) even without LocalJobRunner. So removing setting mapred.child.java.opts. Since this testcase uses LocalJobRunner, removed the setting of mapred.job.shuffle.buffer.percent.

    OK. Setting io.sort.record.percent and io.sort.spill.percent to the default values in the testcase.

    OK. Now the testcase removes the testdir(and not each file separately). Removes the testdir even in the case of failure.

    Just didn't want to change all the calls to constructors of IFile.Reader and IFile.Writer in all files to have this extra parameter(null in most cases). So added new constructors.

    OK. Added "{ }" for the if statements.

    Map First Level Spills: Runping wanted this for
    [ Show » ] Runping Qi - 27/Oct/08 11:32 AM it tells me how effective is the spill thread. It also gives me some hint as to how to optimize my heapsite/sort.mb setting.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 25, 2008 at 5:20 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    Attached the patch with the modifications discussed.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 25, 2008 at 6:02 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650473#action_12650473 ]

    Chris Douglas commented on HADOOP-2774:
    ---------------------------------------

    bq. Just didn't want to change all the calls to constructors of IFile.Reader and IFile.Writer in all files to have this extra parameter(null in most cases). So added new constructors.
    After removing the old constructors, the compiler complains about 7 unmatched calls. One cstr, {{Reader(Configuration, FileSystem, Path, CompressionCodec)}} doesn't have any callers left. You've already done most of the work on this...

    bq. Map First Level Spills: Runping wanted this for [...]
    Yes, but the changes don't have to be part of this issue. It's a separate change to the counters, and any debate about its usefulness needn't hold up the rest of the patch.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 25, 2008 at 7:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    OK. Removed old constructors of IFile.Reader and IFile.Writer.

    Removed code of "Map First Level Spills". We can have a separate JIRA for that.

    Attached the patch with changes
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 25, 2008 at 8:50 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Status: Patch Available (was: Reopened)

    A counter for number of records spilled to disk in maps & reduces is added.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Nov 25, 2008 at 10:26 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-2774:
    ----------------------------------

    Status: Open (was: Patch Available)

    I don't think this counts single-record spills, where the record is too large to fit in the serialization buffer. +1 once it's fixed.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Nov 25, 2008 at 11:42 am
    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-2774:
    ---------------------------------

    Attachment: HADOOP-2774.patch

    Thanks Chris. Fixed the issue of spillSingleRecord()'s Writer.

    Attached the patch and submitting for patch testing.
    Add counters to show number of key/values that have been sorted and merged in the maps and reduces
    --------------------------------------------------------------------------------------------------

    Key: HADOOP-2774
    URL: https://issues.apache.org/jira/browse/HADOOP-2774
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Owen O'Malley
    Assignee: Ravi Gummadi
    Fix For: 0.20.0

    Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch


    For each *pass* of the sort and merge, I would like a count of the number of records. So for example, if the map output 100 records and they were sorted once, the counter would be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level merge, it may not be a multiple of the number of map output records. This would let the users easily see if they have values like io.sort.mb or io.sort.factor set too low.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedFeb 2, '08 at 8:13a
activeNov 26, '08 at 4:43p
posts56
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 56 posts

People

Translate

site design / logo © 2022 Grokbase