FAQ
want input sampler & sorted partitioner
---------------------------------------

Key: HADOOP-3019
URL: https://issues.apache.org/jira/browse/HADOOP-3019
Project: Hadoop Core
Issue Type: New Feature
Components: mapred
Reporter: Doug Cutting


The input sampler should generate a small, random sample of the input, saved to a file.

The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.

Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Doug Cutting (JIRA) at Mar 14, 2008 at 4:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578840#action_12578840 ]

    Doug Cutting commented on HADOOP-3019:
    --------------------------------------

    Implementation thoughts:
    - the sampler can be implemented as an inputformat.
    - a generic sampling job class can configure the sampling input format and a single identity reducer.
    - samples should come from random positions in input files, since input files are frequently themselves sorted.


    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting

    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 14, 2008 at 7:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578899#action_12578899 ]

    Amar Kamat commented on HADOOP-3019:
    ------------------------------------

    Should this be a part of examples like sort? Users can use it the way they use other examples.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting

    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Mar 14, 2008 at 8:14 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578915#action_12578915 ]

    Doug Cutting commented on HADOOP-3019:
    --------------------------------------
    Should this be a part of examples like sort?
    I guess it could live with the examples, but I was thinking that this would be more like mapred/lib. The sampler should be generic enough that folks won't have to modify it to find it useful: it should work for different key/value types and for sequencefile and text data.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting

    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Enis Soztutar (JIRA) at Mar 17, 2008 at 8:49 am
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579347#action_12579347 ]

    Enis Soztutar commented on HADOOP-3019:
    ---------------------------------------

    The sampler can be easily written once Filters are in(HADOOP-449). I intent to come up with a patch today.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting

    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 15, 2008 at 10:40 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Attachment: 3019-0.patch

    Adapted TotalOrderPartitioner and input sampler from HADOOP-3402, with the following changes:
    * Adds two other kinds of samplers
    * Made memcmp-able types (Text, BytesWritable) use the trie, other data structures do a binary search over the partition keyset
    * Adds a unit test for the partitioner
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Attachments: 3019-0.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 15, 2008 at 10:42 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Fix Version/s: 0.19.0
    Assignee: Chris Douglas
    Status: Patch Available (was: Open)

    This won't compile until HADOOP-4151 is in, but marking it PA for review.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 15, 2008 at 11:04 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631175#action_12631175 ]

    Chris Douglas commented on HADOOP-3019:
    ---------------------------------------

    Results of test-patch with HADOOP-4151 applied:
    {noformat}
    [exec] +1 overall.

    [exec] +1 @author. The patch does not contain any @author tags.

    [exec] +1 tests included. The patch appears to include 18 new or modified tests.

    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    {noformat}
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 15, 2008 at 11:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Attachment: 3019-1.patch

    Updated patch to refer to BinaryComparable instead of MemComparable and moved the change to bin/hadoop to the correct patch.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Sep 16, 2008 at 10:00 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631568#action_12631568 ]

    Hadoop QA commented on HADOOP-3019:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12390146/3019-1.patch
    against trunk revision 696002.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    -1 core tests. The patch failed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/console

    This message is automatically generated.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 17, 2008 at 12:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Attachment: 3019-2.patch

    Changed random sampling to be less dominated by keys in the latter part of each sampled split.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 17, 2008 at 5:18 am
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Status: Open (was: Patch Available)
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 17, 2008 at 5:18 am
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Status: Patch Available (was: Open)
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Sep 17, 2008 at 6:10 am
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631664#action_12631664 ]

    Runping Qi commented on HADOOP-3019:
    ------------------------------------


    Sorry for jump in late.

    Since the sample points are kept in array and sorted in memory, then its size is severely limited.
    Why not consider to use a map reduce job to generate the sampling points and the partition file?


    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 17, 2008 at 9:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Status: Open (was: Patch Available)

    bq. Since the sample points are kept in array and sorted in memory, then its size is severely limited. Why not consider to use a map reduce job to generate the sampling points and the partition file?

    The client-side sampler is limited, no question, but it usually only takes a few seconds to run (unlike a distributed job), generates decent results, and can be easily rolled into the user's driver. The distributed sampler (planned, writing it) can be more accurate, but will take longer. The client-side sampler also needs to use the map class, so the sampling is on the map output keytype and distribution rather than the input.

    The latter requires that most of the InputSampler be rewritten to use MapRunnable, so I'm cancelling this for now.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 19, 2008 at 6:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Status: Patch Available (was: Open)

    Submitting last patch to make 0.19.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 19, 2008 at 10:49 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Attachment: 3019-3.patch

    Updated based on Owen's feedback:
    * Changed RandomSampler to sample evenly across all splits, rather than evenly from each split
    * Used double instead of float for sampling rate

    This patch also modifies the sort example to demonstrate how to use InputSampler in a job. This requires examples to depend on tools.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 19, 2008 at 11:09 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Attachment: 3019-4.patch

    Fixed a javadoc warning (ant javadoc target didn't know about tools)
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 19, 2008 at 11:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Attachment: 3019-5.patch

    More updates on Owen's feedback:
    * RandomSampler includes the selected element when selecting
    * Validate ordering of partition file when configuring TotalOrderPartitioner
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 19, 2008 at 11:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632871#action_12632871 ]

    Chris Douglas commented on HADOOP-3019:
    ---------------------------------------

    test-patch results for 3019-4:
    {noformat}
    [exec] +1 overall.

    [exec] +1 @author. The patch does not contain any @author tags.

    [exec] +1 tests included. The patch appears to include 5 new or modified tests.

    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    {noformat}
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Sep 19, 2008 at 11:41 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Owen O'Malley updated HADOOP-3019:
    ----------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Chris!
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Sep 22, 2008 at 3:22 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633309#action_12633309 ]

    Hudson commented on HADOOP-3019:
    --------------------------------

    Integrated in Hadoop-trunk #611 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/])
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 24, 2008 at 6:32 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-3019:
    ----------------------------------

    Release Note: Adds a partitioner capable of effecting a total order of output data. Also includes an input sampler for generating the partition keyset for TotalOrderPartitioner, useful where the map's input keytype and distribution approximates its output.

    Added a release note.
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Chansler (JIRA) at Oct 21, 2008 at 7:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Chansler updated HADOOP-3019:
    ------------------------------------

    Release Note: Added a partitioner that effects a total order of output data, and an input sampler for generating the partition keyset for TotalOrderPartitioner for when the map's input keytype and distribution approximates its output. (was: Adds a partitioner capable of effecting a total order of output data. Also includes an input sampler for generating the partition keyset for TotalOrderPartitioner, useful where the map's input keytype and distribution approximates its output.)
    want input sampler & sorted partitioner
    ---------------------------------------

    Key: HADOOP-3019
    URL: https://issues.apache.org/jira/browse/HADOOP-3019
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Doug Cutting
    Assignee: Chris Douglas
    Fix For: 0.19.0

    Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch


    The input sampler should generate a small, random sample of the input, saved to a file.
    The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
    Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions. 10x the intended reducer count should give good results.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 14, '08 at 4:26p
activeOct 21, '08 at 7:57p
posts24
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Robert Chansler (JIRA): 24 posts

People

Translate

site design / logo © 2022 Grokbase