Grokbase Groups Pig dev April 2011
FAQ
[ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015390#comment-13015390 ]

Zubair Nabi commented on PIG-483:
---------------------------------

Hi,

I would like more information about this project. To determine the small size of the set, is sampling an option?

Thanks,
Zubair
PERFORMANCE: different strategies for large and small order bys
---------------------------------------------------------------

Key: PIG-483
URL: https://issues.apache.org/jira/browse/PIG-483
Project: Pig
Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Labels: gsoc2011

Currently pig always does a multi-pass order by where it first determines a distribution for the keys and then orders in a second pass. This avoids the necessity of having a single reducer. However, in cases where the data is small enough to fit into a single reducer, this is inefficient. For small data sets it would be good to realize the small size of the set and do the order by in a single pass with a single reducer.
This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Search Discussions

  • Daniel Dai (JIRA) at Apr 4, 2011 at 5:38 pm
    [ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015511#comment-13015511 ]

    Daniel Dai commented on PIG-483:
    --------------------------------

    Here the focus is one single reduce not the size of data. Currently when doing sorting, Pig will sample the data in the first map-reduce job, then doing the sort in the second. However, If we detects order by only use one reduce, sampling is not necessary.
    PERFORMANCE: different strategies for large and small order bys
    ---------------------------------------------------------------

    Key: PIG-483
    URL: https://issues.apache.org/jira/browse/PIG-483
    Project: Pig
    Issue Type: Improvement
    Affects Versions: 0.2.0
    Reporter: Olga Natkovich
    Labels: gsoc2011

    Currently pig always does a multi-pass order by where it first determines a distribution for the keys and then orders in a second pass. This avoids the necessity of having a single reducer. However, in cases where the data is small enough to fit into a single reducer, this is inefficient. For small data sets it would be good to realize the small size of the set and do the order by in a single pass with a single reducer.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Zubair Nabi (JIRA) at Apr 8, 2011 at 2:53 pm
    [ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017479#comment-13017479 ]

    Zubair Nabi commented on PIG-483:
    ---------------------------------

    But how can one make the call that the data is small enough to apply a single reduce 'order-by'. As I understand, the distribution helps in proper load-balancing in case of skewed datasets. The first MapReduce pass or sampling is used to built a partitioner and in the second pass, that partitioner is used in conjunction with the order-by key as the grouping key. This ensures that every reduce gets a fair workload. So, without any a-priori knowledge, how can we determine whether we need a two-stage order-by or a single stage order-by with a single reduce?
    PERFORMANCE: different strategies for large and small order bys
    ---------------------------------------------------------------

    Key: PIG-483
    URL: https://issues.apache.org/jira/browse/PIG-483
    Project: Pig
    Issue Type: Improvement
    Affects Versions: 0.2.0
    Reporter: Olga Natkovich
    Labels: gsoc2011

    Currently pig always does a multi-pass order by where it first determines a distribution for the keys and then orders in a second pass. This avoids the necessity of having a single reducer. However, in cases where the data is small enough to fit into a single reducer, this is inefficient. For small data sets it would be good to realize the small size of the set and do the order by in a single pass with a single reducer.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Apr 8, 2011 at 9:44 pm
    [ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017668#comment-13017668 ]

    Daniel Dai commented on PIG-483:
    --------------------------------

    No, the number of reducers are currently determined before you launch Pig jobs. Data distribution does not affects the number of reduces. It is determined by "PARALLEL" statement, default_parallel level, hadoop config entry and input data size. If Pig decide to use only one reduce, there is no need for the sampling job.
    PERFORMANCE: different strategies for large and small order bys
    ---------------------------------------------------------------

    Key: PIG-483
    URL: https://issues.apache.org/jira/browse/PIG-483
    Project: Pig
    Issue Type: Improvement
    Affects Versions: 0.2.0
    Reporter: Olga Natkovich
    Labels: gsoc2011

    Currently pig always does a multi-pass order by where it first determines a distribution for the keys and then orders in a second pass. This avoids the necessity of having a single reducer. However, in cases where the data is small enough to fit into a single reducer, this is inefficient. For small data sets it would be good to realize the small size of the set and do the order by in a single pass with a single reducer.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedApr 4, '11 at 1:11p
activeApr 8, '11 at 9:44p
posts4
users1
websitepig.apache.org

1 user in discussion

Daniel Dai (JIRA): 4 posts

People

Translate

site design / logo © 2022 Grokbase