Grokbase Groups Pig user July 2009
FAQ
Hi,

Recently my cluster configuration changed from 11 reducers to 12. Since then
on every job using pig only 3 reducers do actual work and output results
while the others quickly finish and output zero byte files. The result of
the entire job is OK but getting there takes longer because of the uneven
work distribution. Plain MR jobs are fine and do have even work
distribution. Can this be something in pig ?
(I'm using latest trunk)

Thanks,
Tamir

Search Discussions

  • Dmitriy Ryaboy at Jul 2, 2009 at 2:06 pm
    Tamir, Can you provide example queries that result in this behavior, and
    describe or provide the input data?

    -D
    On Thu, Jul 2, 2009 at 4:55 AM, Tamir Kamara wrote:

    Hi,

    Recently my cluster configuration changed from 11 reducers to 12. Since
    then
    on every job using pig only 3 reducers do actual work and output results
    while the others quickly finish and output zero byte files. The result of
    the entire job is OK but getting there takes longer because of the uneven
    work distribution. Plain MR jobs are fine and do have even work
    distribution. Can this be something in pig ?
    (I'm using latest trunk)

    Thanks,
    Tamir
  • Alan Gates at Jul 2, 2009 at 3:23 pm
    For most operations Pig uses the default Hadoop partitioner. We do
    set our own partitioner for order by, but at the moment I believe
    that's it. How are you launching Pig? Is it possible it's picking up
    an old hadoop-site.xml file or something?

    Alan.
    On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:

    Hi,

    Recently my cluster configuration changed from 11 reducers to 12.
    Since then
    on every job using pig only 3 reducers do actual work and output
    results
    while the others quickly finish and output zero byte files. The
    result of
    the entire job is OK but getting there takes longer because of the
    uneven
    work distribution. Plain MR jobs are fine and do have even work
    distribution. Can this be something in pig ?
    (I'm using latest trunk)

    Thanks,
    Tamir
  • Tamir Kamara at Jul 5, 2009 at 9:19 am
    Hi,

    The config files are up to date and have the 12 reducers.
    I've been able to verify that this only happens when a UDF is used. I mostly
    use eval functions. An example script:

    a01 = load 'file' as (key: long, value: int);
    the same for a02-a31;
    b = cogroup a01 by key, a02 by key, ..., a31 by key;
    DEFINE MEDMAD14 pigUDF.MedMad('14');
    c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ..., a31));

    The MEDMAD function iterates over the input and produces a bag of a rolling
    median and mad according to the window value passed in the definition.

    If cogroup is followed by parallel 11 then all is fine (11 equal result
    parts) but if I use parallel 12 then I get only 3 large files and 9 zero
    bytes files.

    What do you think?


    Thanks,
    Tamir

    On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates wrote:

    For most operations Pig uses the default Hadoop partitioner. We do set our
    own partitioner for order by, but at the moment I believe that's it. How
    are you launching Pig? Is it possible it's picking up an old
    hadoop-site.xml file or something?

    Alan.


    On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:

    Hi,
    Recently my cluster configuration changed from 11 reducers to 12. Since
    then
    on every job using pig only 3 reducers do actual work and output results
    while the others quickly finish and output zero byte files. The result of
    the entire job is OK but getting there takes longer because of the uneven
    work distribution. Plain MR jobs are fine and do have even work
    distribution. Can this be something in pig ?
    (I'm using latest trunk)

    Thanks,
    Tamir
  • Tamir Kamara at Jul 5, 2009 at 10:29 am
    Please disregard my comment about the UDF. Even without it - in the query
    below: store b into ... the same behavior with the 11/12 reducers is seem.

    On Sun, Jul 5, 2009 at 12:18 PM, Tamir Kamara wrote:

    Hi,

    The config files are up to date and have the 12 reducers.
    I've been able to verify that this only happens when a UDF is used. I
    mostly use eval functions. An example script:

    a01 = load 'file' as (key: long, value: int);
    the same for a02-a31;
    b = cogroup a01 by key, a02 by key, ..., a31 by key;
    DEFINE MEDMAD14 pigUDF.MedMad('14');
    c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ..., a31));

    The MEDMAD function iterates over the input and produces a bag of a rolling
    median and mad according to the window value passed in the definition.

    If cogroup is followed by parallel 11 then all is fine (11 equal result
    parts) but if I use parallel 12 then I get only 3 large files and 9 zero
    bytes files.

    What do you think?


    Thanks,
    Tamir


    On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates wrote:

    For most operations Pig uses the default Hadoop partitioner. We do set
    our own partitioner for order by, but at the moment I believe that's it.
    How are you launching Pig? Is it possible it's picking up an old
    hadoop-site.xml file or something?

    Alan.


    On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:

    Hi,
    Recently my cluster configuration changed from 11 reducers to 12. Since
    then
    on every job using pig only 3 reducers do actual work and output results
    while the others quickly finish and output zero byte files. The result of
    the entire job is OK but getting there takes longer because of the uneven
    work distribution. Plain MR jobs are fine and do have even work
    distribution. Can this be something in pig ?
    (I'm using latest trunk)

    Thanks,
    Tamir
  • Alan Gates at Jul 6, 2009 at 7:42 pm
    What is the distribution of the key? Is it fairly uniform, a gaussian
    distribution, or a power-law distribution? It seems like the hash
    function is not well chosen for 12 reducers. We use Long.hashCode()
    to get hash values, so as long as the keys are well distributed the
    hash code should be as well.

    Can you attach a sample of the data (or at least the keys)?

    Alan.
    On Jul 5, 2009, at 2:18 AM, Tamir Kamara wrote:

    Hi,

    The config files are up to date and have the 12 reducers.
    I've been able to verify that this only happens when a UDF is used.
    I mostly
    use eval functions. An example script:

    a01 = load 'file' as (key: long, value: int);
    the same for a02-a31;
    b = cogroup a01 by key, a02 by key, ..., a31 by key;
    DEFINE MEDMAD14 pigUDF.MedMad('14');
    c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ...,
    a31));

    The MEDMAD function iterates over the input and produces a bag of a
    rolling
    median and mad according to the window value passed in the definition.

    If cogroup is followed by parallel 11 then all is fine (11 equal
    result
    parts) but if I use parallel 12 then I get only 3 large files and 9
    zero
    bytes files.

    What do you think?


    Thanks,
    Tamir

    On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates wrote:

    For most operations Pig uses the default Hadoop partitioner. We do
    set our
    own partitioner for order by, but at the moment I believe that's
    it. How
    are you launching Pig? Is it possible it's picking up an old
    hadoop-site.xml file or something?

    Alan.


    On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:

    Hi,
    Recently my cluster configuration changed from 11 reducers to 12.
    Since
    then
    on every job using pig only 3 reducers do actual work and output
    results
    while the others quickly finish and output zero byte files. The
    result of
    the entire job is OK but getting there takes longer because of the
    uneven
    work distribution. Plain MR jobs are fine and do have even work
    distribution. Can this be something in pig ?
    (I'm using latest trunk)

    Thanks,
    Tamir
  • Tamir Kamara at Jul 7, 2009 at 7:46 am
    Hi Alan,

    The distribution is not uniform as can be seen from the image.
    Long.hashCode() is simply the original number and if those aren't
    distributed well the "hash" won't be either. I dropped the casting of my key
    to long to force a real hash of the key values and got a nice spread of the
    work to all reducers.
    What should normally be done to avoid this problem?

    Thanks,
    Tamir

    On Mon, Jul 6, 2009 at 10:41 PM, Alan Gates wrote:

    What is the distribution of the key? Is it fairly uniform, a gaussian
    distribution, or a power-law distribution? It seems like the hash function
    is not well chosen for 12 reducers. We use Long.hashCode() to get hash
    values, so as long as the keys are well distributed the hash code should be
    as well.

    Can you attach a sample of the data (or at least the keys)?

    Alan.


    On Jul 5, 2009, at 2:18 AM, Tamir Kamara wrote:

    Hi,
    The config files are up to date and have the 12 reducers.
    I've been able to verify that this only happens when a UDF is used. I
    mostly
    use eval functions. An example script:

    a01 = load 'file' as (key: long, value: int);
    the same for a02-a31;
    b = cogroup a01 by key, a02 by key, ..., a31 by key;
    DEFINE MEDMAD14 pigUDF.MedMad('14');
    c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ...,
    a31));

    The MEDMAD function iterates over the input and produces a bag of a
    rolling
    median and mad according to the window value passed in the definition.

    If cogroup is followed by parallel 11 then all is fine (11 equal result
    parts) but if I use parallel 12 then I get only 3 large files and 9 zero
    bytes files.

    What do you think?


    Thanks,
    Tamir


    On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates wrote:

    For most operations Pig uses the default Hadoop partitioner. We do set
    our
    own partitioner for order by, but at the moment I believe that's it. How
    are you launching Pig? Is it possible it's picking up an old
    hadoop-site.xml file or something?

    Alan.


    On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:

    Hi,
    Recently my cluster configuration changed from 11 reducers to 12. Since
    then
    on every job using pig only 3 reducers do actual work and output results
    while the others quickly finish and output zero byte files. The result
    of
    the entire job is OK but getting there takes longer because of the
    uneven
    work distribution. Plain MR jobs are fine and do have even work
    distribution. Can this be something in pig ?
    (I'm using latest trunk)

    Thanks,
    Tamir
  • Ted Dunning at Jul 7, 2009 at 4:05 pm
    Pretty much just what you did is the right thing to do.

    Bad hashes are just bad. They do what you saw when you have particular
    numbers of reducers. You might convert to a long by some means other than
    casting, but the basic fix is to not use bad hashes..
    On Tue, Jul 7, 2009 at 12:46 AM, Tamir Kamara wrote:

    I dropped the casting of my key to long to force a real hash of the key
    values and got a nice spread of the work to all reducers.
    What should normally be done to avoid this problem?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 2, '09 at 11:55a
activeJul 7, '09 at 4:05p
posts8
users4
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase