Grokbase Groups Pig user August 2010
FAQ
Hi folks, at the last Pig contributor meeting, the piggybank question was
discussed -- namely, how to make it more easy to contribute to.
(by the way, the contributor meetings are generally open to all comers --
sign up for the pig-dev list if you are interested in that type of thing).

Here's a section of the notes I sent to Pig-dev that documents the results
of the piggybank discussion. How do you, as users, feel about this plan?

Piggybank.
Kevin Weil led a discussion of the piggybank. There are a few problems with
it -- it's released on the Pig schedule, and has quite a few barriers to
submission that are, anecdotally at least, preventing people from
contributing. Several options were discussed, with the group finally
settling on starting a community-curated GitHub project for piggybank. It
will have a number of committers from different companies, and will aim to
make it easy for folks to contribute (all contribs will still have to have
tests, and be Apache 2.0-licensed). More details will be forthcoming as we
figure them out. Initially this project will be seeded with the current
Piggybank functions some time after 0.8 is branched. The initial list of
committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
(Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
Please send us any thoughts you might have on this subject. It was suggested
that a lot of common code might be shared with Hive UDFs, which have the
same problems as Piggybank does, and that perhaps the project can be another
collaboration point between the projects. Not clear how that would work,
Carl will talk to other Hive people.

Search Discussions

  • Corbin Hoenes at Aug 28, 2010 at 3:18 am
    I really like this idea. I'd like to see more sharing of udfs out in
    the open.

    What barriers to submission are removed by this move? How does a udf
    make it into piggybank now vs. before?

    Sent from my iPhone
    On Aug 27, 2010, at 3:13 PM, Dmitriy Ryaboy wrote:

    Hi folks, at the last Pig contributor meeting, the piggybank
    question was
    discussed -- namely, how to make it more easy to contribute to.
    (by the way, the contributor meetings are generally open to all
    comers --
    sign up for the pig-dev list if you are interested in that type of
    thing).

    Here's a section of the notes I sent to Pig-dev that documents the
    results
    of the piggybank discussion. How do you, as users, feel about this
    plan?

    Piggybank.
    Kevin Weil led a discussion of the piggybank. There are a few
    problems with
    it -- it's released on the Pig schedule, and has quite a few
    barriers to
    submission that are, anecdotally at least, preventing people from
    contributing. Several options were discussed, with the group finally
    settling on starting a community-curated GitHub project for
    piggybank. It
    will have a number of committers from different companies, and will
    aim to
    make it easy for folks to contribute (all contribs will still have
    to have
    tests, and be Apache 2.0-licensed). More details will be forthcoming
    as we
    figure them out. Initially this project will be seeded with the
    current
    Piggybank functions some time after 0.8 is branched. The initial
    list of
    committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl
    Steinbach
    (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate
    someone.
    Please send us any thoughts you might have on this subject. It was
    suggested
    that a lot of common code might be shared with Hive UDFs, which have
    the
    same problems as Piggybank does, and that perhaps the project can be
    another
    collaboration point between the projects. Not clear how that would
    work,
    Carl will talk to other Hive people.
  • Milind A Bhandarkar at Aug 28, 2010 at 6:40 pm
    +1 on the direction.

    A few questions:

    1. With Pig marching towards becoming a TLP at Apache, can Piggybank become a full-fledged subproject (with it's own releases and all) ?
    2. Or since the ultimate goal is to have a common UDF repository for both Pig and Hive, t would make sense to make it into an incubator project, with a name that does not indicate pig dependency?
    3. I see parallels between Howl and proposed Piggybank, since they aspire to become common components in both Hive and Pig distributions. What are long term plans for Howl as far as hosting is concerned ?

    - Milind

    ________________________________________
    From: Dmitriy Ryaboy [dvryaboy@gmail.com]
    Sent: Friday, August 27, 2010 2:13 PM
    To: pig-user@hadoop.apache.org
    Subject: Request for Comments: Piggybank future

    Hi folks, at the last Pig contributor meeting, the piggybank question was
    discussed -- namely, how to make it more easy to contribute to.
    (by the way, the contributor meetings are generally open to all comers --
    sign up for the pig-dev list if you are interested in that type of thing).

    Here's a section of the notes I sent to Pig-dev that documents the results
    of the piggybank discussion. How do you, as users, feel about this plan?

    Piggybank.
    Kevin Weil led a discussion of the piggybank. There are a few problems with
    it -- it's released on the Pig schedule, and has quite a few barriers to
    submission that are, anecdotally at least, preventing people from
    contributing. Several options were discussed, with the group finally
    settling on starting a community-curated GitHub project for piggybank. It
    will have a number of committers from different companies, and will aim to
    make it easy for folks to contribute (all contribs will still have to have
    tests, and be Apache 2.0-licensed). More details will be forthcoming as we
    figure them out. Initially this project will be seeded with the current
    Piggybank functions some time after 0.8 is branched. The initial list of
    committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
    (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
    Please send us any thoughts you might have on this subject. It was suggested
    that a lot of common code might be shared with Hive UDFs, which have the
    same problems as Piggybank does, and that perhaps the project can be another
    collaboration point between the projects. Not clear how that would work,
    Carl will talk to other Hive people.
  • Dmitriy Ryaboy at Aug 29, 2010 at 9:12 pm
    Hi folks,

    I'll try to address both Corbin's and Milind's questions. This is just my
    opinion, I'm open to criticism/suggestions/corrections.

    There are several barriers that are being removed.

    First, piggybank will no longer be bound to the pig release schedule. At the
    moment, I am not sure there will be "releases" of piggybank, as such -- we
    might just tag snapshots with their own git branches and move on. This
    allows the code to develop at a much faster pace, while possibly sacrificing
    some of the stability and permanence of Apache-style releases. I feel that
    this is ok, as piggybank was always subject to less stringent testing, and
    the attitude towards it has long been "it might work, and you might have to
    tweak it if it doesn't".

    Second, moving to github makes it easy for people to cook their own versions
    of piggybank if they want to -- they just have to fork the main master, and
    apply changes as needed. The committers can pull in all, or some, of the
    changes, if they are desirable. This puts such mutations in the public view,
    as opposed to what's happening now, where they either don't happen, or
    happen on people's unseen svn exports.

    Third, this allows contributions to piggybank for older version of pig. At
    the moment, for example, there isn't really a way to contribute a Pig 0.6
    loader -- the current svn trunk is on the new API, so such contributions
    won't compile. Something could be contributed for a 0.6 branch, but that
    won't see the light of day unless Pig team decides to do a 0.6.1 release,
    which is highly unlikely and kind of a maintenance nightmare. This is why,
    for example, my HBase loader changes wound up in Elephant-Bird instead of
    Pig proper -- I didn't have a good way of getting them out there otherwise.
    On github, we will be able to just keep a 0.6 branch that folks using that
    version can keep moving.

    Bottom line is that we are sacrificing the benefits of a stately, strict
    Apache workflow in order to gain agility and decrease barriers to
    contribution. I personally feel that this is ok because piggybank is not so
    much a software project as a collection of individual, distinct libraries.
    It's kind of the CPAN of Pig, and no one versions all modules of CPAN in one
    go -- the whole thing would get bogged down if that were to happen. Granted,
    cpan lets you pull down specific versions of individual modules, and this
    doesn't.. but let's take it one step at a time.

    I think the bit about Hive interoperation might be a bit overstated. The
    observation was just that Hive has the same problem with user-defined
    functions, and some common code might be reused since the two projects are
    often used to achieve similar goals. So if the Hive people wanted to
    collaborate on the common bits, and put their udfs into /hive while we put
    ours into /pig, we agreed that would be a good thing. There is no intent, at
    the moment, to build some generic udf interface that would allow one to
    write udfs for both hive and pig at once. Though that would be cool.

    -Dmitriy
    On Sat, Aug 28, 2010 at 11:39 AM, Milind A Bhandarkar wrote:

    +1 on the direction.

    A few questions:

    1. With Pig marching towards becoming a TLP at Apache, can Piggybank become
    a full-fledged subproject (with it's own releases and all) ?
    2. Or since the ultimate goal is to have a common UDF repository for both
    Pig and Hive, t would make sense to make it into an incubator project, with
    a name that does not indicate pig dependency?
    3. I see parallels between Howl and proposed Piggybank, since they aspire
    to become common components in both Hive and Pig distributions. What are
    long term plans for Howl as far as hosting is concerned ?

    - Milind

    ________________________________________
    From: Dmitriy Ryaboy [dvryaboy@gmail.com]
    Sent: Friday, August 27, 2010 2:13 PM
    To: pig-user@hadoop.apache.org
    Subject: Request for Comments: Piggybank future

    Hi folks, at the last Pig contributor meeting, the piggybank question was
    discussed -- namely, how to make it more easy to contribute to.
    (by the way, the contributor meetings are generally open to all comers --
    sign up for the pig-dev list if you are interested in that type of thing).

    Here's a section of the notes I sent to Pig-dev that documents the results
    of the piggybank discussion. How do you, as users, feel about this plan?

    Piggybank.
    Kevin Weil led a discussion of the piggybank. There are a few problems with
    it -- it's released on the Pig schedule, and has quite a few barriers to
    submission that are, anecdotally at least, preventing people from
    contributing. Several options were discussed, with the group finally
    settling on starting a community-curated GitHub project for piggybank. It
    will have a number of committers from different companies, and will aim to
    make it easy for folks to contribute (all contribs will still have to have
    tests, and be Apache 2.0-licensed). More details will be forthcoming as we
    figure them out. Initially this project will be seeded with the current
    Piggybank functions some time after 0.8 is branched. The initial list of
    committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
    (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
    Please send us any thoughts you might have on this subject. It was
    suggested
    that a lot of common code might be shared with Hive UDFs, which have the
    same problems as Piggybank does, and that perhaps the project can be
    another
    collaboration point between the projects. Not clear how that would work,
    Carl will talk to other Hive people.
  • Corbin Hoenes at Aug 31, 2010 at 2:40 pm
    All sounds reasonable thanks for explaining the thought process.
    On Aug 29, 2010, at 3:11 PM, Dmitriy Ryaboy wrote:

    Hi folks,

    I'll try to address both Corbin's and Milind's questions. This is just my
    opinion, I'm open to criticism/suggestions/corrections.

    There are several barriers that are being removed.

    First, piggybank will no longer be bound to the pig release schedule. At the
    moment, I am not sure there will be "releases" of piggybank, as such -- we
    might just tag snapshots with their own git branches and move on. This
    allows the code to develop at a much faster pace, while possibly sacrificing
    some of the stability and permanence of Apache-style releases. I feel that
    this is ok, as piggybank was always subject to less stringent testing, and
    the attitude towards it has long been "it might work, and you might have to
    tweak it if it doesn't".

    Second, moving to github makes it easy for people to cook their own versions
    of piggybank if they want to -- they just have to fork the main master, and
    apply changes as needed. The committers can pull in all, or some, of the
    changes, if they are desirable. This puts such mutations in the public view,
    as opposed to what's happening now, where they either don't happen, or
    happen on people's unseen svn exports.

    Third, this allows contributions to piggybank for older version of pig. At
    the moment, for example, there isn't really a way to contribute a Pig 0.6
    loader -- the current svn trunk is on the new API, so such contributions
    won't compile. Something could be contributed for a 0.6 branch, but that
    won't see the light of day unless Pig team decides to do a 0.6.1 release,
    which is highly unlikely and kind of a maintenance nightmare. This is why,
    for example, my HBase loader changes wound up in Elephant-Bird instead of
    Pig proper -- I didn't have a good way of getting them out there otherwise.
    On github, we will be able to just keep a 0.6 branch that folks using that
    version can keep moving.

    Bottom line is that we are sacrificing the benefits of a stately, strict
    Apache workflow in order to gain agility and decrease barriers to
    contribution. I personally feel that this is ok because piggybank is not so
    much a software project as a collection of individual, distinct libraries.
    It's kind of the CPAN of Pig, and no one versions all modules of CPAN in one
    go -- the whole thing would get bogged down if that were to happen. Granted,
    cpan lets you pull down specific versions of individual modules, and this
    doesn't.. but let's take it one step at a time.

    I think the bit about Hive interoperation might be a bit overstated. The
    observation was just that Hive has the same problem with user-defined
    functions, and some common code might be reused since the two projects are
    often used to achieve similar goals. So if the Hive people wanted to
    collaborate on the common bits, and put their udfs into /hive while we put
    ours into /pig, we agreed that would be a good thing. There is no intent, at
    the moment, to build some generic udf interface that would allow one to
    write udfs for both hive and pig at once. Though that would be cool.

    -Dmitriy

    On Sat, Aug 28, 2010 at 11:39 AM, Milind A Bhandarkar <milindb@yahoo-inc.com
    wrote:
    +1 on the direction.

    A few questions:

    1. With Pig marching towards becoming a TLP at Apache, can Piggybank become
    a full-fledged subproject (with it's own releases and all) ?
    2. Or since the ultimate goal is to have a common UDF repository for both
    Pig and Hive, t would make sense to make it into an incubator project, with
    a name that does not indicate pig dependency?
    3. I see parallels between Howl and proposed Piggybank, since they aspire
    to become common components in both Hive and Pig distributions. What are
    long term plans for Howl as far as hosting is concerned ?

    - Milind

    ________________________________________
    From: Dmitriy Ryaboy [dvryaboy@gmail.com]
    Sent: Friday, August 27, 2010 2:13 PM
    To: pig-user@hadoop.apache.org
    Subject: Request for Comments: Piggybank future

    Hi folks, at the last Pig contributor meeting, the piggybank question was
    discussed -- namely, how to make it more easy to contribute to.
    (by the way, the contributor meetings are generally open to all comers --
    sign up for the pig-dev list if you are interested in that type of thing).

    Here's a section of the notes I sent to Pig-dev that documents the results
    of the piggybank discussion. How do you, as users, feel about this plan?

    Piggybank.
    Kevin Weil led a discussion of the piggybank. There are a few problems with
    it -- it's released on the Pig schedule, and has quite a few barriers to
    submission that are, anecdotally at least, preventing people from
    contributing. Several options were discussed, with the group finally
    settling on starting a community-curated GitHub project for piggybank. It
    will have a number of committers from different companies, and will aim to
    make it easy for folks to contribute (all contribs will still have to have
    tests, and be Apache 2.0-licensed). More details will be forthcoming as we
    figure them out. Initially this project will be seeded with the current
    Piggybank functions some time after 0.8 is branched. The initial list of
    committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
    (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
    Please send us any thoughts you might have on this subject. It was
    suggested
    that a lot of common code might be shared with Hive UDFs, which have the
    same problems as Piggybank does, and that perhaps the project can be
    another
    collaboration point between the projects. Not clear how that would work,
    Carl will talk to other Hive people.
  • Russell Jurney at Aug 31, 2010 at 7:01 pm
    I'm pretty excited about this. This removes all the pain of contributing
    UDFs.

    Russ
    On Tue, Aug 31, 2010 at 7:39 AM, Corbin Hoenes wrote:

    All sounds reasonable thanks for explaining the thought process.
    On Aug 29, 2010, at 3:11 PM, Dmitriy Ryaboy wrote:

    Hi folks,

    I'll try to address both Corbin's and Milind's questions. This is just my
    opinion, I'm open to criticism/suggestions/corrections.

    There are several barriers that are being removed.

    First, piggybank will no longer be bound to the pig release schedule. At the
    moment, I am not sure there will be "releases" of piggybank, as such -- we
    might just tag snapshots with their own git branches and move on. This
    allows the code to develop at a much faster pace, while possibly
    sacrificing
    some of the stability and permanence of Apache-style releases. I feel that
    this is ok, as piggybank was always subject to less stringent testing, and
    the attitude towards it has long been "it might work, and you might have to
    tweak it if it doesn't".

    Second, moving to github makes it easy for people to cook their own versions
    of piggybank if they want to -- they just have to fork the main master, and
    apply changes as needed. The committers can pull in all, or some, of the
    changes, if they are desirable. This puts such mutations in the public view,
    as opposed to what's happening now, where they either don't happen, or
    happen on people's unseen svn exports.

    Third, this allows contributions to piggybank for older version of pig. At
    the moment, for example, there isn't really a way to contribute a Pig 0.6
    loader -- the current svn trunk is on the new API, so such contributions
    won't compile. Something could be contributed for a 0.6 branch, but that
    won't see the light of day unless Pig team decides to do a 0.6.1 release,
    which is highly unlikely and kind of a maintenance nightmare. This is why,
    for example, my HBase loader changes wound up in Elephant-Bird instead of
    Pig proper -- I didn't have a good way of getting them out there
    otherwise.
    On github, we will be able to just keep a 0.6 branch that folks using that
    version can keep moving.

    Bottom line is that we are sacrificing the benefits of a stately, strict
    Apache workflow in order to gain agility and decrease barriers to
    contribution. I personally feel that this is ok because piggybank is not so
    much a software project as a collection of individual, distinct
    libraries.
    It's kind of the CPAN of Pig, and no one versions all modules of CPAN in one
    go -- the whole thing would get bogged down if that were to happen. Granted,
    cpan lets you pull down specific versions of individual modules, and this
    doesn't.. but let's take it one step at a time.

    I think the bit about Hive interoperation might be a bit overstated. The
    observation was just that Hive has the same problem with user-defined
    functions, and some common code might be reused since the two projects are
    often used to achieve similar goals. So if the Hive people wanted to
    collaborate on the common bits, and put their udfs into /hive while we put
    ours into /pig, we agreed that would be a good thing. There is no intent, at
    the moment, to build some generic udf interface that would allow one to
    write udfs for both hive and pig at once. Though that would be cool.

    -Dmitriy

    On Sat, Aug 28, 2010 at 11:39 AM, Milind A Bhandarkar <
    milindb@yahoo-inc.com
    wrote:
    +1 on the direction.

    A few questions:

    1. With Pig marching towards becoming a TLP at Apache, can Piggybank
    become
    a full-fledged subproject (with it's own releases and all) ?
    2. Or since the ultimate goal is to have a common UDF repository for
    both
    Pig and Hive, t would make sense to make it into an incubator project,
    with
    a name that does not indicate pig dependency?
    3. I see parallels between Howl and proposed Piggybank, since they
    aspire
    to become common components in both Hive and Pig distributions. What are
    long term plans for Howl as far as hosting is concerned ?

    - Milind

    ________________________________________
    From: Dmitriy Ryaboy [dvryaboy@gmail.com]
    Sent: Friday, August 27, 2010 2:13 PM
    To: pig-user@hadoop.apache.org
    Subject: Request for Comments: Piggybank future

    Hi folks, at the last Pig contributor meeting, the piggybank question
    was
    discussed -- namely, how to make it more easy to contribute to.
    (by the way, the contributor meetings are generally open to all comers
    --
    sign up for the pig-dev list if you are interested in that type of
    thing).
    Here's a section of the notes I sent to Pig-dev that documents the
    results
    of the piggybank discussion. How do you, as users, feel about this plan?

    Piggybank.
    Kevin Weil led a discussion of the piggybank. There are a few problems
    with
    it -- it's released on the Pig schedule, and has quite a few barriers to
    submission that are, anecdotally at least, preventing people from
    contributing. Several options were discussed, with the group finally
    settling on starting a community-curated GitHub project for piggybank.
    It
    will have a number of committers from different companies, and will aim
    to
    make it easy for folks to contribute (all contribs will still have to
    have
    tests, and be Apache 2.0-licensed). More details will be forthcoming as
    we
    figure them out. Initially this project will be seeded with the current
    Piggybank functions some time after 0.8 is branched. The initial list of
    committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl
    Steinbach
    (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate
    someone.
    Please send us any thoughts you might have on this subject. It was
    suggested
    that a lot of common code might be shared with Hive UDFs, which have the
    same problems as Piggybank does, and that perhaps the project can be
    another
    collaboration point between the projects. Not clear how that would work,
    Carl will talk to other Hive people.
  • Alan Gates at Aug 30, 2010 at 5:12 pm

    On Aug 28, 2010, at 11:39 AM, Milind A Bhandarkar wrote:

    +1 on the direction.

    A few questions:

    1. With Pig marching towards becoming a TLP at Apache, can Piggybank
    become a full-fledged subproject (with it's own releases and all) ?
    2. Or since the ultimate goal is to have a common UDF repository for
    both Pig and Hive, t would make sense to make it into an incubator
    project, with a name that does not indicate pig dependency?
    I agree with Dmitriy that this is not necessarily the ultimate goal.
    3. I see parallels between Howl and proposed Piggybank, since they
    aspire to become common components in both Hive and Pig
    distributions. What are long term plans for Howl as far as hosting
    is concerned ?
    The stated plan with Howl has been to put it in the Incubator.

    Alan.
    - Milind

    ________________________________________
    From: Dmitriy Ryaboy [dvryaboy@gmail.com]
    Sent: Friday, August 27, 2010 2:13 PM
    To: pig-user@hadoop.apache.org
    Subject: Request for Comments: Piggybank future

    Hi folks, at the last Pig contributor meeting, the piggybank
    question was
    discussed -- namely, how to make it more easy to contribute to.
    (by the way, the contributor meetings are generally open to all
    comers --
    sign up for the pig-dev list if you are interested in that type of
    thing).

    Here's a section of the notes I sent to Pig-dev that documents the
    results
    of the piggybank discussion. How do you, as users, feel about this
    plan?

    Piggybank.
    Kevin Weil led a discussion of the piggybank. There are a few
    problems with
    it -- it's released on the Pig schedule, and has quite a few
    barriers to
    submission that are, anecdotally at least, preventing people from
    contributing. Several options were discussed, with the group finally
    settling on starting a community-curated GitHub project for
    piggybank. It
    will have a number of committers from different companies, and will
    aim to
    make it easy for folks to contribute (all contribs will still have
    to have
    tests, and be Apache 2.0-licensed). More details will be forthcoming
    as we
    figure them out. Initially this project will be seeded with the
    current
    Piggybank functions some time after 0.8 is branched. The initial
    list of
    committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl
    Steinbach
    (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate
    someone.
    Please send us any thoughts you might have on this subject. It was
    suggested
    that a lot of common code might be shared with Hive UDFs, which have
    the
    same problems as Piggybank does, and that perhaps the project can be
    another
    collaboration point between the projects. Not clear how that would
    work,
    Carl will talk to other Hive people.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 27, '10 at 9:14p
activeAug 31, '10 at 7:01p
posts7
users6
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase