Grokbase Groups Pig user October 2010
FAQ
Hi Pig Users,

I am currently writing a UDF loader. In one of my use case, one line in the
input stream results in multiple tuples. Has anyone encounter or solve this
issue on their end.

The current structure of the code getNext method only return tuple but I
want it to return a List<tuple>. Let me know if there's use case out there
like mine, I am coding it up to return List<tuple> which is more more
flexible than return only one tuple.

Thanks,

John

Search Discussions

  • Alan Gates at Oct 27, 2010 at 10:47 pm
    The easiest way to do this might be to have your loader return a
    single tuple that contains bag, with all of the tuples you want to
    return in that bag. Then your next statement can be a foreach with a
    flatten to turn each of those into its own record.

    A = load 'foo' as (b:bag{});
    B = foreach A generate flatten(b);

    This way you do not have to modify Pig's internal code to make your
    use case work.

    Alan.
    On Oct 27, 2010, at 3:00 PM, John Hui wrote:

    Hi Pig Users,

    I am currently writing a UDF loader. In one of my use case, one
    line in the
    input stream results in multiple tuples. Has anyone encounter or
    solve this
    issue on their end.

    The current structure of the code getNext method only return tuple
    but I
    want it to return a List<tuple>. Let me know if there's use case
    out there
    like mine, I am coding it up to return List<tuple> which is more more
    flexible than return only one tuple.

    Thanks,

    John
  • John Hui at Oct 28, 2010 at 3:36 pm
    I look into the return data bag as an option. The problem is the Loader
    interface require me to return a Tuple object.

    public Tuple getNext() throws IOException {

    but the DataBag interface is not a derive class of Tuple so this means I
    will need to change the internal code for pig for my loader to return a bag
    of tuples. Right?

    John
    On Wed, Oct 27, 2010 at 6:00 PM, John Hui wrote:

    Hi Pig Users,

    I am currently writing a UDF loader. In one of my use case, one line in
    the input stream results in multiple tuples. Has anyone encounter or solve
    this issue on their end.

    The current structure of the code getNext method only return tuple but I
    want it to return a List<tuple>. Let me know if there's use case out there
    like mine, I am coding it up to return List<tuple> which is more more
    flexible than return only one tuple.

    Thanks,

    John
  • Dmitriy Ryaboy at Oct 28, 2010 at 3:42 pm
    Alan means return a tuple of a single bag of many tuples (don't try to
    make pig work with a loader that returns a bag instead of a tuple..
    you'll be up to your neck in the visitor pattern in no time if you
    start heading that direction).

    Alternative is to change what constitutes a record your loader gets --
    use a different inputformat/recordReader to produce the records as
    needed, instead of feeding you lines.

    -D
    On Thu, Oct 28, 2010 at 8:36 AM, John Hui wrote:
    I look into the return data bag as an option.  The problem is the Loader
    interface require me to return a Tuple object.

    public Tuple getNext() throws IOException {

    but the DataBag interface is not a derive class of Tuple so this means I
    will need to change the internal code for pig for my loader to return a bag
    of tuples.  Right?

    John
    On Wed, Oct 27, 2010 at 6:00 PM, John Hui wrote:

    Hi Pig Users,

    I am currently writing a UDF loader.  In one of my use case, one line in
    the input stream results in multiple tuples.  Has anyone encounter or solve
    this issue on their end.

    The current structure of the code getNext method only return tuple but I
    want it to return a List<tuple>.  Let me know if there's use case out there
    like mine, I am coding it up to return List<tuple> which is more more
    flexible than return only one tuple.

    Thanks,

    John
  • John Hui at Oct 28, 2010 at 3:50 pm
    If I return a single bag with many tuples, how can I split that into
    multiple tuples? Can you give me an example of how this works?

    Let me read up on the inputformat and see if I can work my way around it.

    Why can't getNext return a type T instead of coupling it with the Tuple data
    type. Isn't more flexibility good in this case given how the LoadFunc class
    was meant to be extend for different use cases?

    Thanks for all your responses, it really helps knowing I'm not stuck in a
    hole all myself!

    John
    On Thu, Oct 28, 2010 at 11:42 AM, Dmitriy Ryaboy wrote:

    Alan means return a tuple of a single bag of many tuples (don't try to
    make pig work with a loader that returns a bag instead of a tuple..
    you'll be up to your neck in the visitor pattern in no time if you
    start heading that direction).

    Alternative is to change what constitutes a record your loader gets --
    use a different inputformat/recordReader to produce the records as
    needed, instead of feeding you lines.

    -D
    On Thu, Oct 28, 2010 at 8:36 AM, John Hui wrote:
    I look into the return data bag as an option. The problem is the Loader
    interface require me to return a Tuple object.

    public Tuple getNext() throws IOException {

    but the DataBag interface is not a derive class of Tuple so this means I
    will need to change the internal code for pig for my loader to return a bag
    of tuples. Right?

    John
    On Wed, Oct 27, 2010 at 6:00 PM, John Hui wrote:

    Hi Pig Users,

    I am currently writing a UDF loader. In one of my use case, one line in
    the input stream results in multiple tuples. Has anyone encounter or
    solve
    this issue on their end.

    The current structure of the code getNext method only return tuple but I
    want it to return a List<tuple>. Let me know if there's use case out
    there
    like mine, I am coding it up to return List<tuple> which is more more
    flexible than return only one tuple.

    Thanks,

    John
  • Alan Gates at Oct 28, 2010 at 3:50 pm

    On Oct 28, 2010, at 8:36 AM, John Hui wrote:

    I look into the return data bag as an option. The problem is the
    Loader
    interface require me to return a Tuple object.

    public Tuple getNext() throws IOException {

    but the DataBag interface is not a derive class of Tuple so this
    means I
    will need to change the internal code for pig for my loader to
    return a bag
    of tuples. Right?
    No. If at the end of your getNext() you have a List<Tuple> tuples,
    then return:

    return
    TupleFactory
    .getInstance().newTuple(BagFactory.getInstance().newDefaultBag(tuples));

    This will give you a tuple, which has a single field, which is a bag.
    Within that bag will be all your tuples. If your next Pig Latin
    statement is

    B = foreach A generate flatten($0);

    then B will contain each of your records as individual records.

    Alan.
    John
    On Wed, Oct 27, 2010 at 6:00 PM, John Hui wrote:

    Hi Pig Users,

    I am currently writing a UDF loader. In one of my use case, one
    line in
    the input stream results in multiple tuples. Has anyone encounter
    or solve
    this issue on their end.

    The current structure of the code getNext method only return tuple
    but I
    want it to return a List<tuple>. Let me know if there's use case
    out there
    like mine, I am coding it up to return List<tuple> which is more more
    flexible than return only one tuple.

    Thanks,

    John
  • John Hui at Oct 28, 2010 at 3:52 pm
    Awesome Alan, let me try that out and see if it works.

    John
    On Thu, Oct 28, 2010 at 11:49 AM, Alan Gates wrote:


    On Oct 28, 2010, at 8:36 AM, John Hui wrote:

    I look into the return data bag as an option. The problem is the Loader
    interface require me to return a Tuple object.

    public Tuple getNext() throws IOException {

    but the DataBag interface is not a derive class of Tuple so this means I
    will need to change the internal code for pig for my loader to return a
    bag
    of tuples. Right?
    No. If at the end of your getNext() you have a List<Tuple> tuples, then
    return:

    return
    TupleFactory.getInstance().newTuple(BagFactory.getInstance().newDefaultBag(tuples));

    This will give you a tuple, which has a single field, which is a bag.
    Within that bag will be all your tuples. If your next Pig Latin statement
    is

    B = foreach A generate flatten($0);

    then B will contain each of your records as individual records.

    Alan.


    John

    On Wed, Oct 27, 2010 at 6:00 PM, John Hui wrote:

    Hi Pig Users,
    I am currently writing a UDF loader. In one of my use case, one line in
    the input stream results in multiple tuples. Has anyone encounter or
    solve
    this issue on their end.

    The current structure of the code getNext method only return tuple but I
    want it to return a List<tuple>. Let me know if there's use case out
    there
    like mine, I am coding it up to return List<tuple> which is more more
    flexible than return only one tuple.

    Thanks,

    John

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 27, '10 at 10:39p
activeOct 28, '10 at 3:52p
posts7
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase