Grokbase Groups Pig user March 2011
FAQ
I have these "rows"

({(155495400)})
({(199027860),(199027860),(149167529),(203508790),(198488630)})
({(174255619),(201077556),(199051606),(198778302)})

I believe the correct way to explain them would be each row/tuple is a
bag that contains tuples of size 1? Is that right?

Anyway, is there something native or UDF I can use to convert them to
this format?

(155495400)
(199027860 199027860 149167529 203508790 198488630)
(174255619 201077556 199051606 198778302)

Maybe if I explain what we are trying to do it would help.

We have logs of users to product views in a tab delimited format.

foo\t1234
bar\t1234
foo\t4423
baz\t5563

We simply want product views grouped by user and outputed on 1 line.

1234 4423
1234
5563

The above first line would be from the user foo, second bar and third baz.

Thanks

Search Discussions

  • Dave butlerdi at Mar 31, 2011 at 3:54 pm
    Karmasphere Studio would be good for this. It is really a front end for
    Hadoop however it runs on small amounts of data independently.

    On 31 March 2011 17:49, Mark wrote:

    I have these "rows"

    ({(155495400)})
    ({(199027860),(199027860),(149167529),(203508790),(198488630)})
    ({(174255619),(201077556),(199051606),(198778302)})

    I believe the correct way to explain them would be each row/tuple is a bag
    that contains tuples of size 1? Is that right?

    Anyway, is there something native or UDF I can use to convert them to this
    format?

    (155495400)
    (199027860 199027860 149167529 203508790 198488630)
    (174255619 201077556 199051606 198778302)

    Maybe if I explain what we are trying to do it would help.

    We have logs of users to product views in a tab delimited format.

    foo\t1234
    bar\t1234
    foo\t4423
    baz\t5563

    We simply want product views grouped by user and outputed on 1 line.

    1234 4423
    1234
    5563

    The above first line would be from the user foo, second bar and third baz.

    Thanks


    --
    Regards

    Dave Butler
    butlerdi-at-pharm2phork-dot-org

    Also on Skype as pharm2phork

    Get Skype here http://www.skype.com/download.html


    **********************************************************************
    This email and any files transmitted with it are confidential and
    intended solely for the use of the individual or entity to whom they
    are addressed. If you have received this email in error please notify
    the system manager.

    This footnote also confirms that this email message has been swept by
    MIMEsweeper for the presence of computer viruses.

    www.mimesweeper.com
    **********************************************************************
  • Jonathan Coveney at Mar 31, 2011 at 5:10 pm
    You definitely can do this with a UDF. You simply take the Tuples as input
    and then begin concatenating them together. Be wary of memory limitations
    for the intermediate as it gets large. It may be more practical to let the
    output be a tuple whose element sare the rows.

    (199027860,199027860,149167529,203508790,198488630)

    then the input to your UDF will be a tuple whose first element is a bag, and
    then the output will be a tuple of all the elements. It is quite easy to
    write something that does this, take a look at the UDF documentation and ask
    if you need any help.

    2011/3/31 Mark <static.void.dev@gmail.com>
    I have these "rows"

    ({(155495400)})
    ({(199027860),(199027860),(149167529),(203508790),(198488630)})
    ({(174255619),(201077556),(199051606),(198778302)})

    I believe the correct way to explain them would be each row/tuple is a bag
    that contains tuples of size 1? Is that right?

    Anyway, is there something native or UDF I can use to convert them to this
    format?

    (155495400)
    (199027860 199027860 149167529 203508790 198488630)
    (174255619 201077556 199051606 198778302)

    Maybe if I explain what we are trying to do it would help.

    We have logs of users to product views in a tab delimited format.

    foo\t1234
    bar\t1234
    foo\t4423
    baz\t5563

    We simply want product views grouped by user and outputed on 1 line.

    1234 4423
    1234
    5563

    The above first line would be from the user foo, second bar and third baz.

    Thanks
  • Mark at Apr 1, 2011 at 2:30 pm
    I created the following:

    http://pastie.org/1743857

    And I'm using it in the following way:

    register 'target/pig-1.0-SNAPSHOT.jar'
    rows = LOAD 'foo' AS (user:chararray, item:long);
    grouped = GROUP rows BY user;
    final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));

    Does that look about right? Is there any particular reason why I need to
    flatten at the end? When I try to output a simple tuple from the
    EvalFunc it is always a tuple inside a tuple.

    Thanks

    On 3/31/11 10:10 AM, Jonathan Coveney wrote:
    You definitely can do this with a UDF. You simply take the Tuples as input
    and then begin concatenating them together. Be wary of memory limitations
    for the intermediate as it gets large. It may be more practical to let the
    output be a tuple whose element sare the rows.

    (199027860,199027860,149167529,203508790,198488630)

    then the input to your UDF will be a tuple whose first element is a bag, and
    then the output will be a tuple of all the elements. It is quite easy to
    write something that does this, take a look at the UDF documentation and ask
    if you need any help.

    2011/3/31 Mark<static.void.dev@gmail.com>
    I have these "rows"

    ({(155495400)})
    ({(199027860),(199027860),(149167529),(203508790),(198488630)})
    ({(174255619),(201077556),(199051606),(198778302)})

    I believe the correct way to explain them would be each row/tuple is a bag
    that contains tuples of size 1? Is that right?

    Anyway, is there something native or UDF I can use to convert them to this
    format?

    (155495400)
    (199027860 199027860 149167529 203508790 198488630)
    (174255619 201077556 199051606 198778302)

    Maybe if I explain what we are trying to do it would help.

    We have logs of users to product views in a tab delimited format.

    foo\t1234
    bar\t1234
    foo\t4423
    baz\t5563

    We simply want product views grouped by user and outputed on 1 line.

    1234 4423
    1234
    5563

    The above first line would be from the user foo, second bar and third baz.

    Thanks
  • Dmitriy Ryaboy at Apr 1, 2011 at 5:09 pm
    Right, Pig always returns a Tuple that contains whatever your UDF returns --
    so if you return a string, it returns a Tuple with a String in it.
    Unfortunately that also means that if you return a Tuple, you get a Tuple in
    a Tuple.

    We probably shouldn't do that, but at this point changing the behavior can
    break a lot of people's existing pig code :(.

    D
    On Fri, Apr 1, 2011 at 7:30 AM, Mark wrote:

    I created the following:

    http://pastie.org/1743857

    And I'm using it in the following way:

    register 'target/pig-1.0-SNAPSHOT.jar'
    rows = LOAD 'foo' AS (user:chararray, item:long);
    grouped = GROUP rows BY user;
    final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));

    Does that look about right? Is there any particular reason why I need to
    flatten at the end? When I try to output a simple tuple from the EvalFunc it
    is always a tuple inside a tuple.

    Thanks


    On 3/31/11 10:10 AM, Jonathan Coveney wrote:

    You definitely can do this with a UDF. You simply take the Tuples as input
    and then begin concatenating them together. Be wary of memory limitations
    for the intermediate as it gets large. It may be more practical to let the
    output be a tuple whose element sare the rows.

    (199027860,199027860,149167529,203508790,198488630)

    then the input to your UDF will be a tuple whose first element is a bag,
    and
    then the output will be a tuple of all the elements. It is quite easy to
    write something that does this, take a look at the UDF documentation and
    ask
    if you need any help.

    2011/3/31 Mark<static.void.dev@gmail.com>

    I have these "rows"
    ({(155495400)})
    ({(199027860),(199027860),(149167529),(203508790),(198488630)})
    ({(174255619),(201077556),(199051606),(198778302)})

    I believe the correct way to explain them would be each row/tuple is a
    bag
    that contains tuples of size 1? Is that right?

    Anyway, is there something native or UDF I can use to convert them to
    this
    format?

    (155495400)
    (199027860 199027860 149167529 203508790 198488630)
    (174255619 201077556 199051606 198778302)

    Maybe if I explain what we are trying to do it would help.

    We have logs of users to product views in a tab delimited format.

    foo\t1234
    bar\t1234
    foo\t4423
    baz\t5563

    We simply want product views grouped by user and outputed on 1 line.

    1234 4423
    1234
    5563

    The above first line would be from the user foo, second bar and third
    baz.

    Thanks
  • Mark at Apr 1, 2011 at 7:14 pm
    How would I return a list of values?

    (val1, val2, val3...)

    I tried returning a List<Object> however it I get a tuple that contains
    a tuple with a list of values and I have to flatten it to get the
    desired behavior.

    ((val1, val2, val3...))

    Thanks
    On 4/1/11 10:09 AM, Dmitriy Ryaboy wrote:
    Right, Pig always returns a Tuple that contains whatever your UDF returns --
    so if you return a string, it returns a Tuple with a String in it.
    Unfortunately that also means that if you return a Tuple, you get a Tuple in
    a Tuple.

    We probably shouldn't do that, but at this point changing the behavior can
    break a lot of people's existing pig code :(.

    D

    On Fri, Apr 1, 2011 at 7:30 AM, Markwrote:
    I created the following:

    http://pastie.org/1743857

    And I'm using it in the following way:

    register 'target/pig-1.0-SNAPSHOT.jar'
    rows = LOAD 'foo' AS (user:chararray, item:long);
    grouped = GROUP rows BY user;
    final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));

    Does that look about right? Is there any particular reason why I need to
    flatten at the end? When I try to output a simple tuple from the EvalFunc it
    is always a tuple inside a tuple.

    Thanks


    On 3/31/11 10:10 AM, Jonathan Coveney wrote:

    You definitely can do this with a UDF. You simply take the Tuples as input
    and then begin concatenating them together. Be wary of memory limitations
    for the intermediate as it gets large. It may be more practical to let the
    output be a tuple whose element sare the rows.

    (199027860,199027860,149167529,203508790,198488630)

    then the input to your UDF will be a tuple whose first element is a bag,
    and
    then the output will be a tuple of all the elements. It is quite easy to
    write something that does this, take a look at the UDF documentation and
    ask
    if you need any help.

    2011/3/31 Mark<static.void.dev@gmail.com>

    I have these "rows"
    ({(155495400)})
    ({(199027860),(199027860),(149167529),(203508790),(198488630)})
    ({(174255619),(201077556),(199051606),(198778302)})

    I believe the correct way to explain them would be each row/tuple is a
    bag
    that contains tuples of size 1? Is that right?

    Anyway, is there something native or UDF I can use to convert them to
    this
    format?

    (155495400)
    (199027860 199027860 149167529 203508790 198488630)
    (174255619 201077556 199051606 198778302)

    Maybe if I explain what we are trying to do it would help.

    We have logs of users to product views in a tab delimited format.

    foo\t1234
    bar\t1234
    foo\t4423
    baz\t5563

    We simply want product views grouped by user and outputed on 1 line.

    1234 4423
    1234
    5563

    The above first line would be from the user foo, second bar and third
    baz.

    Thanks

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 31, '11 at 3:50p
activeApr 1, '11 at 7:14p
posts6
users4
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase