Grokbase Groups Pig user April 2012
FAQ
Hi,
I'm storing data into a partitioned table using Hive in RCFile format,
but I want to use Pig to do the aggregation of that data.

In my array <string> in Hive, I have colon delimited data, E.g.

:0:12:21:99:

With the lateral view and explode functions in Hive, I can output each value
as a separate row.

In Pig, I think I need to use flatten, but it just outputs the array as a
single field, and I can't see where to specify that the delimiter is the
delimiter/value separator

register /opt/pig/trunk/bin/piggybank.jar
mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids
array<string>');
opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id;
dump opt;



Thanks

Malc

Search Discussions

  • Norbert Burger at Apr 6, 2012 at 3:01 pm
    Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
    a chararray on some delimeter. So the following should work:

    opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    Norbert
    On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye wrote:

    Hi,
    I'm storing data into a partitioned table using Hive in RCFile format,
    but I want to use Pig to do the aggregation of that data.

    In my array <string> in Hive, I have colon delimited data, E.g.

    :0:12:21:99:

    With the lateral view and explode functions in Hive, I can output each
    value
    as a separate row.

    In Pig, I think I need to use flatten, but it just outputs the array as a
    single field, and I can't see where to specify that the delimiter is the
    delimiter/value separator

    register /opt/pig/trunk/bin/piggybank.jar
    mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array<string>');
    opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id;
    dump opt;



    Thanks

    Malc

  • Malcolm Tye at Apr 11, 2012 at 10:59 pm
    Hi Norbert,
    I don't seem to be getting what I'm after. If my data looks like
    this

    1133957209,61:0:1
    4524524233,21:0

    I want to produce

    1133957209,61
    1133957209,0
    1133957209,1
    4524524233,21
    4524524233,0

    I changed the LOAD statement to

    mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID string,seg_ids
    array');
    opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    I don't seem to be getting the cross product, just something like the
    following

    1133957209,61,0,1
    4524524233,21,0

    Any ideas ?


    Thanks

    Malc


    -----Original Message-----
    From: Norbert Burger
    Sent: 06 April 2012 16:01
    To: user@pig.apache.org
    Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

    Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
    a chararray on some delimeter. So the following should work:

    opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    Norbert

    On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
    wrote:
    Hi,
    I'm storing data into a partitioned table using Hive in RCFile
    format, but I want to use Pig to do the aggregation of that data.

    In my array <string> in Hive, I have colon delimited data, E.g.

    :0:12:21:99:

    With the lateral view and explode functions in Hive, I can output each
    value as a separate row.

    In Pig, I think I need to use flatten, but it just outputs the array
    as a single field, and I can't see where to specify that the delimiter
    is the delimiter/value separator

    register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
    '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array<string>');
    opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
    opt;



    Thanks

    Malc

  • Norbert Burger at Apr 12, 2012 at 3:14 am
    A little wonky, but try wrapping the flattened tuple elements in a bag, and
    then re-flattening that:

    A = LOAD 'test.txt' USING PigStorage(',') AS
    (C_SUB_ID:chararray,seg_ids:chararray);
    B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
    C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

    Only flattened bags generate the cols -> rows transformation that you're
    trying to make. Flattened tuples, on the other hand, simply explode the
    tuple into its composite elements, but without creating the multiple rows
    ("cross product') in your relation. A custom UDF would be another option
    here.

    Norbert
    On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye wrote:

    Hi Norbert,
    I don't seem to be getting what I'm after. If my data looks like
    this

    1133957209,61:0:1
    4524524233,21:0

    I want to produce

    1133957209,61
    1133957209,0
    1133957209,1
    4524524233,21
    4524524233,0

    I changed the LOAD statement to

    mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array');
    opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    I don't seem to be getting the cross product, just something like the
    following

    1133957209,61,0,1
    4524524233,21,0

    Any ideas ?


    Thanks

    Malc


    -----Original Message-----
    From: Norbert Burger
    Sent: 06 April 2012 16:01
    To: user@pig.apache.org
    Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

    Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
    a chararray on some delimeter. So the following should work:

    opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    Norbert

    On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
    wrote:
    Hi,
    I'm storing data into a partitioned table using Hive in RCFile
    format, but I want to use Pig to do the aggregation of that data.

    In my array <string> in Hive, I have colon delimited data, E.g.

    :0:12:21:99:

    With the lateral view and explode functions in Hive, I can output each
    value as a separate row.

    In Pig, I think I need to use flatten, but it just outputs the array
    as a single field, and I can't see where to specify that the delimiter
    is the delimiter/value separator

    register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
    '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array<string>');
    opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
    opt;



    Thanks

    Malc

  • Aniket Mokashi at Apr 12, 2012 at 9:39 am
    Hi Malcolm,

    arrays are converted to tuples and flatten should directly work on it. I
    think you need not worry about the delimiter (assuming hive knows how to
    deserialize it). Btw, does RCFile require delimiter to store arrays? I am
    not sure about that.

    Thanks,
    Aniket

    On Wed, Apr 11, 2012 at 8:14 PM, Norbert Burger wrote:

    A little wonky, but try wrapping the flattened tuple elements in a bag, and
    then re-flattening that:

    A = LOAD 'test.txt' USING PigStorage(',') AS
    (C_SUB_ID:chararray,seg_ids:chararray);
    B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
    C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

    Only flattened bags generate the cols -> rows transformation that you're
    trying to make. Flattened tuples, on the other hand, simply explode the
    tuple into its composite elements, but without creating the multiple rows
    ("cross product') in your relation. A custom UDF would be another option
    here.

    Norbert

    On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye <malcolm.tye@btinternet.com
    wrote:
    Hi Norbert,
    I don't seem to be getting what I'm after. If my data looks like
    this

    1133957209,61:0:1
    4524524233,21:0

    I want to produce

    1133957209,61
    1133957209,0
    1133957209,1
    4524524233,21
    4524524233,0

    I changed the LOAD statement to

    mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array');
    opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    I don't seem to be getting the cross product, just something like the
    following

    1133957209,61,0,1
    4524524233,21,0

    Any ideas ?


    Thanks

    Malc


    -----Original Message-----
    From: Norbert Burger
    Sent: 06 April 2012 16:01
    To: user@pig.apache.org
    Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

    Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to tokenize
    a chararray on some delimeter. So the following should work:

    opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    Norbert

    On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
    wrote:
    Hi,
    I'm storing data into a partitioned table using Hive in RCFile
    format, but I want to use Pig to do the aggregation of that data.

    In my array <string> in Hive, I have colon delimited data, E.g.

    :0:12:21:99:

    With the lateral view and explode functions in Hive, I can output each
    value as a separate row.

    In Pig, I think I need to use flatten, but it just outputs the array
    as a single field, and I can't see where to specify that the delimiter
    is the delimiter/value separator

    register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
    '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array<string>');
    opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
    opt;



    Thanks

    Malc



    --
    "...:::Aniket:::... Quetzalco@tl"
  • Malcolm Tye at May 3, 2012 at 12:30 pm
    Hi Norbert,
    Thanks for your answer. I'm just documenting the problems I
    experienced and will reply to the list soon with a detailed answer


    Thanks for your help


    Malc


    -----Original Message-----
    From: Norbert Burger
    Sent: 12 April 2012 04:14
    To: user@pig.apache.org
    Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

    A little wonky, but try wrapping the flattened tuple elements in a bag, and
    then re-flattening that:

    A = LOAD 'test.txt' USING PigStorage(',') AS
    (C_SUB_ID:chararray,seg_ids:chararray);
    B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
    C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));

    Only flattened bags generate the cols -> rows transformation that you're
    trying to make. Flattened tuples, on the other hand, simply explode the
    tuple into its composite elements, but without creating the multiple rows
    ("cross product') in your relation. A custom UDF would be another option
    here.

    Norbert

    On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye
    wrote:
    Hi Norbert,
    I don't seem to be getting what I'm after. If my data looks
    like this

    1133957209,61:0:1
    4524524233,21:0

    I want to produce

    1133957209,61
    1133957209,0
    1133957209,1
    4524524233,21
    4524524233,0

    I changed the LOAD statement to

    mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array');
    opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    I don't seem to be getting the cross product, just something like the
    following

    1133957209,61,0,1
    4524524233,21,0

    Any ideas ?


    Thanks

    Malc


    -----Original Message-----
    From: Norbert Burger
    Sent: 06 April 2012 16:01
    To: user@pig.apache.org
    Subject: Re: "Exploding" a Hive array<string> in Pig from an RCFile

    Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to
    tokenize a chararray on some delimeter. So the following should work:

    opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
    s_seg_id;

    Norbert

    On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
    wrote:
    Hi,
    I'm storing data into a partitioned table using Hive in RCFile
    format, but I want to use Pig to do the aggregation of that data.

    In my array <string> in Hive, I have colon delimited data, E.g.

    :0:12:21:99:

    With the lateral view and explode functions in Hive, I can output
    each value as a separate row.

    In Pig, I think I need to use flatten, but it just outputs the array
    as a single field, and I can't see where to specify that the
    delimiter is the delimiter/value separator

    register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
    '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
    org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
    string,seg_ids
    array<string>');
    opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id;
    dump opt;



    Thanks

    Malc

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 5, '12 at 12:59p
activeMay 3, '12 at 12:30p
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase