Grokbase Groups Pig user May 2011
FAQ
I am sorry if this has been asked in the past. I can't seem to find
information on it.

I have two questions, but they are somewhat related.

#1) Let's say you are tracking messages and extracting the hash tags from
the message and storing them as one field (#hash1#hash2#hash3). This means
you might have a line that looks something like the following:
2343 2011-05-06T03:04:00.000Z username
some+message+goes+here#with+#hash+#tags #with#hash#tags some other
info

How can I get the # of tweets per hash tag? Also, how can I get the # of
tweets per user per hash tag?
I know I can use the STRSPLIT function to split on '#'. That will give me a
bag of hash tags. How can I then group by these such that each hash tag has
a set of tweets?


#2) Let's say you have a field that has a fairly small, but still unknown
number of unique values (say between 20-5). I know I can group by these
fields to get a count by doing something like so:

A = LOAD '/some/dir' Using PigStorage (date, directive);

B = GROUP A by (date, directive);

C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);

But now I want to end up something like the following:

2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
1983

If I knew the directives ahead of time, I know I can do something like the
following:

D = GROUP C BY date;

E = FOREACH D {
DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
}

But how do I do this w/o having to hardcode the filters? Am I thinking about
this all wrong?

Thanks very much for you help,
Christian

Search Discussions

  • Xiaomeng Wan at May 6, 2011 at 9:29 pm
    you can group on group, like this:

    A = LOAD '/some/dir' Using PigStorage (date, directive);

    B = GROUP A by (date, directive);

    C = FOREACH B GENERATE FLATTEN(group) as (date, directive), COUNT(A) as cnt;

    D = group c by date;

    E = foreach D generate group as date, c.(directive,cnt) as cnts;

    Shawn
    On Fri, May 6, 2011 at 3:14 PM, Christian wrote:
    I am sorry if this has been asked in the past. I can't seem to find
    information on it.

    I have two questions, but they are somewhat related.

    #1) Let's say you are tracking messages and extracting the hash tags from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343    2011-05-06T03:04:00.000Z    username
    some+message+goes+here#with+#hash+#tags    #with#hash#tags   some    other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will give me a
    bag of hash tags. How can I then group by these such that each hash tag has
    a set of tweets?


    #2) Let's say you have a field that has a fairly small, but still unknown
    number of unique values (say between 20-5). I know I can group by these
    fields to get a count by doing something like so:

    A = LOAD '/some/dir' Using PigStorage (date, directive);

    B = GROUP A by (date, directive);

    C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);

    But now I want to end up something like the following:

    2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I thinking about
    this all wrong?

    Thanks very much for you help,
    Christian
  • Jacob at May 6, 2011 at 9:30 pm
    Christian,

    I've answered inline:
    On Fri, 2011-05-06 at 15:14 -0600, Christian wrote:
    I am sorry if this has been asked in the past. I can't seem to find
    information on it.

    I have two questions, but they are somewhat related.

    #1) Let's say you are tracking messages and extracting the hash tags from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will give me a
    bag of hash tags. How can I then group by these such that each hash tag has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.

    #2) Let's say you have a field that has a fairly small, but still unknown
    number of unique values (say between 20-5). I know I can group by these
    fields to get a count by doing something like so:

    A = LOAD '/some/dir' Using PigStorage (date, directive);

    B = GROUP A by (date, directive);

    C = FOREACH B GENERATE FLATTEN(group), COUNT(A.date);

    But now I want to end up something like the following:

    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date), 'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I thinking about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1


    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.


    --jacob
    @thedatachef

    Thanks very much for you help,
    Christian
  • Christian at May 6, 2011 at 9:39 pm

    #1) Let's say you are tracking messages and extracting the hash tags from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will give me a
    bag of hash tags. How can I then group by these such that each hash tag has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a 'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group by? I
    don't want to group by messages that have the exact hash tags. I want all
    messages that have one of the hash tags.

    But now I want to end up something like the following:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I thinking about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to put those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian
  • Jacob at May 6, 2011 at 10:00 pm

    On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:
    #1) Let's say you are tracking messages and extracting the hash tags from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will give me a
    bag of hash tags. How can I then group by these such that each hash tag has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a 'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group by? I
    don't want to group by messages that have the exact hash tags. I want all
    messages that have one of the hash tags.
    Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
    NOT a bag. If you could get a bag then you could do the following (I'm
    throwing out some fields for now):

    A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
    B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
    hashtag;
    C = GROUP B BY hastag;

    Then C will contain a key (the hashtag) and a bag containing all the
    tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
    to do the same as STRSPLIT but that returns a bag instead.

    ie.

    #foobar tweet text,#foobar
    this tweet has #two #hashtags,#two#hashtags
    another #foobar tweet,#foobar

    will yield:

    #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet,
    #foobar)}
    #two, {(this tweet has #two #hashtags, #two)}
    #hashtags, {(this tweet has #two #hashtags, #hashtags)}

    But now I want to end up something like the following:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I thinking about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to put those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian
  • Christian at May 6, 2011 at 10:06 pm
    Thank you for taking the time to explain this to me Jacob!

    Am I stuck with hard-coding for my other question?

    Instead of:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983
    --
    2011-05-01 32423 3433 1983

    would also do as long as I could count on the column order.


    Thanks,
    Christian
    On Fri, May 6, 2011 at 3:59 PM, jacob wrote:
    On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:

    #1) Let's say you are tracking messages and extracting the hash tags
    from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the #
    of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will
    give me
    a
    bag of hash tags. How can I then group by these such that each hash
    tag
    has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a 'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group by? I
    don't want to group by messages that have the exact hash tags. I want all
    messages that have one of the hash tags.
    Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
    NOT a bag. If you could get a bag then you could do the following (I'm
    throwing out some fields for now):

    A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
    B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
    hashtag;
    C = GROUP B BY hastag;

    Then C will contain a key (the hashtag) and a bag containing all the
    tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
    to do the same as STRSPLIT but that returns a bag instead.

    ie.

    #foobar tweet text,#foobar
    this tweet has #two #hashtags,#two#hashtags
    another #foobar tweet,#foobar

    will yield:

    #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet,
    #foobar)}
    #two, {(this tweet has #two #hashtags, #two)}
    #hashtags, {(this tweet has #two #hashtags, #hashtags)}

    But now I want to end up something like the following:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something
    like
    the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I
    thinking
    about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to put
    those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian
  • Jacob at May 6, 2011 at 10:14 pm

    On Fri, 2011-05-06 at 16:06 -0600, Christian wrote:
    Thank you for taking the time to explain this to me Jacob!

    Am I stuck with hard-coding for my other question?

    Instead of:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983
    --
    2011-05-01 32423 3433 1983

    would also do as long as I could count on the column order.
    The best you can do here is to do a GROUP BY on the date (like Xiaomeng
    Wan indicated) after you get the counts. Then, I think theres a UDF in
    contrib that allows you to go from a bag to a tuple.

    Best of luck,

    --jacob
    @thedatachef
    Thanks,
    Christian
    On Fri, May 6, 2011 at 3:59 PM, jacob wrote:
    On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:

    #1) Let's say you are tracking messages and extracting the hash tags
    from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the #
    of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will
    give me
    a
    bag of hash tags. How can I then group by these such that each hash
    tag
    has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a 'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group by? I
    don't want to group by messages that have the exact hash tags. I want all
    messages that have one of the hash tags.
    Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
    NOT a bag. If you could get a bag then you could do the following (I'm
    throwing out some fields for now):

    A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
    B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
    hashtag;
    C = GROUP B BY hastag;

    Then C will contain a key (the hashtag) and a bag containing all the
    tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
    to do the same as STRSPLIT but that returns a bag instead.

    ie.

    #foobar tweet text,#foobar
    this tweet has #two #hashtags,#two#hashtags
    another #foobar tweet,#foobar

    will yield:

    #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet,
    #foobar)}
    #two, {(this tweet has #two #hashtags, #two)}
    #hashtags, {(this tweet has #two #hashtags, #hashtags)}

    But now I want to end up something like the following:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something
    like
    the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I
    thinking
    about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to put
    those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian
  • Dmitriy Ryaboy at May 8, 2011 at 1:42 am
    FWIW -- the reason STRSPLIT returns a Tuple is that the more common
    case is thought to be splitting a string of a known format and trying
    to get some part of it.

    so, "foreach address_book generate STRSPLIT(phone_number, '-') as
    (area_code, top_3, bottom_4);"

    RegexExtractAll (whatever it's called these days) should return a bag, iirc.

    D
    On Fri, May 6, 2011 at 2:59 PM, jacob wrote:
    On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:

    #1) Let's say you are tracking messages and extracting the hash tags from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343    2011-05-06T03:04:00.000Z    username
    some+message+goes+here#with+#hash+#tags    #with#hash#tags   some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will give me a
    bag of hash tags. How can I then group by these such that each hash tag has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a 'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group by? I
    don't want to group by messages that have the exact hash tags. I want all
    messages that have one of the hash tags.
    Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
    NOT a bag. If you could get a bag then you could do the following (I'm
    throwing out some fields for now):

    A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
    B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
    hashtag;
    C = GROUP B BY hastag;

    Then C will contain a key (the hashtag) and a bag containing all the
    tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
    to do the same as STRSPLIT but that returns a bag instead.

    ie.

    #foobar tweet text,#foobar
    this tweet has #two #hashtags,#two#hashtags
    another #foobar tweet,#foobar

    will yield:

    #foobar,   {(#foobar tweet text, #foobar),(another #foobar tweet,
    #foobar)}
    #two,      {(this tweet has #two #hashtags, #two)}
    #hashtags, {(this tweet has #two #hashtags, #hashtags)}

    But now I want to end up something like the following:
    2011-05-01    DIRECTIVE1    32423    DIRECTIVE2    3433    DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I thinking about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to put those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian
  • Jacob Perkins at May 8, 2011 at 2:56 am
    Dmitriy,

    I see your point. It would definitely be nice to have a builtin for
    returning a bag though. I'd actually be happy if
    TOBAG(FLATTEN(STRSPLIT(X,','))) worked.

    --jacob
    @thedatachef
    On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
    FWIW -- the reason STRSPLIT returns a Tuple is that the more common
    case is thought to be splitting a string of a known format and trying
    to get some part of it.

    so, "foreach address_book generate STRSPLIT(phone_number, '-') as
    (area_code, top_3, bottom_4);"

    RegexExtractAll (whatever it's called these days) should return a bag, iirc.

    D
    On Fri, May 6, 2011 at 2:59 PM, jacob wrote:
    On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:

    #1) Let's say you are tracking messages and extracting the hash tags from
    the message and storing them as one field (#hash1#hash2#hash3). This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That will give me a
    bag of hash tags. How can I then group by these such that each hash tag has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a 'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group by? I
    don't want to group by messages that have the exact hash tags. I want all
    messages that have one of the hash tags.
    Oh, that's right, STRSPLIT (rather uselessly) yields a nested tuple and
    NOT a bag. If you could get a bag then you could do the following (I'm
    throwing out some fields for now):

    A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
    B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
    hashtag;
    C = GROUP B BY hastag;

    Then C will contain a key (the hashtag) and a bag containing all the
    tweets with that hashtag. You'll have to write 'MySplittingUDF' yourself
    to do the same as STRSPLIT but that returns a bag instead.

    ie.

    #foobar tweet text,#foobar
    this tweet has #two #hashtags,#two#hashtags
    another #foobar tweet,#foobar

    will yield:

    #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet,
    #foobar)}
    #two, {(this tweet has #two #hashtags, #two)}
    #hashtags, {(this tweet has #two #hashtags, #hashtags)}

    But now I want to end up something like the following:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433 DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I thinking about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive), COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to put those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian
  • Alan Gates at May 10, 2011 at 9:34 pm
    TOKENIZE takes a string and returns a bag. It's issue is right now it
    only allows you to split on whitespace. It would make sense to
    generalize this to take a delimiter.

    Alan.
    On May 7, 2011, at 7:55 PM, Jacob Perkins wrote:

    Dmitriy,

    I see your point. It would definitely be nice to have a builtin for
    returning a bag though. I'd actually be happy if
    TOBAG(FLATTEN(STRSPLIT(X,','))) worked.

    --jacob
    @thedatachef
    On Sat, 2011-05-07 at 18:41 -0700, Dmitriy Ryaboy wrote:
    FWIW -- the reason STRSPLIT returns a Tuple is that the more common
    case is thought to be splitting a string of a known format and trying
    to get some part of it.

    so, "foreach address_book generate STRSPLIT(phone_number, '-') as
    (area_code, top_3, bottom_4);"

    RegexExtractAll (whatever it's called these days) should return a
    bag, iirc.

    D

    On Fri, May 6, 2011 at 2:59 PM, jacob <jacob.a.perkins@gmail.com>
    wrote:
    On Fri, 2011-05-06 at 15:38 -0600, Christian wrote:

    #1) Let's say you are tracking messages and extracting the hash
    tags from
    the message and storing them as one field (#hash1#hash2#hash3).
    This means
    you might have a line that looks something like the following:
    2343 2011-05-06T03:04:00.000Z username
    some+message+goes+here#with+#hash+#tags #with#hash#tags some other
    info

    How can I get the # of tweets per hash tag? Also, how can I get
    the # of
    tweets per user per hash tag?
    I know I can use the STRSPLIT function to split on '#'. That
    will give me a
    bag of hash tags. How can I then group by these such that each
    hash tag has
    a set of tweets?
    You will need to 'FLATTEN' the bag of hashtags then do a 'GROUP
    BY' on
    the hashtag itself.
    If each message has an unknown number of hashtags, will a
    'FLATTEN' given me
    an unknown # of fields? If so, how do I know which field to group
    by? I
    don't want to group by messages that have the exact hash tags. I
    want all
    messages that have one of the hash tags.
    Oh, that's right, STRSPLIT (rather uselessly) yields a nested
    tuple and
    NOT a bag. If you could get a bag then you could do the following
    (I'm
    throwing out some fields for now):

    A = LOAD 'tweets_and_meta' AS (text:chararray, hashtags:chararray);
    B = FOREACH A GENERATE text, FLATTEN(MySplittingUDF(hashtags)) AS
    hashtag;
    C = GROUP B BY hastag;

    Then C will contain a key (the hashtag) and a bag containing all the
    tweets with that hashtag. You'll have to write 'MySplittingUDF'
    yourself
    to do the same as STRSPLIT but that returns a bag instead.

    ie.

    #foobar tweet text,#foobar
    this tweet has #two #hashtags,#two#hashtags
    another #foobar tweet,#foobar

    will yield:

    #foobar, {(#foobar tweet text, #foobar),(another #foobar tweet,
    #foobar)}
    #two, {(this tweet has #two #hashtags, #two)}
    #hashtags, {(this tweet has #two #hashtags, #hashtags)}

    But now I want to end up something like the following:
    2011-05-01 DIRECTIVE1 32423 DIRECTIVE2 3433
    DIRECTIVE3
    1983

    If I knew the directives ahead of time, I know I can do
    something like the
    following:

    D = GROUP C BY date;

    E = FOREACH D {
    DIRECTIVE1 = FILTER type_count by directive == 'DIRECTIVE1';
    DIRECTIVE2 = FILTER type_count by directive == 'DIRECTIVE2';
    DIRECTIVE3 = FILTER type_count by directive == 'DIRECTIVE3';
    GENERATE group, 'DIRECTIVE1', COUNT(DIRECTIVE1.date),
    'DIRECTIVE2',
    COUNT(DIRECTIVE2.date), 'DIRECTIVE3', COUNT(DIRECTIVE3.date);
    }

    But how do I do this w/o having to hardcode the filters? Am I
    thinking about
    this all wrong?
    It's really a matter of how you structure your data ahead of time.
    Imagine the data looking like this instead (call it X):

    201101,directive1
    201101,directive1
    201101,directive2
    201101,directive2
    201101,directive2
    201101,directive3
    201102,directive2
    201102,directive4
    201103,directive1

    This is how my data looks (row and column wise)
    then, a simple:

    Y = GROUP X BY (date,directive);
    Z = FOREACH Y GENERATE FLATTEN(group) AS (date,directive),
    COUNT(X) AS
    num_occurrences;

    would result in:

    201101,directive1,2
    201101,directive2,3
    201101,directive3,1
    201102,directive2,1
    201102,directive4,1
    201103,directive1,1

    At least, that's what it _seems_ like you're asking for.

    I've gotten that far. I'm actually asking for the being able to
    put those
    into columns and not rows.
    --jacob
    @thedatachef

    Thanks Jacob!
    -Christian
    Thanks very much for you help,
    Christian

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 6, '11 at 9:14p
activeMay 10, '11 at 9:34p
posts10
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase