Grokbase Groups Pig user July 2011
FAQ
I have a dataset where each tupple is a term. I then do two filter
operations, to find all terms that have numbers, then all terms that dont
have numbers.

Oddly, there are some terms that dont fit into either group (not really sure
how). So at this point I have 3 bags, all terms, tems with numbers, and
terms without numbers.

What I'm trying to find out is what terms are in the list of all terms, but
are not in either of the two filtered bags. I thought I'd use the DIFF
function, but it only operates on different tuples in the same bag. So
somehow I think I need to crate a new relation, that has three tuples at the
same level (row?). Then I could use the DIFF function.

Any ideas?

The script I have so far is shown below...

terms = LOAD 'terms' AS (term:chararray, min:float, max:float,
count:int);

terms = FOREACH terms GENERATE term;

--bag of all terms

allTerms = GROUP terms ALL;

--bag of terms without numbers

nonNumbers = FILTER terms BY NOT (term MATCHES '^.*[0-9].*$');

nonNumbers = GROUP nonNumbers ALL;

--bag of terms with numbers

withNumbers = FILTER terms BY (term MATCHES '^.*[0-9].*$');

withNumbers = GROUP withNumbers ALL;

Search Discussions

  • William Dowling at Jul 7, 2011 at 1:42 pm
    You could use two rounds of the outer join/filter by null idiom. For example after the first round you would get allTermsMinusNonNumbers like this:

    grunt> sh cat allTerms
    aa
    bb
    cc
    11
    22
    33
    grunt> sh cat nonNumbers
    cc
    grunt> allTerms = load 'allTerms' as (term:chararray);
    grunt> nonNumbers = load 'nonNumbers' as (term:chararray);
    grunt> j1 = join allTerms by term left outer, nonNumbers by term;
    grunt> allTermsMinusNonNumbers = filter j1 by nonNumbers::term is null;
    grunt>
    grunt> dump allTermsMinusNonNumbers

    (11,)
    (22,)
    (33,)
    (aa,)

    William F Dowling
    Sr Technical Specialist, Software Engineering
    Thomson Reuters


    -----Original Message-----
    From: turbocodr@gmail.com On Behalf Of John Conwell
    Sent: Wednesday, July 06, 2011 6:28 PM
    To: user@pig.apache.org
    Subject: Manually build tuple from three group relations

    I have a dataset where each tupple is a term. I then do two filter
    operations, to find all terms that have numbers, then all terms that dont
    have numbers.

    Oddly, there are some terms that dont fit into either group (not really sure
    how). So at this point I have 3 bags, all terms, tems with numbers, and
    terms without numbers.

    What I'm trying to find out is what terms are in the list of all terms, but
    are not in either of the two filtered bags. I thought I'd use the DIFF
    function, but it only operates on different tuples in the same bag. So
    somehow I think I need to crate a new relation, that has three tuples at the
    same level (row?). Then I could use the DIFF function.

    Any ideas?

    The script I have so far is shown below...

    terms = LOAD 'terms' AS (term:chararray, min:float, max:float,
    count:int);

    terms = FOREACH terms GENERATE term;

    --bag of all terms

    allTerms = GROUP terms ALL;

    --bag of terms without numbers

    nonNumbers = FILTER terms BY NOT (term MATCHES '^.*[0-9].*$');

    nonNumbers = GROUP nonNumbers ALL;

    --bag of terms with numbers

    withNumbers = FILTER terms BY (term MATCHES '^.*[0-9].*$');

    withNumbers = GROUP withNumbers ALL;

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 6, '11 at 10:28p
activeJul 7, '11 at 1:42p
posts2
users2
websitepig.apache.org

2 users in discussion

John Conwell: 1 post William Dowling: 1 post

People

Translate

site design / logo © 2022 Grokbase