I have a dataset where each tupple is a term. I then do two filter

operations, to find all terms that have numbers, then all terms that dont

have numbers.

Oddly, there are some terms that dont fit into either group (not really sure

how). So at this point I have 3 bags, all terms, tems with numbers, and

terms without numbers.

What I'm trying to find out is what terms are in the list of all terms, but

are not in either of the two filtered bags. I thought I'd use the DIFF

function, but it only operates on different tuples in the same bag. So

somehow I think I need to crate a new relation, that has three tuples at the

same level (row?). Then I could use the DIFF function.

Any ideas?

The script I have so far is shown below...

terms = LOAD 'terms' AS (term:chararray, min:float, max:float,

count:int);

terms = FOREACH terms GENERATE term;

--bag of all terms

allTerms = GROUP terms ALL;

--bag of terms without numbers

nonNumbers = FILTER terms BY NOT (term MATCHES '^.*[0-9].*$');

nonNumbers = GROUP nonNumbers ALL;

--bag of terms with numbers

withNumbers = FILTER terms BY (term MATCHES '^.*[0-9].*$');

withNumbers = GROUP withNumbers ALL;