First of all i'm using an old version of pig, the one that ran on hadoop
12.1, and yes i will upgrade soon...
Following I have some requests/questions, based on the use of Pig so far:
1: If you have 1 billion files (purposely exaggerating) where apx 50 % of
the files are related to one segment and 50 % to another segment,
then i guess the pig script for isolating the segments would be something
files = LOAD 'path/to/1_billion_files' AS (segment);
sementA = FILTER files BY (segment='a');
sementB = FILTER files BY (segment='b');
STORE segmentA into 'segemtA.dat';
STORE segmentB into 'segemtB.dat';
So the question is, are all 1 billion files filtered and read twice? If so
(guess it is), would it be possible to do
something like this (just to avoid the overhead of 1 billion reads):
STORE SPLIT segmentA into 'segemtA.dat', segmentB into 'segemtB.dat';
2: Would it be possible to allow the use of asterisks in the load method of
files = LOAD 'batches/*/batch/*/segments'
3: Allowing Userdefined hadoop job names when 'execution' a script, i have a
feeling that this one is in the newest version, true?
Appreciate any comments anyone might have, thanks :-)