I am not sure if this will work as you expect.
Depending on which implementation of PigStorage you end up using, it
might exhibit different behavior.
If I am not wrong, currently, for example, if you specify something like :
A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);
your code will end up generating a tuple of 4 fields - the fileName
always being 'null' and the actual filename you inserted through
MyLoader ending up being the 4th field (and so not 'seen' by pig - not
sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).
Note - this is an implementation specific behavior, which could probably
have been fixed by implementation specific hack
"tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
the last field expected].
As expected, it is brittle code.
From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was introduced,
insertion of null's to extend to schema specified (the above behavior),
etc.
So you would become dependent on the impl changes.
I dont think BinStorage and PigStorage have been written with
inheritance in mind ...
Regards,
Mridul
On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:Hi,
In Pig 0.6 you can extend the PigStorage and grab the name of the file with
something like this:
@Override
public void bindTo(String fileName, BufferedPositionedInputStream is, long
offset, long end)
throws IOException {
super.bindTo(fileName, is, offset, end);
this.fileName = fileName; // In your case match with a regexp and get
the group with the name only (e.g. google, baidu)
}
@Override
public Tuple getNext() throws IOException {
Tuple next = super.getNext();
if (next != null) {
next.append(fileName);
}
return next;
}
Then you can group on the name and split on it.
Thanks,
Romain
On Mon, Mar 1, 2010 at 3:09 AM, Jumpingwrote:
Hi,
Could pig recognize files name are importing ? If could, how to do ? I
want
to combine them according filename.
Exp:
google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
Sort and combine by name, then output two files: google_all.csv,
baidu_all.csv in a pig script.
Best Regards,
Jumping Qu
------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)