Hi,
This question refers to a thread that was asked back in June.
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10490.html
I would like to do a similar thing. I have logs in a similar format to:
/logs/<hostname>/<date>.log and I would like to selectively choose which
logs to process in a date range.
First I tried the approach suggested by Brian, writing a subroutine in the
driver
to descend through the file system starting at /logs and builds a list of
input files.
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10492.html
This approach did not work for me when I tried to use inputs from s3. It
kept
complaining about java.lang.IllegalArgumentException: Wrong FS.
Then I tried the second approach that was suggested by writing a custom
InputFormat
that recursively traverses directories for files. This approach worked for
S3 inputs.
But I would like to pass two dates to my InputFormat so that it can use them
as a
date range to filter out files.
I got stuck here because I couldn't figure out how to pass date parameters
to the InputFormat.
In my driver, I set the Inputformat as follows:
conf.setInputFormat(FilterFileTextInputFormat.class);
Any ideas on how I can get either approach to work?
thanks,
David