FAQ
Hi,

This question refers to a thread that was asked back in June.
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10490.html

I would like to do a similar thing. I have logs in a similar format to:
/logs/<hostname>/<date>.log and I would like to selectively choose which
logs to process in a date range.

First I tried the approach suggested by Brian, writing a subroutine in the
driver
to descend through the file system starting at /logs and builds a list of
input files.
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10492.html

This approach did not work for me when I tried to use inputs from s3. It
kept
complaining about java.lang.IllegalArgumentException: Wrong FS.

Then I tried the second approach that was suggested by writing a custom
InputFormat
that recursively traverses directories for files. This approach worked for
S3 inputs.
But I would like to pass two dates to my InputFormat so that it can use them
as a
date range to filter out files.
I got stuck here because I couldn't figure out how to pass date parameters
to the InputFormat.
In my driver, I set the Inputformat as follows:
conf.setInputFormat(FilterFileTextInputFormat.class);



Any ideas on how I can get either approach to work?

thanks,
David

Search Discussions

  • David_ca at Jul 31, 2009 at 5:08 pm
    The way I solved this problem for myself is that I created a file where each
    line represents a file path on s3.
    For example,
    s3n://serverlogs/cluster3-20090517.gz
    s3n://serverlogs/cluster3-20090518.gz
    s3n://serverlogs/cluster3-20090519.gz

    Then I filter the log filename, which contains the date, on a date range.
    The lines that satisfy the date range, are used as input for the job.
    On Thu, Jul 30, 2009 at 7:28 AM, David_ca wrote:

    Hi,

    This question refers to a thread that was asked back in June.
    http://www.mail-archive.com/core-user@hadoop.apache.org/msg10490.html

    I would like to do a similar thing. I have logs in a similar format to:
    /logs/<hostname>/<date>.log and I would like to selectively choose which
    logs to process in a date range.

    First I tried the approach suggested by Brian, writing a subroutine in the
    driver
    to descend through the file system starting at /logs and builds a list of
    input files.
    http://www.mail-archive.com/core-user@hadoop.apache.org/msg10492.html

    This approach did not work for me when I tried to use inputs from s3. It
    kept
    complaining about java.lang.IllegalArgumentException: Wrong FS.

    Then I tried the second approach that was suggested by writing a custom
    InputFormat
    that recursively traverses directories for files. This approach worked for
    S3 inputs.
    But I would like to pass two dates to my InputFormat so that it can use
    them as a
    date range to filter out files.
    I got stuck here because I couldn't figure out how to pass date parameters
    to the InputFormat.
    In my driver, I set the Inputformat as follows:
    conf.setInputFormat(FilterFileTextInputFormat.class);



    Any ideas on how I can get either approach to work?

    thanks,
    David

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 30, '09 at 1:29p
activeJul 31, '09 at 5:08p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

David_ca: 2 posts

People

Translate

site design / logo © 2022 Grokbase