FAQ
Hi,

I have a job that processes raw data inside tarballs. As job input I have a
text file listing the full HDFS path of the files that need to be processed,
e.g.:
...
/user/eric/file451.tar.gz
/user/eric/file452.tar.gz
/user/eric/file453.tar.gz
...

Each mapper gets one line of input at a time, moves the tarball to local
storage, unpacks it and processes all files inside.
This works very well. However: changes are high that a mapper gets to
process a file that is not stored locally on that node so it needs to be
transferred.

My question: is there any way to get better locality in a job as described
above?

Best regards,
Eric

Search Discussions

  • Joey Echeverria at May 9, 2011 at 6:25 pm
    You could write your own input format class to handle breaking out the
    tar files for you. If you subclass FileInputFormat, Hadoop will handle
    decompressing the files because of the .gz file extension. Your input
    format would just need to use a Java tar file library (e.g.
    http://code.google.com/p/jtar/) to give your mappers access to the
    files underneath.

    -Joey
    On Mon, May 9, 2011 at 2:48 AM, Eric wrote:
    Hi,

    I have a job that processes raw data inside tarballs. As job input I have a
    text file listing the full HDFS path of the files that need to be processed,
    e.g.:
    ...
    /user/eric/file451.tar.gz
    /user/eric/file452.tar.gz
    /user/eric/file453.tar.gz
    ...

    Each mapper gets one line of input at a time, moves the tarball to local
    storage, unpacks it and processes all files inside.
    This works very well. However: changes are high that a mapper gets to
    process a file that is not stored locally on that node so it needs to be
    transferred.

    My question: is there any way to get better locality in a job as described
    above?

    Best regards,
    Eric


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMay 9, '11 at 9:48a
activeMay 9, '11 at 6:25p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Eric: 1 post Joey Echeverria: 1 post

People

Translate

site design / logo © 2021 Grokbase