FAQ
Hi,

I was writing a test mapreduce program and noticed that the
input file was always broken down into separate lines and fed
to the mapper. However, in my case I need to process the whole
file in the mapper since there are some dependency between
lines in the input file. Is there any way I can achieve this --
process the whole input file, either text or binary, in the mapper?

Thank you,

Ming Yang

Search Discussions

  • Ted Dunning at Oct 15, 2007 at 3:49 pm
    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Rick Cox at Oct 15, 2007 at 4:58 pm
    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Ming Yang at Oct 15, 2007 at 5:09 pm
    thank you guys. the information is very helpful!

    Ming

    2007/10/15, Rick Cox <rick.cox@gmail.com>:
    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Ted Dunning at Oct 15, 2007 at 5:10 pm
    That doesn't quite do what the poster requested. They wanted to pass the
    entire file to the mapper.

    That requires a custom input format or an indirect input approach (list of
    file names in input).

    On 10/15/07 9:57 AM, "Rick Cox" wrote:

    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Ming Yang at Oct 15, 2007 at 6:02 pm
    I just did a test by simply extending from TextInputFormat
    and override isSplitable(FileSystem fs, Path file) to always
    returning false. However, in my mapper, I still see the input
    file gets splitted into lines. I did set the input format in
    JobConfiguration and isSplitable(...) -> false did get called
    during job execution. Is there anything I did wrong or
    this is the behavior I should be expecting?

    Thanks,

    Ming

    2007/10/15, Ted Dunning <tdunning@veoh.com>:
    That doesn't quite do what the poster requested. They wanted to pass the
    entire file to the mapper.

    That requires a custom input format or an indirect input approach (list of
    file names in input).

    On 10/15/07 9:57 AM, "Rick Cox" wrote:

    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Aaron Kimball at Oct 15, 2007 at 6:42 pm
    That will set the input file to generate a single InputSplit. Meaning
    all the data from that file will go to the same Mapper. But it doesn't
    say -how- the data goes from the file to the mapper. That's controlled
    by the RecordReader instance returned by TextInputFormat. You'll need to
    write a RecordReader that slurps the entire file in at once.

    - Aaron

    Ming Yang wrote:
    I just did a test by simply extending from TextInputFormat
    and override isSplitable(FileSystem fs, Path file) to always
    returning false. However, in my mapper, I still see the input
    file gets splitted into lines. I did set the input format in
    JobConfiguration and isSplitable(...) -> false did get called
    during job execution. Is there anything I did wrong or
    this is the behavior I should be expecting?

    Thanks,

    Ming

    2007/10/15, Ted Dunning <tdunning@veoh.com>:
    That doesn't quite do what the poster requested. They wanted to pass the
    entire file to the mapper.

    That requires a custom input format or an indirect input approach (list of
    file names in input).

    On 10/15/07 9:57 AM, "Rick Cox" wrote:

    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:

    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Ted Dunning at Oct 15, 2007 at 8:54 pm
    You didn't do anything wrong. You just didn't finish the job.

    You need to override getRecordReader as well so that it returns the contents
    of the file (or a lazy version of same) as a single record.

    On 10/15/07 11:00 AM, "Ming Yang" wrote:

    I just did a test by simply extending from TextInputFormat
    and override isSplitable(FileSystem fs, Path file) to always
    returning false. However, in my mapper, I still see the input
    file gets splitted into lines. I did set the input format in
    JobConfiguration and isSplitable(...) -> false did get called
    during job execution. Is there anything I did wrong or
    this is the behavior I should be expecting?

    Thanks,

    Ming

    2007/10/15, Ted Dunning <tdunning@veoh.com>:
    That doesn't quite do what the poster requested. They wanted to pass the
    entire file to the mapper.

    That requires a custom input format or an indirect input approach (list of
    file names in input).

    On 10/15/07 9:57 AM, "Rick Cox" wrote:

    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Ming Yang at Oct 15, 2007 at 9:23 pm
    thank you! after tracing the code I realized that I should override
    getRecordReader(...) as well to return the whole content of the file,
    ie. to finish the job. :)

    2007/10/15, Ted Dunning <tdunning@veoh.com>:

    You didn't do anything wrong. You just didn't finish the job.

    You need to override getRecordReader as well so that it returns the contents
    of the file (or a lazy version of same) as a single record.

    On 10/15/07 11:00 AM, "Ming Yang" wrote:

    I just did a test by simply extending from TextInputFormat
    and override isSplitable(FileSystem fs, Path file) to always
    returning false. However, in my mapper, I still see the input
    file gets splitted into lines. I did set the input format in
    JobConfiguration and isSplitable(...) -> false did get called
    during job execution. Is there anything I did wrong or
    this is the behavior I should be expecting?

    Thanks,

    Ming

    2007/10/15, Ted Dunning <tdunning@veoh.com>:
    That doesn't quite do what the poster requested. They wanted to pass the
    entire file to the mapper.

    That requires a custom input format or an indirect input approach (list of
    file names in input).

    On 10/15/07 9:57 AM, "Rick Cox" wrote:

    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Ted Dunning at Oct 15, 2007 at 9:39 pm
    If you have time, update the wiki FAQ on this so that the next person has an
    easy time figuring this question out.

    On 10/15/07 2:22 PM, "Ming Yang" wrote:

    thank you! after tracing the code I realized that I should override
    getRecordReader(...) as well to return the whole content of the file,
    ie. to finish the job. :)

    2007/10/15, Ted Dunning <tdunning@veoh.com>:

    You didn't do anything wrong. You just didn't finish the job.

    You need to override getRecordReader as well so that it returns the contents
    of the file (or a lazy version of same) as a single record.

    On 10/15/07 11:00 AM, "Ming Yang" wrote:

    I just did a test by simply extending from TextInputFormat
    and override isSplitable(FileSystem fs, Path file) to always
    returning false. However, in my mapper, I still see the input
    file gets splitted into lines. I did set the input format in
    JobConfiguration and isSplitable(...) -> false did get called
    during job execution. Is there anything I did wrong or
    this is the behavior I should be expecting?

    Thanks,

    Ming

    2007/10/15, Ted Dunning <tdunning@veoh.com>:
    That doesn't quite do what the poster requested. They wanted to pass the
    entire file to the mapper.

    That requires a custom input format or an indirect input approach (list of
    file names in input).

    On 10/15/07 9:57 AM, "Rick Cox" wrote:

    You can also gzip each input file. Hadoop will not split a compressed
    input file (but will automatically decompress it before feeding it to
    your mapper).

    rick
    On 10/15/07, Ted Dunning wrote:


    Use a list of file names as your map input. Then your mapper can read a
    line, use that to open and read a file for processing.

    This is similar to the problem of web-crawling where the input is a list
    of
    URL's.
    On 10/15/07 6:57 AM, "Ming Yang" wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
  • Owen O'Malley at Oct 15, 2007 at 4:26 pm

    On Oct 15, 2007, at 6:57 AM, Ming Yang wrote:

    I was writing a test mapreduce program and noticed that the
    input file was always broken down into separate lines and fed
    to the mapper. However, in my case I need to process the whole
    file in the mapper since there are some dependency between
    lines in the input file. Is there any way I can achieve this --
    process the whole input file, either text or binary, in the mapper?
    http://wiki.apache.org/lucene-hadoop/FAQ#10

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 15, '07 at 1:59p
activeOct 15, '07 at 9:39p
posts11
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase