FAQ
I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
tasks. I'm trying to send each whole file of the input directory to the
mapper without splitting them line by line. How should I set the input
format class? I know I could derive a customized FileInputFormat class
and override the isSplitable function. But I have no idea how to
implement around the record reader. Any suggestion or a sample code will
be greatly appreciated.

Thanks in advance,
Grace

Search Discussions

  • Arun C Murthy at Aug 18, 2011 at 12:11 am
    What file format do you want to use ?

    If it's Text or SequenceFile, or any other existing derivative of FileInputFormat, just override isSplittable and rely on the actual RecordReader.

    Arun
    On Aug 17, 2011, at 3:58 PM, Zhixuan Zhu wrote:

    I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
    tasks. I'm trying to send each whole file of the input directory to the
    mapper without splitting them line by line. How should I set the input
    format class? I know I could derive a customized FileInputFormat class
    and override the isSplitable function. But I have no idea how to
    implement around the record reader. Any suggestion or a sample code will
    be greatly appreciated.

    Thanks in advance,
    Grace
  • Harsh J at Aug 18, 2011 at 2:36 am
    Zhixuan,

    You'll require two things here, as you've deduced correctly:

    Under InputFormat
    - isSplitable -> False
    - getRecordReader -> A simple implementation that reads the whole
    file's bytes to an array/your-construct and passes it (as part of
    next(), etc.).

    For example, here's a simple record reader impl you can return
    (untested, but you'll get the idea of reading whole files, and porting
    to new API is easy as well): https://gist.github.com/1153161

    P.s. Since you are reading whole files into memory, keep an eye out
    for memory usage (the above example has a 10 MB limit per file, for
    example). You could run out of memory easily if you don't handle the
    cases properly.
    On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu wrote:
    I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
    tasks. I'm trying to send each whole file of the input directory to the
    mapper without splitting them line by line. How should I set the input
    format class? I know I could derive a customized FileInputFormat class
    and override the isSplitable function. But I have no idea how to
    implement around the record reader. Any suggestion or a sample code will
    be greatly appreciated.

    Thanks in advance,
    Grace


    --
    Harsh J
  • Zhixuan Zhu at Aug 18, 2011 at 2:06 pm
    Thanks so much for your help! I'll study the sample code and see what I
    should do. My mapper will actually invoke another shell process to read
    in the file and do its job. I just need to get the input file names and
    pass it to the separate process from my mapper. That case I don't need
    to read the file to memory right? How should I implement the next
    function accordingly?

    Thanks again,
    Grace

    -----Original Message-----
    From: Harsh J
    Sent: Wednesday, August 17, 2011 9:36 PM
    To: common-dev@hadoop.apache.org
    Subject: Re: question about file input format

    Zhixuan,

    You'll require two things here, as you've deduced correctly:

    Under InputFormat
    - isSplitable -> False
    - getRecordReader -> A simple implementation that reads the whole
    file's bytes to an array/your-construct and passes it (as part of
    next(), etc.).

    For example, here's a simple record reader impl you can return
    (untested, but you'll get the idea of reading whole files, and porting
    to new API is easy as well): https://gist.github.com/1153161

    P.s. Since you are reading whole files into memory, keep an eye out
    for memory usage (the above example has a 10 MB limit per file, for
    example). You could run out of memory easily if you don't handle the
    cases properly.
    On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu wrote:
    I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
    tasks. I'm trying to send each whole file of the input directory to the
    mapper without splitting them line by line. How should I set the input
    format class? I know I could derive a customized FileInputFormat class
    and override the isSplitable function. But I have no idea how to
    implement around the record reader. Any suggestion or a sample code will
    be greatly appreciated.

    Thanks in advance,
    Grace


    --
    Harsh J
  • Harsh J at Aug 18, 2011 at 3:03 pm
    Grace,

    In that case you may simply set the key/value with dummy or nulls and
    return true just once (same unread/read logic applies as in the
    example). Then, using the input file name (via map.input.file or the
    inputsplit), pass it to your spawned process and have it do the work.
    You'll just be omitting the reading under next().
    On Thu, Aug 18, 2011 at 7:35 PM, Zhixuan Zhu wrote:
    Thanks so much for your help! I'll study the sample code and see what I
    should do. My mapper will actually invoke another shell process to read
    in the file and do its job. I just need to get the input file names and
    pass it to the separate process from my mapper. That case I don't need
    to read the file to memory right? How should I implement the next
    function accordingly?

    Thanks again,
    Grace

    -----Original Message-----
    From: Harsh J
    Sent: Wednesday, August 17, 2011 9:36 PM
    To: common-dev@hadoop.apache.org
    Subject: Re: question about file input format

    Zhixuan,

    You'll require two things here, as you've deduced correctly:

    Under InputFormat
    - isSplitable -> False
    - getRecordReader -> A simple implementation that reads the whole
    file's bytes to an array/your-construct and passes it (as part of
    next(), etc.).

    For example, here's a simple record reader impl you can return
    (untested, but you'll get the idea of reading whole files, and porting
    to new API is easy as well): https://gist.github.com/1153161

    P.s. Since you are reading whole files into memory, keep an eye out
    for memory usage (the above example has a 10 MB limit per file, for
    example). You could run out of memory easily if you don't handle the
    cases properly.
    On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu wrote:
    I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
    tasks. I'm trying to send each whole file of the input directory to the
    mapper without splitting them line by line. How should I set the input
    format class? I know I could derive a customized FileInputFormat class
    and override the isSplitable function. But I have no idea how to
    implement around the record reader. Any suggestion or a sample code will
    be greatly appreciated.

    Thanks in advance,
    Grace


    --
    Harsh J


    --
    Harsh J
  • Zhixuan Zhu at Aug 18, 2011 at 3:50 pm
    Thanks very much for the prompt reply! It makes perfect sense. I'll give
    it a try.

    Grace

    -----Original Message-----
    From: Harsh J
    Sent: Thursday, August 18, 2011 10:03 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: question about file input format

    Grace,

    In that case you may simply set the key/value with dummy or nulls and
    return true just once (same unread/read logic applies as in the
    example). Then, using the input file name (via map.input.file or the
    inputsplit), pass it to your spawned process and have it do the work.
    You'll just be omitting the reading under next().
    On Thu, Aug 18, 2011 at 7:35 PM, Zhixuan Zhu wrote:
    Thanks so much for your help! I'll study the sample code and see what I
    should do. My mapper will actually invoke another shell process to read
    in the file and do its job. I just need to get the input file names and
    pass it to the separate process from my mapper. That case I don't need
    to read the file to memory right? How should I implement the next
    function accordingly?

    Thanks again,
    Grace

    -----Original Message-----
    From: Harsh J
    Sent: Wednesday, August 17, 2011 9:36 PM
    To: common-dev@hadoop.apache.org
    Subject: Re: question about file input format

    Zhixuan,

    You'll require two things here, as you've deduced correctly:

    Under InputFormat
    - isSplitable -> False
    - getRecordReader -> A simple implementation that reads the whole
    file's bytes to an array/your-construct and passes it (as part of
    next(), etc.).

    For example, here's a simple record reader impl you can return
    (untested, but you'll get the idea of reading whole files, and porting
    to new API is easy as well): https://gist.github.com/1153161

    P.s. Since you are reading whole files into memory, keep an eye out
    for memory usage (the above example has a 10 MB limit per file, for
    example). You could run out of memory easily if you don't handle the
    cases properly.
    On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu wrote:
    I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
    tasks. I'm trying to send each whole file of the input directory to the
    mapper without splitting them line by line. How should I set the
    input
    format class? I know I could derive a customized FileInputFormat
    class
    and override the isSplitable function. But I have no idea how to
    implement around the record reader. Any suggestion or a sample code will
    be greatly appreciated.

    Thanks in advance,
    Grace


    --
    Harsh J


    --
    Harsh J
  • Jinsong Hu at Oct 3, 2011 at 5:06 am
    Hi, There:
    I just thought an idea. When we format the disk , the block size is
    usually 1K to 4K. For hdfs, the block size is usually 64M.
    I wonder if we change the raw file system's block size to something
    significantly bigger, say, 1M or 8M, will that improve
    disk IO performance for hadoop's hdfs ?
    Currently, I noticed that mapr distribution uses mfs, its own file system.
    That resulted in 4 times performance gain in terms
    of disk IO. I just wonder if we tune the hosting os parameters, we can
    achieve better disk IO performance with just the regular
    apache hadoop distribution.
    I understand that making the block size bigger can result in some disk
    space waste for small files. However, for disk dedicated
    for hdfs, where most of the files are very big, I just wonder if it is a
    good idea. Any body have any comment ?

    Jimmy
  • Niels Basjes at Oct 3, 2011 at 6:13 am
    Have you tried it to see what diffrence it makes?

    --
    Met vriendelijke groet,
    Niels Basjes
    (Verstuurd vanaf mobiel )
    Op 3 okt. 2011 07:06 schreef "Jinsong Hu" <jinsong_hu@hotmail.com> het
    volgende:
    Hi, There:
    I just thought an idea. When we format the disk , the block size is
    usually 1K to 4K. For hdfs, the block size is usually 64M.
    I wonder if we change the raw file system's block size to something
    significantly bigger, say, 1M or 8M, will that improve
    disk IO performance for hadoop's hdfs ?
    Currently, I noticed that mapr distribution uses mfs, its own file system.
    That resulted in 4 times performance gain in terms
    of disk IO. I just wonder if we tune the hosting os parameters, we can
    achieve better disk IO performance with just the regular
    apache hadoop distribution.
    I understand that making the block size bigger can result in some disk
    space waste for small files. However, for disk dedicated
    for hdfs, where most of the files are very big, I just wonder if it is a
    good idea. Any body have any comment ?

    Jimmy
  • Ted Dunning at Oct 3, 2011 at 2:44 pm
    The MapR system allocates files with 8K blocks internally, so I doubt that
    any improvement that you see with a larger block size on HDFS is going to
    matter much and it could seriously confuse your underlying file system.

    The performance advantage for MapR has more to do with a better file system
    design and much more direct data paths than it has to do with block size on
    disk. Changing the block size on the HDFS partition isn't going to help
    that.
    On Mon, Oct 3, 2011 at 5:05 AM, Jinsong Hu wrote:

    Hi, There:
    I just thought an idea. When we format the disk , the block size is
    usually 1K to 4K. For hdfs, the block size is usually 64M.
    I wonder if we change the raw file system's block size to something
    significantly bigger, say, 1M or 8M, will that improve
    disk IO performance for hadoop's hdfs ?
    Currently, I noticed that mapr distribution uses mfs, its own file system.
    That resulted in 4 times performance gain in terms
    of disk IO. I just wonder if we tune the hosting os parameters, we can
    achieve better disk IO performance with just the regular
    apache hadoop distribution.
    I understand that making the block size bigger can result in some disk
    space waste for small files. However, for disk dedicated
    for hdfs, where most of the files are very big, I just wonder if it is a
    good idea. Any body have any comment ?

    Jimmy
  • M. C. Srivas at Oct 9, 2011 at 6:01 am
    By default, Linux file systems use a 4K block size. Block size of 4K means
    all I/O happens 4K at a time. Any *updates* to data smaller than 4K will
    result in a read-modify-write cycle on disk, ie, if a file was extended from
    1K to 2K, the fs will read in the 4K, memcpy the region from 1K-2K into the
    vm page, then write out 4K again.

    If you make the block size 1M, the read-modify-write cycle will read in 1M,
    and write 1M. I think you don't want that to happen. (imagine Hbase WAL
    writing a few 100 bytes at a time.)

    It also means that on the average, you will waste 512K of disk per file (vs.
    2K with a 4K block size).

    btw, MapR uses 8K as the native block size on disk.

    If you insist on HDFS, try using XFS underneath, it does a much better job
    than ext3 or ext4 for Hadoop in terms of how data is layed out on disk. But
    its memory footprint is alteast twice of that of ext3, so it will gobble up
    a lot more memory on your box.


    On Sun, Oct 2, 2011 at 10:05 PM, Jinsong Hu wrote:

    Hi, There:
    I just thought an idea. When we format the disk , the block size is
    usually 1K to 4K. For hdfs, the block size is usually 64M.
    I wonder if we change the raw file system's block size to something
    significantly bigger, say, 1M or 8M, will that improve
    disk IO performance for hadoop's hdfs ?
    Currently, I noticed that mapr distribution uses mfs, its own file system.
    That resulted in 4 times performance gain in terms
    of disk IO. I just wonder if we tune the hosting os parameters, we can
    achieve better disk IO performance with just the regular
    apache hadoop distribution.
    I understand that making the block size bigger can result in some disk
    space waste for small files. However, for disk dedicated
    for hdfs, where most of the files are very big, I just wonder if it is a
    good idea. Any body have any comment ?

    Jimmy
  • Steve Loughran at Oct 10, 2011 at 10:49 am

    On 09/10/11 07:01, M. C. Srivas wrote:

    If you insist on HDFS, try using XFS underneath, it does a much better job
    than ext3 or ext4 for Hadoop in terms of how data is layed out on disk. But
    its memory footprint is alteast twice of that of ext3, so it will gobble up
    a lot more memory on your box.
    How stable have you found XFS? I know people have worked a lot on ext4
    and I am using it locally, even if something (VirtualBox) tell me off
    for doing so. I know the Lustre people are using underneath their DFS,
    and with wide use it does tend to get debugged by others before you use
    your data.
  • M. C. Srivas at Oct 10, 2011 at 1:52 pm
    XFS was created in 1991 by Silicon Graphics. It was designed for streaming.
    The Linux port was in 2002 or so.

    I've used it extensively for the past 8 years. It is very stable, and many
    NAS companies have embedded it in their products. In particular, it works
    well even when the disk starts getting full. ext4 tends to have problems
    with multiple streams (it seeks too much), and ext3 has a fragmentation
    problem.

    (MapR's disk layout is even better compared to XFS ... couldn't resist)

    On Mon, Oct 10, 2011 at 3:48 AM, Steve Loughran wrote:

    On 09/10/11 07:01, M. C. Srivas wrote:

    If you insist on HDFS, try using XFS underneath, it does a much better job
    than ext3 or ext4 for Hadoop in terms of how data is layed out on disk.
    But
    its memory footprint is alteast twice of that of ext3, so it will gobble
    up
    a lot more memory on your box.
    How stable have you found XFS? I know people have worked a lot on ext4 and
    I am using it locally, even if something (VirtualBox) tell me off for doing
    so. I know the Lustre people are using underneath their DFS, and with wide
    use it does tend to get debugged by others before you use your data.
  • Brian Bockelman at Oct 10, 2011 at 2:11 pm
    I can provide another data point here: xfs works very well in modern Linuxes (in the 2.6.9 era, it had many memory management headaches, especially around the switch to 4k stacks), and its advantage is significant when you run file systems over 95% occupied.

    Brian
    On Oct 10, 2011, at 8:51 AM, M. C. Srivas wrote:

    XFS was created in 1991 by Silicon Graphics. It was designed for streaming.
    The Linux port was in 2002 or so.

    I've used it extensively for the past 8 years. It is very stable, and many
    NAS companies have embedded it in their products. In particular, it works
    well even when the disk starts getting full. ext4 tends to have problems
    with multiple streams (it seeks too much), and ext3 has a fragmentation
    problem.

    (MapR's disk layout is even better compared to XFS ... couldn't resist)

    On Mon, Oct 10, 2011 at 3:48 AM, Steve Loughran wrote:

    On 09/10/11 07:01, M. C. Srivas wrote:

    If you insist on HDFS, try using XFS underneath, it does a much better job
    than ext3 or ext4 for Hadoop in terms of how data is layed out on disk.
    But
    its memory footprint is alteast twice of that of ext3, so it will gobble
    up
    a lot more memory on your box.
    How stable have you found XFS? I know people have worked a lot on ext4 and
    I am using it locally, even if something (VirtualBox) tell me off for doing
    so. I know the Lustre people are using underneath their DFS, and with wide
    use it does tend to get debugged by others before you use your data.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedAug 17, '11 at 10:59p
activeOct 10, '11 at 2:11p
posts13
users9
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase