FAQ
Hello Everyone,

As far as I know, when my java program opens a sequence file for a map calculations, from hdfs. Using SequenceFile.Reader(key,value) will actually read the file in dfs.block.size then grabes record-by-record from memory.

Is that right?

.. I tried a simple program with input about 6 MB, but the memory allocated was 13 MB! .. which might be a fragmentation problem, but I doubt it.

Thank you,
Maha

Search Discussions

  • Harsh J at Apr 1, 2011 at 7:00 am

    On Fri, Apr 1, 2011 at 9:00 AM, maha wrote:
    Hello Everyone,

    As far as I know, when my java program opens a sequence file for a map calculations, from hdfs. Using SequenceFile.Reader(key,value) will actually read the file in dfs.block.size then grabes record-by-record from memory.

    Is that right?
    The dfs.block.size part is partially right when applied in MapReduce
    (actually, it would look for sync points for read start and read end).
    And no, the reader does not load the entire data in the memory in
    one-go. It buffers and reads off the stream just like any other
    reader.

    Could we have some more information on what your java program does,
    and what exactly you are measuring? :)
  • Maha at Apr 3, 2011 at 1:19 am
    Hi Harsh,

    My job is for a Similarity Search application. But, my aim for now is to measure the IO overhead if my mapper.map() opened a sequence file and started to read it record by record with:

    SequenceFile.Reader.next(key,value);

    I want to make sure that "next" here is IO efficient. Otherwise, I will need to write it myself to be block read then parsed in my program using the "sync" hints.


    So, what you meant in another words is that the reader will buffer couple of records (the ones between two sync(s)) into memory then use "next" to read from memory .. right? if yes, what parameter is used for the buffer size?

    Thank you,
    Maha


    On Mar 31, 2011, at 11:59 PM, Harsh J wrote:
    On Fri, Apr 1, 2011 at 9:00 AM, maha wrote:
    Hello Everyone,

    As far as I know, when my java program opens a sequence file for a map calculations, from hdfs. Using SequenceFile.Reader(key,value) will actually read the file in dfs.block.size then grabes record-by-record from memory.

    Is that right?
    The dfs.block.size part is partially right when applied in MapReduce
    (actually, it would look for sync points for read start and read end).
    And no, the reader does not load the entire data in the memory in
    one-go. It buffers and reads off the stream just like any other
    reader.

    Could we have some more information on what your java program does,
    and what exactly you are measuring? :)

    --
    Harsh J
    http://harshj.com
  • Harsh J at Apr 3, 2011 at 6:03 am
    Hello,
    On Sun, Apr 3, 2011 at 6:49 AM, maha wrote:
    Hi Harsh,

    My job is for a Similarity Search application. But, my aim for now is to measure the IO overhead if my mapper.map() opened a sequence file and started to read it record by record with:

    SequenceFile.Reader.next(key,value);

    I want to make sure that "next" here is IO efficient. Otherwise, I will need to write it myself to be block read then parsed in my program using the "sync" hints.
    You can have a look at SequenceFile.Reader class's source code perhaps
    - it should clear out all doubts you're having?
    what parameter is used for the buffer size?
    Records are not loaded into the memory. Records are read using
    key/value size informations off the buffered input stream.

    You can specify a buffer size while constructing a Reader object for
    SequenceFiles, or the "io.file.buffer.size" value is used as a
    default.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 1, '11 at 3:31a
activeApr 3, '11 at 6:03a
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Maha: 2 posts Harsh J: 2 posts

People

Translate

site design / logo © 2023 Grokbase