FAQ
Hi,

I am writing an M-R code using MapRunnable interface.
The input format is SequenceFileInputFormat.

Each Sequence-record contains a key-value pair of type <Text key,Text value> (Text: org.apache.hadoop.io.Text)

The "key" Text object contains small string where as "value" Text object contains large XML string.
"value" Text object can contain the data as large as 100 to 300 MB.

I convert the "value" Text object to String using value.toString() method.
It goes OutOfMemory for large data in "value" object.

Is there any other way for converting large Text object to java String object?
Alternatively, can I limit the number of records in RecordReader object coming to run method so that total memory utilization would be limited?

Thanks,
- Bhushan


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Search Discussions

  • Steve Gao at Oct 29, 2009 at 7:41 pm
    Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partical data of an XML segment?

    For example:

    <title>
    <book>book1</book>
    <author>me</author>
    ..............what if this is the boundary of a chunk?...................
    <year>2009</year>
    <book>book2</book>

    <author>me</author>

    <year>2009</year>
    <book>book3</book>

    <author>me</author>

    <year>2009</year>
    <title>
  • Amandeep Khurana at Oct 29, 2009 at 9:13 pm
    Store the entire xml in one line...
    On 10/29/09, Steve Gao wrote:
    Does anybody have the similar issue? If you store XML files in HDFS, how can
    you make sure a chunk reads by a mapper does not contain partical data of an
    XML segment?

    For example:

    <title>
    <book>book1</book>
    <author>me</author>
    ..............what if this is the boundary of a chunk?...................
    <year>2009</year>
    <book>book2</book>

    <author>me</author>

    <year>2009</year>
    <book>book3</book>

    <author>me</author>

    <year>2009</year>
    <title>



    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Brian Bockelman at Oct 29, 2009 at 9:51 pm
    Hey Steve,

    I think I've run across code in SVN that is a splitter for XML entries
    like this. Look at StreamXmlRecordReader, I think it does what you
    want.

    Brian
    On Oct 29, 2009, at 4:12 PM, Amandeep Khurana wrote:

    Store the entire xml in one line...
    On 10/29/09, Steve Gao wrote:
    Does anybody have the similar issue? If you store XML files in
    HDFS, how can
    you make sure a chunk reads by a mapper does not contain partical
    data of an
    XML segment?

    For example:

    <title>
    <book>book1</book>
    <author>me</author>
    ..............what if this is the boundary of a
    chunk?...................
    <year>2009</year>
    <book>book2</book>

    <author>me</author>

    <year>2009</year>
    <book>book3</book>

    <author>me</author>

    <year>2009</year>
    <title>



    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Steve Gao at Nov 16, 2009 at 8:03 pm
    Thanks. But this is not a neat solution in case that the XML block is very large.
    Anybody has another solution? Thanks!

    --- On Thu, 10/29/09, Amandeep Khurana wrote:

    From: Amandeep Khurana <[email protected]>
    Subject: Re: What if an XML file is accross boundary of HDFS chunks?
    To: [email protected]
    Date: Thursday, October 29, 2009, 5:12 PM

    Store the entire xml in one line...
    On 10/29/09, Steve Gao wrote:
    Does anybody have the similar issue? If you store XML files in HDFS, how can
    you make sure a chunk reads by a mapper does not contain partical data of an
    XML segment?

    For example:

    <title>
    <book>book1</book>
    <author>me</author>
    ..............what if this is the boundary of a chunk?...................
    <year>2009</year>
    <book>book2</book>

    <author>me</author>

    <year>2009</year>
    <book>book3</book>

    <author>me</author>

    <year>2009</year>
    <title>



    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Brian Bockelman at Nov 16, 2009 at 8:06 pm
    Hey Steve,

    Look at the mailing list archives - there's a specialized input splitter that you could use that at least 2 different people suggested.

    Brian
    On Nov 16, 2009, at 2:02 PM, Steve Gao wrote:

    Thanks. But this is not a neat solution in case that the XML block is very large.
    Anybody has another solution? Thanks!

    --- On Thu, 10/29/09, Amandeep Khurana wrote:

    From: Amandeep Khurana <[email protected]>
    Subject: Re: What if an XML file is accross boundary of HDFS chunks?
    To: [email protected]
    Date: Thursday, October 29, 2009, 5:12 PM

    Store the entire xml in one line...
    On 10/29/09, Steve Gao wrote:
    Does anybody have the similar issue? If you store XML files in HDFS, how can
    you make sure a chunk reads by a mapper does not contain partical data of an
    XML segment?

    For example:

    <title>
    <book>book1</book>
    <author>me</author>
    ..............what if this is the boundary of a chunk?...................
    <year>2009</year>
    <book>book2</book>

    <author>me</author>

    <year>2009</year>
    <book>book3</book>

    <author>me</author>

    <year>2009</year>
    <title>



    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

  • Steve Gao at Oct 29, 2009 at 7:44 pm
    Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partical data of an XML segment?

    For example:

    <title>
    <book>book1</book>
    <author>me</author>
    ..............what if this is the boundary of a chunk?...................
    <year>2009</year>
    <book>book2</book>

    <author>me</author>

    <year>2009</year>
    <book>book3</book>

    <author>me</author>

    <year>2009</year>
    <title>
  • Jason Venner at Dec 22, 2009 at 3:16 pm
    The text class supports low level access to the underlying byte array in the
    text object

    You can call getbytes directly and then incrementally transcode the bytes
    into characters using the charset encoder tools,
    or call the charAt method to get the characters one by 1.
    The bytesToCodePoint method provides a simpler interface for sequentially
    working through the data.
    On Thu, Oct 29, 2009 at 4:18 AM, bhushan_mahale wrote:

    Hi,

    I am writing an M-R code using MapRunnable interface.
    The input format is SequenceFileInputFormat.

    Each Sequence-record contains a key-value pair of type <Text key,Text
    value> (Text: org.apache.hadoop.io.Text)

    The "key" Text object contains small string where as "value" Text object
    contains large XML string.
    "value" Text object can contain the data as large as 100 to 300 MB.

    I convert the "value" Text object to String using value.toString() method.
    It goes OutOfMemory for large data in "value" object.

    Is there any other way for converting large Text object to java String
    object?
    Alternatively, can I limit the number of records in RecordReader object
    coming to run method so that total memory utilization would be limited?

    Thanks,
    - Bhushan


    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is
    the property of Persistent Systems Ltd. It is intended only for the use of
    the individual or entity to which it is addressed. If you are not the
    intended recipient, you are not authorized to read, retain, copy, print,
    distribute or use this message. If you have received this communication in
    error, please notify the sender and delete all copies of this message.
    Persistent Systems Ltd. does not accept any liability for virus infected
    mails.


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Mark Kerzner at Dec 22, 2009 at 3:29 pm
    Bhushan,

    have you considered simply raising the memory limit for Hadoop? 100M-300M is
    not that much, and 2 Gigs is very mode memory requirement of the today's
    machines. For comparison, small EC2 has 1.7 Gig
    On Tue, Dec 22, 2009 at 9:10 AM, Jason Venner wrote:

    The text class supports low level access to the underlying byte array in
    the
    text object

    You can call getbytes directly and then incrementally transcode the bytes
    into characters using the charset encoder tools,
    or call the charAt method to get the characters one by 1.
    The bytesToCodePoint method provides a simpler interface for sequentially
    working through the data.

    On Thu, Oct 29, 2009 at 4:18 AM, bhushan_mahale <
    [email protected]> wrote:
    Hi,

    I am writing an M-R code using MapRunnable interface.
    The input format is SequenceFileInputFormat.

    Each Sequence-record contains a key-value pair of type <Text key,Text
    value> (Text: org.apache.hadoop.io.Text)

    The "key" Text object contains small string where as "value" Text object
    contains large XML string.
    "value" Text object can contain the data as large as 100 to 300 MB.

    I convert the "value" Text object to String using value.toString() method.
    It goes OutOfMemory for large data in "value" object.

    Is there any other way for converting large Text object to java String
    object?
    Alternatively, can I limit the number of records in RecordReader object
    coming to run method so that total memory utilization would be limited?

    Thanks,
    - Bhushan


    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is
    the property of Persistent Systems Ltd. It is intended only for the use of
    the individual or entity to which it is addressed. If you are not the
    intended recipient, you are not authorized to read, retain, copy, print,
    distribute or use this message. If you have received this communication in
    error, please notify the sender and delete all copies of this message.
    Persistent Systems Ltd. does not accept any liability for virus infected
    mails.


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 29, '09 at 12:19p
activeDec 22, '09 at 3:29p
posts9
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase