FAQ
Hi All,

I am looking to store some real big xml files in HDFS and then process them using MapReduce.



Do we have some utility which uploads the xml files to hdfs making sure split up of file in block doen't brake an elemet ( mean half element on one block and half on someother ) ?



Any suggestions to work thos out will be appreciated greatly.



Thanks



Bari

Search Discussions

  • Hyunsik Choi at Jul 30, 2009 at 12:17 pm
    Hi,

    Actually, I don't know there exists any well-made XML InputFormat or
    Record reader.
    To the best of my knowledge, StreamXmlRecordReader (
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
    ) of Hadoop streaming is only solution.

    Good luck!

    --
    Hyunsik Choi
    Database & Information Systems Group, Korea University
    http://diveintodata.org



    On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bariwrote:


    Hi All,

    I am looking to store some real big xml files in HDFS and then process them using MapReduce.



    Do we have some utility which uploads the xml files to hdfs making sure split  up of file in block doen't brake an elemet ( mean half element on one block and half on someother ) ?



    Any suggestions to work thos out will  be appreciated greatly.



    Thanks



    Bari
  • Joerg Rieger at Aug 10, 2009 at 7:07 pm
    Hello,

    while flipping through the cloud9 collections, I came across an XML
    InputFormat class:

    http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html

    I haven't used it myself, but It might be worth a try.


    Joerg

    On 30.07.2009, at 14:16, Hyunsik Choi wrote:

    Hi,

    Actually, I don't know there exists any well-made XML InputFormat or
    Record reader.
    To the best of my knowledge, StreamXmlRecordReader (
    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
    ) of Hadoop streaming is only solution.

    Good luck!

    --
    Hyunsik Choi
    Database & Information Systems Group, Korea University
    http://diveintodata.org



    On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bariwrote:


    Hi All,

    I am looking to store some real big xml files in HDFS and
    then process them using MapReduce.



    Do we have some utility which uploads the xml files to hdfs making
    sure split up of file in block doen't brake an elemet ( mean half
    element on one block and half on someother ) ?



    Any suggestions to work thos out will be appreciated greatly.



    Thanks



    Bari
    --
  • Aaron Kimball at Aug 11, 2009 at 11:02 pm
    Wasim,

    RecordReader implementations should never require that elements not be
    spread across multiple blocks. The start and end offsets into a file in an
    InputSplit are taken as soft limits, not hard ones. The RecordReader
    implementations that come with Hadoop perform this way, and any that you
    author should do the same. If a logical record continues past its end
    offset, it will continue to read the data from the next block until it finds
    the end of the record. Similarly, if a RecordReader has a start offset > 0,
    then it scans forward til the first end-of-record followed by any
    beginning-of-record marker, ignoring this data (as it was processed by the
    previous inputsplit), and only then does it begin reading records into its
    map task.

    - Aaron

    On Mon, Aug 10, 2009 at 12:07 PM, Joerg Rieger wrote:

    Hello,

    while flipping through the cloud9 collections, I came across an XML
    InputFormat class:


    http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html<http://www.umiacs.umd.edu/%7Ejimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html>

    I haven't used it myself, but It might be worth a try.


    Joerg



    On 30.07.2009, at 14:16, Hyunsik Choi wrote:

    Hi,
    Actually, I don't know there exists any well-made XML InputFormat or
    Record reader.
    To the best of my knowledge, StreamXmlRecordReader (

    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
    ) of Hadoop streaming is only solution.

    Good luck!

    --
    Hyunsik Choi
    Database & Information Systems Group, Korea University
    http://diveintodata.org



    On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bariwrote:


    Hi All,

    I am looking to store some real big xml files in HDFS and then
    process them using MapReduce.



    Do we have some utility which uploads the xml files to hdfs making sure
    split up of file in block doen't brake an elemet ( mean half element on one
    block and half on someother ) ?



    Any suggestions to work thos out will be appreciated greatly.



    Thanks



    Bari
    --


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 30, '09 at 8:31a
activeAug 11, '09 at 11:02p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase