FAQ
Wasim,

RecordReader implementations should never require that elements not be
spread across multiple blocks. The start and end offsets into a file in an
InputSplit are taken as soft limits, not hard ones. The RecordReader
implementations that come with Hadoop perform this way, and any that you
author should do the same. If a logical record continues past its end
offset, it will continue to read the data from the next block until it finds
the end of the record. Similarly, if a RecordReader has a start offset > 0,
then it scans forward til the first end-of-record followed by any
beginning-of-record marker, ignoring this data (as it was processed by the
previous inputsplit), and only then does it begin reading records into its
map task.

- Aaron

On Mon, Aug 10, 2009 at 12:07 PM, Joerg Rieger wrote:

Hello,

while flipping through the cloud9 collections, I came across an XML
InputFormat class:


http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html<http://www.umiacs.umd.edu/%7Ejimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html>

I haven't used it myself, but It might be worth a try.


Joerg



On 30.07.2009, at 14:16, Hyunsik Choi wrote:

Hi,
Actually, I don't know there exists any well-made XML InputFormat or
Record reader.
To the best of my knowledge, StreamXmlRecordReader (

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
) of Hadoop streaming is only solution.

Good luck!

--
Hyunsik Choi
Database & Information Systems Group, Korea University
http://diveintodata.org



On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bariwrote:


Hi All,

I am looking to store some real big xml files in HDFS and then
process them using MapReduce.



Do we have some utility which uploads the xml files to hdfs making sure
split up of file in block doen't brake an elemet ( mean half element on one
block and half on someother ) ?



Any suggestions to work thos out will be appreciated greatly.



Thanks



Bari
--


Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 4 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 30, '09 at 8:31a
activeAug 11, '09 at 11:02p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase