|| at Aug 11, 2009 at 11:02 pm
RecordReader implementations should never require that elements not be
spread across multiple blocks. The start and end offsets into a file in an
InputSplit are taken as soft limits, not hard ones. The RecordReader
implementations that come with Hadoop perform this way, and any that you
author should do the same. If a logical record continues past its end
offset, it will continue to read the data from the next block until it finds
the end of the record. Similarly, if a RecordReader has a start offset > 0,
then it scans forward til the first end-of-record followed by any
beginning-of-record marker, ignoring this data (as it was processed by the
previous inputsplit), and only then does it begin reading records into its
On Mon, Aug 10, 2009 at 12:07 PM, Joerg Rieger wrote:
while flipping through the cloud9 collections, I came across an XML
I haven't used it myself, but It might be worth a try.
On 30.07.2009, at 14:16, Hyunsik Choi wrote:
Actually, I don't know there exists any well-made XML InputFormat or
To the best of my knowledge, StreamXmlRecordReader (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
) of Hadoop streaming is only solution.
Database & Information Systems Group, Korea Universityhttp://diveintodata.org
On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bariwrote:
I am looking to store some real big xml files in HDFS and then
process them using MapReduce.
Do we have some utility which uploads the xml files to hdfs making sure
split up of file in block doen't brake an elemet ( mean half element on one
block and half on someother ) ?
Any suggestions to work thos out will be appreciated greatly.