It can be done! :) I'll see if I can contribute the code that I use for
these sorts of things... I've been parsing through nearly a terabyte of XML
on a regular basis using mapreduce since early this year.
One way to do it would be to define a variant of TextInputFormat that,
instead of using end-of-line to delimit elements, using a custom string,
perhaps a regexp. Just define a regexp for the XML element that you can
cleanly treat independently, and away you go. Might not split work as
ideally across block boundaries as you'd like, but it should be possible to
make it perform much better than single-machine single-parser work would.
Of course, worst cases kill you. It might be a problem if one of your
segments is a couple of gigabytes long, for instance, as is the case with my
dataset if you use the highest-level XML container. The current mapreduce
code really requires that your Writable instances fit in memory.
On 9/27/06, Vetle Roeim wrote:On Wed, 27 Sep 2006 06:37:06 +0200, Feng Jiang wrote:
One principle is that the input file must be a sequence of pairs, and you
must have a input formatter for the input file. otherwise you cannot use
mapreduce directly.
Technically, yes, but that pair could just as well consist of XML and some
insignificant value. The most obvious example is processing log files,
where data might be read using TextInputFormat, and the line number simply
discarded.
[...]
in my example, XML paring to CSV seems to be one-to-one mapping, e.g.
<book>
<title>hadoop</title>
<author>peter</author>
<ISBN>121332</ISBN>
</book>
In this case, you'd might have to write your own implementation of
InputFormat that reads the entire XML fragment into some kind of data
structure.
would become (CSV)
hadoop,peter,121332
use mapreduce seems not suitable?
thanks.
--
Vetle Roeim
Team Manager, Information Systems
Opera Software ASA <URL:
http://www.opera.com/ >
--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp