FAQ
Hello,

I have a larger XML file, over 10GB, has a simple format like

<book>
<title></title>
<author></author>
...
</book>

I used to parse the XML and convert into another format, i.e. CSV.
Currently, the parsing only performed on a single server and speed is
slow (a few hours)

Is hadoop is a good solution for spliting the XML files and spread the
XML parsing on serveral clusters?

Thanks for any comment.

Search Discussions

  • Feng Jiang at Sep 25, 2006 at 3:44 am
    mapreduce doesn't know anything about your application logic. as long as you
    can split the big xml into a lot of small xml files, then hadoop could help
    you.

    1. split this big xml file into 10000 small xml files, for example.
    2. each small xml file could be one pair in sequence file.
    3. then use mapreduce to read the sequence file and parse them, for example
    you have 10 map & reduce tasks.
    4. finally you have 10 output files, which contain the format you want.
    On 9/24/06, howard chen wrote:

    Hello,

    I have a larger XML file, over 10GB, has a simple format like

    <book>
    <title></title>
    <author></author>
    ...
    </book>

    I used to parse the XML and convert into another format, i.e. CSV.
    Currently, the parsing only performed on a single server and speed is
    slow (a few hours)

    Is hadoop is a good solution for spliting the XML files and spread the
    XML parsing on serveral clusters?

    Thanks for any comment.
  • Howard chen at Sep 27, 2006 at 2:59 am

    On 9/25/06, Feng Jiang wrote:
    mapreduce doesn't know anything about your application logic. as long as you
    can split the big xml into a lot of small xml files, then hadoop could help
    you.

    1. split this big xml file into 10000 small xml files, for example.
    2. each small xml file could be one pair in sequence file.
    3. then use mapreduce to read the sequence file and parse them, for example
    you have 10 map & reduce tasks.
    4. finally you have 10 output files, which contain the format you want.

    Hello,

    in my example, XML paring to CSV seems to be one-to-one mapping, e.g.

    <book>
    <title>hadoop</title>
    <author>peter</author>
    <ISBN>121332</ISBN>
    </book>

    would become (CSV)
    hadoop,peter,121332

    use mapreduce seems not suitable?

    thanks.
  • Feng Jiang at Sep 27, 2006 at 4:37 am
    One principle is that the input file must be a sequence of pairs, and you
    must have a input formatter for the input file. otherwise you cannot use
    mapreduce directly.

    for your case, it seems that the input file is not consist of a sequence of
    pairs, so may not be suitable for MapReduce.
    On 9/27/06, howard chen wrote:
    On 9/25/06, Feng Jiang wrote:
    mapreduce doesn't know anything about your application logic. as long as you
    can split the big xml into a lot of small xml files, then hadoop could help
    you.

    1. split this big xml file into 10000 small xml files, for example.
    2. each small xml file could be one pair in sequence file.
    3. then use mapreduce to read the sequence file and parse them, for example
    you have 10 map & reduce tasks.
    4. finally you have 10 output files, which contain the format you want.

    Hello,

    in my example, XML paring to CSV seems to be one-to-one mapping, e.g.

    <book>
    <title>hadoop</title>
    <author>peter</author>
    <ISBN>121332</ISBN>
    </book>

    would become (CSV)
    hadoop,peter,121332

    use mapreduce seems not suitable?

    thanks.
  • Vetle Roeim at Sep 27, 2006 at 9:25 am

    On Wed, 27 Sep 2006 06:37:06 +0200, Feng Jiang wrote:

    One principle is that the input file must be a sequence of pairs, and you
    must have a input formatter for the input file. otherwise you cannot use
    mapreduce directly.
    Technically, yes, but that pair could just as well consist of XML and some
    insignificant value. The most obvious example is processing log files,
    where data might be read using TextInputFormat, and the line number simply
    discarded.


    [...]
    in my example, XML paring to CSV seems to be one-to-one mapping, e.g.

    <book>
    <title>hadoop</title>
    <author>peter</author>
    <ISBN>121332</ISBN>
    </book>
    In this case, you'd might have to write your own implementation of
    InputFormat that reads the entire XML fragment into some kind of data
    structure.

    would become (CSV)
    hadoop,peter,121332

    use mapreduce seems not suitable?

    thanks.


    --
    Vetle Roeim
    Team Manager, Information Systems
    Opera Software ASA <URL: http://www.opera.com/ >
  • Bryan A. P. Pendleton at Sep 27, 2006 at 8:25 pm
    It can be done! :) I'll see if I can contribute the code that I use for
    these sorts of things... I've been parsing through nearly a terabyte of XML
    on a regular basis using mapreduce since early this year.

    One way to do it would be to define a variant of TextInputFormat that,
    instead of using end-of-line to delimit elements, using a custom string,
    perhaps a regexp. Just define a regexp for the XML element that you can
    cleanly treat independently, and away you go. Might not split work as
    ideally across block boundaries as you'd like, but it should be possible to
    make it perform much better than single-machine single-parser work would.

    Of course, worst cases kill you. It might be a problem if one of your
    segments is a couple of gigabytes long, for instance, as is the case with my
    dataset if you use the highest-level XML container. The current mapreduce
    code really requires that your Writable instances fit in memory.
    On 9/27/06, Vetle Roeim wrote:
    On Wed, 27 Sep 2006 06:37:06 +0200, Feng Jiang wrote:

    One principle is that the input file must be a sequence of pairs, and you
    must have a input formatter for the input file. otherwise you cannot use
    mapreduce directly.
    Technically, yes, but that pair could just as well consist of XML and some
    insignificant value. The most obvious example is processing log files,
    where data might be read using TextInputFormat, and the line number simply
    discarded.


    [...]
    in my example, XML paring to CSV seems to be one-to-one mapping, e.g.

    <book>
    <title>hadoop</title>
    <author>peter</author>
    <ISBN>121332</ISBN>
    </book>
    In this case, you'd might have to write your own implementation of
    InputFormat that reads the entire XML fragment into some kind of data
    structure.

    would become (CSV)
    hadoop,peter,121332

    use mapreduce seems not suitable?

    thanks.


    --
    Vetle Roeim
    Team Manager, Information Systems
    Opera Software ASA <URL: http://www.opera.com/ >


    --
    Bryan A. P. Pendleton
    Ph: (877) geek-1-bp
  • Howard chen at Sep 28, 2006 at 4:14 pm

    On 9/28/06, Bryan A. P. Pendleton wrote:
    It can be done! :) I'll see if I can contribute the code that I use for
    these sorts of things... I've been parsing through nearly a terabyte of XML
    on a regular basis using mapreduce since early this year.

    One way to do it would be to define a variant of TextInputFormat that,
    instead of using end-of-line to delimit elements, using a custom string,
    perhaps a regexp. Just define a regexp for the XML element that you can
    cleanly treat independently, and away you go. Might not split work as
    ideally across block boundaries as you'd like, but it should be possible to
    make it perform much better than single-machine single-parser work would.

    Of course, worst cases kill you. It might be a problem if one of your
    segments is a couple of gigabytes long, for instance, as is the case with my
    dataset if you use the highest-level XML container. The current mapreduce
    code really requires that your Writable instances fit in memory.
    On 9/27/06, Vetle Roeim wrote:

    On Wed, 27 Sep 2006 06:37:06 +0200, Feng Jiang <[email protected]>
    wrote:
    One principle is that the input file must be a sequence of pairs, and you
    must have a input formatter for the input file. otherwise you cannot use
    mapreduce directly.
    Technically, yes, but that pair could just as well consist of XML and some
    insignificant value. The most obvious example is processing log files,
    where data might be read using TextInputFormat, and the line number simply
    discarded.


    [...]
    in my example, XML paring to CSV seems to be one-to-one mapping, e.g.

    <book>
    <title>hadoop</title>
    <author>peter</author>
    <ISBN>121332</ISBN>
    </book>
    In this case, you'd might have to write your own implementation of
    InputFormat that reads the entire XML fragment into some kind of data
    structure.

    would become (CSV)
    hadoop,peter,121332

    use mapreduce seems not suitable?

    thanks.


    --
    Vetle Roeim
    Team Manager, Information Systems
    Opera Software ASA <URL: http://www.opera.com/ >


    --
    Bryan A. P. Pendleton
    Ph: (877) geek-1-bp
    hello,

    this sound good. i would appreciate if you can share your code (just
    the outline is okay) , becoz i am new to the madreduce, thanks.

    regards,
    howa

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 24, '06 at 2:28p
activeSep 28, '06 at 4:14p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase