FAQ
Hello everyone,
I am building an hadoop "app" to quickly index a corpus of documents.
This app will accept one or more XML file that will contain the corpus.
Each document is made up of several section: title, authors,
body...these section are not static and depend on the collection. Here's
a sample glimpse of how the xml input file looks like:

<document id='1'>
<field name='title'> the divine comedy </field>
<field name='author'>Dante</field>
<field name='body'>halfway along our life's path.......</field>
</document>
<document id='2'>

...

</document>

I would like to discuss some implementation choices:

- which is the best way to "tell" my hadoop app which section to expect
between <document> and </document> tags?

- is it more appropriate to implement a record reader that passes to the
mapper the whole content of the document tag or section by section. I
was wondering which parser to use, a dom-like one or a sax-like
one...any library (efficient) to recommend?

- do you know any library I could use to process text? By text
processing I mean common preprocessing operation like tokenization,
stopword elimination...I was thinking of using lucene's engine...can it
be a bottleneck?

I am looking forward to read your opinion

Thanks,

Marco

Search Discussions

  • Lance Norskog at Jan 29, 2011 at 4:18 am
    Look at the Reuters example in the Mahout project: http://mahout.apache.org
    On Fri, Jan 28, 2011 at 2:49 AM, Marco Didonna wrote:
    Hello everyone,
    I am building an hadoop "app" to quickly index a corpus of documents.
    This app will accept one or more XML file that will contain the corpus.
    Each document is made up of several section: title, authors,
    body...these section are not static and depend on the collection. Here's
    a sample glimpse of how the xml input file looks like:

    <document id='1'>
    <field name='title'> the divine comedy </field>
    <field name='author'>Dante</field>
    <field name='body'>halfway along our life's path.......</field>
    </document>
    <document id='2'>

    ...

    </document>

    I would like to discuss some implementation choices:

    - which is the best way to "tell" my hadoop app which section to expect
    between <document> and </document> tags?

    - is it more appropriate to implement a record reader that passes to the
    mapper the whole content of the document tag or section by section. I
    was wondering which parser to use, a dom-like one or a sax-like
    one...any library (efficient) to recommend?

    - do you know any library I could use to process text? By text
    processing I mean common preprocessing operation like tokenization,
    stopword elimination...I was thinking of using lucene's engine...can it
    be a bottleneck?

    I am looking forward to read your opinion

    Thanks,

    Marco


    --
    Lance Norskog
    goksron@gmail.com
  • Marco Didonna at Jan 29, 2011 at 8:58 am

    On 01/29/2011 05:17 AM, Lance Norskog wrote:
    Look at the Reuters example in the Mahout project: http://mahout.apache.org
    Ehm could you point me to it ? I cannot find it

    Thanks
  • Ted Yu at Jan 29, 2011 at 4:26 pm
    $MAHOUT_HOME/examples/bin/build-reuters.shFYI
    On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna wrote:
    On 01/29/2011 05:17 AM, Lance Norskog wrote:
    Look at the Reuters example in the Mahout project:
    http://mahout.apache.org

    Ehm could you point me to it ? I cannot find it

    Thanks

  • James Seigel at Jan 29, 2011 at 8:19 pm
    Has anyone tried to do the reuters example with both approaches? I seem to have problems getting them to run.

    Cheers
    James.

    On 2011-01-29, at 9:25 AM, Ted Yu wrote:

    $MAHOUT_HOME/examples/bin/build-reuters.shFYI
    On Sat, Jan 29, 2011 at 12:57 AM, Marco Didonna wrote:
    On 01/29/2011 05:17 AM, Lance Norskog wrote:
    Look at the Reuters example in the Mahout project:
    http://mahout.apache.org

    Ehm could you point me to it ? I cannot find it

    Thanks

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 28, '11 at 10:50a
activeJan 29, '11 at 8:19p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase