FAQ
I want to index the Urdu language corpus (200 documents in CES XML DTD
format). Is net necessary to break the XML file into 200 different files
or it can be indexed in the original form using Lucene. Kindly guide in
this regard.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Steven Rowe at Oct 24, 2007 at 3:14 pm
    Hi Liaqat,

    Liaqat Ali wrote:
    I want to index the Urdu language corpus (200 documents in CES XML DTD
    format). Is net necessary to break the XML file into 200 different files
    or it can be indexed in the original form using Lucene. Kindly guide in
    this regard.
    A Lucene document is composed of one or more fields. You will choose
    which fields each document will have. In your initial implementation,
    you may choose to extract all text from each document and place it in a
    single indexed text field.

    It is your responsibility to locate and open your input sources and
    break them up or combine them to produce the document field data -
    Lucene does not provide this functionality for you.

    It is your choice whether you break the input files before you index
    them or as part of the indexing process - in either case, it is your
    responsibility, not Lucene's. This choice will depend on the parsing
    library you choose, the size of the corpus, and the amount of memory
    available on the machine on which you perform the indexing. If the
    corpus is small, and/or you process the source XML file with a parser
    which does not hold the entire contents in memory (e.g. SAX), and/or the
    machine has lots of memory, it should be okay to construct document
    fields on-the-fly, instead of first splitting the original file up.

    Steve

    --
    Steve Rowe
    Center for Natural Language Processing
    http://www.cnlp.org/tech/lucene.asp

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 24, '07 at 1:03p
activeOct 24, '07 at 3:14p
posts2
users2
websitelucene.apache.org

2 users in discussion

Steven Rowe: 1 post Liaqat Ali: 1 post

People

Translate

site design / logo © 2022 Grokbase