FAQ
Hi All! I have a trouble... When I index text documents in english, there
is no problem, buy when I index Spanish text documents (And they're big),
a lot of information from the document don't become indexed (I suppose it
is due to the Analyzer, but if the documents is less tahn 400kb it
works perfectly). Howewer I want to Index ALL the strings in the
document with no StopWords. Is this possible??

Thank's in advance

Search Discussions

  • Daniel Naber at Jun 9, 2006 at 8:25 pm

    On Freitag 09 Juni 2006 21:31, manu mohedano wrote:

    Hi All! I have a trouble... When I index text documents in english,
    there is no problem, buy when I index Spanish text documents (And
    they're big), a lot of information from the document don't become
    indexed
    Read the FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ,
    item "Why am I getting no hits / incorrect hits?"

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Pasha Bizhan at Jun 9, 2006 at 8:27 pm
    Hi,
    From: manu mohedano
    Hi All! I have a trouble... When I index text documents in
    english, there is no problem, buy when I index Spanish text
    documents (And they're big), a lot of information from the
    document don't become indexed (I suppose it is due to the
    Analyzer, but if the documents is less tahn 400kb it works
    perfectly). Howewer I want to Index ALL the strings in the
    document with no StopWords. Is this possible??
    Read javadoc about DEFAULT_MAX_FIELD_LENGTH at
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
    tml#setMaxFieldLength(int)

    Pasha Bizhan



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Rob Staveley (Tom) at Jun 10, 2006 at 6:21 am
    I'm trying to come to terms with
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
    tml#setMaxFieldLength(int) too. I've been attempting to index large text
    files as single Lucene documents, passing them as java.io.Reader to cope
    with RAM. I was assuming (like - I suspect - manu mohedano) that an unstored
    field could be of any length and that maxFieldLength was only applicable to
    stored fields. Do we in fact need to break the document into manageable
    parts?

    -----Original Message-----
    From: Pasha Bizhan
    Sent: 09 June 2006 21:35
    To: java-user@lucene.apache.org
    Subject: RE: Problems indexing large documents

    Hi,
    From: manu mohedano
    Hi All! I have a trouble... When I index text documents in
    english, there is no problem, buy when I index Spanish text
    documents (And they're big), a lot of information from the
    document don't become indexed (I suppose it is due to the
    Analyzer, but if the documents is less tahn 400kb it works
    perfectly). Howewer I want to Index ALL the strings in the
    document with no StopWords. Is this possible??
    Read javadoc about DEFAULT_MAX_FIELD_LENGTH at
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
    tml#setMaxFieldLength(int)

    Pasha Bizhan



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Rob Staveley (Tom) at Jun 10, 2006 at 7:01 am
    The answer was of course in the FAQ -
    http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc0
    22d889484a9248b71

    Breaking large documents into manageable chunks isn't ideal. I need to index
    e-mail and with attachments which are frequently large. Currently each
    message part corresponds to a Lucene Document, but that means I am
    discarding terms > maxFieldLength. It is ugly having to span message parts
    across multiple Lucene Documents for various reasons - e.g. a search returns
    multiple Documents with different relevance, but more than one of these
    Documents refer to the same message part.

    Two thoughts:

    (1) If the sentence "XX YY XX ZZ XX" was indexed, does that count as 3 terms
    in this context or 5? If repeat terms are not counted, I can probably cope
    with increasing the size of the heap and increasing maxFieldLength to deal
    with realistic vocabularies, and I ought to be able to cope with most large
    documents.

    (2) Lucene wishlist thought... Would it be realistic to have an option for
    Field indexing, which isn't entirely in RAM? The client code knows when the
    Field is going to be a big one, because it can look at the file size before
    passing the Field the java.io.Reader. If we could have a flag in Field that
    says "do this the slow way because the calling code already knows that it is
    a big one" and Otis, Eric & Co could work their magic, we could perhaps have
    large Lucene Documents without running out of heap space. maxFieldLength =
    -1 could perhaps denote what's needed??

    -----Original Message-----
    From: Rob Staveley (Tom)
    Sent: 10 June 2006 07:22
    To: java-user@lucene.apache.org
    Subject: RE: Problems indexing large documents

    I'm trying to come to terms with
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
    tml#setMaxFieldLength(int) too. I've been attempting to index large text
    files as single Lucene documents, passing them as java.io.Reader to cope
    with RAM. I was assuming (like - I suspect - manu mohedano) that an unstored
    field could be of any length and that maxFieldLength was only applicable to
    stored fields. Do we in fact need to break the document into manageable
    parts?

    -----Original Message-----
    From: Pasha Bizhan
    Sent: 09 June 2006 21:35
    To: java-user@lucene.apache.org
    Subject: RE: Problems indexing large documents

    Hi,
    From: manu mohedano
    Hi All! I have a trouble... When I index text documents in
    english, there is no problem, buy when I index Spanish text
    documents (And they're big), a lot of information from the
    document don't become indexed (I suppose it is due to the
    Analyzer, but if the documents is less tahn 400kb it works
    perfectly). Howewer I want to Index ALL the strings in the
    document with no StopWords. Is this possible??
    Read javadoc about DEFAULT_MAX_FIELD_LENGTH at
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
    tml#setMaxFieldLength(int)

    Pasha Bizhan



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Manu mohedano at Jun 10, 2006 at 4:28 am
    Problem Solved! Thank's a lot guys!!!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 9, '06 at 7:32p
activeJun 10, '06 at 7:01a
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase