FAQ
Hello,
I have question about KEYWORD type and searching/updating. I am getting strange behavior that I can't quite comprehend.
My index is created using standard analyzer, which used for writing and searching. It has three fields

userpin - alphanumeric field which is stored as TEXT
documentkey - alphanumeric field which is stored as TEXT
contents - text of document which is stored as TEXT

When I try to update document I am creating Term to find document by documentKey and I am using

org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

to do the update. Lucene fails to find the document by the term and I am getting duplicate documents in the index.
When I changed index to define documentKey as KEYWORD the updates started to work fine.
However, search for documentKey using StandardAnalyzer stopped working.

It appears that lucene is using keywordAnalyzer for searching for the term during update, even though the indexer is open with StandardAnalyzer.

The sample values that are stored in documentKeys are: "L2222FAHBHMF", "L2222FAHBHAS".
I noticed if documentKey is numeric value, both KeywordAnalyzer and StandardAnalyzer can find the documents by it without any problem thus reader can find and indexer can update without any problems. With alphanumeric I cant get both to work.
Any help is appreciated.
Thanks
Leonard

Search Discussions

  • Leonard Gestrin at Aug 3, 2009 at 2:49 am
    Hello,
    I have question about KEYWORD type and searching/updating. I am getting strange behavior that I can't quite comprehend.
    My index is created using standard analyzer, which used for writing and searching. It has three fields

    userpin - alphanumeric field which is stored as TEXT
    documentkey - alphanumeric field which is stored as TEXT
    contents - text of document which is stored as TEXT

    When I try to update document I am creating Term to find document by documentKey and I am using

    org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

    to do the update. Lucene fails to find the document by the term and I am getting duplicate documents in the index.
    When I changed index to define documentKey as KEYWORD the updates started to work fine.
    However, search for documentKey using StandardAnalyzer stopped working.

    It appears that lucene is using keywordAnalyzer for searching for the term during update, even though the indexer is open with StandardAnalyzer.

    The sample values that are stored in documentKeys are: "L2222FAHBHMF", "L2222FAHBHAS".
    I noticed if documentKey is numeric value, both KeywordAnalyzer and StandardAnalyzer can find the documents by it without any problem thus reader can find and indexer can update without any problems. With alphanumeric I cant get both to work.
    Any help is appreciated.
    Thanks
    Leonard










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Aug 3, 2009 at 9:21 am
    Hi


    Storing documentkey as TEXT will be causing it to be passed through
    StandardAnalyzer which will be downcasing it, and the index will be
    holding "l2222fahbhmf" rather than "L2222FAHBHMF". When you changed
    it to KEYWORD it will have been stored as is so the
    updateDocument(term, doc) call will have worked but searching will
    have failed because StandardAnalyzer will have downcased it. Numeric
    keys will have worked everywhere because they don't get downcased.

    See "Why is it important to use the same analyzer type during indexing
    and search?" in the FAQ.

    The best solution is probably to store it as KEYWORD and use
    PerFieldAnalyzerWrapper to specify KeywordAnalyzer for documentkey.
    The javadocs have an example showing what you need.


    TEXT and KEYWORD haven't been around in lucene for a while. You might
    like to consider upgrading. Good practice anyway to mention what
    version you are using when asking questions.


    --
    Ian.


    On Mon, Aug 3, 2009 at 3:49 AM, Leonard
    Gestrinwrote:
    Hello,
    I have question about KEYWORD type and searching/updating.  I am getting strange behavior that I can't quite comprehend.
    My index is created using standard analyzer, which used for writing and searching. It has three fields

    userpin - alphanumeric field which is stored as TEXT
    documentkey  - alphanumeric field which is stored as TEXT
    contents - text of document which is stored as TEXT

    When I try to update document I am creating Term to find document by documentKey and I am using

    org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

    to do the update.  Lucene fails to find the document by the term and I am getting duplicate documents in the index.
    When I changed index to define documentKey as KEYWORD the updates started to work fine.
    However, search for documentKey using StandardAnalyzer stopped working.

    It appears that lucene is using keywordAnalyzer for searching for the term during update, even though the indexer is open with StandardAnalyzer.

    The sample values that are stored in documentKeys are: "L2222FAHBHMF", "L2222FAHBHAS".
    I noticed if documentKey is numeric value, both KeywordAnalyzer and StandardAnalyzer can find the documents by it without any problem thus reader can find and indexer can update without any problems. With alphanumeric I cant get both to work.
    Any help is appreciated.
    Thanks
    Leonard
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Leonard Gestrin at Aug 3, 2009 at 5:35 pm
    Hi Ian
    Thank you for reply.
    I have recently upgraded the application to lucene 2.4.1

    I did not realize that during update operation standard analyzer was not invoked on the term same way as it's done for searching even though indexer is open using it. I am a newbie on lucene (I inherited project) - I will do some reading as you suggested.
    Thanks
    Leonard





    -----Original Message-----
    From: Ian Lea
    Sent: Monday, August 03, 2009 2:21 AM
    To: java-user@lucene.apache.org
    Subject: Re: question about indexing/searching using standardanalyzer for KEYWORD field that contains alphanumeric data

    Hi


    Storing documentkey as TEXT will be causing it to be passed through
    StandardAnalyzer which will be downcasing it, and the index will be
    holding "l2222fahbhmf" rather than "L2222FAHBHMF". When you changed
    it to KEYWORD it will have been stored as is so the
    updateDocument(term, doc) call will have worked but searching will
    have failed because StandardAnalyzer will have downcased it. Numeric
    keys will have worked everywhere because they don't get downcased.

    See "Why is it important to use the same analyzer type during indexing
    and search?" in the FAQ.

    The best solution is probably to store it as KEYWORD and use
    PerFieldAnalyzerWrapper to specify KeywordAnalyzer for documentkey.
    The javadocs have an example showing what you need.


    TEXT and KEYWORD haven't been around in lucene for a while. You might
    like to consider upgrading. Good practice anyway to mention what
    version you are using when asking questions.


    --
    Ian.


    On Mon, Aug 3, 2009 at 3:49 AM, Leonard
    Gestrinwrote:
    Hello,
    I have question about KEYWORD type and searching/updating.  I am getting strange behavior that I can't quite comprehend.
    My index is created using standard analyzer, which used for writing and searching. It has three fields

    userpin - alphanumeric field which is stored as TEXT
    documentkey  - alphanumeric field which is stored as TEXT
    contents - text of document which is stored as TEXT

    When I try to update document I am creating Term to find document by documentKey and I am using

    org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

    to do the update.  Lucene fails to find the document by the term and I am getting duplicate documents in the index.
    When I changed index to define documentKey as KEYWORD the updates started to work fine.
    However, search for documentKey using StandardAnalyzer stopped working.

    It appears that lucene is using keywordAnalyzer for searching for the term during update, even though the indexer is open with StandardAnalyzer.

    The sample values that are stored in documentKeys are: "L2222FAHBHMF", "L2222FAHBHAS".
    I noticed if documentKey is numeric value, both KeywordAnalyzer and StandardAnalyzer can find the documents by it without any problem thus reader can find and indexer can update without any problems. With alphanumeric I cant get both to work.
    Any help is appreciated.
    Thanks
    Leonard
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Aug 4, 2009 at 2:40 pm
    Leonard,

    Make sure the "key" or "id" fields are not analyzed and that should solve your problems.
    You are using some older version of Lucene?

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Leonard Gestrin <Leonard.Gestrin@markettools.com>
    To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
    Sent: Sunday, August 2, 2009 10:49:27 PM
    Subject: question about indexing/searching using standardanalyzer for KEYWORD field that contains alphanumeric data


    Hello,
    I have question about KEYWORD type and searching/updating. I am getting strange
    behavior that I can't quite comprehend.
    My index is created using standard analyzer, which used for writing and
    searching. It has three fields

    userpin - alphanumeric field which is stored as TEXT
    documentkey - alphanumeric field which is stored as TEXT
    contents - text of document which is stored as TEXT

    When I try to update document I am creating Term to find document by documentKey
    and I am using

    org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

    to do the update. Lucene fails to find the document by the term and I am
    getting duplicate documents in the index.
    When I changed index to define documentKey as KEYWORD the updates started to
    work fine.
    However, search for documentKey using StandardAnalyzer stopped working.

    It appears that lucene is using keywordAnalyzer for searching for the term
    during update, even though the indexer is open with StandardAnalyzer.

    The sample values that are stored in documentKeys are: "L2222FAHBHMF",
    "L2222FAHBHAS".
    I noticed if documentKey is numeric value, both KeywordAnalyzer and
    StandardAnalyzer can find the documents by it without any problem thus reader
    can find and indexer can update without any problems. With alphanumeric I cant
    get both to work.
    Any help is appreciated.
    Thanks
    Leonard










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Aug 3, 2009 at 1:21 pm
    When you construct a Term manually, no analyzers are applied, it'sconstructed
    with whatever you put in there, just as you specify it. So,
    indeed, it "looks" like a KeywordAnalyzer is being used, but in reality
    no analysis is being done.

    So what's happening is that when you index with StandardAnalyzer,
    your tokens are getting lower-cased and, I assume, you construct
    your Terim with upper case, so it's not found.

    You probably want to use KeywordAnalyzer for your document IDs
    as it's intended to pass through the input without change. Then, as you
    see, you'll be able to find things.

    See PerFieldAnalyzerWrapper for the way to use different
    analyzers on different fields, both an index time and
    search time, it might help.

    See Luke (google Lucene Luke) for a wonderful tool that
    allows you to look at your index and see what's actually
    stored as well as what the effect of different analyzers is.

    Best
    Erick


    On Sun, Aug 2, 2009 at 10:44 PM, Leonard Gestrin wrote:

    Hello,
    I have question about KEYWORD type and searching/updating. I am getting
    strange behavior that I can't quite comprehend.
    My index is created using standard analyzer, which used for writing and
    searching. It has three fields

    userpin - alphanumeric field which is stored as TEXT
    documentkey - alphanumeric field which is stored as TEXT
    contents - text of document which is stored as TEXT

    When I try to update document I am creating Term to find document by
    documentKey and I am using

    org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

    to do the update. Lucene fails to find the document by the term and I am
    getting duplicate documents in the index.
    When I changed index to define documentKey as KEYWORD the updates started
    to work fine.
    However, search for documentKey using StandardAnalyzer stopped working.

    It appears that lucene is using keywordAnalyzer for searching for the term
    during update, even though the indexer is open with StandardAnalyzer.

    The sample values that are stored in documentKeys are: "L2222FAHBHMF",
    "L2222FAHBHAS".
    I noticed if documentKey is numeric value, both KeywordAnalyzer and
    StandardAnalyzer can find the documents by it without any problem thus
    reader can find and indexer can update without any problems. With
    alphanumeric I cant get both to work.
    Any help is appreciated.
    Thanks
    Leonard








  • Leonard Gestrin at Aug 3, 2009 at 5:37 pm
    Thank you

    -----Original Message-----
    From: Erick Erickson
    Sent: Monday, August 03, 2009 6:21 AM
    To: java-user@lucene.apache.org
    Subject: Re: question about

    When you construct a Term manually, no analyzers are applied, it'sconstructed
    with whatever you put in there, just as you specify it. So,
    indeed, it "looks" like a KeywordAnalyzer is being used, but in reality
    no analysis is being done.

    So what's happening is that when you index with StandardAnalyzer,
    your tokens are getting lower-cased and, I assume, you construct
    your Terim with upper case, so it's not found.

    You probably want to use KeywordAnalyzer for your document IDs
    as it's intended to pass through the input without change. Then, as you
    see, you'll be able to find things.

    See PerFieldAnalyzerWrapper for the way to use different
    analyzers on different fields, both an index time and
    search time, it might help.

    See Luke (google Lucene Luke) for a wonderful tool that
    allows you to look at your index and see what's actually
    stored as well as what the effect of different analyzers is.

    Best
    Erick


    On Sun, Aug 2, 2009 at 10:44 PM, Leonard Gestrin wrote:

    Hello,
    I have question about KEYWORD type and searching/updating. I am getting
    strange behavior that I can't quite comprehend.
    My index is created using standard analyzer, which used for writing and
    searching. It has three fields

    userpin - alphanumeric field which is stored as TEXT
    documentkey - alphanumeric field which is stored as TEXT
    contents - text of document which is stored as TEXT

    When I try to update document I am creating Term to find document by
    documentKey and I am using

    org.apache.lucene.index.IndexWriter.updateDocument(term, pDocument);

    to do the update. Lucene fails to find the document by the term and I am
    getting duplicate documents in the index.
    When I changed index to define documentKey as KEYWORD the updates started
    to work fine.
    However, search for documentKey using StandardAnalyzer stopped working.

    It appears that lucene is using keywordAnalyzer for searching for the term
    during update, even though the indexer is open with StandardAnalyzer.

    The sample values that are stored in documentKeys are: "L2222FAHBHMF",
    "L2222FAHBHAS".
    I noticed if documentKey is numeric value, both KeywordAnalyzer and
    StandardAnalyzer can find the documents by it without any problem thus
    reader can find and indexer can update without any problems. With
    alphanumeric I cant get both to work.
    Any help is appreciated.
    Thanks
    Leonard








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 3, '09 at 2:44a
activeAug 4, '09 at 2:40p
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase