FAQ
I currently have a project that indexes multiple file formats. There is a
2nd index that I use to keep track of files (because the queries in the
database are too slow, we query an index and use an ID field to get the
stuff out of the database)

However, I've started to run into some issues with the StandardAnalyzer. We
were using different analyzers at one point, so moved all creations of an
anaylzer to this function

public static Analyzer getAnalyzer()
{
Hashtable htStopWords = new Hashtable();
Analyzer analyzer = new
StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, htStopWords);

return analyzer;
}

So now all functions should now be using a StandardAnalyzer.

It is to my knowledge that a StandardAnalyzer uses a LowerCaseFilter to
change all strings to a lower-case string and in some cases that is true.
To get all documents in an index, we use a field called SearchAll and store
the word "SearchAll" into the index, then search for that.

Creation of the document to write is done in this function

public Document getFileInfoDoc()
{
Document doc = new Document();
doc.Add(new Field('FieldId", this.FieldID, Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.Add(new Field("SelectAll", "SelectAll", Field.Store.NO,
Field.Index.ANALYZED));
doc.Add(new Field("FilePath", this.FilePath, Field.Store.YES,
Field.Index.ANALYZED));

return doc;
}

In one case we call this code

Document doc = getFileInfoDoc();
Analyzer analyzer = getAnalyzer();
indexWriter.UpdateDocument(new Term("FileId", this.FileId.ToString()), doc,
analyzer);

This code writes to the indexWriter, but DOES NOT ALWAYS apply the
LowerCaseFilter to the string stored in SelectAll.

To rebuild the index, we DeleteAllDocs from the index and loop through each
file to be stored, we then call the getFileInfoDoc from above and then call
the following 2 lines of code

Analyzer analyzer = getAnalyzer();
iwCurrent.UpdateDocument(new Term("FileId", iFileID.ToString(), doc,
analyzer);

this USUALLY stores the SearchAll field as lower case, but sometimes it
still fails and writes it as upper case.



Is there anything that I am missing in terms of making the LowerCaseFilter
be applied? I don't particularly want to change the text to lower case in
my code as a 2nd index we use may be having the same issues, but contains
the contents of the file and changing that to lower case may have a major
impact on performance.


Thanks in advance,

Trevor Watson

Search Discussions

  • Digy at Mar 5, 2011 at 12:13 am
    Hi Trevor,

    Lucene.Net is intented to be a deterministic code :) So "NOT ALWAYS" or
    "USUALLY" should mean a bug either in Lucene.Net or in your code. I would
    recommend to revise your code and use Luke (http://www.getopt.org/luke/) to
    inspect your index in order to see what you have in it.

    DIGY

    PS: Don't try to make searches on an index created with a different
    analyzer.


    -----Original Message-----
    From: Trevor Watson
    Sent: Friday, March 04, 2011 9:04 PM
    To: lucene-net-user@lucene.apache.org
    Subject: [Lucene.Net] StandardAnalyzer and lowercase

    I currently have a project that indexes multiple file formats. There is a
    2nd index that I use to keep track of files (because the queries in the
    database are too slow, we query an index and use an ID field to get the
    stuff out of the database)

    However, I've started to run into some issues with the StandardAnalyzer. We
    were using different analyzers at one point, so moved all creations of an
    anaylzer to this function

    public static Analyzer getAnalyzer()
    {
    Hashtable htStopWords = new Hashtable();
    Analyzer analyzer = new
    StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, htStopWords);

    return analyzer;
    }

    So now all functions should now be using a StandardAnalyzer.

    It is to my knowledge that a StandardAnalyzer uses a LowerCaseFilter to
    change all strings to a lower-case string and in some cases that is true.
    To get all documents in an index, we use a field called SearchAll and store
    the word "SearchAll" into the index, then search for that.

    Creation of the document to write is done in this function

    public Document getFileInfoDoc()
    {
    Document doc = new Document();
    doc.Add(new Field('FieldId", this.FieldID, Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.Add(new Field("SelectAll", "SelectAll", Field.Store.NO,
    Field.Index.ANALYZED));
    doc.Add(new Field("FilePath", this.FilePath, Field.Store.YES,
    Field.Index.ANALYZED));

    return doc;
    }

    In one case we call this code

    Document doc = getFileInfoDoc();
    Analyzer analyzer = getAnalyzer();
    indexWriter.UpdateDocument(new Term("FileId", this.FileId.ToString()), doc,
    analyzer);

    This code writes to the indexWriter, but DOES NOT ALWAYS apply the
    LowerCaseFilter to the string stored in SelectAll.

    To rebuild the index, we DeleteAllDocs from the index and loop through each
    file to be stored, we then call the getFileInfoDoc from above and then call
    the following 2 lines of code

    Analyzer analyzer = getAnalyzer();
    iwCurrent.UpdateDocument(new Term("FileId", iFileID.ToString(), doc,
    analyzer);

    this USUALLY stores the SearchAll field as lower case, but sometimes it
    still fails and writes it as upper case.



    Is there anything that I am missing in terms of making the LowerCaseFilter
    be applied? I don't particularly want to change the text to lower case in
    my code as a 2nd index we use may be having the same issues, but contains
    the contents of the file and changing that to lower case may have a major
    impact on performance.


    Thanks in advance,

    Trevor Watson
  • Trevor Watson at Mar 7, 2011 at 5:13 pm
    Thanks for the response!

    We changed our code so we always use a StandardAnalyzer, which as far as
    I know should always use a LowerCaseFilter when an IndexWriter writes to
    an index. However, using Luke shows that this isn't the case.

    Does the StandardAnalyzer use a LowerCaseFilter?
    Should it be stored in the index without capitalization?
    Is there a way to force it to make all data lower case without using
    C#'s ToLower()? Would it be best to write an analyzer that extends
    StandardAnalyzer to use a ToLower?

    Thanks in advance.

    Trevor
    On 03/04/2011 7:12 PM, Digy wrote:
    Hi Trevor,

    Lucene.Net is intented to be a deterministic code :) So "NOT ALWAYS" or
    "USUALLY" should mean a bug either in Lucene.Net or in your code. I would
    recommend to revise your code and use Luke (http://www.getopt.org/luke/) to
    inspect your index in order to see what you have in it.

    DIGY

    PS: Don't try to make searches on an index created with a different
    analyzer.


    -----Original Message-----
    From: Trevor Watson
    Sent: Friday, March 04, 2011 9:04 PM
    To: lucene-net-user@lucene.apache.org
    Subject: [Lucene.Net] StandardAnalyzer and lowercase

    I currently have a project that indexes multiple file formats. There is a
    2nd index that I use to keep track of files (because the queries in the
    database are too slow, we query an index and use an ID field to get the
    stuff out of the database)

    However, I've started to run into some issues with the StandardAnalyzer. We
    were using different analyzers at one point, so moved all creations of an
    anaylzer to this function

    public static Analyzer getAnalyzer()
    {
    Hashtable htStopWords = new Hashtable();
    Analyzer analyzer = new
    StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, htStopWords);

    return analyzer;
    }

    So now all functions should now be using a StandardAnalyzer.

    It is to my knowledge that a StandardAnalyzer uses a LowerCaseFilter to
    change all strings to a lower-case string and in some cases that is true.
    To get all documents in an index, we use a field called SearchAll and store
    the word "SearchAll" into the index, then search for that.

    Creation of the document to write is done in this function

    public Document getFileInfoDoc()
    {
    Document doc = new Document();
    doc.Add(new Field('FieldId", this.FieldID, Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.Add(new Field("SelectAll", "SelectAll", Field.Store.NO,
    Field.Index.ANALYZED));
    doc.Add(new Field("FilePath", this.FilePath, Field.Store.YES,
    Field.Index.ANALYZED));

    return doc;
    }

    In one case we call this code

    Document doc = getFileInfoDoc();
    Analyzer analyzer = getAnalyzer();
    indexWriter.UpdateDocument(new Term("FileId", this.FileId.ToString()), doc,
    analyzer);

    This code writes to the indexWriter, but DOES NOT ALWAYS apply the
    LowerCaseFilter to the string stored in SelectAll.

    To rebuild the index, we DeleteAllDocs from the index and loop through each
    file to be stored, we then call the getFileInfoDoc from above and then call
    the following 2 lines of code

    Analyzer analyzer = getAnalyzer();
    iwCurrent.UpdateDocument(new Term("FileId", iFileID.ToString(), doc,
    analyzer);

    this USUALLY stores the SearchAll field as lower case, but sometimes it
    still fails and writes it as upper case.



    Is there anything that I am missing in terms of making the LowerCaseFilter
    be applied? I don't particularly want to change the text to lower case in
    my code as a 2nd index we use may be having the same issues, but contains
    the contents of the file and changing that to lower case may have a major
    impact on performance.


    Thanks in advance,

    Trevor Watson
  • Erik Hatcher at Mar 7, 2011 at 5:24 pm
    In Java Lucene, StandardAnalyzer lowercases (I can only speak to Java Lucene though).

    Stored values are _never_ affected by analysis though. What goes in is what gets stored. Analysis is a complete different step and location in the index.

    Erik

    On Mar 7, 2011, at 12:10 , Trevor Watson wrote:

    Thanks for the response!

    We changed our code so we always use a StandardAnalyzer, which as far as I know should always use a LowerCaseFilter when an IndexWriter writes to an index. However, using Luke shows that this isn't the case.

    Does the StandardAnalyzer use a LowerCaseFilter?
    Should it be stored in the index without capitalization?
    Is there a way to force it to make all data lower case without using C#'s ToLower()? Would it be best to write an analyzer that extends StandardAnalyzer to use a ToLower?

    Thanks in advance.

    Trevor
    On 03/04/2011 7:12 PM, Digy wrote:
    Hi Trevor,

    Lucene.Net is intented to be a deterministic code :) So "NOT ALWAYS" or
    "USUALLY" should mean a bug either in Lucene.Net or in your code. I would
    recommend to revise your code and use Luke (http://www.getopt.org/luke/) to
    inspect your index in order to see what you have in it.

    DIGY

    PS: Don't try to make searches on an index created with a different
    analyzer.


    -----Original Message-----
    From: Trevor Watson
    Sent: Friday, March 04, 2011 9:04 PM
    To: lucene-net-user@lucene.apache.org
    Subject: [Lucene.Net] StandardAnalyzer and lowercase

    I currently have a project that indexes multiple file formats. There is a
    2nd index that I use to keep track of files (because the queries in the
    database are too slow, we query an index and use an ID field to get the
    stuff out of the database)

    However, I've started to run into some issues with the StandardAnalyzer. We
    were using different analyzers at one point, so moved all creations of an
    anaylzer to this function

    public static Analyzer getAnalyzer()
    {
    Hashtable htStopWords = new Hashtable();
    Analyzer analyzer = new
    StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, htStopWords);

    return analyzer;
    }

    So now all functions should now be using a StandardAnalyzer.

    It is to my knowledge that a StandardAnalyzer uses a LowerCaseFilter to
    change all strings to a lower-case string and in some cases that is true.
    To get all documents in an index, we use a field called SearchAll and store
    the word "SearchAll" into the index, then search for that.

    Creation of the document to write is done in this function

    public Document getFileInfoDoc()
    {
    Document doc = new Document();
    doc.Add(new Field('FieldId", this.FieldID, Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.Add(new Field("SelectAll", "SelectAll", Field.Store.NO,
    Field.Index.ANALYZED));
    doc.Add(new Field("FilePath", this.FilePath, Field.Store.YES,
    Field.Index.ANALYZED));

    return doc;
    }

    In one case we call this code

    Document doc = getFileInfoDoc();
    Analyzer analyzer = getAnalyzer();
    indexWriter.UpdateDocument(new Term("FileId", this.FileId.ToString()), doc,
    analyzer);

    This code writes to the indexWriter, but DOES NOT ALWAYS apply the
    LowerCaseFilter to the string stored in SelectAll.

    To rebuild the index, we DeleteAllDocs from the index and loop through each
    file to be stored, we then call the getFileInfoDoc from above and then call
    the following 2 lines of code

    Analyzer analyzer = getAnalyzer();
    iwCurrent.UpdateDocument(new Term("FileId", iFileID.ToString(), doc,
    analyzer);

    this USUALLY stores the SearchAll field as lower case, but sometimes it
    still fails and writes it as upper case.



    Is there anything that I am missing in terms of making the LowerCaseFilter
    be applied? I don't particularly want to change the text to lower case in
    my code as a 2nd index we use may be having the same issues, but contains
    the contents of the file and changing that to lower case may have a major
    impact on performance.


    Thanks in advance,

    Trevor Watson

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedMar 4, '11 at 7:29p
activeMar 7, '11 at 5:24p
posts4
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase