FAQ
Hi All,

Once I index a bunch of documents with a StandardAnalyzer (and if the effort
I need to put in to reindex the documents is not worth the effort), is there
a way to search on the index without case sensitivity.
I do not use any sophisticated Analyzer that makes use of
LowerCaseTokenizer.

Please let me know if there is a solution to circumvent this case
sensitivity problem.

Many thanks
Dino

Search Discussions

  • Dino Korah at Aug 13, 2008 at 4:16 pm
    Also would like to highlight the version of Lucene I am using; It is 2.0.0.

    _____

    From: Dino Korah
    Sent: 13 August 2008 17:10
    To: 'java-user@lucene.apache.org'
    Subject: Case Sensitivity


    Hi All,

    Once I index a bunch of documents with a StandardAnalyzer (and if the effort
    I need to put in to reindex the documents is not worth the effort), is there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.

    Please let me know if there is a solution to circumvent this case
    sensitivity problem.

    Many thanks
    Dino
  • Steven A Rowe at Aug 13, 2008 at 4:31 pm
    Hi Dino,

    StandardAnalyzer incorporates StandardTokenizer, StandardFilter, LowerCaseFilter, and StopFilter. Any index you create using it will only provide case-insensitive matching.

    Steve
    On 08/13/2008 at 12:15 PM, Dino Korah wrote:
    Also would like to highlight the version of Lucene I am
    using; It is 2.0.0.

    _____

    From: Dino Korah
    Sent: 13 August 2008 17:10
    To: 'java-user@lucene.apache.org'
    Subject: Case Sensitivity


    Hi All,

    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort I need to put in to reindex the documents is not worth the
    effort), is there a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.

    Please let me know if there is a solution to circumvent this case
    sensitivity problem.

    Many thanks
    Dino



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Aug 13, 2008 at 4:48 pm
    What analyzer are you using at *query* time? I suspect that's where your
    problem lies if you indeed "don't use any sophisticated analyzers", since
    you *are* using a sophisticated analyzer at index time. You almost
    invariably want to use the same analyzer at query time and analyzer time.

    Please start a separate thread with your second question. Google
    "Thread Hijacking" for the explanation of why that's a good idea.

    Best
    Erick
    On Wed, Aug 13, 2008 at 12:27 PM, Steven A Rowe wrote:

    Hi Dino,

    StandardAnalyzer incorporates StandardTokenizer, StandardFilter,
    LowerCaseFilter, and StopFilter. Any index you create using it will only
    provide case-insensitive matching.

    Steve
    On 08/13/2008 at 12:15 PM, Dino Korah wrote:
    Also would like to highlight the version of Lucene I am
    using; It is 2.0.0.

    _____

    From: Dino Korah
    Sent: 13 August 2008 17:10
    To: 'java-user@lucene.apache.org'
    Subject: Case Sensitivity


    Hi All,

    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort I need to put in to reindex the documents is not worth the
    effort), is there a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.

    Please let me know if there is a solution to circumvent this case
    sensitivity problem.

    Many thanks
    Dino



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sergey Kabashnyuk at Aug 14, 2008 at 8:48 am
    Hello.

    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field(“PROPERTIES”,
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks

    Hi All,
    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort), is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino
    --
    Sergey Kabashnyuk
    eXo Platform SAS

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Aug 14, 2008 at 1:45 pm
    About the only way to do this that I know of is to
    index the data three times, once without any case
    changing, once uppercased and once lowercased.
    You'll have to watch your analyzer, probably making
    up your own (easily done, see the synonym analyzer
    in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and
    at index times. The analyzer is responsible for any
    transformations, like case folding, tokenizing, etc.

    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming
    up with a reason to do both that wouldn't be satisfied
    by just a caseless search.

    Best
    Erick
    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk wrote:

    Hello.

    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks



    Hi All,
    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort), is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino
    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sergey Kabashnyuk at Aug 14, 2008 at 2:32 pm
    Thanks for you reply Erick.

    About the only way to do this that I know of is to
    index the data three times, once without any case
    changing, once uppercased and once lowercased.
    You'll have to watch your analyzer, probably making
    up your own (easily done, see the synonym analyzer
    in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and
    at index times. The analyzer is responsible for any
    transformations, like case folding, tokenizing, etc.

    In example I want to show what I stored field as Field.Index.NO_NORMS

    As I understand it means what field contains original string
    despite what analyzer I chose(StandardAnalyzer by default).

    All querys I made myself without using Parsers.
    For example new TermQuery(new Term(“filed”, “MaMa”));


    I agree with you about possible implementation,
    but it increase size of index at times.

    But are there other possibilities, such as using custom query, possibly
    similar to RegexQuery,RegexTermEnum that would compare terms
    at it's own discretion?


    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming
    up with a reason to do both that wouldn't be satisfied
    by just a caseless search.

    Best
    Erick

    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk
    wrote:
    Hello.

    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks



    Hi All,
    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort), is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino
    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Sergey Kabashnyuk
    eXo Platform SAS

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doron Cohen at Aug 14, 2008 at 2:43 pm

    In example I want to show what I stored field as Field.Index.NO_NORMS

    As I understand it means what field contains original string
    despite what analyzer I chose(StandardAnalyzer by default).
    This would be achieved by UN_TOKENIZED.

    The NO_NORMS just guides Lucene to avoid normalizing
    results by document length for this field (and to avoid allocating
    resources for that).

    Other than that I join Erick in wondering why all three options are needed.
    It would help the list to help you if you provide
    a few simple examples of: document, query, expected result.

    Doron
  • Erick Erickson at Aug 14, 2008 at 3:18 pm
    Be aware that StandardAnalyzer lowercases all the input,
    both at index and query times. Field.Store.YES will store
    the original text without any transformations, so doc.get(<field>)
    will return the original text. However, no matter what the
    Field.Store value, the *indexed* tokens (using
    TOKENIZED as you Field.Index.TOKENIZED)
    are passed through the analyzer.

    For instance, indexing "MIXed CasE TEXT" in a
    field called "myfield" with Field.Store.YES,
    Field.Index.TOKENIZED would index the
    following tokens (with StandardAnalyzer).
    mixed
    case
    text

    and searches (with StandardAnalyzer) would match
    any case in the query terms (e.g. MIXED would hit,
    as would mixed as would CaSE).

    However, doc.get("myfield") would return
    "MIXed CasE TEXT"

    As Doron said, though, a few use cases would
    help us provide better answers.

    Best
    Erick

    On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk wrote:

    Thanks for you reply Erick.


    About the only way to do this that I know of is to
    index the data three times, once without any case
    changing, once uppercased and once lowercased.
    You'll have to watch your analyzer, probably making
    up your own (easily done, see the synonym analyzer
    in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and
    at index times. The analyzer is responsible for any
    transformations, like case folding, tokenizing, etc.

    In example I want to show what I stored field as Field.Index.NO_NORMS

    As I understand it means what field contains original string
    despite what analyzer I chose(StandardAnalyzer by default).

    All querys I made myself without using Parsers.
    For example new TermQuery(new Term("filed", "MaMa"));


    I agree with you about possible implementation,
    but it increase size of index at times.

    But are there other possibilities, such as using custom query, possibly
    similar to RegexQuery,RegexTermEnum that would compare terms
    at it's own discretion?




    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming
    up with a reason to do both that wouldn't be satisfied
    by just a caseless search.

    Best
    Erick

    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
    wrote: Hello.
    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks



    Hi All,
    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort), is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino


    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Sergey Kabashnyuk
    eXo Platform SAS

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Andre Rubin at Aug 14, 2008 at 10:17 pm
    Sergey,

    Based on a recent discussion I posted:
    http://www.nabble.com/Searching-Tokenized-x-Un_tokenized-td18882569.html
    , you cannot use Un_Tokenized because you can't have any analyzer run
    thorugh it.

    My suggestion, use a tokenized filed and a custom made Analyzer.
    Haven't figure out all the details for you, but I think it's possible.

    Andre
    On Thu, Aug 14, 2008 at 8:17 AM, Erick Erickson wrote:
    Be aware that StandardAnalyzer lowercases all the input,
    both at index and query times. Field.Store.YES will store
    the original text without any transformations, so doc.get(<field>)
    will return the original text. However, no matter what the
    Field.Store value, the *indexed* tokens (using
    TOKENIZED as you Field.Index.TOKENIZED)
    are passed through the analyzer.

    For instance, indexing "MIXed CasE TEXT" in a
    field called "myfield" with Field.Store.YES,
    Field.Index.TOKENIZED would index the
    following tokens (with StandardAnalyzer).
    mixed
    case
    text

    and searches (with StandardAnalyzer) would match
    any case in the query terms (e.g. MIXED would hit,
    as would mixed as would CaSE).

    However, doc.get("myfield") would return
    "MIXed CasE TEXT"

    As Doron said, though, a few use cases would
    help us provide better answers.

    Best
    Erick

    On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk wrote:

    Thanks for you reply Erick.


    About the only way to do this that I know of is to
    index the data three times, once without any case
    changing, once uppercased and once lowercased.
    You'll have to watch your analyzer, probably making
    up your own (easily done, see the synonym analyzer
    in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and
    at index times. The analyzer is responsible for any
    transformations, like case folding, tokenizing, etc.

    In example I want to show what I stored field as Field.Index.NO_NORMS

    As I understand it means what field contains original string
    despite what analyzer I chose(StandardAnalyzer by default).

    All querys I made myself without using Parsers.
    For example new TermQuery(new Term("filed", "MaMa"));


    I agree with you about possible implementation,
    but it increase size of index at times.

    But are there other possibilities, such as using custom query, possibly
    similar to RegexQuery,RegexTermEnum that would compare terms
    at it's own discretion?




    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming
    up with a reason to do both that wouldn't be satisfied
    by just a caseless search.

    Best
    Erick

    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
    wrote: Hello.
    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks



    Hi All,
    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort), is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino


    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Sergey Kabashnyuk
    eXo Platform SAS

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sergey Kabashnyuk at Aug 15, 2008 at 7:52 am
    Hello

    Here's my use case content of the field
    Doc1 -
    Field - “text ” - “Field Without Norms”

    Doc2 -
    Field - “text ” - “field without norms”

    Doc3 -
    Field - “text ” - “FIELD WITHOUT NORMS”


    Query expected result
    1. new Term(“text”,”Field Without Norms”) doc1
    2. new Term(“text”,”field without norms”) doc2
    3. new Term(“text”,”FIELD WITHOUT NORMS”) doc3
    lowercase(“text”,”field without norms”) doc1, doc2, doc3
    uppercase(“text”,”FIELD WITHOUT NORMS”) doc1, doc2, doc3

    I stor “text” field like :
    new Field(“text”, Field.Store.NO, Field.Index.NO_NORMS,Field.TermVector.NO)
    using StandardAnalyzer and query 1-3 works perfectly as I need. The
    question is
    how create query 4-5?

    Thanks
    Sergey Kabashnyuk
    eXo Platform SAS

    Be aware that StandardAnalyzer lowercases all the input,
    both at index and query times. Field.Store.YES will store
    the original text without any transformations, so doc.get(<field>)
    will return the original text. However, no matter what the
    Field.Store value, the *indexed* tokens (using
    TOKENIZED as you Field.Index.TOKENIZED)
    are passed through the analyzer.

    For instance, indexing "MIXed CasE TEXT" in a
    field called "myfield" with Field.Store.YES,
    Field.Index.TOKENIZED would index the
    following tokens (with StandardAnalyzer).
    mixed
    case
    text

    and searches (with StandardAnalyzer) would match
    any case in the query terms (e.g. MIXED would hit,
    as would mixed as would CaSE).

    However, doc.get("myfield") would return
    "MIXed CasE TEXT"

    As Doron said, though, a few use cases would
    help us provide better answers.

    Best
    Erick


    On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk
    wrote:
    Thanks for you reply Erick.


    About the only way to do this that I know of is to
    index the data three times, once without any case
    changing, once uppercased and once lowercased.
    You'll have to watch your analyzer, probably making
    up your own (easily done, see the synonym analyzer
    in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and
    at index times. The analyzer is responsible for any
    transformations, like case folding, tokenizing, etc.

    In example I want to show what I stored field as Field.Index.NO_NORMS

    As I understand it means what field contains original string
    despite what analyzer I chose(StandardAnalyzer by default).

    All querys I made myself without using Parsers.
    For example new TermQuery(new Term("filed", "MaMa"));


    I agree with you about possible implementation,
    but it increase size of index at times.

    But are there other possibilities, such as using custom query, possibly
    similar to RegexQuery,RegexTermEnum that would compare terms
    at it's own discretion?




    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming
    up with a reason to do both that wouldn't be satisfied
    by just a caseless search.

    Best
    Erick

    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
    wrote: Hello.
    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks



    Hi All,
    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort),
    is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino


    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Sergey Kabashnyuk
    eXo Platform SAS

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doron Cohen at Aug 16, 2008 at 8:01 pm
    Hi Sergey, seems like case 4 and 5 are equivalent,
    both meaning case insensitive right. Otherwise please
    explain the difference.

    If it is required to support both case sensitive
    (cases 1,2,3) and case insensitive (case 4/5) then
    both forms must be saved in the index - in two separate
    fields (as Erick mentioned, I think).

    Hope this helps,
    Doron
    On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk wrote:

    Hello

    Here's my use case content of the field
    Doc1 -
    Field - "text " - "Field Without Norms"

    Doc2 -
    Field - "text " - "field without norms"

    Doc3 -
    Field - "text " - "FIELD WITHOUT NORMS"


    Query expected result
    1. new Term("text","Field Without Norms") doc1
    2. new Term("text","field without norms") doc2
    3. new Term("text","FIELD WITHOUT NORMS") doc3
    lowercase("text","field without norms") doc1, doc2, doc3
    uppercase("text","FIELD WITHOUT NORMS") doc1, doc2, doc3

    I stor "text" field like :
    new Field("text", Field.Store.NO, Field.Index.NO_NORMS,Field.TermVector.NO
    )
    using StandardAnalyzer and query 1-3 works perfectly as I need. The
    question is
    how create query 4-5?

    Thanks

    Sergey Kabashnyuk
    eXo Platform SAS


    Be aware that StandardAnalyzer lowercases all the input,
    both at index and query times. Field.Store.YES will store
    the original text without any transformations, so doc.get(<field>)
    will return the original text. However, no matter what the
    Field.Store value, the *indexed* tokens (using
    TOKENIZED as you Field.Index.TOKENIZED)
    are passed through the analyzer.

    For instance, indexing "MIXed CasE TEXT" in a
    field called "myfield" with Field.Store.YES,
    Field.Index.TOKENIZED would index the
    following tokens (with StandardAnalyzer).
    mixed
    case
    text

    and searches (with StandardAnalyzer) would match
    any case in the query terms (e.g. MIXED would hit,
    as would mixed as would CaSE).

    However, doc.get("myfield") would return
    "MIXed CasE TEXT"

    As Doron said, though, a few use cases would
    help us provide better answers.

    Best
    Erick


    On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
    wrote:
    Thanks for you reply Erick.

    About the only way to do this that I know of is to
    index the data three times, once without any case
    changing, once uppercased and once lowercased.
    You'll have to watch your analyzer, probably making
    up your own (easily done, see the synonym analyzer
    in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and
    at index times. The analyzer is responsible for any
    transformations, like case folding, tokenizing, etc.
    In example I want to show what I stored field as Field.Index.NO_NORMS

    As I understand it means what field contains original string
    despite what analyzer I chose(StandardAnalyzer by default).

    All querys I made myself without using Parsers.
    For example new TermQuery(new Term("filed", "MaMa"));


    I agree with you about possible implementation,
    but it increase size of index at times.

    But are there other possibilities, such as using custom query, possibly
    similar to RegexQuery,RegexTermEnum that would compare terms
    at it's own discretion?





    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming
    up with a reason to do both that wouldn't be satisfied
    by just a caseless search.

    Best
    Erick

    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
    wrote: Hello.
    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO)
    for original string and make case sensitive search.

    But does anyone have an idea to how implement second and third type of
    search?

    Thanks



    Hi All,

    Once I index a bunch of documents with a StandardAnalyzer (and if the
    effort
    I need to put in to reindex the documents is not worth the effort), is
    there
    a way to search on the index without case sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino


    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Sergey Kabashnyuk
    eXo Platform SAS

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 19, 2008 at 12:59 pm
    Hi Guys,
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, I
    shouldn't have any problems with cases. But if I have any UN_TOKENIZED
    fields there will be problems if I do not case-normalize them myself before
    adding them as a field to the document.

    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same name.

    For example, if the email had a from address "John Smith"
    <J.Smith@world.net>, my document looks like this

    ------------------8<----------------
    to: ... - UN_TOKENIZED
    from: J.Smith@world.net - UN_TOKENIZED
    From-tokenized: John - UN_TOKENIZED
    From-tokenized: Smith - UN_TOKENIZED
    From-tokenized: J - UN_TOKENIZED
    From-tokenized: Smith - UN_TOKENIZED
    From-tokenized: world.net - UN_TOKENIZED
    From-tokenized: world - UN_TOKENIZED
    From-tokenized: net - UN_TOKENIZED
    Subject: ... - TOKENIZED
    Body: ... - TOKENIZED
    ------------------8<----------------

    Does it mean that where ever I use UN_TOKENIZED, they do not get through the
    StandardAnalyzer before getting Indexed, but they do when they are searched
    on? If that is the case, Do I need to normalise them before adding to
    document?

    I also would like to know if it is better to employ an EmailAnalyzer that
    makes a TokenStream out of the given email address, rather than using a
    simplistic function that gives me a list of string pieces and adding them
    one by one. With searches, would both the approaches give same result?

    Many thanks,
    Dino



    -----Original Message-----
    From: Doron Cohen
    Sent: 16 August 2008 21:01
    To: java-user@lucene.apache.org
    Subject: Re: Case Sensitivity

    Hi Sergey, seems like case 4 and 5 are equivalent, both meaning case
    insensitive right. Otherwise please explain the difference.

    If it is required to support both case sensitive (cases 1,2,3) and case
    insensitive (case 4/5) then both forms must be saved in the index - in two
    separate fields (as Erick mentioned, I think).

    Hope this helps,
    Doron

    On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk
    wrote:
    Hello

    Here's my use case content of the field
    Doc1 -
    Field - "text " - "Field Without Norms"

    Doc2 -
    Field - "text " - "field without norms"

    Doc3 -
    Field - "text " - "FIELD WITHOUT NORMS"


    Query expected result
    1. new Term("text","Field Without Norms") doc1
    2. new Term("text","field without norms") doc2
    3. new Term("text","FIELD WITHOUT NORMS") doc3
    lowercase("text","field without norms") doc1, doc2, doc3
    uppercase("text","FIELD WITHOUT NORMS") doc1, doc2, doc3

    I stor "text" field like :
    new Field("text", Field.Store.NO,
    Field.Index.NO_NORMS,Field.TermVector.NO
    )
    using StandardAnalyzer and query 1-3 works perfectly as I need. The
    question is how create query 4-5?

    Thanks

    Sergey Kabashnyuk
    eXo Platform SAS


    Be aware that StandardAnalyzer lowercases all the input,
    both at index and query times. Field.Store.YES will store the
    original text without any transformations, so doc.get(<field>) will
    return the original text. However, no matter what the Field.Store
    value, the *indexed* tokens (using TOKENIZED as you
    Field.Index.TOKENIZED) are passed through the analyzer.

    For instance, indexing "MIXed CasE TEXT" in a field called "myfield"
    with Field.Store.YES, Field.Index.TOKENIZED would index the following
    tokens (with StandardAnalyzer).
    mixed
    case
    text

    and searches (with StandardAnalyzer) would match any case in the
    query terms (e.g. MIXED would hit, as would mixed as would CaSE).

    However, doc.get("myfield") would return "MIXed CasE TEXT"

    As Doron said, though, a few use cases would help us provide better
    answers.

    Best
    Erick


    On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk
    <ksmmlist@gmail.com
    wrote:
    Thanks for you reply Erick.

    About the only way to do this that I know of is to
    index the data three times, once without any case changing, once
    uppercased and once lowercased.
    You'll have to watch your analyzer, probably making up your own
    (easily done, see the synonym analyzer in Lucene in Action).

    Your example doesn't tell us anything, since the critical
    information is the *analyzer* you use, both at query and at index
    times. The analyzer is responsible for any transformations, like
    case folding, tokenizing, etc.
    In example I want to show what I stored field as
    Field.Index.NO_NORMS

    As I understand it means what field contains original string despite
    what analyzer I chose(StandardAnalyzer by default).

    All querys I made myself without using Parsers.
    For example new TermQuery(new Term("filed", "MaMa"));


    I agree with you about possible implementation, but it increase size
    of index at times.

    But are there other possibilities, such as using custom query,
    possibly similar to RegexQuery,RegexTermEnum that would compare
    terms at it's own discretion?





    But what is your use-case for needing both upper and
    lower case comparisons? I have a hard time coming up with a reason
    to do both that wouldn't be satisfied by just a caseless search.

    Best
    Erick

    On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk
    <ksmmlist@gmail.com
    wrote: Hello.
    I have the similar question.

    I need to implement
    1. Case sensitive search.
    2. Lower case search for concrete field.
    3. Upper case search for concrete filed.

    For now I use
    new Field("PROPERTIES",
    content,
    Field.Store.NO,
    Field.Index.NO_NORMS,
    Field.TermVector.NO) for original string and make
    case sensitive search.

    But does anyone have an idea to how implement second and third
    type of search?

    Thanks



    Hi All,

    Once I index a bunch of documents with a StandardAnalyzer (and if
    the
    effort
    I need to put in to reindex the documents is not worth the
    effort), is there a way to search on the index without case
    sensitivity.
    I do not use any sophisticated Analyzer that makes use of
    LowerCaseTokenizer.
    Please let me know if there is a solution to circumvent this case
    sensitivity problem.
    Many thanks
    Dino


    --
    Sergey Kabashnyuk
    eXo Platform SAS


    ------------------------------------------------------------------
    --- To unsubscribe, e-mail:
    java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Sergey Kabashnyuk
    eXo Platform SAS

    --------------------------------------------------------------------
    - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steven A Rowe at Aug 19, 2008 at 4:43 pm
    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and search?"
    <http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c44472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:
    <http://wiki.apache.org/lucene-java/AnalysisParalysis>
    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you want case-insensitive matching, and that you use the StandardAnalyzer (or some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields
    with same name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get through
    the StandardAnalyzer before getting Indexed, but they do when they are
    searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather
    than using a simplistic function that gives me a list of string pieces
    and adding them one by one. With searches, would both the approaches
    give same result?
    Yes, both approaches give the same result. When you add string pieces one-by-one, you are adding multiple same-named fields. By contrast, the EmailAnalyzer approach would add a single field, and would allow you to control positions (via Token.setPositionIncrement(): <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if you make up an EmailAnalyzer, you can use it to search against your tokenized email field, along with other analyzer(s) on other field(s), using the PerFieldAnalyzerWrapper <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 20, 2008 at 3:22 pm
    Hi Steve,

    Thanks a lot for that.

    I have a question on TokenStreams and email addresses, but I will post them
    on a separate thread.

    Many thanks,
    Dino


    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"
    <http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:
    <http://wiki.apache.org/lucene-java/AnalysisParalysis>
    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you
    want case-insensitive matching, and that you use the StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast, the
    EmailAnalyzer approach would add a single field, and would allow you to
    control positions (via Token.setPositionIncrement():
    <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
    ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other field(s), using
    the PerFieldAnalyzerWrapper
    <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Andre Rubin at Aug 21, 2008 at 7:21 am
    Just to add to that, as I said before, in my case, I found more useful not
    to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that
    uses the KeywordTokenizer (entire input as only one token) with the
    LowerCaseFilter: This way I get the best of both worlds.

    public class KeywordLowerAnalyzer extends Analyzer {

    public KeywordLowerAnalyzer() {
    }


    public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new KeywordTokenizer(reader);
    result = new LowerCaseFilter(result);
    return result;
    }

    }
    On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah wrote:
    Hi Steve,

    Thanks a lot for that.

    I have a question on TokenStreams and email addresses, but I will post them
    on a separate thread.

    Many thanks,
    Dino


    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"
    <
    http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:
    <http://wiki.apache.org/lucene-java/AnalysisParalysis>
    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you
    want case-insensitive matching, and that you use the StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast, the
    EmailAnalyzer approach would add a single field, and would allow you to
    control positions (via Token.setPositionIncrement():
    <
    http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
    ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other field(s), using
    the PerFieldAnalyzerWrapper
    <
    http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 22, 2008 at 9:50 am
    That is very clever. With that, the text we index will get through the
    analyser, but will not get tokenized. Will hit the analyser the same way
    when we search, again untokenized.

    Brilliant!!


    -----Original Message-----
    From: Andre Rubin
    Sent: 21 August 2008 08:21
    To: java-user@lucene.apache.org
    Subject: Re: Case Sensitivity

    Just to add to that, as I said before, in my case, I found more useful not
    to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that
    uses the KeywordTokenizer (entire input as only one token) with the
    LowerCaseFilter: This way I get the best of both worlds.

    public class KeywordLowerAnalyzer extends Analyzer {

    public KeywordLowerAnalyzer() {
    }


    public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new KeywordTokenizer(reader);
    result = new LowerCaseFilter(result);
    return result;
    }

    }
    On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah wrote:
    Hi Steve,

    Thanks a lot for that.

    I have a question on TokenStreams and email addresses, but I will post them
    on a separate thread.

    Many thanks,
    Dino


    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"
    <
    http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:
    <http://wiki.apache.org/lucene-java/AnalysisParalysis>
    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that
    you want case-insensitive matching, and that you use the
    StandardAnalyzer (or some other downcasing analyzer) at query-time, then
    this is true.
    In my case I have a mixed scenario. I am indexing emails and the
    email addresses are indexed UN_TOKENIZED. I do have a second set of
    custom tokenized field, which keep the tokens in individual fields
    with same name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to
    normalize them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast,
    the EmailAnalyzer approach would add a single field, and would allow
    you to control positions (via Token.setPositionIncrement():
    <
    http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
    ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
    Also, if you make up an EmailAnalyzer, you can use it to search
    against your tokenized email field, along with other analyzer(s) on
    other field(s), using
    the PerFieldAnalyzerWrapper
    <
    http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 26, 2008 at 11:12 am
    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
    on this thread, is it right to say that a field, if either UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"
    <http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:
    <http://wiki.apache.org/lucene-java/AnalysisParalysis>
    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you
    want case-insensitive matching, and that you use the StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast, the
    EmailAnalyzer approach would add a single field, and would allow you to
    control positions (via Token.setPositionIncrement():
    <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
    ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other field(s), using
    the PerFieldAnalyzerWrapper
    <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 26, 2008 at 1:18 pm
    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we need to
    analyze them differently in the analyzer we use for searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get tokenized?

    What is the best solution if one was to add a set of fields UN_TOKENIZED and
    others TOKENIZED, of the later set a few with setOmitNorms(true) (the index
    writer is plain StandardAnalyzer based)? A per field analyzer at query time
    ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
    on this thread, is it right to say that a field, if either UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"
    <http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:
    <http://wiki.apache.org/lucene-java/AnalysisParalysis>
    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you
    want case-insensitive matching, and that you use the StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast, the
    EmailAnalyzer approach would add a single field, and would allow you to
    control positions (via Token.setPositionIncrement():
    <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
    ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other field(s), using
    the PerFieldAnalyzerWrapper
    <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Aug 27, 2008 at 5:31 am
    Dino, you lost me half-way through your email :(

    NO_NORMS does not mean the field is not tokenized.
    UN_TOKENIZED does mean the field is not tokenized.


    Otis--
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah <dckorah@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 9:17:49 AM
    Subject: RE: Case Sensitivity

    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we need to
    analyze them differently in the analyzer we use for searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get tokenized?

    What is the best solution if one was to add a set of fields UN_TOKENIZED and
    others TOKENIZED, of the later set a few with setOmitNorms(true) (the index
    writer is plain StandardAnalyzer based)? A per field analyzer at query time
    ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
    on this thread, is it right to say that a field, if either UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you
    want case-insensitive matching, and that you use the StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast, the
    EmailAnalyzer approach would add a single field, and would allow you to
    control positions (via Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other field(s), using
    the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 27, 2008 at 9:37 am
    Actually, as confusing as it is, Field.Index.NO_NORMS means
    Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

    Mike

    Otis Gospodnetic wrote:
    Dino, you lost me half-way through your email :(

    NO_NORMS does not mean the field is not tokenized.
    UN_TOKENIZED does mean the field is not tokenized.


    Otis--
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah <dckorah@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 9:17:49 AM
    Subject: RE: Case Sensitivity

    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and
    searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
    ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we
    need to
    analyze them differently in the analyzer we use for searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get
    tokenized?

    What is the best solution if one was to add a set of fields
    UN_TOKENIZED and
    others TOKENIZED, of the later set a few with setOmitNorms(true)
    (the index
    writer is plain StandardAnalyzer based)? A per field analyzer at
    query time
    ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
    and
    on this thread, is it right to say that a field, if either
    UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
    we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing
    and
    search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and
    Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and
    that you
    want case-insensitive matching, and that you use the
    StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the
    email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with
    same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do
    when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to
    normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces
    and
    adding them one by one. With searches, would both the approaches
    give
    same result?
    Yes, both approaches give the same result. When you add string
    pieces
    one-by-one, you are adding multiple same-named fields. By contrast,
    the
    EmailAnalyzer approach would add a single field, and would allow
    you to
    control positions (via Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
    Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other
    field(s), using
    the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 27, 2008 at 10:41 am
    Thanks Otis & Mike.

    Probably we should keep it the way it is now. Would be better to include
    more information on the various combinations of these options and its effect
    on the final result (set of terms that get to the index). Would be nicer if
    we could mention the search scenario as well. To be honest, it took me a
    while to get a grip on it.

    On the same topic, what would be the effect of the following code.

    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO
    ,Field.Index.TOKENIZED);
    f.setOmitNorms(true);

    Would that be equivalent to

    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO ,Field.Index.NO_NORMS);
    And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ?


    Many thanks,
    Dino


    -----Original Message-----
    From: Michael McCandless
    Sent: 27 August 2008 10:37
    To: java-user@lucene.apache.org
    Subject: Re: Case Sensitivity


    Actually, as confusing as it is, Field.Index.NO_NORMS means
    Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

    Mike

    Otis Gospodnetic wrote:
    Dino, you lost me half-way through your email :(

    NO_NORMS does not mean the field is not tokenized.
    UN_TOKENIZED does mean the field is not tokenized.


    Otis--
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah <dckorah@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 9:17:49 AM
    Subject: RE: Case Sensitivity

    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and
    searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
    ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we
    need to analyze them differently in the analyzer we use for
    searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get
    tokenized?

    What is the best solution if one was to add a set of fields
    UN_TOKENIZED and others TOKENIZED, of the later set a few with
    setOmitNorms(true) (the index writer is plain StandardAnalyzer
    based)? A per field analyzer at query time ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on
    http://markmail.org/message/q7dqr4r7o6t6dgo5
    and
    on this thread, is it right to say that a field, if either
    UN_TOKENIZED or NO_NORMS-ized, it doesn't get analyzed while
    indexing? Which means we need to case-normalize (down-case) those
    fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing
    and search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and
    Querying, I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and
    that you want case-insensitive matching, and that you use the
    StandardAnalyzer (or some other downcasing analyzer) at query-time,
    then this is true.
    In my case I have a mixed scenario. I am indexing emails and the
    email addresses are indexed UN_TOKENIZED. I do have a second set of
    custom tokenized field, which keep the tokens in individual fields
    with same name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do
    when they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to
    normalize them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces
    and adding them one by one. With searches, would both the approaches
    give same result?
    Yes, both approaches give the same result. When you add string
    pieces one-by-one, you are adding multiple same-named fields. By
    contrast, the EmailAnalyzer approach would add a single field, and
    would allow you to control positions (via
    Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
    Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other
    field(s), using the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Naber at Aug 27, 2008 at 10:47 am

    On Mittwoch, 27. August 2008, Michael McCandless wrote:

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?
    I think it's enough if the api doc explains it, no need to rename it.
    What's more confusing is that (UN_)TOKENIZED should actually be called
    (UN_)ANALYZED IMHO.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 27, 2008 at 11:38 am
    Or ... split the two notions apart so that you have Field.Index.
    [UN_]ANALYZED and, separately, Field.Index.[NO_]NORMS which could then
    be combined together in all 4 combinations (we'd have to fix the
    Parameter class to let you build up a new Parameter by combining
    existing ones...).

    I think naming things well is just as important as good javadocs
    explaining things.

    But: I think these changes should probably wait until we work out how
    to refactor AbstractField/Fieldable/Field?

    Mike

    Daniel Naber wrote:
    On Mittwoch, 27. August 2008, Michael McCandless wrote:

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?
    I think it's enough if the api doc explains it, no need to rename it.
    What's more confusing is that (UN_)TOKENIZED should actually be called
    (UN_)ANALYZED IMHO.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Aug 27, 2008 at 5:33 am
    Dino,

    If a field is not tokenized then it is indexed as is.
    For example: "Dino Korah" would get indexed just like that. It would not get split into multiple tokens, it would not be lowercased, it would not have any stop words removed from it, etc.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah <dckorah@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 7:11:42 AM
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
    on this thread, is it right to say that a field, if either UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing and
    search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and that you
    want case-insensitive matching, and that you use the StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces and
    adding them one by one. With searches, would both the approaches give
    same result?
    Yes, both approaches give the same result. When you add string pieces
    one-by-one, you are adding multiple same-named fields. By contrast, the
    EmailAnalyzer approach would add a single field, and would allow you to
    control positions (via Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other field(s), using
    the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Aug 27, 2008 at 3:09 pm
    Nah, I think the names are fine, I simply forgot. I looked at the javadocs, it clearly says NO_NORMS doesn't get passed through an Analyzer. Maybe in 3.0 we can switch to NOT_ANALYZED, as suggested, to reflect reality more closely.


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Michael McCandless <lucene@mikemccandless.com>
    To: java-user@lucene.apache.org
    Sent: Wednesday, August 27, 2008 5:36:46 AM
    Subject: Re: Case Sensitivity


    Actually, as confusing as it is, Field.Index.NO_NORMS means
    Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

    Mike

    Otis Gospodnetic wrote:
    Dino, you lost me half-way through your email :(

    NO_NORMS does not mean the field is not tokenized.
    UN_TOKENIZED does mean the field is not tokenized.


    Otis--
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 9:17:49 AM
    Subject: RE: Case Sensitivity

    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and
    searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
    ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we
    need to
    analyze them differently in the analyzer we use for searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get
    tokenized?

    What is the best solution if one was to add a set of fields
    UN_TOKENIZED and
    others TOKENIZED, of the later set a few with setOmitNorms(true)
    (the index
    writer is plain StandardAnalyzer based)? A per field analyzer at
    query time
    ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
    and
    on this thread, is it right to say that a field, if either
    UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
    we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing
    and
    search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and
    Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and
    that you
    want case-insensitive matching, and that you use the
    StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the
    email
    addresses are indexed UN_TOKENIZED. I do have a second set of custom
    tokenized field, which keep the tokens in individual fields with
    same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do
    when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to
    normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces
    and
    adding them one by one. With searches, would both the approaches
    give
    same result?
    Yes, both approaches give the same result. When you add string
    pieces
    one-by-one, you are adding multiple same-named fields. By contrast,
    the
    EmailAnalyzer approach would add a single field, and would allow
    you to
    control positions (via Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
    Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other
    field(s), using
    the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 27, 2008 at 11:27 pm
    OK I'll open an issue to do this renaming in 3.0, which actually means
    we do the renaming in 2.4 or 2.9 (deprecating the old ones) then in
    3.0 removing the old ones.

    Mike
    On Aug 27, 2008, at 11:08 AM, Otis Gospodnetic wrote:

    Nah, I think the names are fine, I simply forgot. I looked at the
    javadocs, it clearly says NO_NORMS doesn't get passed through an
    Analyzer. Maybe in 3.0 we can switch to NOT_ANALYZED, as suggested,
    to reflect reality more closely.


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Michael McCandless <lucene@mikemccandless.com>
    To: java-user@lucene.apache.org
    Sent: Wednesday, August 27, 2008 5:36:46 AM
    Subject: Re: Case Sensitivity


    Actually, as confusing as it is, Field.Index.NO_NORMS means
    Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

    Mike

    Otis Gospodnetic wrote:
    Dino, you lost me half-way through your email :(

    NO_NORMS does not mean the field is not tokenized.
    UN_TOKENIZED does mean the field is not tokenized.


    Otis--
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 9:17:49 AM
    Subject: RE: Case Sensitivity

    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and
    searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or
    NO_NORMS-
    ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we
    need to
    analyze them differently in the analyzer we use for searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get
    tokenized?

    What is the best solution if one was to add a set of fields
    UN_TOKENIZED and
    others TOKENIZED, of the later set a few with setOmitNorms(true)
    (the index
    writer is plain StandardAnalyzer based)? A per field analyzer at
    query time
    ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
    and
    on this thread, is it right to say that a field, if either
    UN_TOKENIZED or
    NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
    we need
    to case-normalize (down-case) those fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing
    and
    search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some
    hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am
    using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and
    Querying,
    I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if
    I do
    not case-normalize them myself before adding them as a field to
    the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and
    that you
    want case-insensitive matching, and that you use the
    StandardAnalyzer (or
    some other downcasing analyzer) at query-time, then this is true.
    In my case I have a mixed scenario. I am indexing emails and the
    email
    addresses are indexed UN_TOKENIZED. I do have a second set of
    custom
    tokenized field, which keep the tokens in individual fields with
    same
    name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do
    when
    they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to
    normalize
    them before adding them to the document.
    I also would like to know if it is better to employ an
    EmailAnalyzer
    that makes a TokenStream out of the given email address, rather
    than
    using a simplistic function that gives me a list of string pieces
    and
    adding them one by one. With searches, would both the approaches
    give
    same result?
    Yes, both approaches give the same result. When you add string
    pieces
    one-by-one, you are adding multiple same-named fields. By contrast,
    the
    EmailAnalyzer approach would add a single field, and would allow
    you to
    control positions (via Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
    Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other
    field(s), using
    the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dino Korah at Aug 28, 2008 at 8:59 am
    Looks like my question got unnoticed among the more important Jira
    discussion. :(

    On the same topic, what would be the effect of the following code.

    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO,
    Field.Index.TOKENIZED);
    f.setOmitNorms(true);

    Would that be equivalent to

    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO ,Field.Index.NO_NORMS);

    And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ?

    Many thanks,
    Dino



    -----Original Message-----
    From: Michael McCandless
    Sent: 27 August 2008 10:37
    To: java-user@lucene.apache.org
    Subject: Re: Case Sensitivity


    Actually, as confusing as it is, Field.Index.NO_NORMS means
    Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

    Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

    Mike

    Otis Gospodnetic wrote:
    Dino, you lost me half-way through your email :(

    NO_NORMS does not mean the field is not tokenized.
    UN_TOKENIZED does mean the field is not tokenized.


    Otis--
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Dino Korah <dckorah@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 9:17:49 AM
    Subject: RE: Case Sensitivity

    I think I should rephrase my question.

    [ Context: Using out of the box StandardAnalyzer for indexing and
    searching.
    ]

    Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
    ized (
    field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
    Which means that when we search, it gets thru the analyzer and we
    need to analyze them differently in the analyzer we use for
    searching?
    Doesn't it mean that a setOmitNorms(true) field also doesn't get
    tokenized?

    What is the best solution if one was to add a set of fields
    UN_TOKENIZED and others TOKENIZED, of the later set a few with
    setOmitNorms(true) (the index writer is plain StandardAnalyzer
    based)? A per field analyzer at query time ?!

    Many thanks,
    Dino


    -----Original Message-----
    From: Dino Korah
    Sent: 26 August 2008 12:12
    To: 'java-user@lucene.apache.org'
    Subject: RE: Case Sensitivity

    A little more case sensitivity questions.

    Based on the discussion on
    http://markmail.org/message/q7dqr4r7o6t6dgo5
    and
    on this thread, is it right to say that a field, if either
    UN_TOKENIZED or NO_NORMS-ized, it doesn't get analyzed while
    indexing? Which means we need to case-normalize (down-case) those
    fields before hand?

    Doest it mean that if I can afford, I should use norms.

    Many thanks,
    Dino



    -----Original Message-----
    From: Steven A Rowe
    Sent: 19 August 2008 17:43
    To: java-user@lucene.apache.org
    Subject: RE: Case Sensitivity

    Hi Dino,

    I think you'd benefit from reading some FAQ answers, like:

    "Why is it important to use the same analyzer type during indexing
    and search?"

    4472d10961ba63c>

    Also, have a look at the AnalysisParalysis wiki page for some hints:

    On 08/19/2008 at 8:57 AM, Dino Korah wrote:
    From the discussion here what I could understand was, if I am using
    StandardAnalyzer on TOKENIZED fields, for both Indexing and
    Querying, I shouldn't have any problems with cases.
    If by "shouldn't have problems with cases" you mean "can match
    case-insensitively", then this is true.
    But if I have any UN_TOKENIZED fields there will be problems if I do
    not case-normalize them myself before adding them as a field to the
    document.
    Again, assuming that by "case-normalize" you mean "downcase", and
    that you want case-insensitive matching, and that you use the
    StandardAnalyzer (or some other downcasing analyzer) at query-time,
    then this is true.
    In my case I have a mixed scenario. I am indexing emails and the
    email addresses are indexed UN_TOKENIZED. I do have a second set of
    custom tokenized field, which keep the tokens in individual fields
    with same name. [...]
    Does it mean that where ever I use UN_TOKENIZED, they do not get
    through the StandardAnalyzer before getting Indexed, but they do
    when they are searched on?
    This is true.
    If that is the case, Do I need to normalise them before adding to
    document?
    If you want case-insensitive matching, then yes, you do need to
    normalize them before adding them to the document.
    I also would like to know if it is better to employ an EmailAnalyzer
    that makes a TokenStream out of the given email address, rather than
    using a simplistic function that gives me a list of string pieces
    and adding them one by one. With searches, would both the approaches
    give same result?
    Yes, both approaches give the same result. When you add string
    pieces one-by-one, you are adding multiple same-named fields. By
    contrast, the EmailAnalyzer approach would add a single field, and
    would allow you to control positions (via
    Token.setPositionIncrement():

    ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
    Also, if
    you make up an EmailAnalyzer, you can use it to search against your
    tokenized email field, along with other analyzer(s) on other
    field(s), using the PerFieldAnalyzerWrapper

    AnalyzerWrapper.html>.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Wettin at Aug 28, 2008 at 9:19 am

    28 aug 2008 kl. 10.58 skrev Dino Korah:

    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO,
    Field.Index.TOKENIZED);
    f.setOmitNorms(true);

    Would that be equivalent to

    Document doc = new Document();
    Field f = new Field("body", bodyText,
    Field.Store.NO ,Field.Index.NO_NORMS);

    And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ?
    Yes, those two have the same effect.


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Andrzej Bialecki at Aug 28, 2008 at 9:47 am

    Karl Wettin wrote:

    28 aug 2008 kl. 10.58 skrev Dino Korah:
    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO,
    Field.Index.TOKENIZED);
    f.setOmitNorms(true);

    Would that be equivalent to

    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO
    ,Field.Index.NO_NORMS);

    And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ?
    Yes, those two have the same effect.
    I don't think so - these two scenarios are different.

    When you create a Field using Index.NO_NORMS, the constructor makes sure
    that:
    isIndexed = true;
    isTokenized = false;
    omitNorms = true;

    When you create a Field using Index.TOKENIZED, the constructor sets
    these flags:
    isIndexed = true;
    isTokenized = true;

    Then, when you call setOmitNorms(true), it does NOT affect isTokenized,
    it sets only omitNorms. So the flags are set now like this:
    isIndexed = true;
    isTokenized = true;
    omitNorms = true;

    The end result of processing such a field is (I believe) conceptually
    equivalent to adding as many Fields as there are tokens, each with
    omitNorms=true.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Wettin at Aug 28, 2008 at 9:53 am

    28 aug 2008 kl. 11.46 skrev Andrzej Bialecki:

    Karl Wettin wrote:
    28 aug 2008 kl. 10.58 skrev Dino Korah:
    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO,
    Field.Index.TOKENIZED);
    f.setOmitNorms(true);

    Would that be equivalent to

    Document doc = new Document();
    Field f = new Field("body", bodyText,
    Field.Store.NO ,Field.Index.NO_NORMS);

    And Field.Index.TOKENIZED has no effect after
    f.setOmitNorms(true); ?
    Yes, those two have the same effect.
    I don't think so - these two scenarios are different.

    When you create a Field using Index.NO_NORMS, the constructor makes
    sure that:
    isIndexed = true;
    isTokenized = false;
    omitNorms = true;

    When you create a Field using Index.TOKENIZED, the constructor sets
    these flags:
    isIndexed = true;
    isTokenized = true;

    Then, when you call setOmitNorms(true), it does NOT affect
    isTokenized, it sets only omitNorms. So the flags are set now like
    this:
    isIndexed = true;
    isTokenized = true;
    omitNorms = true;

    The end result of processing such a field is (I believe)
    conceptually equivalent to adding as many Fields as there are
    tokens, each with omitNorms=true.
    Oh, you are of course right, I was too quick to read. Sorry.


    karl


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Aug 28, 2008 at 4:56 pm
    So in other words, it *is* possible to have the field both tokenized and its norms omitted?


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Karl Wettin <karl.wettin@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, August 28, 2008 5:52:54 AM
    Subject: Re: Case Sensitivity


    28 aug 2008 kl. 11.46 skrev Andrzej Bialecki:
    Karl Wettin wrote:
    28 aug 2008 kl. 10.58 skrev Dino Korah:
    Document doc = new Document();
    Field f = new Field("body", bodyText, Field.Store.NO,
    Field.Index.TOKENIZED);
    f.setOmitNorms(true);

    Would that be equivalent to

    Document doc = new Document();
    Field f = new Field("body", bodyText,
    Field.Store.NO ,Field.Index.NO_NORMS);

    And Field.Index.TOKENIZED has no effect after
    f.setOmitNorms(true); ?
    Yes, those two have the same effect.
    I don't think so - these two scenarios are different.

    When you create a Field using Index.NO_NORMS, the constructor makes
    sure that:
    isIndexed = true;
    isTokenized = false;
    omitNorms = true;

    When you create a Field using Index.TOKENIZED, the constructor sets
    these flags:
    isIndexed = true;
    isTokenized = true;

    Then, when you call setOmitNorms(true), it does NOT affect
    isTokenized, it sets only omitNorms. So the flags are set now like
    this:
    isIndexed = true;
    isTokenized = true;
    omitNorms = true;

    The end result of processing such a field is (I believe)
    conceptually equivalent to adding as many Fields as there are
    tokens, each with omitNorms=true.
    Oh, you are of course right, I was too quick to read. Sorry.


    karl


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Andrzej Bialecki at Aug 28, 2008 at 5:40 pm

    Otis Gospodnetic wrote:
    So in other words, it *is* possible to have the field both tokenized and its norms omitted?
    Yes. Probably this is an unintended side-effect of adding setOmitNorms,
    but I think it's useful and IMHO we should keep it.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Aug 28, 2008 at 5:42 pm
    Yes. And I think I have used this "trick" a couple of years ago, but have since forgotten about it. :)

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Andrzej Bialecki <ab@getopt.org>
    To: java-user@lucene.apache.org
    Sent: Thursday, August 28, 2008 1:39:21 PM
    Subject: Re: Case Sensitivity

    Otis Gospodnetic wrote:
    So in other words, it *is* possible to have the field both tokenized and its
    norms omitted?

    Yes. Probably this is an unintended side-effect of adding setOmitNorms,
    but I think it's useful and IMHO we should keep it.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 28, 2008 at 5:45 pm
    In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this
    issue:

    https://issues.apache.org/jira/browse/LUCENE-1366

    Mike

    Otis Gospodnetic wrote:
    Yes. And I think I have used this "trick" a couple of years ago,
    but have since forgotten about it. :)

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Andrzej Bialecki <ab@getopt.org>
    To: java-user@lucene.apache.org
    Sent: Thursday, August 28, 2008 1:39:21 PM
    Subject: Re: Case Sensitivity

    Otis Gospodnetic wrote:
    So in other words, it *is* possible to have the field both
    tokenized and its
    norms omitted?

    Yes. Probably this is an unintended side-effect of adding
    setOmitNorms,
    but I think it's useful and IMHO we should keep it.


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Aug 28, 2008 at 5:51 pm

    On Thu, Aug 28, 2008 at 1:44 PM, Michael McCandless wrote:

    In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue:
    I wasn't originally going to add a Field.Index at all for omitNorms,
    but Doug suggested it.
    The problem with this type-safe way of doing things is the
    combinatorial explosion.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 28, 2008 at 6:17 pm

    Yonik Seeley wrote:

    On Thu, Aug 28, 2008 at 1:44 PM, Michael McCandless
    wrote:
    In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this
    issue:
    I wasn't originally going to add a Field.Index at all for omitNorms,
    but Doug suggested it.
    The problem with this type-safe way of doing things is the
    combinatorial explosion.
    Yeah I realize that. Now that we have omitTF as an option we could
    really go crazy ;)

    I figured since we already have NOT_ANALYZED_NO_NORMS we may as well
    round it out with ANALYZED_NO_NORMS, and then stop there. Plus,
    people have been surprised that you could do ANALYZED_NO_NORMS, yet it
    is useful.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Anthony Urso at Sep 12, 2008 at 1:48 am

    On Thu, Aug 28, 2008 at 11:16 AM, Michael McCandless wrote:

    Yonik Seeley wrote:
    I wasn't originally going to add a Field.Index at all for omitNorms,
    but Doug suggested it.
    The problem with this type-safe way of doing things is the
    combinatorial explosion.
    Yeah I realize that. Now that we have omitTF as an option we could really
    go crazy ;)

    I figured since we already have NOT_ANALYZED_NO_NORMS we may as well round
    it out with ANALYZED_NO_NORMS, and then stop there. Plus, people have been
    surprised that you could do ANALYZED_NO_NORMS, yet it is useful.
    Why not make this flag field into a bitmap?

    Cheers,
    Anthony

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Sep 19, 2008 at 12:12 pm

    Anthony Urso wrote:

    On Thu, Aug 28, 2008 at 11:16 AM, Michael McCandless
    wrote:
    Yonik Seeley wrote:
    I wasn't originally going to add a Field.Index at all for omitNorms,
    but Doug suggested it.
    The problem with this type-safe way of doing things is the
    combinatorial explosion.
    Yeah I realize that. Now that we have omitTF as an option we could
    really
    go crazy ;)

    I figured since we already have NOT_ANALYZED_NO_NORMS we may as
    well round
    it out with ANALYZED_NO_NORMS, and then stop there. Plus, people
    have been
    surprised that you could do ANALYZED_NO_NORMS, yet it is useful.
    Why not make this flag field into a bitmap?
    I think that makes sense, at some point in the future (when we clean
    up Fieldable/AbstractField/Field?). This way you can OR together
    things like NORMS/NO_NORMS, ANALYZED/NOT_ANALYZED, INCLUDE_TF/OMIT_TF,
    etc.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Andrzej Bialecki at Sep 19, 2008 at 12:22 pm

    Michael McCandless wrote:

    Anthony Urso wrote:
    On Thu, Aug 28, 2008 at 11:16 AM, Michael McCandless
    wrote:
    Yonik Seeley wrote:
    I wasn't originally going to add a Field.Index at all for omitNorms,
    but Doug suggested it.
    The problem with this type-safe way of doing things is the
    combinatorial explosion.
    Yeah I realize that. Now that we have omitTF as an option we could
    really
    go crazy ;)

    I figured since we already have NOT_ANALYZED_NO_NORMS we may as well
    round
    it out with ANALYZED_NO_NORMS, and then stop there. Plus, people
    have been
    surprised that you could do ANALYZED_NO_NORMS, yet it is useful.
    Why not make this flag field into a bitmap?
    I think that makes sense, at some point in the future (when we clean up
    Fieldable/AbstractField/Field?). This way you can OR together things
    like NORMS/NO_NORMS, ANALYZED/NOT_ANALYZED, INCLUDE_TF/OMIT_TF, etc.
    +1 on that. AFAIR the original motivation for these type-safe
    enumerations was that some combination of flags are invalid /
    unsupported, and then you would discover it only at runtime. But the
    problems with this approach seem to outweigh the benefits ...

    Perhaps we could provide static methods on Fieldable that test the
    validity of flag combinations with particular version of Lucene?

    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Andrzej Bialecki at Aug 28, 2008 at 5:52 pm

    Michael McCandless wrote:

    In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue:

    https://issues.apache.org/jira/browse/LUCENE-1366
    This has consequences when searching - so if we expose it the javadoc
    has to be really good at explaining what's going on :)


    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 28, 2008 at 6:18 pm

    Andrzej Bialecki wrote:

    Michael McCandless wrote:
    In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this
    issue:
    https://issues.apache.org/jira/browse/LUCENE-1366
    This has consequences when searching - so if we expose it the
    javadoc has to be really good at explaining what's going on :)
    Agreed, I'll fix the javadocs and mark these as Expert.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 13, '08 at 4:11p
activeSep 19, '08 at 12:22p
posts42
users13
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase