FAQ
HI ,

We have been observing the following problem while tokenizing using
lucene's StandardAnalyzer. Tokens that we get is different on different
machines. I am suspecting it has something to do with the Locale settings on
individual machines?

For example
the word 'CÃ(c)sar' is split as 'CÃ(c)sar' on machine 1

while it is split into [cã, sar] on machine 2 .

Could someone please tell me what might be going on?

Thx
PM

Search Discussions

  • Steven A Rowe at Apr 22, 2008 at 7:03 pm
    Hi Prashant,
    On 04/22/2008 at 2:23 PM, Prashant Malik wrote:
    We have been observing the following problem while
    tokenizing using lucene's StandardAnalyzer. Tokens that we get is
    different on different machines. I am suspecting it has something to do
    with the Locale settings on individual machines?

    For example
    the word 'CÃ(c)sar' is split as 'CÃ(c)sar' on machine 1

    while it is split into [cã, sar] on machine 2 .

    Could someone please tell me what might be going on?
    Which version of Lucene are you using? Is it the same on both machines?

    I ask because Lucene recently switched StandardTokenizer lexer generation from JavaCC to JFlex, for performance reasons (increased throughput).

    Also, my email viewer displays the word in question as the following sequence of characters:

    1. Capital "C"
    2. Capital "A" with a tilda ("~") above it
    3. Left parenthesis
    4. Lowercase "c"
    5. Right parenthesis
    6. Lowercase "s"
    7. Lowercase "a"
    8. Lowercase "r"

    Is this the correct character sequence? (Sometimes UTF-8 can look similar to this when it's interpreted as Latin-1.)

    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Prashant Malik at Apr 22, 2008 at 8:45 pm
    Yes the version of lucene and java are exactly the same on the different
    machines.
    Infact we unjared lucene and jared it with our jar and are running from the
    same nfs mounts on both the machines

    Also we have tried with lucene2.2.0 and 2.3.1. with the same result .

    also about the actual string u have it right till 2 .

    3,4,5 are a single character

    Thx
    PM
    On Tue, Apr 22, 2008 at 12:01 PM, Steven A Rowe wrote:

    Hi Prashant,
    On 04/22/2008 at 2:23 PM, Prashant Malik wrote:
    We have been observing the following problem while
    tokenizing using lucene's StandardAnalyzer. Tokens that we get is
    different on different machines. I am suspecting it has something to do
    with the Locale settings on individual machines?

    For example
    the word 'CÃ(c)sar' is split as 'CÃ(c)sar' on machine 1

    while it is split into [cã, sar] on machine 2 .

    Could someone please tell me what might be going on?
    Which version of Lucene are you using? Is it the same on both machines?

    I ask because Lucene recently switched StandardTokenizer lexer generation
    from JavaCC to JFlex, for performance reasons (increased throughput).

    Also, my email viewer displays the word in question as the following
    sequence of characters:

    1. Capital "C"
    2. Capital "A" with a tilda ("~") above it
    3. Left parenthesis
    4. Lowercase "c"
    5. Right parenthesis
    6. Lowercase "s"
    7. Lowercase "a"
    8. Lowercase "r"

    Is this the correct character sequence? (Sometimes UTF-8 can look similar
    to this when it's interpreted as Latin-1.)

    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steven A Rowe at Apr 22, 2008 at 8:51 pm
    Hi Prashant,

    What is the Unicode code point associated with the 3,4,5 character?

    Steve
    On 04/22/2008 at 4:45 PM, Prashant Malik wrote:
    Yes the version of lucene and java are exactly the same on
    the different
    machines.
    Infact we unjared lucene and jared it with our jar and are
    running from the
    same nfs mounts on both the machines

    Also we have tried with lucene2.2.0 and 2.3.1. with the same result .

    also about the actual string u have it right till 2 .

    3,4,5 are a single character

    Thx
    PM

    On Tue, Apr 22, 2008 at 12:01 PM, Steven A Rowe
    wrote:
    Hi Prashant,
    On 04/22/2008 at 2:23 PM, Prashant Malik wrote:
    We have been observing the following problem while
    tokenizing using lucene's StandardAnalyzer. Tokens that we get is
    different on different machines. I am suspecting it has something to
    do with the Locale settings on individual machines?

    For example
    the word 'CÃ(c)sar' is split as 'CÃ(c)sar' on machine 1

    while it is split into [cã, sar] on machine 2 .

    Could someone please tell me what might be going on?
    Which version of Lucene are you using? Is it the same on both machines?

    I ask because Lucene recently switched StandardTokenizer lexer
    generation from JavaCC to JFlex, for performance reasons (increased
    throughput).

    Also, my email viewer displays the word in question as the following
    sequence of characters:

    1. Capital "C"
    2. Capital "A" with a tilda ("~") above it
    3. Left parenthesis
    4. Lowercase "c"
    5. Right parenthesis
    6. Lowercase "s"
    7. Lowercase "a"
    8. Lowercase "r"

    Is this the correct character sequence? (Sometimes UTF-8 can look
    similar to this when it's interpreted as Latin-1.)

    Steve



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For
    additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Apr 22, 2008 at 9:04 pm
    : Yes the version of lucene and java are exactly the same on the different
    : machines.
    : Infact we unjared lucene and jared it with our jar and are running from the
    : same nfs mounts on both the machines

    i didn't do an indepth code read, but a quick skim of
    StandardTokenizerImpl didn't turn up any questionale uses of APIs that
    might have differnet behavior depending on the default locale/charset of
    the JVM running it ... everthing is simple char or String based access.

    Are you *certain* that you are providing Lucene with the Strings you think
    you are? Is it possible that you are using a FileReader or
    InputStreamReader that rely on the default character encoding of the JVM
    (which may not be correct for the data you are reading in) ?

    Can you write a simple junit test that fails on one machine and passes on
    the other? If so i'd love to see that test along with the output of this
    code...

    java.util.Enumeration e = System.getProperties().propertyNames();
    while(e.hasMoreElements()) {
    String prop = (String)e.nextElement();
    System.out.println(prop + " = " + java.net.URLEncoder.encode(System.getProperty(prop), "US-ASCII"));
    }


    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 22, '08 at 6:45p
activeApr 22, '08 at 9:04p
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase