FAQ
Hi All!



Let say I have a filter that produces new tokens based on the original ones.

How bad will it be if my filter sets the start of each token to 0 and end to
the length of a token?

An example (based on the phrase "How are you?":



Original token:

[you?] (8,12)



New tokens:

[you] (0,3)

[?] (0,1)



It wouldn't be so hard to calculate the right numbers for left to right
languages and it is a bit more challenging to do it for right to left ones
but for mixed text it is quite hard.



Thanks.

Search Discussions

  • Robert Muir at Jul 20, 2009 at 5:42 pm
    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 5:58 pm
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 20, 2009 at 6:06 pm
    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 6:17 pm
    Well, the only thing I can say is that the order of tokens I've presented is what I see in the debugger.
    It is what input.next(reusableToken) gives me, in that exact order and with that exact indexes.

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 6:18 pm
    Hold on a second, the phrase that you included link to is not in the correct order of words!

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 6:20 pm
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 20, 2009 at 6:25 pm
    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 6:53 pm
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts = new WhitespaceTokenizer( reader );
    ts = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 20, 2009 at 7:02 pm
    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts  = new WhitespaceTokenizer( reader );
    ts                      = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 20, 2009 at 7:17 pm
    Obender, based on your previous comments (that you see text displayed
    in the wrong order), I again recommend that you enable support for RTL
    languages in your operating system, as I mentioned earlier... are you
    using a Windows-based OS, this is not enabled by default!

    I think you are seeing things in the incorrect order, and this is
    causing confusion for you!

    On Mon, Jul 20, 2009 at 3:02 PM, Robert Muirwrote:
    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts  = new WhitespaceTokenizer( reader );
    ts                      = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 7:33 pm
    I've checked, and it appears to be enabled.

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:18 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, based on your previous comments (that you see text displayed
    in the wrong order), I again recommend that you enable support for RTL
    languages in your operating system, as I mentioned earlier... are you
    using a Windows-based OS, this is not enabled by default!

    I think you are seeing things in the incorrect order, and this is
    causing confusion for you!

    On Mon, Jul 20, 2009 at 3:02 PM, Robert Muirwrote:
    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts = new WhitespaceTokenizer( reader );
    ts = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 20, 2009 at 7:48 pm
    Obender, does the following text appear like the image in the link, or not?

    שומר אחי

    http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


    On Mon, Jul 20, 2009 at 3:34 PM, OBenderwrote:
    I've checked, and it appears to be enabled.

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:18 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, based on your previous comments (that you see text displayed
    in the wrong order), I again recommend that you enable support for RTL
    languages in your operating system, as I mentioned earlier... are you
    using a Windows-based OS, this is not enabled by default!

    I think you are seeing things in the incorrect order, and this is
    causing confusion for you!

    On Mon, Jul 20, 2009 at 3:02 PM, Robert Muirwrote:
    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts  = new WhitespaceTokenizer( reader );
    ts                      = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 8:41 pm
    No, it reversed in the e-mail. Funny though, when I insert it in to the Excel it turns to the right order of words.
    Thanks for all the help.

    Maybe you have an idea on what could be the problem.
    Here is how my data gets read and indexed.

    I have a UTF-8 CSV file that is produced from Excel.
    I read it in with Java (preserving UTF-8 encoding). At this point strings in the debugger look correct.
    I insert it in to the DB (MySql) which is also UTF-8.
    Then read it back and put in to index.

    It looks like in UTF-8 CSV file the words are in "reverse" order from the grammar stand point (left to right, e.g., EREV left most then TOV). Should UTF-8 CSV file preserve the natural (language specific) order of words?


    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, does the following text appear like the image in the link, or not?

    שומר אחי

    http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


    On Mon, Jul 20, 2009 at 3:34 PM, OBenderwrote:
    I've checked, and it appears to be enabled.

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:18 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, based on your previous comments (that you see text displayed
    in the wrong order), I again recommend that you enable support for RTL
    languages in your operating system, as I mentioned earlier... are you
    using a Windows-based OS, this is not enabled by default!

    I think you are seeing things in the incorrect order, and this is
    causing confusion for you!

    On Mon, Jul 20, 2009 at 3:02 PM, Robert Muirwrote:
    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts = new WhitespaceTokenizer( reader );
    ts = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 9:07 pm
    Never mind, I think I got it.

    -----Original Message-----
    From: OBender
    Sent: Monday, July 20, 2009 4:42 PM
    To: java-user@lucene.apache.org
    Subject: RE: question on custom filter

    No, it reversed in the e-mail. Funny though, when I insert it in to the Excel it turns to the right order of words.
    Thanks for all the help.

    Maybe you have an idea on what could be the problem.
    Here is how my data gets read and indexed.

    I have a UTF-8 CSV file that is produced from Excel.
    I read it in with Java (preserving UTF-8 encoding). At this point strings in the debugger look correct.
    I insert it in to the DB (MySql) which is also UTF-8.
    Then read it back and put in to index.

    It looks like in UTF-8 CSV file the words are in "reverse" order from the grammar stand point (left to right, e.g., EREV left most then TOV). Should UTF-8 CSV file preserve the natural (language specific) order of words?


    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, does the following text appear like the image in the link, or not?

    שומר אחי

    http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


    On Mon, Jul 20, 2009 at 3:34 PM, OBenderwrote:
    I've checked, and it appears to be enabled.

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:18 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, based on your previous comments (that you see text displayed
    in the wrong order), I again recommend that you enable support for RTL
    languages in your operating system, as I mentioned earlier... are you
    using a Windows-based OS, this is not enabled by default!

    I think you are seeing things in the incorrect order, and this is
    causing confusion for you!

    On Mon, Jul 20, 2009 at 3:02 PM, Robert Muirwrote:
    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts = new WhitespaceTokenizer( reader );
    ts = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 7:29 pm
    Interesting, the question now is why am I seeing (even in println) what I'm seeing :)
    I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:03 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts = new WhitespaceTokenizer( reader );
    ts = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jul 20, 2009 at 7:33 pm
    Obender, I think your input is incorrect. The hebrew text you pasted
    in your example appears incorrect. Its gonna be hard for me to
    communicate this since I think your computer is not displaying hebrew
    correctly :)

    but the text you sent as an example was [טוֹב עֶרֶב]

    Shouldn't the adjective follow the noun like this: עֶרֶב טוֹב

    This makes me think your input is incorrect because its being rendered
    incorrectly, as I mentioned this isn't enabled by default in windows.
    But your input appears correct to you :)

    On Mon, Jul 20, 2009 at 3:29 PM, OBenderwrote:
    Interesting, the question now is why am I seeing (even in println) what I'm seeing :)
    I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:03 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts  = new WhitespaceTokenizer( reader );
    ts                      = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • OBender at Jul 20, 2009 at 8:05 pm
    Ok, it makes a lot of sense (the input being incorrect).
    Let's just verify that :)

    At the end of the line:
    "but the text you sent as an example was" what I see is word TOV [טוֹב] on the left and EREV [עֶרֶב] on the right.
    So it reads (for me) EREV TOV which is correct.

    At the end of the line:
    " Shouldn't the adjective follow the noun like this " what I see is the word EREV [עֶרֶב] on the left and TOV [טוֹב] on the right.
    So it reads (for me) TOV EREV which is not correct.

    Is the above the way you see the Hebrew text or it is other way around for you :) ?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:34 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think your input is incorrect. The hebrew text you pasted
    in your example appears incorrect. Its gonna be hard for me to
    communicate this since I think your computer is not displaying hebrew
    correctly :)

    but the text you sent as an example was [טוֹב עֶרֶב]

    Shouldn't the adjective follow the noun like this: עֶרֶב טוֹב

    This makes me think your input is incorrect because its being rendered
    incorrectly, as I mentioned this isn't enabled by default in windows.
    But your input appears correct to you :)

    On Mon, Jul 20, 2009 at 3:29 PM, OBenderwrote:
    Interesting, the question now is why am I seeing (even in println) what I'm seeing :)
    I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 3:03 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, i ran your code and it did what I expected (but not what you pasted):

    First token is: (טוֹב,0,4)
    Second token is: (עֶרֶב,5,10)

    I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

    On Mon, Jul 20, 2009 at 2:53 PM, OBenderwrote:
    Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

    Again I'm talking about tokens not the individual letters here.

    public class XFilter extends TokenFilter
    {
    protected XFilter( TokenStream tokenStream ) {
    super( tokenStream );
    }

    @Override
    public Token next( final Token reusableToken ) throws IOException
    {
    Token nextToken = input.next( reusableToken );
    System.out.println( nextToken != null? nextToken: "" );
    return nextToken;
    }
    }

    public class SimpleWhitespaceAnalyzer extends Analyzer
    {
    @Override
    public TokenStream tokenStream( final String fieldName, final Reader reader )
    {
    TokenStream ts = new WhitespaceTokenizer( reader );
    ts = new XFilter( ts );

    return ts;
    }
    }

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:26 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I think something in your environment / display environment
    might be causing some confusion.

    Are you using microsoft windows? If so, please verify that support for
    right-to-left languages is enabled [control panel/regional and
    language options]. It is possible you are "seeing something different"
    because your rendering system is not actually rendering right-to-left
    text in right-to-left direction!!!!

    Second, Instead of using a debugger, I would recommend using Luke to
    look at resulting tokens from your analyzer.

    On Mon, Jul 20, 2009 at 2:21 PM, OBenderwrote:
    This is how it should be written:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 2:07 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, This is not true.
    the text you pasted is the following in unicode:

    \N{HEBREW LETTER TET}
    \N{HEBREW LETTER VAV}
    \N{HEBREW POINT HOLAM}
    \N{HEBREW LETTER BET}
    \N{SPACE}
    \N{HEBREW LETTER AYIN}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER RESH}
    \N{HEBREW POINT SEGOL}
    \N{HEBREW LETTER BET}

    you can use this utility to see how your text is encoded:
    http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

    For more information on directionality in unicode, see
    http://unicode.org/reports/tr9/

    On Mon, Jul 20, 2009 at 1:59 PM, OBenderwrote:
    Robert,

    I'm not sure you are correct on this one.

    If I have a Hebrew phrase:
    [טוֹב עֶרֶב]
    Then first token that filter receives is:
    [עֶרֶב] (0,5)
    and the second is:
    [טוֹב] (6,10)
    Which means that it counts from right to left (words and indexes).

    Am I missing something?

    -----Original Message-----
    From: Robert Muir
    Sent: Monday, July 20, 2009 1:43 PM
    To: java-user@lucene.apache.org
    Subject: Re: question on custom filter

    Obender, I don't think its as difficult as you think. Your filter does
    not need to be aware of this issue at all.

    In unicode, right-to-left languages are encoded in the data in logical order.
    The rendering system is what converts it to display in right-to-left
    for RTL languages.

    For example in Arabic, "Robert 1234" displays as روبرت 1234
    To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
    beh, waw, reh

    But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

    2009/7/20 OBender <osya_bender@hotmail.com>:
    Hi All!



    Let say I have a filter that produces new tokens based on the original ones.

    How bad will it be if my filter sets the start of each token to 0 and end to
    the length of a token?

    An example (based on the phrase "How are you?":



    Original token:

    [you?] (8,12)



    New tokens:

    [you] (0,3)

    [?] (0,1)



    It wouldn't be so hard to calculate the right numbers for left to right
    languages and it is a bit more challenging to do it for right to left ones
    but for mixed text it is quite hard.



    Thanks.


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 20, '09 at 2:41p
activeJul 20, '09 at 9:07p
posts18
users2
websitelucene.apache.org

2 users in discussion

OBender: 11 posts Robert Muir: 7 posts

People

Translate

site design / logo © 2022 Grokbase