FAQ
Hi,

I would like to know whether Standard Analyzer allows searching of chinese
words?

And in order to support chinese searching, is there any encoding needed in
order to develop the application?

I'm currently using Jetty as web server, jsp as application, and search
results will be saved in xml file and display it using xsl. So is there
encoding needed for any of the files (xml, xsl, etc...) as well as during
parsing of query?

thanks alot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Chris Lu at Jun 17, 2007 at 6:09 pm
    There are three things to watch out for chinese or CJK languages:

    1. The content source or database need to be encoded in UTF-8.
    2. StandardAnalyzer doesn't support chinese words well. Use either
    ChineseAnalyzer or CJKAnalyzer. My experience is that CJKAnalyzer is a
    little better.
    3. The user's query should be encoded in UTF-8.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    On 6/17/07, leelb@xedge.com.sg wrote:
    Hi,

    I would like to know whether Standard Analyzer allows searching of chinese
    words?

    And in order to support chinese searching, is there any encoding needed in
    order to develop the application?

    I'm currently using Jetty as web server, jsp as application, and search
    results will be saved in xml file and display it using xsl. So is there
    encoding needed for any of the files (xml, xsl, etc...) as well as during
    parsing of query?

    thanks alot


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lee Li Bin at Jun 18, 2007 at 12:48 pm
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?



    Regards,
    Lee Li Bin

    -----Original Message-----
    From: Chris Lu
    Sent: Monday, June 18, 2007 2:10 AM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    There are three things to watch out for chinese or CJK languages:

    1. The content source or database need to be encoded in UTF-8.
    2. StandardAnalyzer doesn't support chinese words well. Use either
    ChineseAnalyzer or CJKAnalyzer. My experience is that CJKAnalyzer is a
    little better.
    3. The user's query should be encoded in UTF-8.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
    inutes

    On 6/17/07, leelb@xedge.com.sg wrote:
    Hi,

    I would like to know whether Standard Analyzer allows searching of chinese
    words?

    And in order to support chinese searching, is there any encoding needed in
    order to develop the application?

    I'm currently using Jetty as web server, jsp as application, and search
    results will be saved in xml file and display it using xsl. So is there
    encoding needed for any of the files (xml, xsl, etc...) as well as during
    parsing of query?

    thanks alot


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mathieu Lecarme at Jun 18, 2007 at 12:58 pm

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8 parameters.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lee Li Bin at Jun 18, 2007 at 1:16 pm
    Hi,

    For indexing, there is no problem, there is Chinese text similar to my
    datasource (XML) in the index file when opening on a note pad.

    When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
    ISO88599_1 or Cp1252 in Java servlet, but we getting search problem, the
    search result does not display for Chinese term.

    I mixed English and Chinese text in my datasource, the search is working for
    English term, and Chinese char display as '???' in the result output.

    Please advice or send some sample / solutions

    Thanks.

    -----Original Message-----
    From: Mathieu Lecarme
    Sent: Monday, June 18, 2007 8:58 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8 parameters.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Jun 18, 2007 at 6:02 pm
    Basically where ever you see, the encoding should be utf8.

    The servlet also has an encoding setting. For your case, change the
    tomcat setting.
    When rendering jsp page, the encoding also matters.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    On 6/18/07, Lee Li Bin wrote:

    Hi,

    For indexing, there is no problem, there is Chinese text similar to my
    datasource (XML) in the index file when opening on a note pad.

    When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
    ISO88599_1 or Cp1252 in Java servlet, but we getting search problem, the
    search result does not display for Chinese term.

    I mixed English and Chinese text in my datasource, the search is working for
    English term, and Chinese char display as '???' in the result output.

    Please advice or send some sample / solutions

    Thanks.

    -----Original Message-----
    From: Mathieu Lecarme
    Sent: Monday, June 18, 2007 8:58 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8 parameters.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl wettin at Jun 18, 2007 at 8:03 pm
    A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
    characters are represented by 3 bytes with UTF8, and 2 bytes as
    UTF16. It is a simple hack.

    It did however not save me that much as I had a mixed latin and CJK
    corpus, and I reverted. Still think it is something worth
    considering. Perhaps it might be worth implementing per index, per
    document or per field string encoding strategy.




    18 jun 2007 kl. 20.01 skrev Chris Lu:
    Basically where ever you see, the encoding should be utf8.

    The servlet also has an encoding setting. For your case, change the
    tomcat setting.
    When rendering jsp page, the encoding also matters.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?
    title=Create_Lucene_Database_Search_in_3_minutes
    On 6/18/07, Lee Li Bin wrote:

    Hi,

    For indexing, there is no problem, there is Chinese text similar
    to my
    datasource (XML) in the index file when opening on a note pad.

    When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
    ISO88599_1 or Cp1252 in Java servlet, but we getting search
    problem, the
    search result does not display for Chinese term.

    I mixed English and Chinese text in my datasource, the search is
    working for
    English term, and Chinese char display as '???' in the result output.

    Please advice or send some sample / solutions

    Thanks.

    -----Original Message-----
    From: Mathieu Lecarme
    Sent: Monday, June 18, 2007 8:58 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and
    ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8
    parameters.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Jun 18, 2007 at 8:37 pm
    Hi, Karl,

    Thanks for sharing this experience.

    I did find CJKAnalyzer somehow behaves differently than
    ChineseAnalyzer. When trying to highlight the matched term,
    ChineseAnalyzer didn't work somehow. But I didn't investigate into it.

    This is a useful clue for it.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    On 6/18/07, karl wettin wrote:
    A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
    characters are represented by 3 bytes with UTF8, and 2 bytes as
    UTF16. It is a simple hack.

    It did however not save me that much as I had a mixed latin and CJK
    corpus, and I reverted. Still think it is something worth
    considering. Perhaps it might be worth implementing per index, per
    document or per field string encoding strategy.




    18 jun 2007 kl. 20.01 skrev Chris Lu:
    Basically where ever you see, the encoding should be utf8.

    The servlet also has an encoding setting. For your case, change the
    tomcat setting.
    When rendering jsp page, the encoding also matters.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?
    title=Create_Lucene_Database_Search_in_3_minutes
    On 6/18/07, Lee Li Bin wrote:

    Hi,

    For indexing, there is no problem, there is Chinese text similar
    to my
    datasource (XML) in the index file when opening on a note pad.

    When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
    ISO88599_1 or Cp1252 in Java servlet, but we getting search
    problem, the
    search result does not display for Chinese term.

    I mixed English and Chinese text in my datasource, the search is
    working for
    English term, and Chinese char display as '???' in the result output.

    Please advice or send some sample / solutions

    Thanks.

    -----Original Message-----
    From: Mathieu Lecarme
    Sent: Monday, June 18, 2007 8:58 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and
    ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8
    parameters.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl wettin at Jun 18, 2007 at 8:45 pm
    Don't they differ in tokenization? One of them uses grams, the other
    does not. Or? That would be another thing that might mess it up. But
    then I never looked at the highlighter, so I can only guess.

    --
    karl

    18 jun 2007 kl. 22.37 skrev Chris Lu:
    Hi, Karl,

    Thanks for sharing this experience.

    I did find CJKAnalyzer somehow behaves differently than
    ChineseAnalyzer. When trying to highlight the matched term,
    ChineseAnalyzer didn't work somehow. But I didn't investigate into it.

    This is a useful clue for it.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?
    title=Create_Lucene_Database_Search_in_3_minutes

    On 6/18/07, karl wettin wrote:
    A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
    characters are represented by 3 bytes with UTF8, and 2 bytes as
    UTF16. It is a simple hack.

    It did however not save me that much as I had a mixed latin and CJK
    corpus, and I reverted. Still think it is something worth
    considering. Perhaps it might be worth implementing per index, per
    document or per field string encoding strategy.




    18 jun 2007 kl. 20.01 skrev Chris Lu:
    Basically where ever you see, the encoding should be utf8.

    The servlet also has an encoding setting. For your case, change the
    tomcat setting.
    When rendering jsp page, the encoding also matters.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?
    title=Create_Lucene_Database_Search_in_3_minutes
    On 6/18/07, Lee Li Bin wrote:

    Hi,

    For indexing, there is no problem, there is Chinese text similar
    to my
    datasource (XML) in the index file when opening on a note pad.

    When I try to use the utf8 in jsp and, getbytes array of
    'utf-8' or
    ISO88599_1 or Cp1252 in Java servlet, but we getting search
    problem, the
    search result does not display for Chinese term.

    I mixed English and Chinese text in my datasource, the search is
    working for
    English term, and Chinese char display as '???' in the result
    output.
    Please advice or send some sample / solutions

    Thanks.

    -----Original Message-----
    From: Mathieu Lecarme
    Sent: Monday, June 18, 2007 8:58 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and
    ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8
    parameters.

    M.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lee Li Bin at Jun 19, 2007 at 8:19 am
    Hi,

    thanks guys for helping me.

    I forgot to use back the same analyzer for searching, that's why I can't
    search for Chinese words.. :)



    -----Original Message-----
    From: Chris Lu
    Sent: Tuesday, June 19, 2007 4:37 AM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Hi, Karl,

    Thanks for sharing this experience.

    I did find CJKAnalyzer somehow behaves differently than
    ChineseAnalyzer. When trying to highlight the matched term,
    ChineseAnalyzer didn't work somehow. But I didn't investigate into it.

    This is a useful clue for it.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
    inutes

    On 6/18/07, karl wettin wrote:
    A year or two ago I hacked Lucene to use UTF16 instead of UTF8 as CJK
    characters are represented by 3 bytes with UTF8, and 2 bytes as
    UTF16. It is a simple hack.

    It did however not save me that much as I had a mixed latin and CJK
    corpus, and I reverted. Still think it is something worth
    considering. Perhaps it might be worth implementing per index, per
    document or per field string encoding strategy.




    18 jun 2007 kl. 20.01 skrev Chris Lu:
    Basically where ever you see, the encoding should be utf8.

    The servlet also has an encoding setting. For your case, change the
    tomcat setting.
    When rendering jsp page, the encoding also matters.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?
    title=Create_Lucene_Database_Search_in_3_minutes
    On 6/18/07, Lee Li Bin wrote:

    Hi,

    For indexing, there is no problem, there is Chinese text similar
    to my
    datasource (XML) in the index file when opening on a note pad.

    When I try to use the utf8 in jsp and, getbytes array of 'utf-8' or
    ISO88599_1 or Cp1252 in Java servlet, but we getting search
    problem, the
    search result does not display for Chinese term.

    I mixed English and Chinese text in my datasource, the search is
    working for
    English term, and Chinese char display as '???' in the result output.

    Please advice or send some sample / solutions

    Thanks.

    -----Original Message-----
    From: Mathieu Lecarme
    Sent: Monday, June 18, 2007 8:58 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene for chinese search

    Lee Li Bin a écrit :
    Hi,

    I still met problem for searching of Chinese words.
    XMl file which is the datasource and analyzer has already been encoded.
    Have testing on StandardAnalyzer, CJKAnalyzer, and
    ChineseAnalyzer, but it
    still can't get any results.

    1. do we need any encoding configuration in apache tomcat for Chinese
    search using Lucence

    2. do we need to use JSP meta / page encoding ? what is the encoding
    for jsp?
    try first with simple junit test, after you can fight with UTF8
    parameters.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 22, 2007 at 10:37 am
    Regarding point #2, in case none of those work for you for some reason, you could always try using this:

    $ ll analyzers/src/java/org/apache/lucene/analysis/ngram/
    total 48
    -rw-rw-r-- 1 otis otis 4934 Mar 2 16:32 EdgeNGramTokenFilter.java
    -rw-rw-r-- 1 otis otis 4617 Feb 21 15:33 EdgeNGramTokenizer.java
    -rw-rw-r-- 1 otis otis 3257 Mar 2 17:12 NGramTokenFilter.java
    -rw-rw-r-- 1 otis otis 3103 Mar 2 16:33 NGramTokenizer.java
    drwxrwxr-x 7 otis otis 4096 May 31 10:11 .svn/

    Otis
    --
    Lucene Consulting -- http://lucene-consulting.com/


    ----- Original Message ----
    From: Chris Lu <chris.lu@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Sunday, June 17, 2007 8:09:30 PM
    Subject: Re: Lucene for chinese search

    There are three things to watch out for chinese or CJK languages:

    1. The content source or database need to be encoded in UTF-8.
    2. StandardAnalyzer doesn't support chinese words well. Use either
    ChineseAnalyzer or CJKAnalyzer. My experience is that CJKAnalyzer is a
    little better.
    3. The user's query should be encoded in UTF-8.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    On 6/17/07, leelb@xedge.com.sg wrote:
    Hi,

    I would like to know whether Standard Analyzer allows searching of chinese
    words?

    And in order to support chinese searching, is there any encoding needed in
    order to develop the application?

    I'm currently using Jetty as web server, jsp as application, and search
    results will be saved in xml file and display it using xsl. So is there
    encoding needed for any of the files (xml, xsl, etc...) as well as during
    parsing of query?

    thanks alot


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 17, '07 at 2:51p
activeJun 22, '07 at 10:37a
posts11
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase