FAQ
HI,



I'm new to the lucene. I downloaded lucene 2.4.1.



I have one xml file which contains few special characters like 'å', 'ø,' °' etc.(these are Danish language elements).



How can I search these things.





Uday Kumar Reddy Maddigatla

Software Engineer(Progrator|gatetrade)

MACH India(Operations)

Mobile: + 91-9963000377

Uday.Maddigatla@ness.com

ukma@mach.com

www.ness.com

Search Discussions

  • Erick Erickson at Apr 21, 2009 at 2:35 pm
    Take a look at DutchAnalyzer. The problem you'll have is if you're indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick
    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla wrote:

    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å', 'ø,' °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com
    ukma@mach.com
    www.ness.com




  • Uday kumar maddigatla at Apr 22, 2009 at 6:58 am
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample example or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å', 'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com >
    ukma@mach.com >
    www.ness.com




    --
    View this message in context: http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 22, 2009 at 12:51 pm
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick
    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla wrote:


    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample example or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å', 'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com >>
    ukma@mach.com >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uday kumar maddigatla at Apr 23, 2009 at 5:43 am
    HI

    Here are the details about my goals.
    1. I want to use this lucene for mixed languages.
    2. I want to make indexes of the documents which are either english or
    danish etc.
    I'm attaching my IndexFiles.java file.

    When i'm searching i'm giving the index path location as well as doucmets
    folder.

    If i use StandardAnalyzer as an argument to IndexWriter's method it is able
    to search the english characters.

    How can i use DutchAnalyzer in order to make this IndexFiles.java to index
    the danish elements.

    In my Code which i attached, you can see 'C:\test3'. This is my location
    where i want to store my indexes.

    I'm giving documents folder location as comand line argument.

    In my document the content will be like this

    <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga skal
    opsplittes
    hhv. byggeplads og skat
    Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
    Bygning B
    SES Journal nr. : 42895-0001
    SES Navision nr.: Navision 9800124
    SES Ansvarlig : Martin Krøldrup Nielsen
    SES rådgiver : Friis & Moltke A/S
    Hermed fremsendes faktura på ekstra tømrerarbejde.
    Byggeplads Amtsgården B-4
    jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>

    i"m searching the word like rådgiver . When i see the result it is clearly
    searching for r dgiver. It is omitting the danish element.

    Please help me in this.



    Erick Erickson wrote:
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick

    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
    wrote:
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample example
    or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å',
    'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com > >>
    ukma@mach.com > >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23190555/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190555/SearchFiles.java SearchFiles.java
    http://www.nabble.com/file/p23190555/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190555/IndexFiles.java IndexFiles.java
    --
    View this message in context: http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190555.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uday kumar maddigatla at Apr 23, 2009 at 5:44 am
    HI

    Here are the details about my goals.
    1. I want to use this lucene for mixed languages.
    2. I want to make indexes of the documents which are either english or
    danish etc.
    I'm attaching my IndexFiles.java file.

    When i'm searching i'm giving the index path location as well as doucmets
    folder.

    If i use StandardAnalyzer as an argument to IndexWriter's method it is able
    to search the english characters.

    How can i use DutchAnalyzer in order to make this IndexFiles.java to index
    the danish elements.

    In my Code which i attached, you can see 'C:\test3'. This is my location
    where i want to store my indexes.

    I'm giving documents folder location as comand line argument.

    In my document the content will be like this

    <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga skal
    opsplittes
    hhv. byggeplads og skat
    Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
    Bygning B
    SES Journal nr. : 42895-0001
    SES Navision nr.: Navision 9800124
    SES Ansvarlig : Martin Krøldrup Nielsen
    SES rådgiver : Friis & Moltke A/S
    Hermed fremsendes faktura på ekstra tømrerarbejde.
    Byggeplads Amtsgården B-4
    jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>

    i"m searching the word like rådgiver . When i see the result it is clearly
    searching for r dgiver. It is omitting the danish element.

    Please help me in this.



    Erick Erickson wrote:
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick

    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
    wrote:
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample example
    or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å',
    'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com > >>
    ukma@mach.com > >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    --
    View this message in context: http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 23, 2009 at 1:35 pm
    OK, this is a much different problem than you were originally
    asking about, effectively "how to index/search mixed language
    documents".

    This topic has been discussed multiple times on the user list, I
    think your first step should be to search the archive. I *was*
    going to find the old searchable mail archive, but those clever folks
    at Lucid Imagination have something new, see:

    http://www.lucidimagination.com/search/p:lucene?q=multiple+languages

    Once you've had a chance to look that over I think you'll be off and
    running.

    Best
    Erick
    On Thu, Apr 23, 2009 at 1:43 AM, uday kumar maddigatla wrote:


    HI

    Here are the details about my goals.
    1. I want to use this lucene for mixed languages.
    2. I want to make indexes of the documents which are either english or
    danish etc.
    I'm attaching my IndexFiles.java file.

    When i'm searching i'm giving the index path location as well as doucmets
    folder.

    If i use StandardAnalyzer as an argument to IndexWriter's method it is able
    to search the english characters.

    How can i use DutchAnalyzer in order to make this IndexFiles.java to index
    the danish elements.

    In my Code which i attached, you can see 'C:\test3'. This is my location
    where i want to store my indexes.

    I'm giving documents folder location as comand line argument.

    In my document the content will be like this

    <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga skal
    opsplittes
    hhv. byggeplads og skat
    Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
    Bygning B
    SES Journal nr. : 42895-0001
    SES Navision nr.: Navision 9800124
    SES Ansvarlig : Martin Krøldrup Nielsen
    SES rådgiver : Friis & Moltke A/S
    Hermed fremsendes faktura på ekstra tømrerarbejde.
    Byggeplads Amtsgården B-4
    jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>

    i"m searching the word like rådgiver . When i see the result it is clearly
    searching for r dgiver. It is omitting the danish element.

    Please help me in this.



    Erick Erickson wrote:
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick

    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
    wrote:
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample example
    or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å',
    'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com >> >>
    ukma@mach.com >> >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uday kumar maddigatla at Apr 24, 2009 at 7:53 am
    Hi Thanks for your reply.

    After gone threw with the site which you given... i understood that
    StandardAnalyzer is enough to handle these special characters.

    i'm attaching one class called AnalysisDemo.java. By executing that class
    i'm able to say the above sentance(i.e StandardAnalyzer is enough).

    Here is the out put when i ran the above java file.
    Analzying "Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270
    Højbjerg"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
    [Vedr.] [:] [Amtsgården] [Århus,] [Lyseng] [Allé] [1,] [8270] [Højbjerg]

    org.apache.lucene.analysis.SimpleAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]

    org.apache.lucene.analysis.StopAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]

    org.apache.lucene.analysis.standard.StandardAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270] [højbjerg]

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270] [højbjerg]

    By the above out put we can say that StandardAnalyzer is enough to get rid
    of danish elements.

    But only problem is when i'm searching the any term which includes the
    danish elements(like højbjerg...)

    it is unable to find out.

    Even i checked with LUKE. In that i given my sample text which contains the
    danish elements and selected the StandardAnalyzer as analyser. when i click
    analyze in that it cleary making index of danish words.

    and also i givne one try on luke by loading my index directory in to luke.
    after loading my index i searched for a word which contains the danish
    element, But this time it was failed. It was shown nothing(i.e o resluts).

    As in my sense the problem might be making the indexes or in searching the
    item.


    I gone threw the site which you given. From that i'm able to do this kind of
    reaserch work.

    Please help me in this.


    Erick Erickson wrote:
    OK, this is a much different problem than you were originally
    asking about, effectively "how to index/search mixed language
    documents".

    This topic has been discussed multiple times on the user list, I
    think your first step should be to search the archive. I *was*
    going to find the old searchable mail archive, but those clever folks
    at Lucid Imagination have something new, see:

    http://www.lucidimagination.com/search/p:lucene?q=multiple+languages

    Once you've had a chance to look that over I think you'll be off and
    running.

    Best
    Erick

    On Thu, Apr 23, 2009 at 1:43 AM, uday kumar maddigatla
    wrote:
    HI

    Here are the details about my goals.
    1. I want to use this lucene for mixed languages.
    2. I want to make indexes of the documents which are either english or
    danish etc.
    I'm attaching my IndexFiles.java file.

    When i'm searching i'm giving the index path location as well as
    doucmets
    folder.

    If i use StandardAnalyzer as an argument to IndexWriter's method it is
    able
    to search the english characters.

    How can i use DutchAnalyzer in order to make this IndexFiles.java to
    index
    the danish elements.

    In my Code which i attached, you can see 'C:\test3'. This is my location
    where i want to store my indexes.

    I'm giving documents folder location as comand line argument.

    In my document the content will be like this

    <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga skal
    opsplittes
    hhv. byggeplads og skat
    Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
    Bygning B
    SES Journal nr. : 42895-0001
    SES Navision nr.: Navision 9800124
    SES Ansvarlig : Martin Krøldrup Nielsen
    SES rådgiver : Friis & Moltke A/S
    Hermed fremsendes faktura på ekstra tømrerarbejde.
    Byggeplads Amtsgården B-4
    jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>

    i"m searching the word like rådgiver . When i see the result it is
    clearly
    searching for r dgiver. It is omitting the danish element.

    Please help me in this.



    Erick Erickson wrote:
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick

    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
    wrote:
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample
    example
    or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like 'å',
    'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com > >> >>
    ukma@mach.com > >> >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23211629/AnalysisDemo.java AnalysisDemo.java
    --
    View this message in context: http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23211629.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 24, 2009 at 1:48 pm
    I'm puzzled why you say

    "By the above out put we can say that StandardAnalyzer is
    enough to get rid of danish elements."

    It does NOT get rid of the accents, according to your own output.
    If your goal is to go ahead and index multiple language documents
    in a single index then search it, I'd recommend using the
    ISOLatin1AccentFilter. Compare the output of an analyzer like:

    public class MyAnalyzer extends StandardAnalyzer {
    public TokenStream tokenStream(String s, Reader reader) {
    return new ISOLatin1AccentFilter(super.tokenStream(s, reader));
    }
    }
    and you'll see that all the accents disappear, which will play much
    more nicely with queries that don't have accents.

    Best
    Erick
    On Fri, Apr 24, 2009 at 3:53 AM, uday kumar maddigatla wrote:


    Hi Thanks for your reply.

    After gone threw with the site which you given... i understood that
    StandardAnalyzer is enough to handle these special characters.

    i'm attaching one class called AnalysisDemo.java. By executing that class
    i'm able to say the above sentance(i.e StandardAnalyzer is enough).

    Here is the out put when i ran the above java file.
    Analzying "Vedr. : Amtsgården Århus, Lyseng Allé 1, 827
    Højbjerg"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
    [Vedr.] [:] [Amtsgården] [Århus,] [Lyseng] [Allé] [1,]
    [8270] [Højbjerg]

    org.apache.lucene.analysis.SimpleAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]

    org.apache.lucene.analysis.StopAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]

    org.apache.lucene.analysis.standard.StandardAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
    [højbjerg]

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
    [højbjerg]

    By the above out put we can say that StandardAnalyzer is enough to get rid
    of danish elements.

    But only problem is when i'm searching the any term which includes the
    danish elements(like højbjerg...)

    it is unable to find out.

    Even i checked with LUKE. In that i given my sample text which contains
    the
    danish elements and selected the StandardAnalyzer as analyser. when i click
    analyze in that it cleary making index of danish words.

    and also i givne one try on luke by loading my index directory in to luke.
    after loading my index i searched for a word which contains the danish
    element, But this time it was failed. It was shown nothing(i.e o resluts).

    As in my sense the problem might be making the indexes or in searching the
    item.


    I gone threw the site which you given. From that i'm able to do this kind
    of
    reaserch work.

    Please help me in this.


    Erick Erickson wrote:
    OK, this is a much different problem than you were originally
    asking about, effectively "how to index/search mixed language
    documents".

    This topic has been discussed multiple times on the user list, I
    think your first step should be to search the archive. I *was*
    going to find the old searchable mail archive, but those clever folks
    at Lucid Imagination have something new, see:

    http://www.lucidimagination.com/search/p:lucene?q=multiple+languages

    Once you've had a chance to look that over I think you'll be off and
    running.

    Best
    Erick

    On Thu, Apr 23, 2009 at 1:43 AM, uday kumar maddigatla
    wrote:
    HI

    Here are the details about my goals.
    1. I want to use this lucene for mixed languages.
    2. I want to make indexes of the documents which are either english or
    danish etc.
    I'm attaching my IndexFiles.java file.

    When i'm searching i'm giving the index path location as well as
    doucmets
    folder.

    If i use StandardAnalyzer as an argument to IndexWriter's method it is
    able
    to search the english characters.

    How can i use DutchAnalyzer in order to make this IndexFiles.java to
    index
    the danish elements.

    In my Code which i attached, you can see 'C:\test3'. This is my location
    where i want to store my indexes.

    I'm giving documents folder location as comand line argument.

    In my document the content will be like this

    <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga
    skal
    opsplittes
    hhv. byggeplads og skat
    Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
    Bygning B
    SES Journal nr. : 42895-0001
    SES Navision nr.: Navision 9800124
    SES Ansvarlig : Martin Krøldrup Nielsen
    SES rådgiver : Friis & Moltke A/S
    Hermed fremsendes faktura på ekstra tømrerarbejde.
    Byggeplads Amtsgården B-4
    jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>

    i"m searching the word like rådgiver . When i see the result it is
    clearly
    searching for r dgiver. It is omitting the danish element.

    Please help me in this.



    Erick Erickson wrote:
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick

    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
    wrote:
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample
    example
    or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if you're
    indexing
    this document along with a bunch of documents from other languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like
    'å',
    'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com >> >> >>
    ukma@mach.com >> >> >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23211629/AnalysisDemo.java AnalysisDemo.java
    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23211629.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uday kumar maddigatla at Apr 27, 2009 at 11:26 am
    hi
    thanks for your reply. Please suggest me what to do now.

    i want to index the document which contains multiple languages.


    I really waiting for this to complete with your help.

    Please,please help me

    Erick Erickson wrote:
    I'm puzzled why you say

    "By the above out put we can say that StandardAnalyzer is
    enough to get rid of danish elements."

    It does NOT get rid of the accents, according to your own output.
    If your goal is to go ahead and index multiple language documents
    in a single index then search it, I'd recommend using the
    ISOLatin1AccentFilter. Compare the output of an analyzer like:

    public class MyAnalyzer extends StandardAnalyzer {
    public TokenStream tokenStream(String s, Reader reader) {
    return new ISOLatin1AccentFilter(super.tokenStream(s, reader));
    }
    }
    and you'll see that all the accents disappear, which will play much
    more nicely with queries that don't have accents.

    Best
    Erick

    On Fri, Apr 24, 2009 at 3:53 AM, uday kumar maddigatla
    wrote:
    Hi Thanks for your reply.

    After gone threw with the site which you given... i understood that
    StandardAnalyzer is enough to handle these special characters.

    i'm attaching one class called AnalysisDemo.java. By executing that class
    i'm able to say the above sentance(i.e StandardAnalyzer is enough).

    Here is the out put when i ran the above java file.
    Analzying "Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270
    Højbjerg"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
    [Vedr.] [:] [Amtsgården] [Århus,] [Lyseng] [Allé] [1,]
    [8270] [Højbjerg]

    org.apache.lucene.analysis.SimpleAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]

    org.apache.lucene.analysis.StopAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]

    org.apache.lucene.analysis.standard.StandardAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
    [højbjerg]

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
    [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
    [højbjerg]

    By the above out put we can say that StandardAnalyzer is enough to get
    rid
    of danish elements.

    But only problem is when i'm searching the any term which includes the
    danish elements(like højbjerg...)

    it is unable to find out.

    Even i checked with LUKE. In that i given my sample text which contains
    the
    danish elements and selected the StandardAnalyzer as analyser. when i
    click
    analyze in that it cleary making index of danish words.

    and also i givne one try on luke by loading my index directory in to
    luke.
    after loading my index i searched for a word which contains the danish
    element, But this time it was failed. It was shown nothing(i.e o
    resluts).

    As in my sense the problem might be making the indexes or in searching
    the
    item.


    I gone threw the site which you given. From that i'm able to do this kind
    of
    reaserch work.

    Please help me in this.


    Erick Erickson wrote:
    OK, this is a much different problem than you were originally
    asking about, effectively "how to index/search mixed language
    documents".

    This topic has been discussed multiple times on the user list, I
    think your first step should be to search the archive. I *was*
    going to find the old searchable mail archive, but those clever folks
    at Lucid Imagination have something new, see:

    http://www.lucidimagination.com/search/p:lucene?q=multiple+languages

    Once you've had a chance to look that over I think you'll be off and
    running.

    Best
    Erick

    On Thu, Apr 23, 2009 at 1:43 AM, uday kumar maddigatla
    wrote:
    HI

    Here are the details about my goals.
    1. I want to use this lucene for mixed languages.
    2. I want to make indexes of the documents which are either english or
    danish etc.
    I'm attaching my IndexFiles.java file.

    When i'm searching i'm giving the index path location as well as
    doucmets
    folder.

    If i use StandardAnalyzer as an argument to IndexWriter's method it is
    able
    to search the english characters.

    How can i use DutchAnalyzer in order to make this IndexFiles.java to
    index
    the danish elements.

    In my Code which i attached, you can see 'C:\test3'. This is my
    location
    where i want to store my indexes.

    I'm giving documents folder location as comand line argument.

    In my document the content will be like this

    <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga
    skal
    opsplittes
    hhv. byggeplads og skat
    Vedr. : Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
    Bygning B
    SES Journal nr. : 42895-0001
    SES Navision nr.: Navision 9800124
    SES Ansvarlig : Martin Krøldrup Nielsen
    SES rådgiver : Friis & Moltke A/S
    Hermed fremsendes faktura på ekstra tømrerarbejde.
    Byggeplads Amtsgården B-4
    jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>

    i"m searching the word like rådgiver . When i see the result it is
    clearly
    searching for r dgiver. It is omitting the danish element.

    Please help me in this.



    Erick Erickson wrote:
    Are you *also* using the DutchAnalyzer for your *query*?

    Please show us the index and search code (simplified as much
    as possible), then we'll be able to provide better suggestions.

    Also, tell us a bit more about your goals here. Is this an
    index entirely of Dutch documents? Or is it a mixed-language
    index?

    Think about getting a copy of Luke and
    1> examining your index to see what's *really* there
    2> examining the effects of using different parsers on
    your *query*.

    Best
    Erick

    On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
    wrote:
    Hi

    Thanks for your reply.

    I'm able to see the DutchAnalyzer.

    When i'm indexing my documents i given instace of DutchAnalyzer as
    an
    argument to IndexWriter Class.

    After this when i search for the
    http://www.nabble.com/file/p23170710/IndexFiles.java
    IndexFiles.java
    contains the danish elements .. Still it is not able to identify.

    Please tell me how to use DutchAnalzer in my application. Sample
    example
    or
    series of steps helps me.

    I also attached my index file(.java file).

    Please help me in this. please..

    Erick Erickson wrote:
    Take a look at DutchAnalyzer. The problem you'll have is if
    you're
    indexing
    this document along with a bunch of documents from other
    languages.
    You could search the mail archive for extensive discussions of indexing/
    searching documents from several languages.

    Best
    Erick

    On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
    wrote:
    HI,



    I'm new to the lucene. I downloaded lucene 2.4.1.



    I have one xml file which contains few special characters like
    'å',
    'ø,'
    °'
    etc.(these are Danish language elements).



    How can I search these things.





    Uday Kumar Reddy Maddigatla

    Software Engineer(Progrator|gatetrade)

    MACH India(Operations)

    Mobile: + 91-9963000377

    Uday.Maddigatla@ness.com > >> >> >>
    ukma@mach.com > >> >> >>
    www.ness.com




    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
    Sent from the Lucene - Java Users mailing list archive at
    Nabble.com.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    http://www.nabble.com/file/p23211629/AnalysisDemo.java AnalysisDemo.java
    --
    View this message in context:
    http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23211629.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context: http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23254287.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 21, '09 at 6:41a
activeApr 27, '09 at 11:26a
posts10
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase