FAQ
Hi,

In our application we are using lucene.net version - 2.1.0.2

For globalization we need to use many other Analyzers like


GermanAnalyzer

RussianAnalyzer

BrazilianAnalyzer

Etc.

For which we will need Lucene.Net.Analysis.Br etc.
With the present dll I am not getting any of these analyzers.

With this version is it possible to use these analyzers?
If no, which version of dll should be used to achieve this?

Thanks,
Sudhanya




DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Search Discussions

  • Michael Garski at Apr 25, 2008 at 5:25 pm
    Sudhanya,

    Those analyzers have not yet been ported to .Net from the Java Lucene project. Check out the analyzers in the contrib section of the Java Lucene source code - I don't believe they would be too difficult to port over.

    Michael

    -----Original Message-----
    From: Sudhanya Chatterjee
    Sent: Friday, April 25, 2008 6:24 AM
    To: lucene-net-user@incubator.apache.org
    Subject: Analyzer query

    Hi,

    In our application we are using lucene.net version - 2.1.0.2

    For globalization we need to use many other Analyzers like


    GermanAnalyzer

    RussianAnalyzer

    BrazilianAnalyzer

    Etc.

    For which we will need Lucene.Net.Analysis.Br etc.
    With the present dll I am not getting any of these analyzers.

    With this version is it possible to use these analyzers?
    If no, which version of dll should be used to achieve this?

    Thanks,
    Sudhanya




    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
  • Sudhanya Chatterjee at Apr 29, 2008 at 12:58 pm
    Hi,

    The StandardAnalyzer fulfills all requirements in our application, apart
    from one.

    Input text that goes for indexing is -



    My name is Sudhanya.Chatterjee is my second name. Cost of book is 50.6



    StandardTokenizer successfully tokenizes on punctuations.

    But if a dot is not followed by a space it considers it as one token.

    In above case "Sudhanya" "Chatterjee" should be considered as two tokens but
    50.6 as one token.

    So if a dot is preceded or followed by a number it should be kept intact.



    This is one extra rule I want apart from the exsisting ones of a
    StandardAnalyzer.



    How to add the above requirement into the existing code of StandardAnalyzer?

    Taken into consideration that the rest of the rules are still required.



    Thanks,

    Sudhanya










    DISCLAIMER
    ==========
    This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
  • Digy at Apr 29, 2008 at 4:44 pm
    Something like that?



    public class MyTokenStream : TokenStream

    {

    TokenStream _TokenStream;

    Lucene.Net.Analysis.Token _SecondPart = null;



    public MyTokenStream(TokenStream TokenStream)

    {

    _TokenStream = TokenStream;

    }



    bool IsNumber(string s)

    {

    for (int i = 0; i < s.Length; i++) if (Char.IsNumber(s[i]) ==
    false) return false;

    return true;

    }



    public override Lucene.Net.Analysis.Token Next()

    {

    Lucene.Net.Analysis.Token t = null;



    if(_SecondPart!=null)

    {

    t = _SecondPart;

    _SecondPart = null;

    return t;

    }



    t = _TokenStream.Next();

    if (t == null) return null;





    string s = t.TermText();

    if (s.Contains(".") == false) return t;





    string[] parts = s.Split(new char[] { '.' } ,
    StringSplitOptions.RemoveEmptyEntries);

    if (parts.Length > 2) return t;



    if(IsNumber(parts[0]) && IsNumber(parts[1]) ) return t;



    //Form token "abcd" form first part of abcd.defg

    Lucene.Net.Analysis.Token retToken = new
    Lucene.Net.Analysis.Token(parts[0], t.StartOffset(), t.StartOffset() +
    parts[0].Length);

    retToken.SetPositionIncrement(t.GetPositionIncrement());



    //Form token "defg" form second part of abcd.defg

    int start = t.StartOffset() + parts[0].Length + 1;

    _SecondPart = new Lucene.Net.Analysis.Token(parts[1], start,
    start + parts[1].Length);



    return retToken;

    }

    }



    public class MyAnalyzer : StandardAnalyzer

    {

    public override TokenStream TokenStream(string fieldName, TextReader
    reader)

    {

    return new MyTokenStream(base.TokenStream(fieldName, reader));

    }

    }





    DIGY



    -----Original Message-----
    From: Sudhanya Chatterjee
    Sent: Tuesday, April 29, 2008 3:59 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Analyzer query



    Hi,



    The StandardAnalyzer fulfills all requirements in our application, apart

    from one.



    Input text that goes for indexing is -







    My name is Sudhanya.Chatterjee is my second name. Cost of book is 50.6







    StandardTokenizer successfully tokenizes on punctuations.



    But if a dot is not followed by a space it considers it as one token.



    In above case "Sudhanya" "Chatterjee" should be considered as two tokens but

    50.6 as one token.



    So if a dot is preceded or followed by a number it should be kept intact.







    This is one extra rule I want apart from the exsisting ones of a

    StandardAnalyzer.







    How to add the above requirement into the existing code of StandardAnalyzer?



    Taken into consideration that the rest of the rules are still required.







    Thanks,



    Sudhanya





















    DISCLAIMER

    ==========

    This e-mail may contain privileged and confidential information which is the
    property of Persistent Systems Ltd. It is intended only for the use of the
    individual or entity to which it is addressed. If you are not the intended
    recipient, you are not authorized to read, retain, copy, print, distribute
    or use this message. If you have received this communication in error,
    please notify the sender and delete all copies of this message. Persistent
    Systems Ltd. does not accept any liability for virus infected mails.
  • Ben Martz at Apr 29, 2008 at 5:41 pm
    Have you looked at the standard WordDelimiterFilter from Solr by chance? In
    keeping with the spirit of Lucene.Net, I did a simple line-by-line
    conversion from Java just a couple of days ago and it might fulfill your
    needs. I'm going to try to attach it to this email - hopefully it doesn't
    mess up the list engine.

    Cheers,
    Ben

    /**
    * Splits words into subwords and performs optional transformations on
    subword groups.
    * Words are split into subwords with the following rules:
    * - split on intra-word delimiters (by default, all non alpha-numeric
    characters).
    * - "Wi-Fi" -> "Wi", "Fi"
    * - split on case transitions
    * - "PowerShot" -> "Power", "Shot"
    * - split on letter-number transitions
    * - "SD500" -> "SD", "500"
    * - leading and trailing intra-word delimiters on each subword are
    ignored
    * - "//hello---there, 'dude'" -> "hello", "there", "dude"
    * - trailing "'s" are removed for each subword
    * - "O'Neil's" -> "O", "Neil"
    * - Note: this step isn't performed in a separate filter because of
    possible subword combinations.
    *
    * The <b>combinations</b> parameter affects how subwords are combined:
    * - combinations="0" causes no subword combinations.
    * - "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token
    positions)
    * - combinations="1" means that in addition to the subwords, maximum
    runs of non-numeric subwords are catenated and produced at the same position
    of the last subword in the run.
    * - "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"
    * - "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
    * - "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper",
    2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
    *
    * One use for WordDelimiterFilter is to help match words with
    different subword delimiters.
    * For example, if the source text contained "wi-fi" one may want
    "wifi" "WiFi" "wi-fi" "wi+fi"
    * queries to all match.
    * One way of doing so is to specify combinations="1" in the analyzer
    * used for indexing, and combinations="0" (the default) in the
    analyzer
    * used for querying. Given that the current StandardTokenizer
    * immediately removes many intra-word delimiters, it is recommended
    that
    * this filter be used after a tokenizer that does not do this
    * (such as WhitespaceTokenizer).
    *
    * @author yonik
    * @version $Id: WordDelimiterFilter.java 472574 2006-11-08 18:25:52Z
    yonik $
    */

    On Tue, Apr 29, 2008 at 5:58 AM, Sudhanya Chatterjee wrote:

    Hi,

    The StandardAnalyzer fulfills all requirements in our application, apart
    from one.

    Input text that goes for indexing is -



    My name is Sudhanya.Chatterjee is my second name. Cost of book is 50.6



    StandardTokenizer successfully tokenizes on punctuations.

    But if a dot is not followed by a space it considers it as one token.

    In above case "Sudhanya" "Chatterjee" should be considered as two tokens
    but
    50.6 as one token.

    So if a dot is preceded or followed by a number it should be kept intact.



    This is one extra rule I want apart from the exsisting ones of a
    StandardAnalyzer.



    How to add the above requirement into the existing code of
    StandardAnalyzer?

    Taken into consideration that the rest of the rules are still required.



    Thanks,

    Sudhanya
    --
    13:37 - Someone stole the precinct toilet. The cops have nothing to go on.
    14:37 - Officers dispatched to a daycare where a three-year-old was
    resisting a rest.
    21:11 - Hole found in nudist camp wall. Officers are looking into it.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedApr 25, '08 at 1:24p
activeApr 29, '08 at 5:41p
posts5
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase