FAQ
I have a field that I currently store as a comma separated list of Guid objects. This field is crucial to our search strategy.



I can't figure out how to get those guid objects to be tokenized. I'm playing with the idea of a custom Analyzer and TokenFilter to try and do this but, I'm not sure that's the way to go here.



As you tell, I'm pretty new to Lucene and can't find any good documentation. J



Thanks



Chris Martin

Software Developer - myKB.com

ü http://mykb.com U chris.martin@mykb.com O+1 480-424-6952 x124

Search Discussions

  • Joe Shaw at Mar 16, 2007 at 9:15 pm
    Hi,
    On Fri, 2007-03-16 at 16:23 -0400, Martin, Chris wrote:
    I have a field that I currently store as a comma separated list of
    Guid objects. This field is crucial to our search strategy.

    I can't figure out how to get those guid objects to be tokenized. I'm
    playing with the idea of a custom Analyzer and TokenFilter to try and
    do this but, I'm not sure that's the way to go here.
    To tokenize a field, pass in Field.Index.TOKENIZED to the constructor.
    You'll also need to pass an analyzer to the IndexWriter constructor. Or
    do you mean something more specific?

    Joe
  • Martin, Chris at Mar 16, 2007 at 9:29 pm
    Hi Joe,

    I do that already. My problem is that the field is not being tokenized.

    Say I have the following:

    String guids = "4c1052c1-62d8-4369-9134-984f4d68c556,64867646-215e-41f7-8c9c-cd797b13bc58";

    I store that in a field named, "directories". But, the value is stored literally as it is written above. In order to search by one of the guids, I'd like to be able to store both Guids as tokens.

    Oh crap!!! I just realized that they weren't being tokenized because there is no space between the comma and the guid! <slapHead />.

    I did a test and it works as suspected.

    Thanks for your help (attempted at least. hah)

    Chris Martin
    Software Developer – myKB.com
    http://mykb.com  chris.martin@mykb.com +1 480-424-6952 x124


    -----Original Message-----
    From: Joe Shaw
    Sent: Friday, March 16, 2007 2:15 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Re: Help With Tokenization

    Hi,
    On Fri, 2007-03-16 at 16:23 -0400, Martin, Chris wrote:
    I have a field that I currently store as a comma separated list of
    Guid objects. This field is crucial to our search strategy.

    I can't figure out how to get those guid objects to be tokenized. I'm
    playing with the idea of a custom Analyzer and TokenFilter to try and
    do this but, I'm not sure that's the way to go here.
    To tokenize a field, pass in Field.Index.TOKENIZED to the constructor.
    You'll also need to pass an analyzer to the IndexWriter constructor. Or
    do you mean something more specific?

    Joe
  • Joe Shaw at Mar 17, 2007 at 2:47 pm
    Hi Chris,

    Martin, Chris wrote:
    I do that already. My problem is that the field is not being
    tokenized.

    Say I have the following:

    String guids =
    "4c1052c1-62d8-4369-9134-984f4d68c556,64867646-215e-41f7-8c9c-cd797b13bc58";

    I store that in a field named, "directories". But, the value is
    stored literally as it is written above. In order to search by one of
    the guids, I'd like to be able to store both Guids as tokens.

    Oh crap!!! I just realized that they weren't being tokenized because
    there is no space between the comma and the guid! <slapHead />.
    Well, the field was being tokenized in both cases according to the rules
    of the analyzer you are using. It just happens that the analyzer you
    are using tokenizes "a,b,c" as one token and "a, b, c" as three.

    Are you using an analyzer that stems terms? If so, you might get some
    unexpected results in very rare cases. It seems that in this case you
    should probably want to break the fields up yourself and add them
    untokenized.

    Joe
  • Digy at Mar 17, 2007 at 2:53 am
    Hi Chris,

    You can write your own analyzer as below to split the text(and use it in
    both indexing and searching).


    DIGY


    public class MyAnalyzer : Lucene.Net.Analysis.Analyzer
    {
    public override Lucene.Net.Analysis.TokenStream TokenStream(string
    fieldName, TextReader reader)
    {
    Lucene.Net.Analysis.TokenStream result = new
    Lucene.Net.Analysis.Standard.StandardTokenizer(reader);
    result = new
    Lucene.Net.Analysis.Standard.StandardFilter(result);
    result = new SplitterFilter(result);

    return result;
    }



    class SplitterFilter : Lucene.Net.Analysis.TokenStream
    {
    Lucene.Net.Analysis.TokenStream stream = null;
    List<Lucene.Net.Analysis.Token> tokens = new
    List<Lucene.Net.Analysis.Token>();

    char[] Separators = new char[] { ',', '-'};

    public SplitterFilter(Lucene.Net.Analysis.TokenStream stream)
    {
    this.stream = stream;
    }

    public override Lucene.Net.Analysis.Token Next()
    {
    if (tokens.Count > 0)
    {
    Lucene.Net.Analysis.Token t = tokens[0];
    tokens.RemoveAt(0);
    return t;
    }

    Lucene.Net.Analysis.Token token = stream.Next();
    if (token == null) return null;

    string termText = token.TermText();
    string[] subTokens = termText.Split(Separators,
    StringSplitOptions.RemoveEmptyEntries);
    if (subTokens.Length > 1)
    {
    int tokenOffset = subTokens[0].Length;
    for (int i = 1; i < subTokens.Length; i++)
    {
    tokenOffset = termText.IndexOf(subTokens[i],
    tokenOffset);
    tokens.Add(new Lucene.Net.Analysis.Token(
    subTokens[i],
    token.StartOffset(),
    token.StartOffset() +
    subTokens[i].Length,

    Lucene.Net.Analysis.Standard.StandardTokenizerConstants.tokenImage[Lucene.Ne
    t.Analysis.Standard.StandardTokenizerConstants.ALPHANUM]
    )
    );
    tokenOffset += subTokens[i].Length;
    }

    return new Lucene.Net.Analysis.Token(
    subTokens[0],
    token.StartOffset(),
    token.StartOffset()
    + subTokens[0].Length,

    Lucene.Net.Analysis.Standard.StandardTokenizerConstants.tokenImage[Lucene.Ne
    t.Analysis.Standard.StandardTokenizerConstants.ALPHANUM]
    );
    }
    else
    {
    return token;
    }
    }

    public override void Close()
    {
    this.stream.Close();
    }
    }

    }

    -----Original Message-----
    From: Martin, Chris
    Sent: Friday, March 16, 2007 10:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Help With Tokenization

    I have a field that I currently store as a comma separated list of Guid
    objects. This field is crucial to our search strategy.



    I can't figure out how to get those guid objects to be tokenized. I'm
    playing with the idea of a custom Analyzer and TokenFilter to try and do
    this but, I'm not sure that's the way to go here.



    As you tell, I'm pretty new to Lucene and can't find any good documentation.
    J



    Thanks



    Chris Martin

    Software Developer - myKB.com

    ü http://mykb.com U chris.martin@mykb.com O+1 480-424-6952 x124
  • Chris Martin at Mar 17, 2007 at 3:58 am
    Dude! Thanks DIGY! This will prove to be useful in the future for sure. :)

    Chris Martin
    Software Developer – myKB.com
    http://mykb.com  chris@mykb.com  +1 602-326-5200
    -----Original Message-----
    From: Digy
    Sent: Friday, March 16, 2007 7:53 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Help With Tokenization

    Hi Chris,

    You can write your own analyzer as below to split the text(and use it in
    both indexing and searching).


    DIGY


    public class MyAnalyzer : Lucene.Net.Analysis.Analyzer
    {
    public override Lucene.Net.Analysis.TokenStream TokenStream(string
    fieldName, TextReader reader)
    {
    Lucene.Net.Analysis.TokenStream result = new
    Lucene.Net.Analysis.Standard.StandardTokenizer(reader);
    result = new
    Lucene.Net.Analysis.Standard.StandardFilter(result);
    result = new SplitterFilter(result);

    return result;
    }



    class SplitterFilter : Lucene.Net.Analysis.TokenStream
    {
    Lucene.Net.Analysis.TokenStream stream = null;
    List<Lucene.Net.Analysis.Token> tokens = new
    List<Lucene.Net.Analysis.Token>();

    char[] Separators = new char[] { ',', '-'};

    public SplitterFilter(Lucene.Net.Analysis.TokenStream stream)
    {
    this.stream = stream;
    }

    public override Lucene.Net.Analysis.Token Next()
    {
    if (tokens.Count > 0)
    {
    Lucene.Net.Analysis.Token t = tokens[0];
    tokens.RemoveAt(0);
    return t;
    }

    Lucene.Net.Analysis.Token token = stream.Next();
    if (token == null) return null;

    string termText = token.TermText();
    string[] subTokens = termText.Split(Separators,
    StringSplitOptions.RemoveEmptyEntries);
    if (subTokens.Length > 1)
    {
    int tokenOffset = subTokens[0].Length;
    for (int i = 1; i < subTokens.Length; i++)
    {
    tokenOffset = termText.IndexOf(subTokens[i],
    tokenOffset);
    tokens.Add(new Lucene.Net.Analysis.Token(
    subTokens[i],
    token.StartOffset(),
    token.StartOffset() +
    subTokens[i].Length,

    Lucene.Net.Analysis.Standard.StandardTokenizerConstants.tokenImage[Lucene.Ne
    t.Analysis.Standard.StandardTokenizerConstants.ALPHANUM]
    )
    );
    tokenOffset += subTokens[i].Length;
    }

    return new Lucene.Net.Analysis.Token(
    subTokens[0],
    token.StartOffset(),
    token.StartOffset()
    + subTokens[0].Length,

    Lucene.Net.Analysis.Standard.StandardTokenizerConstants.tokenImage[Lucene.Ne
    t.Analysis.Standard.StandardTokenizerConstants.ALPHANUM]
    );
    }
    else
    {
    return token;
    }
    }

    public override void Close()
    {
    this.stream.Close();
    }
    }

    }

    -----Original Message-----
    From: Martin, Chris
    Sent: Friday, March 16, 2007 10:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Help With Tokenization

    I have a field that I currently store as a comma separated list of Guid
    objects. This field is crucial to our search strategy.



    I can't figure out how to get those guid objects to be tokenized. I'm
    playing with the idea of a custom Analyzer and TokenFilter to try and do
    this but, I'm not sure that's the way to go here.



    As you tell, I'm pretty new to Lucene and can't find any good documentation.
    J



    Thanks



    Chris Martin

    Software Developer - myKB.com

    ü http://mykb.com U chris.martin@mykb.com O+1 480-424-6952 x124
  • Digy at Mar 17, 2007 at 5:09 pm
    Hi Chris,

    The sample code "MyAnalyzer" is not just for future. If you like, you can use it in your code directly
    or make some modifications according to your needs.
    It splits the tokens that contains '-' and ','.

    DIGY

    -----Original Message-----
    From: Chris Martin
    Sent: Saturday, March 17, 2007 5:58 AM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Help With Tokenization

    Dude! Thanks DIGY! This will prove to be useful in the future for sure. :)

    Chris Martin
    Software Developer myKB.com
    ? http://mykb.com ? chris@mykb.com ? +1 602-326-5200
    -----Original Message-----
    From: Digy
    Sent: Friday, March 16, 2007 7:53 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Help With Tokenization

    Hi Chris,

    You can write your own analyzer as below to split the text(and use it in
    both indexing and searching).


    DIGY


    public class MyAnalyzer : Lucene.Net.Analysis.Analyzer
    {
    public override Lucene.Net.Analysis.TokenStream TokenStream(string
    fieldName, TextReader reader)
    {
    Lucene.Net.Analysis.TokenStream result = new
    Lucene.Net.Analysis.Standard.StandardTokenizer(reader);
    result = new
    Lucene.Net.Analysis.Standard.StandardFilter(result);
    result = new SplitterFilter(result);

    return result;
    }



    class SplitterFilter : Lucene.Net.Analysis.TokenStream
    {
    Lucene.Net.Analysis.TokenStream stream = null;
    List<Lucene.Net.Analysis.Token> tokens = new
    List<Lucene.Net.Analysis.Token>();

    char[] Separators = new char[] { ',', '-'};

    public SplitterFilter(Lucene.Net.Analysis.TokenStream stream)
    {
    this.stream = stream;
    }

    public override Lucene.Net.Analysis.Token Next()
    {
    if (tokens.Count > 0)
    {
    Lucene.Net.Analysis.Token t = tokens[0];
    tokens.RemoveAt(0);
    return t;
    }

    Lucene.Net.Analysis.Token token = stream.Next();
    if (token == null) return null;

    string termText = token.TermText();
    string[] subTokens = termText.Split(Separators,
    StringSplitOptions.RemoveEmptyEntries);
    if (subTokens.Length > 1)
    {
    int tokenOffset = subTokens[0].Length;
    for (int i = 1; i < subTokens.Length; i++)
    {
    tokenOffset = termText.IndexOf(subTokens[i],
    tokenOffset);
    tokens.Add(new Lucene.Net.Analysis.Token(
    subTokens[i],
    token.StartOffset(),
    token.StartOffset() +
    subTokens[i].Length,

    Lucene.Net.Analysis.Standard.StandardTokenizerConstants.tokenImage[Lucene.Ne
    t.Analysis.Standard.StandardTokenizerConstants.ALPHANUM]
    )
    );
    tokenOffset += subTokens[i].Length;
    }

    return new Lucene.Net.Analysis.Token(
    subTokens[0],
    token.StartOffset(),
    token.StartOffset()
    + subTokens[0].Length,

    Lucene.Net.Analysis.Standard.StandardTokenizerConstants.tokenImage[Lucene.Ne
    t.Analysis.Standard.StandardTokenizerConstants.ALPHANUM]
    );
    }
    else
    {
    return token;
    }
    }

    public override void Close()
    {
    this.stream.Close();
    }
    }

    }

    -----Original Message-----
    From: Martin, Chris
    Sent: Friday, March 16, 2007 10:24 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Help With Tokenization

    I have a field that I currently store as a comma separated list of Guid
    objects. This field is crucial to our search strategy.



    I can't figure out how to get those guid objects to be tokenized. I'm
    playing with the idea of a custom Analyzer and TokenFilter to try and do
    this but, I'm not sure that's the way to go here.



    As you tell, I'm pretty new to Lucene and can't find any good documentation.
    J



    Thanks



    Chris Martin

    Software Developer - myKB.com

    http://mykb.com U chris.martin@mykb.com O+1 480-424-6952 x124

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedMar 16, '07 at 8:23p
activeMar 17, '07 at 5:09p
posts7
users3
websitelucene.apache.org

3 users in discussion

Chris Martin: 3 posts Joe Shaw: 2 posts Digy: 2 posts

People

Translate

site design / logo © 2022 Grokbase