FAQ
Hi, all
I currently need a TokenFilter to break token season07 into two tokens
season 07

I tried PatternReplaceCharFilter to replace "season07" with "season 07",
however, the offset is not correct for Highlighting. For this reason, I want
to implement a TokenFilter, but I do not know how to deal with the offset.
My implemtation is currently following EdgeNGramTokenFilter:
public final class AlphaNumberTokenFilter extends TokenFilter
{

private char[] curTermBuffer;

private int curTermLength;

private int currentOffset;

private int baseOffset;


private TermAttribute termAtt;

private OffsetAttribute offsetAtt;

protected AlphaNumberTokenFilter(TokenStream input)
{
super(input);
this.termAtt = addAttribute(TermAttribute.class);
this.offsetAtt = addAttribute(OffsetAttribute.class);
}

@Override
public final boolean incrementToken() throws IOException
{
while (true)
{
if (curTermBuffer == null)
{
if (!input.incrementToken())
{
return false;
}
else
{
curTermBuffer = (char[]) termAtt.termBuffer().clone();
curTermLength = termAtt.termLength();
currentOffset = 0;
baseOffset = offsetAtt.startOffset();
}
}
if (currentOffset < curTermLength)
{

for(int i=currentOffset;i<curTermLength-1;i++)
{

if(Character.isLetter(curTermBuffer[i])&&Character.isDigit(curTermBuffer[i+1]))
{
int start = currentOffset;
int end = i+1;
offsetAtt.setOffset(baseOffset+start, baseOffset+end);
termAtt.setTermBuffer(curTermBuffer,start,end-start);
currentOffset=i+1;
return true;
}
}
if(currentOffset<curTermLength)
{
int start = currentOffset;
int end = curTermLength;
offsetAtt.setOffset(baseOffset+start, baseOffset+end);
termAtt.setTermBuffer(curTermBuffer,start,end-start);
currentOffset=curTermLength;
return true;
}
}
curTermBuffer = null;
}
}

@Override
public void reset() throws IOException
{
super.reset();
curTermBuffer = null;
}
}

--
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Search Discussions

  • Koji Sekiguchi at Dec 15, 2009 at 11:17 am

    Weiwei Wang wrote:
    Hi, all
    I currently need a TokenFilter to break token season07 into two tokens
    season 07
    I'd recommend you to refer WordDelimiterFilter in Solr.

    Koji

    --
    http://www.rondhuit.com/en/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Dec 15, 2009 at 11:23 am
    And if you do it yourself, don't forget to call clearAttributes() whenever
    you produce new tokens (else you may have bugs in the token increments). In
    the old token api its Token.clear()... Just a warning!

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Koji Sekiguchi
    Sent: Tuesday, December 15, 2009 12:17 PM
    To: java-user@lucene.apache.org
    Subject: Re: I need to implement a TokenFilter to break season07

    Weiwei Wang wrote:
    Hi, all
    I currently need a TokenFilter to break token season07 into two tokens
    season 07
    I'd recommend you to refer WordDelimiterFilter in Solr.

    Koji

    --
    http://www.rondhuit.com/en/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at Dec 15, 2009 at 12:25 pm

    Uwe Schindler wrote:
    And if you do it yourself, don't forget to call clearAttributes() whenever
    you produce new tokens (else you may have bugs in the token increments). In
    the old token api its Token.clear()... Just a warning!
    This comment has worried me, is this ok or am i meant to call
    clearAttributes() somewhere


    public class StripLeadingZeroFilter extends TokenFilter {
    /**
    * Construct filtering <i>in</i>.
    */
    public StripLeadingZeroFilter(TokenStream in) {
    super(in);
    termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    private TermAttribute termAtt;

    /**
    *
    * <p>Removes zeroes if first char in token
    */
    public final boolean incrementToken() throws java.io.IOException {
    if (!input.incrementToken()) {
    return false;
    }

    char[] buffer = termAtt.termBuffer();
    final int bufferLength = termAtt.termLength();

    if (buffer[0] == '0') {
    for (int i = 1; i < bufferLength; i++) {
    char c = buffer[i];
    buffer[i - 1] = c;
    }
    termAtt.setTermLength(bufferLength - 1);
    return true;
    } else {
    return true;
    }
    }

    }


    thanks Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Dec 15, 2009 at 12:47 pm

    And if you do it yourself, don't forget to call clearAttributes() whenever
    you produce new tokens (else you may have bugs in the token increments). In
    the old token api its Token.clear()... Just a warning!
    This comment has worried me, is this ok or am i meant to call
    clearAttributes() somewhere
    Your filter is fine. As noted, you should call clearAttributes, when you
    produce new tokens, but you are only modifying existing ones. The example
    Weiwei was mendtioning was to split a Token into two. So for the second
    generated token you must really initialize all attributes to default values.

    This is why the warning.

    public class StripLeadingZeroFilter extends TokenFilter {
    /**
    * Construct filtering <i>in</i>.
    */
    public StripLeadingZeroFilter(TokenStream in) {
    super(in);
    termAtt = (TermAttribute) addAttribute(TermAttribute.class);
    }

    private TermAttribute termAtt;

    /**
    *
    * <p>Removes zeroes if first char in token
    */
    public final boolean incrementToken() throws java.io.IOException {
    if (!input.incrementToken()) {
    return false;
    }

    char[] buffer = termAtt.termBuffer();
    final int bufferLength = termAtt.termLength();

    if (buffer[0] == '0') {
    for (int i = 1; i < bufferLength; i++) {
    char c = buffer[i];
    buffer[i - 1] = c;
    }
    termAtt.setTermLength(bufferLength - 1);
    return true;
    } else {
    return true;
    }
    }

    }


    thanks Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Weiwei Wang at Dec 15, 2009 at 12:48 pm
    WordDelimiterFilter is implemented in an old version where nextToken is
    called
    On Tue, Dec 15, 2009 at 7:17 PM, Koji Sekiguchi wrote:

    Weiwei Wang wrote:
    Hi, all
    I currently need a TokenFilter to break token season07 into two
    tokens
    season 07

    I'd recommend you to refer WordDelimiterFilter in Solr.

    Koji

    --
    http://www.rondhuit.com/en/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Weiwei Wang
    Alex Wang
    王巍巍
    Room 403, Mengmin Wei Building
    Computer Science Department
    Gulou Campus of Nanjing University
    Nanjing, P.R.China, 210093

    Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 15, '09 at 9:02a
activeDec 15, '09 at 12:48p
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase