FAQ
Hello,
I'm interested in knowing how these tokenizers work together.
The API doc for TeeTokenizer
http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/TeeTokenFilter.html

has this sample code:
SinkTokenizer sink1 = new SinkTokenizer(null);
SinkTokenizer sink2 = new SinkTokenizer(null);

TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2);
TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2);

TokenStream final3 = new EntityDetect(sink1);
TokenStream final4 = new URLDetect(sink2);

with an explanation that reads "sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer",
but I don't understand how the input from reader1 and reader2 are mixed together.
Will sink1 first reaturn the reader1 text, and reader2?
Or are they mixed randomly?

-Kuro


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll at Aug 22, 2008 at 9:46 pm

    On Aug 22, 2008, at 3:47 PM, Teruhiko Kurosaka wrote:

    Hello,
    I'm interested in knowing how these tokenizers work together.
    The API doc for TeeTokenizer
    http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/TeeTokenFilter.html

    has this sample code:
    SinkTokenizer sink1 = new SinkTokenizer(null);
    SinkTokenizer sink2 = new SinkTokenizer(null);

    TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new
    WhitespaceTokenizer(reader1), sink1), sink2);
    TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new
    WhitespaceTokenizer(reader2), sink1), sink2);

    TokenStream final3 = new EntityDetect(sink1);
    TokenStream final4 = new URLDetect(sink2);

    with an explanation that reads "sink1 and sink2 will both get tokens
    from both reader1 and reader2 after whitespace tokenizer",
    but I don't understand how the input from reader1 and reader2 are
    mixed together.
    Will sink1 first reaturn the reader1 text, and reader2?
    It depends on the order the fields are added. If source1 is used
    first, then reader1 will be first.

    Try out the code at the bottom. I get the following if source1 is
    first:

    ------
    final 1
    (a,0,1)
    (b,2,3)
    (c,4,5)
    (d,6,7)
    (f,8,9)
    (g,10,11)
    -------- end final 1 -------
    ------
    final 2
    (h,0,1)
    (i,2,3)
    (J,4,5)
    (k,6,7)
    (L,8,9)
    (m,10,11)
    -------- end final 2 -------
    ------
    final 3
    (a,0,1)
    (c,4,5)
    (F,8,9)
    (g,10,11)
    (h,0,1)
    (i,2,3)
    (J,4,5)
    (k,6,7)
    (L,8,9)
    (m,10,11)
    -------- end final 3 -------
    ------
    final 4
    (a,0,1)
    (b,2,3)
    (c,4,5)
    (d,6,7)
    (F,8,9)
    (h,0,1)
    (i,2,3)
    (J,4,5)
    (k,6,7)
    (L,8,9)
    -------- end final 4 -------

    and this if final2 is first:

    ------
    final 2
    (h,0,1)
    (i,2,3)
    (J,4,5)
    (k,6,7)
    (L,8,9)
    (m,10,11)
    -------- end final 2 -------
    ------
    final 1
    (a,0,1)
    (b,2,3)
    (c,4,5)
    (d,6,7)
    (f,8,9)
    (g,10,11)
    -------- end final 1 -------
    ------
    final 3
    (h,0,1)
    (i,2,3)
    (J,4,5)
    (k,6,7)
    (L,8,9)
    (m,10,11)
    (a,0,1)
    (c,4,5)
    (F,8,9)
    (g,10,11)
    -------- end final 3 -------
    ------
    final 4
    (h,0,1)
    (i,2,3)
    (J,4,5)
    (k,6,7)
    (L,8,9)
    (a,0,1)
    (b,2,3)
    (c,4,5)
    (d,6,7)
    (F,8,9)
    -------- end final 4 -------




    public class SinkTest extends TestCase {
    public class SinkTest extends TestCase {
    public void testSink() throws Exception {
    StringReader reader1 = new StringReader("a b c d F g");
    StringReader reader2 = new StringReader("h i J k L m");

    SinkTokenizer sink1 = new SinkTokenizer(null);
    SinkTokenizer sink2 = new SinkTokenizer(null);

    TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new
    WhitespaceTokenizer(reader1), sink1), sink2);
    TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new
    WhitespaceTokenizer(reader2), sink1), sink2);

    TokenStream final1 = new LowerCaseFilter(source1);
    TokenStream final2 = source2;
    String[] stops1 = {"b", "d"};
    TokenStream final3 = new StopFilter(sink1, stops1);
    String[] stops2 = {"m", "g"};
    TokenStream final4 = new StopFilter(sink2, stops2);



    printTokens(final1, "final 1");
    printTokens(final2, "final 2");

    printTokens(final3, "final 3");
    printTokens(final4, "final 4");


    }
    private void printTokens(TokenStream input, String label) throws
    IOException {
    Token next = new Token();
    System.out.println("------");
    System.out.println(label);
    while ((next = input.next(next)) != null) {
    System.out.println(next);
    }
    System.out.println("-------- end " + label + " -------");
    }
    }


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Teruhiko Kurosaka at Aug 25, 2008 at 11:30 pm
    Thank you, Grant and (Koji) Sekiguchi-san.

    but I don't
    understand how the input from reader1 and reader2 are mixed together.
    Will sink1 first reaturn the reader1 text, and reader2?
    It depends on the order the fields are added. If source1 is
    used first, then reader1 will be first.
    This puzzles me. Is this really useful if how SinkTokenizer
    and TeeTokenizer behave depends on how they are read?
    I've read the source code of these Tokenizers but that
    didn't solve my question.

    This is an excerpt from Sekiguchi-san's code sample:

    Analyzer analyzer = new Analyzer() {

    public TokenStream tokenStream(String field, Reader in) {
    return new TeeTokenFilter(
    new TeeTokenFilter( new SenTokenizer( in, SEN_CONF ),
    sinkPerson ), sinkOrg );
    }
    };

    TokenFilter exPerson = new EntityExtractor( sinkPerson, T_PERSON );
    TokenFilter exOrg = new EntityExtractor( sinkOrg, T_ORG );
    IndexWriter writer = new IndexWriter( INDEX, analyzer, true );
    Document doc = new Document();
    doc.add( new Field( F_BODY, CONTENT, Store.YES, Index.TOKENIZED ) );
    doc.add( new Field( F_PERSON, exPerson ) );
    doc.add( new Field( F_ORG, exOrg ) );
    writer.addDocument( doc );

    It seems that the code works as expected only if the token stream from
    the analyzer on CONTENT is read completely, then the token stream from
    sinkPerson is read compeltely, followed by that from sinkOrg.

    Does Lucene's core gurantees that a field's token stream is read completely
    before the next field's token stream is read, in the order the Field's are add()'ed?

    - Kuro

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Aug 26, 2008 at 1:15 pm

    On Aug 25, 2008, at 7:29 PM, Teruhiko Kurosaka wrote:

    Thank you, Grant and (Koji) Sekiguchi-san.

    but I don't
    understand how the input from reader1 and reader2 are mixed together.
    Will sink1 first reaturn the reader1 text, and reader2?
    It depends on the order the fields are added. If source1 is
    used first, then reader1 will be first.
    This puzzles me. Is this really useful if how SinkTokenizer
    and TeeTokenizer behave depends on how they are read?
    Fields in a Document are added as a List, so the Field ordering is
    always the same.
    I've read the source code of these Tokenizers but that
    didn't solve my question.

    This is an excerpt from Sekiguchi-san's code sample:

    Analyzer analyzer = new Analyzer() {

    public TokenStream tokenStream(String field, Reader in) {
    return new TeeTokenFilter(
    new TeeTokenFilter( new SenTokenizer( in, SEN_CONF ),
    sinkPerson ), sinkOrg );
    }
    };

    TokenFilter exPerson = new EntityExtractor( sinkPerson, T_PERSON );
    TokenFilter exOrg = new EntityExtractor( sinkOrg, T_ORG );
    IndexWriter writer = new IndexWriter( INDEX, analyzer, true );
    Document doc = new Document();
    doc.add( new Field( F_BODY, CONTENT, Store.YES, Index.TOKENIZED ) );
    doc.add( new Field( F_PERSON, exPerson ) );
    doc.add( new Field( F_ORG, exOrg ) );
    writer.addDocument( doc );

    It seems that the code works as expected only if the token stream from
    the analyzer on CONTENT is read completely, then the token stream from
    sinkPerson is read compeltely, followed by that from sinkOrg.

    Does Lucene's core gurantees that a field's token stream is read
    completely
    before the next field's token stream is read, in the order the
    Field's are add()'ed?
    Yes, it processes all of one Field first, then the next one. If it
    doesn't, then we have a bug, IMO, and we will have to have a different
    approach for the Tee/Sink.

    -Grant

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Koji Sekiguchi at Aug 23, 2008 at 8:48 am
    Hi Kurosaka-san,

    I'd written an article on my blog several month ago about SinkTokenizer
    and TeeTokenFilter.

    See:
    http://lucene.jugem.jp/?eid=172

    Sorry, but all written in Japanese...

    Koji


    Teruhiko Kurosaka wrote:
    Hello,
    I'm interested in knowing how these tokenizers work together.
    The API doc for TeeTokenizer
    http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/TeeTokenFilter.html

    has this sample code:
    SinkTokenizer sink1 = new SinkTokenizer(null);
    SinkTokenizer sink2 = new SinkTokenizer(null);

    TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2);
    TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2);

    TokenStream final3 = new EntityDetect(sink1);
    TokenStream final4 = new URLDetect(sink2);

    with an explanation that reads "sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer",
    but I don't understand how the input from reader1 and reader2 are mixed together.
    Will sink1 first reaturn the reader1 text, and reader2?
    Or are they mixed randomly?

    -Kuro


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 22, '08 at 7:48p
activeAug 26, '08 at 1:15p
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase