FAQ
Hey everyone,



I'm trying to do the following:



1. Run a string through a simple tokenizer (i.e. WhitespaceTokenizer)

2. Run the resultant tokens through my current tokenizer as well as
StandardTokenizer, in order to isolate the tokens that are different between
them. (Background: I want to do this so that I can maintain my current
tokenization scheme while allowing for matches against the StandardTokenizer
scheme.)

~~

Example:



Original string: "Bob v2document"



1. "Bob v2document" => ["Bob", "v2document"]

2. Run each of these through my current tokenizer and StandardTokenizer. For
"v2document", let's say that my tokenizer spits out ["v", "2", "document"]
and StandardTokenizer spits out ["v2document"]. I would take the latter
token and append it, and get something like ["Bob", "v", "2", "document",
GAP_SO_PHRASE_QUERIES_SKIP, "v2document"].

~~

I'll be doing this at scale when I index any document, and I'm concerned
about performance. For every single original token, I'll need to a build a
StringReader and pass it through both Tokenizers.


I know that in general, what I'm doing seems more appropriate for
TokenFilter's, but I see two issues:


- It looks like there's no TokenFilter version of StandardTokenizer that
applies the rules to each token. Maybe I'm overlooking something?

- If I pass ["Bob", "v2document"] through two TokenFilters separately, I'll
get something like: ["Bob", "v", "2", "document"] / ["Bob", "v2document"].
And then I won't know which tokens belong to which original token.



Any suggestions for doing this in a performant way?


Thanks!

Tavi

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 10, '11 at 9:02p
activeFeb 10, '11 at 9:02p
posts1
users1
websitelucene.apache.org

1 user in discussion

Tavi Nathanson: 1 post

People

Translate

site design / logo © 2022 Grokbase