Hey everyone,

I'm trying to do the following:

1. Run a string through a simple tokenizer (i.e. WhitespaceTokenizer)

2. Run the resultant tokens through my current tokenizer as well as
StandardTokenizer, in order to isolate the tokens that are different between
them. (Background: I want to do this so that I can maintain my current
tokenization scheme while allowing for matches against the StandardTokenizer



Original string: "Bob v2document"

1. "Bob v2document" => ["Bob", "v2document"]

2. Run each of these through my current tokenizer and StandardTokenizer. For
"v2document", let's say that my tokenizer spits out ["v", "2", "document"]
and StandardTokenizer spits out ["v2document"]. I would take the latter
token and append it, and get something like ["Bob", "v", "2", "document",


I'll be doing this at scale when I index any document, and I'm concerned
about performance. For every single original token, I'll need to a build a
StringReader and pass it through both Tokenizers.

I know that in general, what I'm doing seems more appropriate for
TokenFilter's, but I see two issues:

- It looks like there's no TokenFilter version of StandardTokenizer that
applies the rules to each token. Maybe I'm overlooking something?

- If I pass ["Bob", "v2document"] through two TokenFilters separately, I'll
get something like: ["Bob", "v", "2", "document"] / ["Bob", "v2document"].
And then I won't know which tokens belong to which original token.

Any suggestions for doing this in a performant way?



Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
postedFeb 10, '11 at 9:02p
activeFeb 10, '11 at 9:02p

1 user in discussion

Tavi Nathanson: 1 post



site design / logo © 2022 Grokbase