FAQ
Hi,

I was wondering if it's possible to get the token offset based of the
position in the original text.

My problem is I'm working on my own "Snippet Generator" and I'm giving a
token index (call it t) as input and need to make a snippet of the original
text. I want the Snippet to be some number of tokens (call it n tokens).
But to make the Snippet easier to read I want to see if it's close to the
end of a paragraph (if it is I'll make more of the Snippet before the token
than usual). So I'm scanning the original text forward some number of
characters looking for a new line or tab. If I find it I'd like to get the
token before that new line (and it's offset, call it y). Once I have the
offset I know I have y - t tokens after my token, and finally I know I put
n-(y-t) tokens before my token and can successfully make my Snippet.

Thanks in advance!

--JP

Search Discussions

  • John Paul Sondag at Jul 5, 2007 at 9:28 pm
    Hi,

    I never got a response to this and thought maybe I was too wordy.

    I'm wondering if there's a way where given a position in the original text
    you can retrieve the token index that is nearest to that position using the
    StandardToken/StandardTokenizer classes?



    --JP
    On 7/3/07, John Paul Sondag wrote:

    Hi,

    I was wondering if it's possible to get the token offset based of the
    position in the original text.

    My problem is I'm working on my own "Snippet Generator" and I'm giving a
    token index (call it t) as input and need to make a snippet of the original
    text. I want the Snippet to be some number of tokens (call it n tokens).
    But to make the Snippet easier to read I want to see if it's close to the
    end of a paragraph (if it is I'll make more of the Snippet before the token
    than usual). So I'm scanning the original text forward some number of
    characters looking for a new line or tab. If I find it I'd like to get the
    token before that new line (and it's offset, call it y). Once I have the
    offset I know I have y - t tokens after my token, and finally I know I put
    n-(y-t) tokens before my token and can successfully make my Snippet.

    Thanks in advance!

    --JP
  • Chris Hostetter at Jul 6, 2007 at 5:24 pm
    : I never got a response to this and thought maybe I was too wordy.
    :
    : I'm wondering if there's a way where given a position in the original text
    : you can retrieve the token index that is nearest to that position using the
    : StandardToken/StandardTokenizer classes?

    i may not be understanding the question, but wouldn't that just be...

    TokenStream s = getTokenStreamForOrriginalText()
    Token t;
    for (i=0; i<thePositionYouKnow; i++) {
    t = s.next();
    }
    return t;

    ?


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • John Paul Sondag at Jul 6, 2007 at 5:51 pm
    I thought that went to the "index" of the token. I may not understand it
    completely but this is how I currently view the TokenStream

    For example if my text was the following:

    This is an Example

    This is index of 1, is has index 2, an has index 3 Example has index 4.
    What I have is the actual "character position" in the original text. "This"
    is characters 0-3, "is" is characters 5-6, "an" is characters 8-9, and
    "Example" is characters 11-17. I know that given Token 4 (Example) I can
    get the startOffset and endOffset (11, and 17). What I'm wondering is given
    character offset can I get a tokenIndex. (I.E. given character offset 12,
    it would return 3, because Example is the closest token that starts at
    character 12).

    --JP
    On 7/6/07, Chris Hostetter wrote:

    : I never got a response to this and thought maybe I was too wordy.
    :
    : I'm wondering if there's a way where given a position in the original
    text
    : you can retrieve the token index that is nearest to that position using
    the
    : StandardToken/StandardTokenizer classes?

    i may not be understanding the question, but wouldn't that just be...

    TokenStream s = getTokenStreamForOrriginalText()
    Token t;
    for (i=0; i<thePositionYouKnow; i++) {
    t = s.next();
    }
    return t;

    ?


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Jul 6, 2007 at 6:02 pm
    : This is index of 1, is has index 2, an has index 3 Example has index 4.
    : What I have is the actual "character position" in the original text. "This"

    in that case, you'll have to do a while loop over next() calls and check
    the startOffset (or endOffset) of each untill you find the one you are
    looking for.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 3, '07 at 5:50p
activeJul 6, '07 at 6:02p
posts5
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase