FAQ
Hi,



I need to modify the StandardAnalyzer so that it will tokenize zip codes
that look like this:



92626-2646



I think the part I need to modify is in here - specifically:



<HAS_DIGIT> <P> <ALPHANUM>



// floating point, serial, model numbers, ip addresses, etc.

// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
<HAS_DIGIT> <P> <ALPHANUM>
<HAS_DIGIT> <M>
<HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M>
<LETTER> (<P> <LETTER>)+
<ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
<HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
)

>



Is there a way to keep that line so that the StandardAnalyzer works as
is - but tokenize anything that looks like



(HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better:



(<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) |
<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I
have zip codes that look like 92626, 92626-, and 92626-2646



I've tried adding that both lines to the "SKIP" section - but to no
avail.

Search Discussions

  • Mark Miller at Jan 12, 2007 at 1:38 am
    I would try adding this (or your regex)
    <ZIPCODE: <DIGIT><DIGIT><DIGIT><DIGIT><DIGIT> (("-" <DIGIT><DIGIT><DIGIT><DIGIT>)|(<DIGIT><DIGIT><DIGIT><DIGIT>))

    between the EMAIL and HOST line or something,


    And change this:

    org.apache.lucene.analysis.Token next() throws IOException :
    {
    Token token = null;
    }
    {
    ( token = <ALPHANUM> |
    token = <APOSTROPHE> |
    token = <ACRONYM> |
    token = <COMPANY> |
    token = <EMAIL> |
    token = <HOST> |
    token = <NUM> |
    token = <CJ> |
    token = <EOF>
    )
    {
    if (token.kind == EOF) {
    return null;
    } else {
    return
    new org.apache.lucene.analysis.Token(token.image,
    token.beginColumn,token.endColumn,
    tokenImage[token.kind]);
    }
    }
    }

    TO:

    org.apache.lucene.analysis.Token next() throws IOException :
    {
    Token token = null;
    }
    {
    ( token = <ALPHANUM> |
    token = <APOSTROPHE> |
    token = <ACRONYM> |
    tokrn = <ZIPCODE>
    token = <COMPANY> |
    token = <EMAIL> |
    token = <HOST> |
    token = <NUM> |
    token = <CJ> |
    token = <EOF>
    )
    {
    if (token.kind == EOF) {
    return null;
    } else {
    return
    new org.apache.lucene.analysis.Token(token.image,
    token.beginColumn,token.endColumn,
    tokenImage[token.kind]);
    }
    }
    }
    - Mark
    >

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jan 12, 2007 at 2:11 am
    Would it be simpler just to modify the input with a regex rather than risk
    messing with StandardANalyzer? Or wouldn't that do what you need?
    On 1/11/07, Van Nguyen wrote:

    Hi,



    I need to modify the StandardAnalyzer so that it will tokenize zip codes
    that look like this:



    92626-2646



    I think the part I need to modify is in here - specifically:



    <HAS_DIGIT> <P> <ALPHANUM>



    // floating point, serial, model numbers, ip addresses, etc.

    // every other segment must have at least one digit
    <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
    <HAS_DIGIT> <P> <ALPHANUM>
    <HAS_DIGIT> <M>
    <HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M>
    <LETTER> (<P> <LETTER>)+
    <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ )


    Is there a way to keep that line so that the StandardAnalyzer works as
    is - but tokenize anything that looks like



    (HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better:



    (<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) |
    <DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I
    have zip codes that look like 92626, 92626-, and 92626-2646



    I've tried adding that both lines to the "SKIP" section - but to no
    avail.



  • Van Nguyen at Jan 12, 2007 at 11:54 pm
    It won't do what I need. I may have something like:

    "All-In-One is located in 92226-4446 and has an E-A-R"

    I want it to be tokenized as follows:

    all
    one
    located
    92226
    4446
    E-A-R

    Right now... it is tokenizing it as this:

    all
    one
    located
    92226-4446
    E-A-R



    -----Original Message-----
    From: Erick Erickson
    Sent: Thursday, January 11, 2007 6:11 PM
    To: java-user@lucene.apache.org
    Subject: Re: Modifying StandardAnalyzer

    Would it be simpler just to modify the input with a regex rather than
    risk
    messing with StandardANalyzer? Or wouldn't that do what you need?
    On 1/11/07, Van Nguyen wrote:

    Hi,



    I need to modify the StandardAnalyzer so that it will tokenize zip codes
    that look like this:



    92626-2646



    I think the part I need to modify is in here - specifically:



    <HAS_DIGIT> <P> <ALPHANUM>



    // floating point, serial, model numbers, ip addresses, etc.

    // every other segment must have at least one digit
    <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
    <HAS_DIGIT> <P> <ALPHANUM>
    <HAS_DIGIT> <M>
    <HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M>
    <LETTER> (<P> <LETTER>)+
    <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ )


    Is there a way to keep that line so that the StandardAnalyzer works as
    is - but tokenize anything that looks like



    (HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better:



    (<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) |
    <DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I
    have zip codes that look like 92626, 92626-, and 92626-2646



    I've tried adding that both lines to the "SKIP" section - but to no
    avail.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jan 13, 2007 at 12:16 am

    It won't do what I need. I may have something like:

    "All-In-One is located in 92226-4446 and has an E-A-R"

    I want it to be tokenized as follows:

    all
    one
    located
    92226
    4446
    E-A-R

    Right now... it is tokenizing it as this:

    all
    one
    located
    92226-4446
    E-A-R
    Thats the type of information you give when you ask the question the
    first time (not to be a pompous ass or anything <g> ). The problem is
    that your zip code is match by NUM
    <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
    <HAS_DIGIT> <P> <ALPHANUM>
    <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    )
    >

    You could try and remove the first two OR options. Other than that, it
    gets tricky. And if you remove them than other things they might
    normally match (other than zip-codes) will not be matched.

    - Mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jan 13, 2007 at 1:01 am
    Figures...I don't even think removing those pieces from OR will
    work...that will just skip both pieces because they will appear as pure
    numbers. What you want is a bit tricky. I will think about it if someone
    else doesn't chime in...it is difficult to recognize one token and then
    return it as two, though not impossible of course...

    - Mark

    Van Nguyen wrote:
    It won't do what I need. I may have something like:

    "All-In-One is located in 92226-4446 and has an E-A-R"

    I want it to be tokenized as follows:

    all
    one
    located
    92226
    4446
    E-A-R

    Right now... it is tokenizing it as this:

    all
    one
    located
    92226-4446
    E-A-R



    -----Original Message-----
    From: Erick Erickson
    Sent: Thursday, January 11, 2007 6:11 PM
    To: java-user@lucene.apache.org
    Subject: Re: Modifying StandardAnalyzer

    Would it be simpler just to modify the input with a regex rather than
    risk
    messing with StandardANalyzer? Or wouldn't that do what you need?
    On 1/11/07, Van Nguyen wrote:

    Hi,



    I need to modify the StandardAnalyzer so that it will tokenize zip codes
    that look like this:



    92626-2646



    I think the part I need to modify is in here - specifically:



    <HAS_DIGIT> <P> <ALPHANUM>



    // floating point, serial, model numbers, ip addresses, etc.

    // every other segment must have at least one digit
    <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
    <HAS_DIGIT> <P> <ALPHANUM>
    <HAS_DIGIT> <M>
    <HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M>
    <LETTER> (<P> <LETTER>)+
    <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ )


    Is there a way to keep that line so that the StandardAnalyzer works as
    is - but tokenize anything that looks like



    (HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better:



    (<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) |
    <DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I
    have zip codes that look like 92626, 92626-, and 92626-2646



    I've tried adding that both lines to the "SKIP" section - but to no
    avail.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 12, '07 at 1:13a
activeJan 13, '07 at 1:01a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase