FAQ
Hey all,

I am using the StandardAnalyzer with my own list of stop words (which is
more comprehensive than the default list), and my expectation was that this
would omit these stop words from the index when data is indexed using this
analyzer. However, I am seeing stop words in the term vector for documents
indexed with this analyzer.

Is this expected behaviour? Is there any way I can force these stop words
to be omitted from the index? Having them in the index is wreaking havoc
with term vector analysis to determine document similarity.

Thanks.

Search Discussions

  • Otis Gospodnetic at Sep 2, 2006 at 4:26 pm
    They shouldn't be in the index. You must be using StandardAnalyzer incorrectly, or maybe you think you are using it, but are really using something else.

    Otis

    ----- Original Message ----
    From: Jason Polites <jason.polites@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Saturday, September 2, 2006 9:05:27 AM
    Subject: Stop words in index

    Hey all,

    I am using the StandardAnalyzer with my own list of stop words (which is
    more comprehensive than the default list), and my expectation was that this
    would omit these stop words from the index when data is indexed using this
    analyzer. However, I am seeing stop words in the term vector for documents
    indexed with this analyzer.

    Is this expected behaviour? Is there any way I can force these stop words
    to be omitted from the index? Having them in the index is wreaking havoc
    with term vector analysis to determine document similarity.

    Thanks.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jason Polites at Sep 2, 2006 at 4:56 pm
    Roger that. I'll double check my code.

    Thanks.
    On 9/3/06, Otis Gospodnetic wrote:

    They shouldn't be in the index. You must be using StandardAnalyzer
    incorrectly, or maybe you think you are using it, but are really using
    something else.

    Otis

    ----- Original Message ----
    From: Jason Polites <jason.polites@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Saturday, September 2, 2006 9:05:27 AM
    Subject: Stop words in index

    Hey all,

    I am using the StandardAnalyzer with my own list of stop words (which is
    more comprehensive than the default list), and my expectation was that
    this
    would omit these stop words from the index when data is indexed using this
    analyzer. However, I am seeing stop words in the term vector for
    documents
    indexed with this analyzer.

    Is this expected behaviour? Is there any way I can force these stop words
    to be omitted from the index? Having them in the index is wreaking havoc
    with term vector analysis to determine document similarity.

    Thanks.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jason Polites at Sep 4, 2006 at 6:29 am
    Hey,

    Just a quick addendum to this original issue.

    My first need was to ensure that stop words were not stored in the index
    (which your helpful suggestion led me to confim); however this has raised a
    second more scary issue.

    It seems that because stop words are excluded from the index, quoted
    sequences of tokens which include stop words will yield no results.

    That is,

    In the default StandardAnalyzer, the stop word list contains the word "on".
    If I have a document which contains the phrase "Disney on Ice", the index
    will show only "Disney" and "Ice", but not "on".

    This is fine, and if the user searches for:

    Disney on Ice

    They will get a match. But, it seems that a search for:

    "Disney on Ice"

    With the quotations indicating the desire for an "exact match", the absence
    of stop words in the index means this yields zero results.

    Am I going crazy here?
    On 9/3/06, Jason Polites wrote:

    Roger that. I'll double check my code.

    Thanks.

    On 9/3/06, Otis Gospodnetic wrote:

    They shouldn't be in the index. You must be using StandardAnalyzer
    incorrectly, or maybe you think you are using it, but are really using
    something else.

    Otis

    ----- Original Message ----
    From: Jason Polites <jason.polites@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Saturday, September 2, 2006 9:05:27 AM
    Subject: Stop words in index

    Hey all,

    I am using the StandardAnalyzer with my own list of stop words (which is
    more comprehensive than the default list), and my expectation was that
    this
    would omit these stop words from the index when data is indexed using
    this
    analyzer. However, I am seeing stop words in the term vector for
    documents
    indexed with this analyzer.

    Is this expected behaviour? Is there any way I can force these stop
    words
    to be omitted from the index? Having them in the index is wreaking
    havoc
    with term vector analysis to determine document similarity.

    Thanks.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Sep 4, 2006 at 6:53 am
    : In the default StandardAnalyzer, the stop word list contains the word "on".
    : If I have a document which contains the phrase "Disney on Ice", the index
    : will show only "Disney" and "Ice", but not "on".

    : "Disney on Ice"
    :
    : With the quotations indicating the desire for an "exact match", the absence
    : of stop words in the index means this yields zero results.
    :
    : Am I going crazy here?

    you should still get a match *if* you use the same analyzer (with the same
    set of stop words) in your query parser as you did when you indexed your
    documents. if so, then "Disney on Ice" should become a phrase search for
    the two successive tokens "disney" and "ice" which is what should have
    cone in your doc when it was indexed using that analyzer.

    :
    : On 9/3/06, Jason Polites wrote:
    : >
    : > Roger that. I'll double check my code.
    : >
    : > Thanks.
    : >
    : >
    : > On 9/3/06, Otis Gospodnetic wrote:
    : > >
    : > > They shouldn't be in the index. You must be using StandardAnalyzer
    : > > incorrectly, or maybe you think you are using it, but are really using
    : > > something else.
    : > >
    : > > Otis
    : > >
    : > > ----- Original Message ----
    : > > From: Jason Polites <jason.polites@gmail.com>
    : > > To: java-user@lucene.apache.org
    : > > Sent: Saturday, September 2, 2006 9:05:27 AM
    : > > Subject: Stop words in index
    : > >
    : > > Hey all,
    : > >
    : > > I am using the StandardAnalyzer with my own list of stop words (which is
    : > > more comprehensive than the default list), and my expectation was that
    : > > this
    : > > would omit these stop words from the index when data is indexed using
    : > > this
    : > > analyzer. However, I am seeing stop words in the term vector for
    : > > documents
    : > > indexed with this analyzer.
    : > >
    : > > Is this expected behaviour? Is there any way I can force these stop
    : > > words
    : > > to be omitted from the index? Having them in the index is wreaking
    : > > havoc
    : > > with term vector analysis to determine document similarity.
    : > >
    : > > Thanks.
    : > >
    : > >
    : > >
    : > >
    : > > ---------------------------------------------------------------------
    : > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : > > For additional commands, e-mail: java-user-help@lucene.apache.org
    : > >
    : > >
    : >
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jason Polites at Sep 4, 2006 at 7:17 am
    right ok. I may have been tinkering with the analyzer between searches.

    Thanks
    On 9/4/06, Chris Hostetter wrote:

    : In the default StandardAnalyzer, the stop word list contains the word
    "on".
    : If I have a document which contains the phrase "Disney on Ice", the
    index
    : will show only "Disney" and "Ice", but not "on".

    : "Disney on Ice"
    :
    : With the quotations indicating the desire for an "exact match", the
    absence
    : of stop words in the index means this yields zero results.
    :
    : Am I going crazy here?

    you should still get a match *if* you use the same analyzer (with the same
    set of stop words) in your query parser as you did when you indexed your
    documents. if so, then "Disney on Ice" should become a phrase search for
    the two successive tokens "disney" and "ice" which is what should have
    cone in your doc when it was indexed using that analyzer.

    :
    : On 9/3/06, Jason Polites wrote:
    : >
    : > Roger that. I'll double check my code.
    : >
    : > Thanks.
    : >
    : >
    : > On 9/3/06, Otis Gospodnetic wrote:
    : > >
    : > > They shouldn't be in the index. You must be using StandardAnalyzer
    : > > incorrectly, or maybe you think you are using it, but are really
    using
    : > > something else.
    : > >
    : > > Otis
    : > >
    : > > ----- Original Message ----
    : > > From: Jason Polites <jason.polites@gmail.com>
    : > > To: java-user@lucene.apache.org
    : > > Sent: Saturday, September 2, 2006 9:05:27 AM
    : > > Subject: Stop words in index
    : > >
    : > > Hey all,
    : > >
    : > > I am using the StandardAnalyzer with my own list of stop words
    (which is
    : > > more comprehensive than the default list), and my expectation was
    that
    : > > this
    : > > would omit these stop words from the index when data is indexed
    using
    : > > this
    : > > analyzer. However, I am seeing stop words in the term vector for
    : > > documents
    : > > indexed with this analyzer.
    : > >
    : > > Is this expected behaviour? Is there any way I can force these stop
    : > > words
    : > > to be omitted from the index? Having them in the index is wreaking
    : > > havoc
    : > > with term vector analysis to determine document similarity.
    : > >
    : > > Thanks.
    : > >
    : > >
    : > >
    : > >
    : > >
    ---------------------------------------------------------------------
    : > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : > > For additional commands, e-mail: java-user-help@lucene.apache.org
    : > >
    : > >
    : >
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 2, '06 at 1:05p
activeSep 4, '06 at 7:17a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase