FAQ
Does anyone know how much stop words are supposed to affect the index size?

I did an experiment of building an index once with, and once without,
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.

Thus, the index grows by 45% in this case, which I found suprising, as I
expected it to not grow as much. I haven't dug into the details of the
Lucene file formats but thought compression (field/term vector/sparse
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36

-- Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Chris Hostetter at Jan 14, 2005 at 12:44 am
    : The corpus is the English Wikipedia, and I indexed the title and body of
    : the articles. I used a list of 525 stop words.
    :
    : With stopwords removed the index is 227MB.
    : With stopwords kept the index is 331MB.

    That doesn't seem horribly surprising.

    consider that for every Term in the index, lucene is keeping track of the
    list of <docId, freq> pairs for every document that contains that term.

    Assume that something has to be in at least 25% of the docs before you
    decide it's worth making it a stop word. your URL indicates you are
    dealing with 400k docs, which means that for each stop word, the space
    need to store the int pairs for <docId, freq> is...

    (4B + 4B) * 100,000 =~ 780KB (per stop word Term, minimum)

    ...not counting any indexing structures that may be used internally to
    improve the lookup of a Term. assuming some of those words are in more or
    less then 25% of your documents, that could easily account for a
    differents of 100MB.

    I suspect that an interesting excersize would be to use some of the code
    I've seen tossed arround on this list that lets you iterate over all Terms
    and find the most common once to help you determine your stopword list
    progromaticly. Then remove/reindex any documents that have each word as
    you add it to your stoplist (one word at a time) and watch your index
    shrink.




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Jan 14, 2005 at 6:01 pm

    David Spencer wrote:
    Does anyone know how much stop words are supposed to affect the index size?

    I did an experiment of building an index once with, and once without,
    stop words.

    The corpus is the English Wikipedia, and I indexed the title and body of
    the articles. I used a list of 525 stop words.

    With stopwords removed the index is 227MB.
    With stopwords kept the index is 331MB.
    The unstopped version is indeed bigger and slower to build, but it's
    only slower to search when folks search on stop words. One approach to
    minimizing stopwords in searches (used by, e.g. Nutch & Google) is to
    index all stop words but remove them from queries unless they're (a) in
    a phrase or (b) explicitly required with a "+". (It might be nice if
    Lucene included a query parser that had this feature.)

    Nutch also optimizes phrase searches involving a few very common stop
    words (e.g., "the", "a", "to") by indexing these as bigrams and
    converting phrases involving them to bigram phrases. So, if someone
    searches for "to be or not to be" then this turns into a search for
    "to-be be or not-to to-be" which is considerably faster since it
    involves rarer terms. But the more words you bigram the bigger the
    index gets and the slower updates get, so you probably can't afford to
    do this for your full stop list. (It might be nice if Lucene included
    support for this technique too!)

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 14, '05 at 12:03a
activeJan 14, '05 at 6:01p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase