Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without,
stop words.
The corpus is the English Wikipedia, and I indexed the title and body of
the articles. I used a list of 525 stop words.
With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
Thus, the index grows by 45% in this case, which I found suprising, as I
expected it to not grow as much. I haven't dug into the details of the
Lucene file formats but thought compression (field/term vector/sparse
lists/ vints) would negate the affect of stopwords to a large extent.
Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36
-- Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org