|| at Aug 21, 2009 at 5:42 pm
On Fri, 2009-08-21 at 17:22 +0200, dimazest at gmail.com wrote:
I use python xapian bindings to stem strings and get this behavior:
Python 2.4.6 (#1, Jul 24 2009, 19:28:46)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
s = xapian.Stem('en')
Is it a bug or a feature, that for the word 'Editing' different result
is returned than for edit?
I think the stemmer is ignoring uppercase token prefixes. So in the
second case it's actually stemming the word "diting". This likely
related Xapian's term prefixes, which are all uppercase:http://xapian.org/docs/omega/termprefixes.html
The stemming algorithm treats English words starting with
consonant-vowel-consonant differently, to handle words like duping ->
dupe, doting -> dote etc.
Actually, it's more complicated than that:http://snowball.tartarus.org/algorithms/english/stemmer.html