FAQ

[Xapian-discuss] Stemming behavior

Dimazest
Aug 21, 2009 at 3:22 pm
Hello,

I use python xapian bindings to stem strings and get this behavior:

Python 2.4.6 (#1, Jul 24 2009, 19:28:46)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import xapian
xapian.version_string()
'1.0.14'
s = xapian.Stem('en')
s('editing')
'edit'
s('Editing')
'Edite'

Is it a bug or a feature, that for the word 'Editing' different result
is returned than for edit?

Thanks a lot,
Dima
reply

Search Discussions

2 responses

  • John Leach at Aug 21, 2009 at 5:42 pm

    On Fri, 2009-08-21 at 17:22 +0200, dimazest at gmail.com wrote:
    I use python xapian bindings to stem strings and get this behavior:

    Python 2.4.6 (#1, Jul 24 2009, 19:28:46)
    [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    import xapian
    xapian.version_string()
    '1.0.14'
    s = xapian.Stem('en')
    s('editing')
    'edit'
    s('Editing')
    'Edite'

    Is it a bug or a feature, that for the word 'Editing' different result
    is returned than for edit?
    Hi Dima,

    I think the stemmer is ignoring uppercase token prefixes. So in the
    second case it's actually stemming the word "diting". This likely
    related Xapian's term prefixes, which are all uppercase:

    http://xapian.org/docs/omega/termprefixes.html

    The stemming algorithm treats English words starting with
    consonant-vowel-consonant differently, to handle words like duping ->
    dupe, doting -> dote etc.

    Actually, it's more complicated than that:

    http://snowball.tartarus.org/algorithms/english/stemmer.html

    John.
  • Dimazest at Aug 21, 2009 at 8:12 pm
    Richar and John, thank you for the replies.

    I'll lowercase the input.

    --
    Dima

Related Discussions

Discussion Navigation
viewthread | post

2 users in discussion

Dimazest: 2 posts John Leach: 1 post