FAQ
Is there an approved standard library/function/algarithm for comparing
two similar strings and returning a percentage match?

I am aware of soundEx.py / .c which is based on the grammar and
phonetics of words, but from what I have read it seems to be flawed..
and thus removed from the python standard library.

I have noticed similar techniques in other languages which are based
on shift matrixes, working out the minimum number of changes to
transform string A into string B.

I am more looking for one which looks at
words/
chars/
char-order/
length/
similarity
perhaps omitting spaces, and a common library (the,a,and,mr,mrs......)
with a weighted scoring mechanism...

Thanks in advance...
Clayton Brown / Emmie Osawa

Search Discussions

  • Tim Peters at Sep 2, 2001 at 2:13 am
    [Clayton Brown - Emmie Osawa]
    Is there an approved standard library/function/algarithm for comparing
    two similar strings and returning a percentage match?
    See the std difflib module in Python 2.1; the guts of that appeared in
    earlier Python releases as part of the ndiff.py utility; it implements an
    algorithm related to Ratcliff and Obershelp's "gestalt" pattern matching.
    I am aware of soundEx.py / .c which is based on the grammar and
    phonetics of words, but from what I have read it seems to be flawed..
    and thus removed from the python standard library.
    It was removed more because Soundex isn't well-defined (even Knuth's
    definition changed between editions 2 and 3 of TAoCP volume 3), and it was a
    PITA to keep arguing about which was "the right" version. The version we
    had didn't correspond to any known published version anyway. In any case,
    Soundex was specifically designed to help match Anglo and some West European
    surnames, and uses beyond that were always ill-advised.
    I have noticed similar techniques in other languages which are based
    on shift matrixes, working out the minimum number of changes to
    transform string A into string B.
    There are dozens of possibilities.
    I am more looking for one which looks at
    words/
    chars/
    char-order/
    length/
    similarity
    perhaps omitting spaces, and a common library (the,a,and,mr,mrs......)
    with a weighted scoring mechanism...
    In that case, there are thousands of possibilities <0.5 wink>.

    difflib-offers-one-ly y'rs - tim
  • Jay Parlar at Sep 2, 2001 at 4:08 am
    I do believe that the difflib library is what you desire, and more specifically, SequenceMatcher from said library. If I
    remember correctly, it's described quite well in the documentation. Hope this helps!
    Is there an approved standard library/function/algarithm for comparing
    two similar strings and returning a percentage match?

    I am aware of soundEx.py / .c which is based on the grammar and
    phonetics of words, but from what I have read it seems to be flawed..
    and thus removed from the python standard library.

    I have noticed similar techniques in other languages which are based
    on shift matrixes, working out the minimum number of changes to
    transform string A into string B.

    I am more looking for one which looks at
    words/
    chars/
    char-order/
    length/
    similarity
    perhaps omitting spaces, and a common library (the,a,and,mr,mrs......)
    with a weighted scoring mechanism...

    Thanks in advance...
    Clayton Brown / Emmie Osawa

    Jay Parlar
    ----------------------------------------------------------------
    Software Engineering III
    McMaster University
    Hamilton, Ontario, Canada

    "Though there are many paths
    At the foot of the mountain
    All those who reach the top
    See the same moon."
  • Skip Montanaro at Sep 2, 2001 at 1:42 pm
    Clayton> I am aware of soundEx.py / .c which is based on the grammar and
    Clayton> phonetics of words, but from what I have read it seems to be
    Clayton> flawed.. and thus removed from the python standard library.

    Not removed because it was flawed, but because it wasn't really
    general-purpose enough for continued inclusion. The soundex algorithm was
    original meant to group similar sounding surnames together. While you can
    apply it to different string scoring applications, it probably won't be as
    good as it is at surname grouping. It's also a bit dated. I suspect there
    are much better scoring algorithms out there even for the surname problem.

    --
    Skip Montanaro (skip at pobox.com)
    http://www.mojam.com/
    http://www.musi-cal.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedSep 2, '01 at 12:50a
activeSep 2, '01 at 1:42p
posts4
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase