FAQ
I'm generally pleased with difflib.SequenceMatcher: It's probably not
the best available string matcher out there, but it's in the standard
library and I've seen worse in the wild. One thing that kind of
bothers me is that it's sensitive to which argument you pick as "seq1"
and which you pick as "seq2":

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
import difflib
difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
0.44444444444444442
difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
0.66666666666666663
>>>

Is this a bug? I am guessing the algorithm is implemented correctly,
and that it's just an inherent property of the algorithm used. It's
certainly not what I'd call a desirably property. Are there any
simple adjustments that can be made without sacrificing (too much)
performance?

John

Search Discussions

  • Wolfgang Rohdewald at Nov 24, 2010 at 8:49 am

    On Mittwoch 24 November 2010, John Yeung wrote:
    Are there any
    simple adjustments that can be made without sacrificing (too
    much) performance?
    difflib.SequenceMatcher(None,*sorted(('BYRD','BRADY'))).ratio()
    0.66666666666666663

    --
    Wolfgang
  • Peter Otten at Nov 24, 2010 at 9:43 am

    John Yeung wrote:

    I'm generally pleased with difflib.SequenceMatcher: It's probably not
    the best available string matcher out there, but it's in the standard
    library and I've seen worse in the wild. One thing that kind of
    bothers me is that it's sensitive to which argument you pick as "seq1"
    and which you pick as "seq2":

    Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    import difflib
    difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
    0.44444444444444442
    difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
    0.66666666666666663
    Is this a bug? I am guessing the algorithm is implemented correctly,
    and that it's just an inherent property of the algorithm used. It's
    certainly not what I'd call a desirably property. Are there any
    simple adjustments that can be made without sacrificing (too much)
    performance?
    def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
    return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

    I'm expecting 50% performance loss ;)

    Seriously, have you tried to calculate the ratio with realistic data?
    Without looking into the source I would expect the two ratios to get more
    similar.

    Peter
  • John Machin at Nov 24, 2010 at 10:01 pm

    On Nov 24, 8:43?pm, Peter Otten wrote:
    John Yeung wrote:
    I'm generally pleased with difflib.SequenceMatcher: ?It's probably not
    the best available string matcher out there, but it's in the standard
    library and I've seen worse in the wild. ?One thing that kind of
    bothers me is that it's sensitive to which argument you pick as "seq1"
    and which you pick as "seq2":
    Python 2.6.1 (r261:67517, Dec ?4 2008, 16:51:00) [MSC v.1500 32 bit
    (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    import difflib
    difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()
    0.44444444444444442
    difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()
    0.66666666666666663
    Is this a bug? ?I am guessing the algorithm is implemented correctly,
    and that it's just an inherent property of the algorithm used. ?It's
    certainly not what I'd call a desirably property. ?Are there any
    simple adjustments that can be made without sacrificing (too much)
    performance?
    def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
    ? ? return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

    I'm expecting 50% performance loss ;)

    Seriously, have you tried to calculate the ratio with realistic data?
    Without looking into the source I would expect the two ratios to get more
    similar.

    Peter
    Surnames are extremely realistic data. The OP should consider using
    Levenshtein distance, which is "symmetric". A good (non-naive)
    implementation should be much faster than difflib.

    ratio = 1.0 - levenshtein(a, b) / float(max(len(a), len(b)))

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedNov 24, '10 at 7:45a
activeNov 24, '10 at 10:01p
posts4
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase