FAQ
Hi all,
I'd like to ask about the availability of a text diff library, like
difflib, which would support the detection of moved text blocks.
Currently I am almost happy with the results of
difflib.SequenceMatcher in my app (comparing different versions of
natural language texts), however the only drawback seems to be the
missing detection of moves of the text parts. I was thinking of simply
recomparing the deleted and inserted blocks using difflib again, but
this obviously won't be a general solution.
I found several algorithm discussions, but unfortunately no suitable
python implementation. (E.g. Medite -
http://www-poleia.lip6.fr/~ganascia/Medite_Project - seems to be
implemented in Python but it targets some rather special and probably
much more complex textological issues, than my current needs.)
Does maybe someone know such python library (or possibly a way of
enhancing difflib) for this task (i.e character-wise comparing of
texts - detecting insertion, deletion, substitution and move of text
blocks)?

Thanks in advance,
Vlastimil Brom

Search Discussions

  • Chris Torek at Jul 14, 2011 at 4:47 am
    In article <mailman.1002.1310591600.1164.python-list at python.org>
    Vlastimil Brom wrote:
    I'd like to ask about the availability of a text diff library, like
    difflib, which would support the detection of moved text blocks.
    If you allow arbitrary moves, the "minimal edit distance" problem
    (string-to-string edit) becomes substantially harder. If you only
    allow insert, delete, or in-place-substitute, you have what is
    called the "Levenshtein distance" case. If you allow transpositions
    you get "Damerau-Levenshtein". These are both solveable with a
    dynamic programming algorithm. Once you allow move operations,
    though, the problem becomes NP-complete.

    See http://pages.cs.brandeis.edu/~shapird/publications/JDAmoves.pdf
    for instance. (They give an algorithm that produces "usually
    acceptable" results in polynomial time.)
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Intel require I note that my opinions are not those of WRS or Intel
    Salt Lake City, UT, USA (40?39.22'N, 111?50.29'W) +1 801 277 2603
    email: gmail (figure it out) http://web.torek.net/torek/index.html
  • Vlastimil Brom at Jul 15, 2011 at 9:49 pm

    2011/7/14 Chris Torek <nospam at torek.net>:
    In article <mailman.1002.1310591600.1164.python-list at python.org>
    Vlastimil Brom ?wrote:
    I'd like to ask about the availability of a text diff library, like
    difflib, which would support the detection of moved text blocks.
    If you allow arbitrary moves, the "minimal edit distance" problem
    (string-to-string edit) becomes substantially harder. ?If you only
    allow insert, delete, or in-place-substitute, you have what is
    called the "Levenshtein distance" case. ?If you allow transpositions
    you get "Damerau-Levenshtein". ?These are both solveable with a
    dynamic programming algorithm. ?Once you allow move operations,
    though, the problem becomes NP-complete.

    See http://pages.cs.brandeis.edu/~shapird/publications/JDAmoves.pdf
    for instance. ?(They give an algorithm that produces "usually
    acceptable" results in polynomial time.)
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Thanks for the references and explanation!
    I do realise the added complexity with taking the moves into account;
    given that, my current needs and the usually satisfying results
    obtained easily with difflib, I am not going to try to implement some
    more complex diffing algorithm.
    However, it turns out that the mentioned naive approach with just
    recomparing the text additions and removals may be partially viable -
    with some luck, i.e. given, the relevant segments are identified as
    deletions and inserts and isolated by difflib in the first place (and
    not subsumed under larger changes or split).

    For illustration, the rough simplified code is attached (sorry for the
    style and possible quirks...)
    Just now the more similar text segments are just collected, it would
    be also possible to sort them on their similarity ratio; the current
    approach also allows to highlight potentially multiple moved segments.

    Comments and suggestions are, of course, welcome,
    regards,
    vbr

    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # # # # # # # # # # # # # # # # # # # # #

    #! Python
    # -*- coding: utf-8 -*-

    import difflib
    import itertools

    def compare_moves(a, b, similarity_threshold=0.6):
    """
    Poor man's text comparison with simple moves check. Compares two
    strings using difflib
    and additionally tries to detect moved blocks
    by comparing similar deleted and inserted segments with each other
    - given the similarity_threshold.
    """

    seq_matcher = difflib.SequenceMatcher(isjunk=None, a=a, b=b, autojunk=False)
    diff_raw = [[tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]] for tag, i1,
    i2, j1, j2 in seq_matcher.get_opcodes()]

    deleted, inserted = {}, {}
    for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes():
    if tag == 'delete':
    deleted[(i1, i2)] = [tag, i1, i2, j1, j2, a[i1:i2]]
    elif tag == 'insert':
    inserted[(i1, i2)] = [tag, i1, i2, j1, j2, b[j1:j2]]

    possibly_moved_blocks = []
    for deleted_item, inserted_item in
    itertools.product(deleted.values(), inserted.values()):
    similarity_ratio = difflib.SequenceMatcher(isjunk=None,
    a=deleted_item[5], b=inserted_item[5], autojunk=False).ratio()
    if similarity_ratio >= similarity_threshold:
    possibly_moved_blocks.append([deleted_item, inserted_item,
    similarity_ratio])

    print diff_raw
    print possibly_moved_blocks


    if __name__ == "__main__":
    compare_moves("abcXYZdeABfghijklmnopABBCq",
    "ABCDabcdeACfgXYXZhijklmnopq", similarity_threshold = 0.6)

    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # # # # # #
    # output: #
    [['insert', 0, 0, 0, 4, '', 'ABCD'], ['equal', 0, 3, 4, 7, 'abc',
    'abc'], ['delete', 3, 6, 7, 7, 'XYZ', ''], ['equal', 6, 9, 7, 10,
    'deA', 'deA'], ['replace', 9, 10, 10, 11, 'B', 'C'], ['equal', 10, 12,
    11, 13, 'fg', 'fg'], ['insert', 12, 12, 13, 17, '', 'XYXZ'], ['equal',
    12, 21, 17, 26, 'hijklmnop', 'hijklmnop'], ['delete', 21, 25, 26, 26,
    'ABBC', ''], ['equal', 25, 26, 26, 27, 'q', 'q']]

    [[['delete', 21, 25, 26, 26, 'ABBC'], ['insert', 0, 0, 0, 4, 'ABCD'],
    0.75], [['delete', 3, 6, 7, 7, 'XYZ'], ['insert', 12, 12, 13, 17,
    'XYXZ'], 0.8571428571428571]]
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: compare_moves.py
    Type: application/octet-stream
    Size: 1473 bytes
    Desc: not available
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110715/fe993130/attachment.obj>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 13, '11 at 9:13p
activeJul 15, '11 at 9:49p
posts3
users2
websitepython.org

2 users in discussion

Vlastimil Brom: 2 posts Chris Torek: 1 post

People

Translate

site design / logo © 2022 Grokbase