FAQ
I am wondering if there is a way to diff 2 versions of a reStructuredText
document that differ only by line breaks within paragraphs such that those
differences do not trigger a diff entry. In other words, I wonder if there is a
tool out there where:

this is one
reStructuredText
paragraph

Is considered equivalent to:

this is one reStructuredText paragraph

Does anyone have any ideas how this can be accomplished, especially with respect
to VCS differences, e.g. svn?

Thanks!

Jeffrey.

Search Discussions

  • Martin Blais at Mar 2, 2009 at 8:21 pm

    On Mon, 2 Mar 2009 17:33:46 +0000 (UTC), "Jeffrey C. Jacobs" <docutils.org.timehorse at neverbox.com> said:
    I am wondering if there is a way to diff 2 versions of a reStructuredText
    document that differ only by line breaks within paragraphs such that
    those
    differences do not trigger a diff entry. In other words, I wonder if
    there is a
    tool out there where:

    this is one
    reStructuredText
    paragraph

    Is considered equivalent to:

    this is one reStructuredText paragraph

    Does anyone have any ideas how this can be accomplished, especially with
    respect
    to VCS differences, e.g. svn?
    If the differences are only whitespace, xxdiff has an option to keep those gray in the GUI.

    tangerine:~/p/xxdiff/src$ xxdiff --list-resource | grep Hunk
    Accel.ToggleIgnorePerHunkWhitespace: ""
    IgnorePerHunkWhitespace: False
    tangerine:~/p/xxdiff/src$

    Otherwise you can write a 40 lines Python script to parse GNU diff output and filter out those changes from the diff hunks.
  • Gael Varoquaux at Mar 2, 2009 at 8:42 pm

    On Mon, Mar 02, 2009 at 05:33:46PM +0000, Jeffrey C. Jacobs wrote:
    I am wondering if there is a way to diff 2 versions of a reStructuredText
    document that differ only by line breaks within paragraphs such that those
    differences do not trigger a diff entry. In other words, I wonder if there is a
    tool out there where:
    this is one
    reStructuredText
    paragraph
    Is considered equivalent to:
    this is one reStructuredText paragraph
    wdiff.

    Ga?l
  • Jeffrey C. Jacobs at Mar 3, 2009 at 5:57 pm

    Gael Varoquaux <gael.varoquaux <at> normalesup.org> writes:

    wdiff.
    Thanks for the suggestions! Unfortunately, one thing I forgot to mention
    was that the concatenations should not span different paragraphs. Thus:

    Hello! World!

    is not the same as:

    Hello!

    World!

    Since the first represents 2 paragraphs, but the second only 1.

    Instead, I propose the following python script that diffs the docutil
    trees instead of the original text files. I don't know how it could tell
    whether the 2 imputs are reStructuredText documents vs. regular text
    documents and only perform the doc-tree step if rst, and am welcome to
    suggestions for improvements but so far this does a good job of what I am
    trying to achieve. Such a tool could be handy to rst documenters in
    cases where a document may have a bunch of lines through years of editing
    that go beyond 80 columns and thus the file is edited to bring it back in
    line, which produces massive standard diffs when the result really should
    more or less be the same document. This script could be used to confirm
    that the two versions of documents are more or less the same.

    ----------

    #!/usr/bin/python

    import sys
    import subprocess
    import tempfile
    import docutils.core
    import os
    import re

    # Regexp for removing inconsequential characters
    trimwhite = re.compile(r'(?<!>)\n\s*(?![< ])', re.M + re.U + re.L)
    webspace = re.compile(r'(?<=[.?!):])\s{2,}(?=[\w\d(])', re.M + re.U + re.L)
    repl = r' '

    if __name__ == '__main__':
    # To Do: verify that document 1 and document 2 are both
    # reStructuredText documents

    # Last 2 parameters are the left hand side and right hand side file
    lhs, rhs = sys.argv[-2:]

    # Parse the left and right file into docutils tree strings
    lhss1 = docutils.core.publish_string(file(lhs).read())
    rhss2 = docutils.core.publish_string(file(rhs).read())

    # Concatenate multi-line text that lies within a node
    lhss1, lhsr1 = trimwhite.subn(repl, lhss1)
    rhss2, rhsr2 = trimwhite.subn(repl, rhss2)
    #sys.stdout.write('Removed returns (left, right): %d, %d\n' %
    # (lhsr1, rhsr2))

    # Trim multiple white spaces between full-stop (.?!) and the next phrase
    lhss1, lhsr1 = webspace.subn(repl, lhss1)
    rhss2, rhsr2 = webspace.subn(repl, rhss2)
    #sys.stdout.write('Removed double space (left, right): %d, %d\n' %
    # (lhsr1, rhsr2))

    # Make sure the last line is properly terminated
    lhss1 += '\n'
    rhss2 += '\n'

    # Allocate temporary files to hold the left and right doc-trees
    lhsh1, lhst1 = tempfile.mkstemp(text=True)
    rhsh2, rhst2 = tempfile.mkstemp(text=True)

    # Open the left and write temp files for writing
    lhso1 = os.fdopen(lhsh1, 'w')
    rhso2 = os.fdopen(rhsh2, 'w')

    # Write the doc-trees to the temp files
    lhso1.write(lhss1)
    rhso2.write(rhss2)

    # Close the temp files
    lhso1.close()
    rhso2.close()

    # Spawn [UNIX] diff and wait for it to complete
    # Stdout and Stderr are passed directly to this application
    sp = subprocess.Popen(['diff'] + sys.argv[1:-2] + [lhst1, rhst2])
    sp.wait()

    # Delete the temp files
    os.remove(lhst1)
    os.remove(rhst2)
  • Martin Blais at Mar 3, 2009 at 5:59 pm

    On Tue, 3 Mar 2009 17:57:12 +0000 (UTC), "Jeffrey C. Jacobs" <docutils.org.timehorse at neverbox.com> said:
    Gael Varoquaux <gael.varoquaux <at> normalesup.org> writes:
    wdiff.
    Thanks for the suggestions! Unfortunately, one thing I forgot to mention
    was that the concatenations should not span different paragraphs. Thus:

    Hello! World!

    is not the same as:

    Hello!

    World!

    Since the first represents 2 paragraphs, but the second only 1.

    Instead, I propose the following python script that diffs the docutil
    trees instead of the original text files. I don't know how it could tell
    whether the 2 imputs are reStructuredText documents vs. regular text
    documents and only perform the doc-tree step if rst, and am welcome to
    suggestions for improvements but so far this does a good job of what I am
    trying to achieve. Such a tool could be handy to rst documenters in
    cases where a document may have a bunch of lines through years of editing
    that go beyond 80 columns and thus the file is edited to bring it back in
    line, which produces massive standard diffs when the result really should
    more or less be the same document. This script could be used to confirm
    that the two versions of documents are more or less the same.
    This is great. BTW if you want to inspect your diffs graphically, you can tell xxdiff to use your program to compute the differences. It'll work if your program outputs POSIX diffs (which it likely does, because you're invoking GNU diff).

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdoc-sig @
categoriespython
postedMar 2, '09 at 5:33p
activeMar 3, '09 at 5:59p
posts5
users3
websitepython.org

People

Translate

site design / logo © 2019 Grokbase