FAQ
In all kinds of circumstances it would be very useful to call an
external filter to process some data, and read the results back in.
What I needed was something like popen(), only working for both
reading and writing. However, such a thing is hard to write in a
simple-minded fashion because of deadlocks that occur when handling
more than several bytes of data. Deadlocks can either be caused by
both programs waiting for not-yet-generated input, or (in my case) by
both their writes being blocked waiting for the other to read.

The usual choices are to:

a) Write a deadlock-free communication protocol and use it on both
ends. This is rarely a good solution, because the program that
needs to be invoked is in most cases an external filter that knows
nothing about our deadlock problems.

b) Use PTY's instead of pipes. Many programmers prefer to avoid this
path because of the added system resources that the PTY's require,
and because of the increased complexity.

Given these choices, most people opt to use a temporary file and get
it over with.

However, discussing this problem with a colleague, he thought of a
third solution: break the circularity by using a process only for
reading and writing. This can be done whenever reading and writing
are independent, i.e. when the data read from the subprocess does not
influence future writes to it.

The function below implements that idea. Usage is something like:

rwpopen("""Some long string here...""", "sed", ["s/long/short/"])
-> 'Some short string here...'

I've put the function to good use in a program I'm writing. In
addition to getting rid of temporary files, the whole operation timed
faster than using a tmpfile (that was on a single-CPU machine). The
function will, of course, work only under Unix and its lookalikes.
Additional information is embedded in the docstring and the comments.

I'd like to hear feedback. Do other people find such a thing useful?
Is there a fundamental flaw or a possibility of a deadlock that I'm
missing?


def rwpopen(input, command, args=[]):
"""Execute command with args, pipe input into it, and read it back.
Return the result read from the command.

Normally, when a process tries to write to a child process and
read back its output, a deadlock condition can occur easily,
either by both processes waiting for not-yet-generated input, or
by both their writes() being blocked waiting for the other to
read.

This function prevents deadlocks by using separate processes for
reading and writing, at the expense of an additional fork(). That
way the process that writes to an exec'ed command and the process
that reads from the command are fully independent, and no deadlock
can occur. The child process exits immediately after writing.

More precisely: the current process (A) forks off a process B,
which in turns forks off a process C. While C does the usual
dup,close,exec thing, B merely writes the data to the pipe and
exits. Independently of B, A reads C's response. A deadlock
cannot occur because A and B are independent of each other -- even
if B's write() is stopped because it filled up the pipe buffer, A
will happily keep reading C's output, and B's write() will be
resumed shortly.
"""
# XXX Redo this as a class, with overridable methods for reading
# and writing.
#
# XXX Provide error-checking and propagating exceptions from child
# to parent. This would require either wait()ing on the child
# (which is a bag of worms), or opening another pipe for
# transmitting error messages or serialized exception objects.
#
# XXX This function expects the system to wait for the child upon
# receiving SIGCHLD. This should be the case on most systems as
# long as SIGCHLD is handled by SIG_DFL. If this is not the case,
# zombies will remain.

def safe_traceback():
# Child processes catch exceptions so that they can exit using
# os._exit() without fanfare. They use this function to print
# the traceback to stderr before dying.
import traceback
sys.stderr.write("Error in child process, pid %d.\n" %
os.getpid())
sys.stderr.flush()
traceback.print_exc()
sys.stderr.flush()

# It would be nice if Python provided a way to see if pipes are
# bidirectional. In that case, we could open only one pipe
# instead of two, with p_readfd == p_writefd and c_readfd ==
# c_writefd.
p_readfd, c_writefd = os.pipe()
c_readfd, p_writefd = os.pipe()
if os.fork():
# Parent
for fd in (c_readfd, c_writefd, p_writefd):
os.close(fd)
# Convert the pipe fd to a file object, so we can use its
# read() method to read all data.
fp = os.fdopen(p_readfd, 'r')
result = fp.read()
fp.close() # Will close p_readfd.
return result
else:
# Child
try:
if os.fork():
# Still the same child
os.write(p_writefd, input)
else:
# Grandchild
try:
# Redirect the pipe to stdin.
os.close(0)
os.dup(c_readfd)
# Redirect stdout to the pipe.
os.close(1)
os.dup(c_writefd)
# Now close unneeded descriptors.
for fd in (c_readfd, c_writefd, p_readfd, p_writefd):
os.close(fd)
# Finally, execute the external command.
os.execvp(command, [command] + args)
except:
safe_traceback()
os._exit(127)
except:
safe_traceback()
os._exit(127)
else:
os._exit(0)

Search Discussions

  • Tim Evans at Oct 17, 1999 at 2:08 am
    Looks good. Could this also be done using threads?

    Possiblity:

    create two threads, one for reading and the other for writing. Then
    you can pass strings to the writing thread and get back the buffered
    result from the reading thread. This allows interactive communication
    without the danger of deadlocks.

    Has this been done before or should I give it a go?

    --
    Tim Evans
  • Hrvoje Niksic at Oct 17, 1999 at 5:20 pm

    "Tim Evans" <tre17 at student.canterbury.ac.nz> writes:

    Looks good. Could this also be done using threads?
    Probably. But this wouldn't work on machines without threading
    support, and the task is just too basic to *require* threads.

    I don't buy threads as a buzzword -- I believe my particular problem
    is solved much more naturally using an helper process. The additional
    process touches very little data and modifies none, so COW should make
    it inexpensive. My timings show that this is indeed the case.
    create two threads, one for reading and the other for writing. Then
    you can pass strings to the writing thread and get back the buffered
    result from the reading thread. This allows interactive
    communication without the danger of deadlocks.
    I think deadlocks can occur as long as there is the writing and
    reading thread/process depend on each other.
  • Oleg Broytmann at Oct 17, 1999 at 12:14 pm
    Hi!
    On 17 Oct 1999, Hrvoje Niksic wrote:
    def rwpopen(input, command, args=[]):
    ^^^^^^^
    There was well-know problem with passing mutable object as a default.
    Not sure if it was fixed in recent versions of Python...

    Oleg.
    ----
    Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
    Programmers don't die, they just GOSUB without RETURN.
  • Fredrik Lundh at Oct 17, 1999 at 12:33 pm

    Oleg Broytmann wrote:
    On 17 Oct 1999, Hrvoje Niksic wrote:
    def rwpopen(input, command, args=[]):
    ^^^^^^^
    There was well-know problem with passing mutable object as a default.
    Not sure if it was fixed in recent versions of Python...
    well, that's only a problem if you modify the object
    inside the function...

    (and no, it hasn't been fixed. I doubt it can be
    fixed without breaking stuff).

    ...

    but to make the code a bit more flexible, I'd change
    the execvp call to:

    os.execvp(command, (command,) + tuple(args))

    or, if you prefer:

    os.execvp(command, [command] + list(args))

    (this allows the caller to use *any* kind of sequence,
    not just a list).

    </F>

    <!-- coming monday:
    http://www.pythonware.com/people/fredrik/librarybook.htm
    (the eff-bot guide to) the standard python library. -->
  • Oleg Broytmann at Oct 17, 1999 at 12:36 pm

    On Sun, 17 Oct 1999, Fredrik Lundh wrote:
    Oleg Broytmann wrote:
    On 17 Oct 1999, Hrvoje Niksic wrote:
    def rwpopen(input, command, args=[]):
    ^^^^^^^
    There was well-know problem with passing mutable object as a default.
    Not sure if it was fixed in recent versions of Python...
    well, that's only a problem if you modify the object
    inside the function...
    Sooner or later you forget about it and modify args and what? :) No, I
    better will avoid this completely until Python fixes it.
    (and no, it hasn't been fixed. I doubt it can be
    fixed without breaking stuff).
    Mmm??? Are there a line of code that *relies* on that misfeature?

    Oleg.
    ----
    Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
    Programmers don't die, they just GOSUB without RETURN.
  • Fredrik Lundh at Oct 17, 1999 at 12:53 pm

    (and no, it hasn't been fixed. I doubt it can be
    fixed without breaking stuff).
    Mmm??? Are there a line of code that *relies* on that misfeature?
    yes, there are tons of code that relies on the
    fact that the default values are evaluated once,
    and more importantly, that they are evaluated
    in the namespace where the function/lambda
    is defined.

    in fact, it's currently the only reasonable way
    to pass local variables into a nested namespace
    (like when using lambdas). it's also often used
    to speed things up, by binding commonly used
    globals to local names.

    ...

    but sure, I'm sure Guido is open for proposals. I
    don't think you can get away with "always evaluate
    them on each call," though...

    </F>

    <!-- coming monday:
    http://www.pythonware.com/people/fredrik/librarybook.htm
    (the eff-bot guide to) the standard python library. -->
  • Oleg Broytmann at Oct 17, 1999 at 12:59 pm
    Hi!

    I marked it with the word "misfeature", but of course I meant only the
    problem with mutable types. Sure, I use
    lambda x, y=z: ...
    often.
    (But that's another problem. After 10 years with Pascal, I used to use
    local functions, that have access to outer function's variables. Learning
    to use lambdas was a little pain for me. And I still hope Python will have
    local functions sometime... may be 2.0+)
    On Sun, 17 Oct 1999, Fredrik Lundh wrote:
    Mmm??? Are there a line of code that *relies* on that misfeature?
    yes, there are tons of code that relies on the
    fact that the default values are evaluated once,
    and more importantly, that they are evaluated
    in the namespace where the function/lambda
    is defined.

    in fact, it's currently the only reasonable way
    to pass local variables into a nested namespace
    (like when using lambdas). it's also often used
    to speed things up, by binding commonly used
    globals to local names.
    but sure, I'm sure Guido is open for proposals. I
    don't think you can get away with "always evaluate
    them on each call," though...
    Oleg.
    ----
    Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
    Programmers don't die, they just GOSUB without RETURN.
  • Hrvoje Niksic at Oct 17, 1999 at 5:17 pm

    Oleg Broytmann <phd at sun.med.ru> writes:
    On Sun, 17 Oct 1999, Fredrik Lundh wrote:
    Oleg Broytmann wrote:
    On 17 Oct 1999, Hrvoje Niksic wrote:
    def rwpopen(input, command, args=[]):
    ^^^^^^^
    There was well-know problem with passing mutable object as a default.
    Not sure if it was fixed in recent versions of Python...
    well, that's only a problem if you modify the object
    inside the function...
    Sooner or later you forget about it and modify args and what?
    No, I don't. It's not generally nice to make destructive
    modifications on sequences a function passes as an argument, so I
    don't do that with ARGS, regardless of the default value.
    :) No, I better will avoid this completely until Python fixes it.
    Your choice, not mine.
    (and no, it hasn't been fixed. I doubt it can be fixed without
    breaking stuff).
    Mmm??? Are there a line of code that *relies* on that misfeature?
    But of course. A misfeature to you is a feature to someone else.
  • Hrvoje Niksic at Oct 17, 1999 at 5:14 pm

    "Fredrik Lundh" <fredrik at pythonware.com> writes:

    but to make the code a bit more flexible, I'd change
    the execvp call to:

    os.execvp(command, (command,) + tuple(args)) [...]
    (this allows the caller to use *any* kind of sequence, not just a
    list).
    Thanks for the suggestion; I've now made that change.
  • Donn Cave at Oct 18, 1999 at 7:00 pm
    Quoth Hrvoje Niksic <hniksic at srce.hr>:
    In all kinds of circumstances it would be very useful to call an
    external filter to process some data, and read the results back in.
    What I needed was something like popen(), only working for both
    reading and writing. However, such a thing is hard to write in a
    simple-minded fashion because of deadlocks that occur when handling
    more than several bytes of data. Deadlocks can either be caused by
    both programs waiting for not-yet-generated input, or (in my case) by
    both their writes being blocked waiting for the other to read.

    The usual choices are to:

    a) Write a deadlock-free communication protocol and use it on both
    ends. This is rarely a good solution, because the program that
    needs to be invoked is in most cases an external filter that knows
    nothing about our deadlock problems.

    b) Use PTY's instead of pipes. Many programmers prefer to avoid this
    path because of the added system resources that the PTY's require,
    and because of the increased complexity.

    Given these choices, most people opt to use a temporary file and get
    it over with.

    However, discussing this problem with a colleague, he thought of a
    third solution: break the circularity by using a process only for
    reading and writing. This can be done whenever reading and writing
    are independent, i.e. when the data read from the subprocess does not
    influence future writes to it.

    The function below implements that idea. Usage is something like:

    rwpopen("""Some long string here...""", "sed", ["s/long/short/"])
    -> 'Some short string here...'

    I've put the function to good use in a program I'm writing. In
    addition to getting rid of temporary files, the whole operation timed
    faster than using a tmpfile (that was on a single-CPU machine). The
    function will, of course, work only under Unix and its lookalikes.
    Additional information is embedded in the docstring and the comments.

    I'd like to hear feedback. Do other people find such a thing useful?
    Is there a fundamental flaw or a possibility of a deadlock that I'm
    missing?
    Interesting idea. I was inspired to try a slightly different
    approach, which I will append here.

    It's definitely a solution, possibly the only general one, for
    deadlocks caused by the pipe buffer size. That's an interesting
    problem, but I think a relatively unusual one. In order to get
    here, your processes need to be ignoring their input so it stalls
    in the pipe ... for example, the parent might wait() for the child
    and then read its output, while the child is stuck trying to
    finish writing its large output. But I am having a hard time
    thinking of an example where it isn't easily avoided. I'm also
    surprised that the intermediate process would be more economical
    than a temporary file, so I wonder if the resources were all
    accounted for. Temporary files do have the liability that their
    filesystem may run out of space, but then it seems like a much
    safer way to buffer large transfers.

    By far the most common intractable deadlock problem is internal
    buffering in a command that uses C I/O and hasn't flushed its
    own buffer. This is where the pty device comes in, and to my
    knowledge it's the only general cure. It works because C I/O
    switches to line buffering with a tty device. But again this
    problem can be easily avoided in a situation where all the input
    for the command can be written before you wait for its output -
    just close the pipe after you're finished writing to it! The
    problem really arises when you're trying to conduct an exchange
    that really needs to alternate reads and writes, like, try to
    write a line to "awk", read awk's output, and then write another
    line to the same awk process. To do this, you need a pty device.

    Anyway, here's my attempt at the 3rd process solution. I made
    both processes children of the calling process, the 3rd process
    copies I/O both ways, and the caller can issue reads and writes
    to the command at its convenience. It's a subclass of a normal
    1-stage read/write command. The 3rd process avoids blocking on
    reads or writes with the select system call, which is specific
    to UNIX.

    # Donn Cave, University Computing Services, University of Washington
    # donn at u.washington.edu
    #----------------------------
    import os
    import select
    import sys
    import traceback

    # External command, with plain read and write dual pipe.
    #
    # ex. cmd = RWPipe('/bin/sh', ('sh', '-c', 'nslookup'))
    # os.write(cmd.input, 'set q=any\n')
    # os.write(cmd.input, 'nosuchhost\n')
    # os.close(cmd.input)
    # while 1:
    # x = os.read(cmd.output, 8192)
    # if not x:
    # break
    # print 'output:', x
    # status = cmd.wait()
    #
    # I/O is unbuffered UNIX read/write, caller may make file objects.
    #
    class RWPipe:
    def __init__(self, command, argv, environ = None):
    self.command = command
    self.argv = argv
    if environ is None:
    self.environ = os.environ
    else:
    self.environ = environ
    self.start()
    def pipexec(self, pipes):
    for unit, use in pipes:
    os.dup2(use, unit)
    os.execve(self.command, self.argv, self.environ)
    def setpipes(self, rp, wp, xp):
    # Close unused pipe ends.
    for p in rp:
    # Using read end here.
    os.close(p[1])
    for p in wp:
    # Using write end here.
    os.close(p[0])
    for p in xp:
    # Not using this pipe here.
    os.close(p[0])
    os.close(p[1])
    def start(self):
    tocmd = os.pipe()
    frcmd = os.pipe()
    pid = os.fork()
    if not pid:
    try:
    self.setpipes([tocmd], [frcmd], [])
    self.pipexec([(0, tocmd[0]), (1, frcmd[1])])
    finally:
    traceback.print_exc()
    os._exit(127)
    self.pid = pid
    self.setpipes([frcmd], [tocmd], [])
    self.input = tocmd[1]
    self.output = frcmd[0]
    def wait(self):
    p, s = os.waitpid(self.pid, 0)
    return (s >> 8) & 0x7f

    # Industrial strength external command, with an intermediate process
    # that copies I/O, buffering as necessary to avoid deadlock due to
    # system pipe buffer size limit.
    #
    class BigRWPipe(RWPipe):
    def buffer(self, xferunits):
    # Transfer I/O between pipes: self.buffer([(from, to), ...])
    #
    xfers = []
    for r, w in xferunits:
    xfers.append((r, w, ''))
    while xfers:
    wsel = []
    rsel = []
    esel = []
    nxf = []
    for r, w, buf in xfers:
    # Compile select masks for active units.
    if w >= 0:
    if buf:
    # Only check for write if any
    # data buffered to write.
    wsel.append(w)
    elif r >= 0:
    # If dest invalid, close source.
    # Will cause SIGPIPE in source proc.
    os.close(r)
    r = -1
    if r >= 0:
    rsel.append(r)
    esel.append(r)
    elif w >= 0 and not buf:
    # If source invalid and no data,
    # close dest. Will usually cause
    # dest to finish normally.
    os.close(w)
    w = -1
    if w >= 0:
    esel.append(w)
    if w >= 0 or r >= 0:
    nxf.append((r, w, buf))
    xfers = nxf
    if not xfers:
    break

    rdset, wdset, edset = select.select(rsel, wsel, esel)

    nxf = []
    for r, w, buf in xfers:
    if r in rdset:
    b = os.read(r, 8192)
    if b:
    buf = buf + b
    else:
    os.close(r)
    r = -1
    if r in edset:
    r = -1
    if w in wdset:
    n = os.write(w, buf)
    buf = buf[n:]
    if w in edset:
    w = -1
    if r >= 0 or w >= 0:
    nxf.append((r, w, buf))
    xfers = nxf
    def start(self):
    frcmd = os.pipe()
    tocmd = os.pipe()
    frmed = os.pipe()
    tomed = os.pipe()

    pid = os.fork()
    if not pid:
    # Set up the buffer process.
    try:
    self.setpipes([frcmd, tomed], [tocmd, frmed], [])
    self.buffer([(frcmd[0], frmed[1]),
    (tomed[0], tocmd[1])])
    except:
    traceback.print_exc()
    sys.exit(1)
    sys.exit(0)
    self.med = pid

    pid = os.fork()
    if not pid:
    try:
    self.setpipes([tocmd], [frcmd], [tomed, frmed])
    self.pipexec([(0, tocmd[0]), (1, frcmd[1])])
    finally:
    traceback.print_exc()
    os._exit(127)
    self.pid = pid
    self.setpipes([frmed], [tomed], [frcmd, tocmd])
    self.output = frmed[0]
    self.input = tomed[1]
    def wait(self):
    p, s = os.waitpid(self.med, 0)
    p, s = os.waitpid(self.pid, 0)
    return (s >> 8) & 0x7f
  • Justin Sheehy at Oct 27, 1999 at 7:30 pm

    Hrvoje Niksic <hniksic at srce.hr> writes:

    In all kinds of circumstances it would be very useful to call an
    external filter to process some data, and read the results back in.
    What I needed was something like popen(), only working for both
    reading and writing.
    In what way was the popen2 standard module insufficient?

    I use popen2.popen2() and popen2.popen3() fairly frequently, and am
    trying to see what your code would buy you that you can't do with
    those functions.

    -Justin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 17, '99 at 12:47a
activeOct 27, '99 at 7:30p
posts12
users6
websitepython.org

People

Translate

site design / logo © 2021 Grokbase