FAQ
Dear Python dev community,

I'm CTO at a small software company that makes music visualization
software (you can check us out at www.soundspectrum.com). About two
years ago we went with decision to use embedded python in a couple of
our new products, given all the great things about python. We were
close to using lua but for various reasons we decided to go with
python. However, over the last two years, there's been one area of
grief that sometimes makes me think twice about our decision to go
with python...

Some background first... Our software is used for entertainment and
centers around real time, high-performance graphics, so python's
performance, embedded flexibility, and stability are the most
important issues for us. Our software targets a large cross section
of hardware and we currently ship products for Win32, OS X, and the
iPhone and since our customers are end users, our products have to be
robust, have a tidy install footprint, and be foolproof. Basically,
we use embedded python and use it to wrap our high performance C++
class set which wraps OpenGL, DirectX and our own software renderer.
In addition to wrapping our C++ frameworks, we use python to perform
various "worker" tasks on worker thread (e.g. image loading and
processing). However, we require *true* thread/interpreter
independence so python 2 has been frustrating at time, to say the
least. Please don't start with "but really, python supports multiple
interpreters" because I've been there many many times with people.
And, yes, I'm aware of the multiprocessing module added in 2.6, but
that stuff isn't lightweight and isn't suitable at all for many
environments (including ours). The bottom line is that if you want to
perform independent processing (in python) on different threads, using
the machine's multiple cores to the fullest, then you're out of luck
under python 2.

Sadly, the only way we could get truly independent interpreters was to
put python in a dynamic library, have our installer make a *duplicate*
copy of it during the installation process (e.g. python.dll/.bundle ->
python2.dll/.bundle) and load each one explicitly in our app, so we
can get truly independent interpreters. In other words, we load a
fresh dynamic lib for each thread-independent interpreter (you can't
reuse the same dynamic library because the OS will just reference the
already-loaded one).
From what I gather from the python community, the basis for not
offering "real" muti-threaded support is that it'd add to much
internal overhead--and I couldn't agree more. As a high performance C
and C++ guy, I fully agree that thread safety should be at the high
level, not at the low level. BUT, the lack of truly independent
interpreters is what ultimately prevents using python in cool,
powerful ways. This shortcoming alone has caused game developers--
both large and small--to choose other embedded interpreters over
python (e.g. Blizzard chose lua over python). For example, Apple's
QuickTime API is powerful in that high-level instance objects can
leverage performance gains associated with multi-threaded processing.
Meanwhile, the QuickTime API simply lists the responsibilities of the
caller regarding thread safety and that's all its needs to do. In
other words, CPython doesn't need to step in an provide a threadsafe
environment; it just needs to establish the rules and make sure that
its own implementation supports those rules.

More than once, I had actually considered expending company resources
to develop a high performance, truly independent interpreter
implementation of the python core language and modules but in the end
estimated that the size of that project would just be too much, given
our company's current resources. Should such an implementation ever
be developed, it would be very attractive for companies to support,
fund, and/or license. The truth is, we just love python as a
language, but it's lack of true interpreter independence (in a
interpreter as well as in a thread sense) remains a *huge* liability.

So, my question becomes: is python 3 ready for true multithreaded
support?? Can we finally abandon our Frankenstein approach of loading
multiple identical dynamic libs to achieve truly independent
interpreters?? I've reviewed all the new python 3 C API module stuff,
and all I have to say is: whew--better late then never!! So, although
that solves modules offering truly independent interpreter support,
the following questions remain:

- In python 3, the C module API now supports true interpreter
independence, but have all the modules in the python codebase been
converted over? Are they all now truly compliant? It will only take
a single static/global state variable in a module to potentially cause
no end of pain in a multiple interpreter environment! Yikes!

- How close is python 3 really to true multithreaded use? The
assumption here is that caller ensures safety (e.g. ensuring that
neither interpreter is in use when serializing data from one to
another).

I believe that true python independent thread/interpreter support is
paramount and should become the top priority because this is the key
consideration used by developers when they're deciding which
interpreter to embed in their app. Until there's a hello world that
demonstrates running independent python interpreters on multiple app
threads, lua will remain the clear choice over python. Python 3 needs
true interpreter independence and multi-threaded support!


Thanks,
Andy O'Meara

Search Discussions

  • Thomas Heller at Oct 22, 2008 at 5:45 pm

    Andy schrieb:
    Dear Python dev community,

    [...] Basically,
    we use embedded python and use it to wrap our high performance C++
    class set which wraps OpenGL, DirectX and our own software renderer.
    In addition to wrapping our C++ frameworks, we use python to perform
    various "worker" tasks on worker thread (e.g. image loading and
    processing). However, we require *true* thread/interpreter
    independence so python 2 has been frustrating at time, to say the
    least. [...]
    Sadly, the only way we could get truly independent interpreters was to
    put python in a dynamic library, have our installer make a *duplicate*
    copy of it during the installation process (e.g. python.dll/.bundle ->
    python2.dll/.bundle) and load each one explicitly in our app, so we
    can get truly independent interpreters. In other words, we load a
    fresh dynamic lib for each thread-independent interpreter (you can't
    reuse the same dynamic library because the OS will just reference the
    already-loaded one).
    Interesting questions you ask.

    A random note: py2exe also does something similar for executables build
    with the 'bundle = 1' option. The python.dll and .pyd extension modules
    in this case are not loaded into the process in the 'normal' way (with
    some kind of windows LoadLibrary() call, instead they are loaded by code
    in py2exe that /emulates/ LoadLibrary - the code segments are loaded into
    memory, fixups are made for imported functions, and marked executable.

    The result is that separate COM objects implemented as Python modules and
    converted into separate dlls by py2exe do not share their interpreters even
    if they are running in the same process. Of course this only works on windows.
    In effect this is similar to using /statically/ linked python interpreters
    in separate dlls. Can't you do something like that?
    So, my question becomes: is python 3 ready for true multithreaded
    support?? Can we finally abandon our Frankenstein approach of loading
    multiple identical dynamic libs to achieve truly independent
    interpreters?? I've reviewed all the new python 3 C API module stuff,
    and all I have to say is: whew--better late then never!! So, although
    that solves modules offering truly independent interpreter support,
    the following questions remain:

    - In python 3, the C module API now supports true interpreter
    independence, but have all the modules in the python codebase been
    converted over? Are they all now truly compliant? It will only take
    a single static/global state variable in a module to potentially cause
    no end of pain in a multiple interpreter environment! Yikes!
    I don't think this is the case (currently). But you could submit patches
    to Python so that at least the 'official' modules (builtin and extensions)
    would behave corectly in the case of multiple interpreters. At least
    this is a much lighter task than writing your own GIL-less interpreter.

    My 2 cents,

    Thomas
  • Andy at Oct 22, 2008 at 6:45 pm
    Hi Thomas -

    I appreciate your thoughts and time on this subject.
    The result is that separate COM objects implemented as Python modules and
    converted into separate dlls by py2exe do not share their interpreters even
    if they are running in the same process. ?Of course this only works on windows.
    In effect this is similar to using /statically/ linked python interpreters
    in separate dlls. ?Can't you do something like that?
    You're definitely correct that homebrew loading and linking would do
    the trick. However, because our python stuff makes callbacks into our
    C/C++, that complicates the linking process (if I understand you
    correctly). Also, then there's the problem of OS X.

    - In python 3, the C module API now supports true interpreter
    independence, but have all the modules in the python codebase been
    converted over? ?Are they all now truly compliant? ?It will only take
    a single static/global state variable in a module to potentially cause
    no end of pain in a multiple interpreter environment! ?Yikes!
    I don't think this is the case (currently). ?But you could submit patches
    to Python so that at least the 'official' modules (builtin and extensions)
    would behave corectly in the case of multiple interpreters. ?At least
    this is a much lighter task than writing your own GIL-less interpreter.
    I agree -- and I've been considering that (or rather, having our
    company hire/pay part of the python dev community to do the work). To
    consider that, the question becomes, how many modules are we talking
    about do you think? 10? 100? I confess that I'm no familiar enough
    with the full C python suite to have a good idea of how much work
    we're talking about here.

    Regards,
    Andy
  • Terry Reedy at Oct 22, 2008 at 9:15 pm

    Andy wrote:

    I agree -- and I've been considering that (or rather, having our
    company hire/pay part of the python dev community to do the work). To
    consider that, the question becomes, how many modules are we talking
    about do you think? 10? 100?
    In your Python directory, everything in Lib is Python, I believe.
    Everything in DLLs is compiled C extensions. I see about 15 for Windows
    3.0. These reflect two separate directories in the source tree. Builtin
    classes are part of pythonxx.dll in the main directory. I have no idea
    if things such as lists (from listobject.c), for instance, are a
    potential problem for you.

    You could start with the module of most interest to you, or perhaps a
    small one, and see if it needs patching (from your viewpoint) and how
    much effort it would take to meet your needs.

    Terry Jan Reedy
  • Stefan Behnel at Oct 24, 2008 at 2:19 pm

    Terry Reedy wrote:
    Everything in DLLs is compiled C extensions. I see about 15 for Windows
    3.0.
    Ah, weren't that wonderful times back in the days of Win3.0, when DLL-hell was
    inhabited by only 15 libraries? *sigh*

    ... although ... wait, didn't Win3.0 have more than that already? Maybe you
    meant Windows 1.0?

    SCNR-ly,

    Stefan
  • Terry Reedy at Oct 24, 2008 at 4:09 pm

    Stefan Behnel wrote:
    Terry Reedy wrote:
    Everything in DLLs is compiled C extensions. I see about 15 for Windows
    3.0.
    Ah, weren't that wonderful times back in the days of Win3.0, when DLL-hell was
    inhabited by only 15 libraries? *sigh*

    ... although ... wait, didn't Win3.0 have more than that already? Maybe you
    meant Windows 1.0?

    SCNR-ly,
    Is that the equivalent of a smilely? or did you really not understand
    what I wrote?
  • Sturlamolden at Oct 24, 2008 at 1:35 pm
    Instead of "appdomains" (one interpreter per thread), or free
    threading, you could use multiple processes. Take a look at the new
    multiprocessing module in Python 2.6. It has roughly the same
    interface as Python's threading and queue modules, but uses processes
    instead of threads. Processes are scheduled independently by the
    operating system. The objects in the multiprocessing module also tend
    to have much better performance than their threading and queue
    counterparts. If you have a problem with threads due to the GIL, the
    multiprocessing module with most likely take care of it.

    There is a fundamental problem with using homebrew loading of multiple
    (but renamed) copies of PythonXX.dll that is easily overlooked. That
    is, extension modules (.pyd) are DLLs as well. Even if required by two
    interpreters, they will only be loaded into the process image once.
    Thus you have to rename all of them as well, or you will get havoc
    with refcounts. Not to speak of what will happen if a Windows HANDLE
    is closed by one interpreter while still needed by another. It is
    almost guaranteed to bite you, sooner or later.

    There are other options as well:

    - Use IronPython. It does not have a GIL.

    - Use Jython. It does not have a GIL.

    - Use pywin32 to create isolated outproc COM servers in Python. (I'm
    not sure what the effect of inproc servers would be.)

    - Use os.fork() if your platform supports it (Linux, Unix, Apple,
    Cygwin, Windows Vista SUA). This is the standard posix way of doing
    multiprocessing. It is almost unbeatable if you have a fast copy-on-
    write implementation of fork (that is, all platforms except Cygwin).
  • Andy O'Meara at Oct 24, 2008 at 1:58 pm

    On Oct 24, 9:35?am, sturlamolden wrote:
    Instead of "appdomains" (one interpreter per thread), or free
    threading, you could use multiple processes. Take a look at the new
    multiprocessing module in Python 2.6.
    That's mentioned earlier in the thread.
    There is a fundamental problem with using homebrew loading of multiple
    (but renamed) copies of PythonXX.dll that is easily overlooked. That
    is, extension modules (.pyd) are DLLs as well.
    Tell me about it--there's all kinds of problems and maintenance
    liabilities with our approach. That's why I'm here talking about this
    stuff.
    There are other options as well:

    - Use IronPython. It does not have a GIL.

    - Use Jython. It does not have a GIL.

    - Use pywin32 to create isolated outproc COM servers in Python. (I'm
    not sure what the effect of inproc servers would be.)

    - Use os.fork() if your platform supports it (Linux, Unix, Apple,
    Cygwin, Windows Vista SUA). This is the standard posix way of doing
    multiprocessing. It is almost unbeatable if you have a fast copy-on-
    write implementation of fork (that is, all platforms except Cygwin).
    This is discussed earlier in the thread--they're unfortunately all
    out.
  • Sturlamolden at Oct 24, 2008 at 2:32 pm

    On Oct 24, 3:58?pm, "Andy O'Meara" wrote:

    This is discussed earlier in the thread--they're unfortunately all
    out.
    It occurs to me that tcl is doing what you want. Have you ever thought
    of not using Python?

    That aside, the fundamental problem is what I perceive a fundamental
    design flaw in Python's C API. In Java JNI, each function takes a
    JNIEnv* pointer as their first argument. There is nothing the
    prevents you from embedding several JVMs in a process. Python can
    create embedded subinterpreters, but it works differently. It swaps
    subinterpreters like a finite state machine: only one is concurrently
    active, and the GIL is shared. The approach is fine, except it kills
    free threading of subinterpreters. The argument seems to be that
    Apache's mod_python somehow depends on it (for reasons I don't
    understand).
  • Andy O'Meara at Oct 24, 2008 at 2:58 pm

    That aside, the fundamental problem is what I perceive a fundamental
    design flaw in Python's C API. In Java JNI, each function takes a
    JNIEnv* pointer as their first argument. There ?is nothing the
    prevents you from embedding several JVMs in a process. Python can
    create embedded subinterpreters, but it works differently. It swaps
    subinterpreters like a finite state machine: only one is concurrently
    active, and the GIL is shared.
    Bingo, it seems that you've hit it right on the head there. Sadly,
    that's why I regard this thread largely futile (but I'm an optimist
    when it comes to cool software communities so here I am). I've been
    afraid to say it for fear of getting mauled by everyone here, but I
    would definitely agree if there was a context (i.e. environment)
    object passed around then perhaps we'd have the best of all worlds.
    *winces*

    This is discussed earlier in the thread--they're unfortunately all
    out.
    It occurs to me that tcl is doing what you want. Have you ever thought
    of not using Python?
    Bingo again. Our research says that the options are tcl, perl
    (although it's generally untested and not recommended by the
    community--definitely dealbreakers for a commercial user like us), and
    lua. Also, I'd rather saw off my own right arm than adopt perl, so
    that's out. :^)

    As I mentioned, we're looking to either (1) support a python dev
    community effort, (2) make our own high-performance python interpreter
    (that uses an env object as you described), or (3) drop python and go
    to lua. I'm favoring them in the order I list them, but the more I
    discuss the issue with folks here, the more people seem to be
    unfortunately very divided on (1).

    Andy
  • Greg at Oct 25, 2008 at 5:26 am

    Andy O'Meara wrote:

    I would definitely agree if there was a context (i.e. environment)
    object passed around then perhaps we'd have the best of all worlds.
    Moreover, I think this is probably the *only* way that
    totally independent interpreters could be realized.

    Converting the whole C API to use this strategy would be
    a very big project. Also, on the face of it, it seems like
    it would render all existing C extension code obsolete,
    although it might be possible to do something clever with
    macros to create a compatibility layer.

    Another thing to consider is that passing all these extra
    pointers around everywhere is bound to have some effect
    on performance. The idea mightn't go down too well if it
    slows things significantly in the case where you're only
    using one interpreter.

    --
    Greg
  • Martin v. Löwis at Oct 22, 2008 at 6:14 pm

    - In python 3, the C module API now supports true interpreter
    independence, but have all the modules in the python codebase been
    converted over?
    No, none of them.
    Are they all now truly compliant? It will only take
    a single static/global state variable in a module to potentially cause
    no end of pain in a multiple interpreter environment! Yikes!
    So you will have to suffer pain.
    - How close is python 3 really to true multithreaded use?
    Python is as thread-safe as ever (i.e. completely thread-safe).
    I believe that true python independent thread/interpreter support is
    paramount and should become the top priority because this is the key
    consideration used by developers when they're deciding which
    interpreter to embed in their app. Until there's a hello world that
    demonstrates running independent python interpreters on multiple app
    threads, lua will remain the clear choice over python. Python 3 needs
    true interpreter independence and multi-threaded support!
    So what patches to achieve that goal have you contributed so far?

    In open source, pleas have nearly zero effect; code contributions is
    what has effect.

    I don't think any of the current committers has a significant interest
    in supporting multiple interpreters (and I say that as the one who wrote
    and implemented PEP 3121). To make a significant change, you need to
    start with a PEP, offer to implement it once accepted, and offer to
    maintain the feature for five years.

    Regards,
    Martin
  • Andy at Oct 22, 2008 at 7:26 pm

    - In python 3, the C module API now supports true interpreter
    independence, but have all the modules in the python codebase been
    converted over?
    No, none of them.
    :^)
    - How close is python 3 really to true multithreaded use?
    Python is as thread-safe as ever (i.e. completely thread-safe).
    If you're referring to the fact that the GIL does that, then you're
    certainly correct. But if you've got multiple CPUs/cores and actually
    want to use them, that GIL means you might as well forget about them.
    So please take my use of "true multithreaded" to mean "turning off"
    the GIL and push the responsibility of object safety to the client/API
    level (such as in my QuickTime API example).

    I believe that true python independent thread/interpreter support is
    paramount and should become the top priority because this is the key
    consideration used by developers when they're deciding which
    interpreter to embed in their app. Until there's a hello world that
    demonstrates running independent python interpreters on multiple app
    threads, lua will remain the clear choice over python. Python 3 needs
    true interpreter independence and multi-threaded support!
    So what patches to achieve that goal have you contributed so far?

    In open source, pleas have nearly zero effect; code contributions is
    what has effect.
    This is just my second email, please be a little patient. :^) But
    more seriously, I do represent a company ready, able, and willing to
    fund the development of features that we're looking for, so please
    understand that I'm definitely not coming to the table empty-handed
    here.

    I don't think any of the current committers has a significant interest
    in supporting multiple interpreters (and I say that as the one who wrote
    and implemented PEP 3121). To make a significant change, you need to
    start with a PEP, offer to implement it once accepted, and offer to
    maintain the feature for five years.
    Nice to meet you! :^) Seriously though, thank you for all your work on
    3121 and taking the initiative with it! It's definitely the first
    step in what companies like ours attract us to embedded an interpreted
    language. Specifically: unrestricted interpreter and thread-
    independent use.

    I would *love* for our company to be 10 times larger and be able to
    add another zero to what we'd be able to hire/offer the python dev
    community for work that we're looking for, but we unfortunately have
    limits at the moment. And I would love to see python become the
    leading choice when companies look to use an embedded interpreter, and
    I offer my comments here to paint a picture of what can make python
    more appealing to commercial software developers. Hopefully, the
    python dev community doesn't underestimate the dev funding that could
    potentially come in from companies if python grew in certain ways!

    So, that said, I represent a company willing to fund the development
    of features that move python towards thread-independent operation. No
    software engineer can deny that we're entering a new era of
    multithreaded processing where support frameworks (such as python)
    need to be open minded with how they're used in a multi-threaded
    environment--that's all I'm saying here.

    Anyway, I can definitely tell you and anyone else interested that
    we're willing to put our money where our wish-list is. As I mentioned
    in my previous post to Thomas, the next step is to get an
    understanding of the options available that will satisfy our needs.
    We have a budget for this, but it's not astronomical (it's driven by
    the cost associated with dropping python and going with lua--or,
    making our own pared-down interpreter implementation). Please let me
    be clear--I love python (as a language) and I don't want to switch.
    BUT, we have to be able to run interpreters in different threads (and
    get unhindered/full CPU core performance--ie. no GIL).

    Thoughts? Also, please feel free to email me off-list if you prefer.

    Oh, while I'm at it, if anyone in the python dev community (or anyone
    that has put real work into python) is interested in our software,
    email me and I'll hook you up with a complimentary copy of the
    products that use python (music visuals for iTunes and WMP).

    Regards,
    Andy
  • Martin v. Löwis at Oct 22, 2008 at 7:55 pm

    I would *love* for our company to be 10 times larger and be able to
    add another zero to what we'd be able to hire/offer the python dev
    community for work that we're looking for, but we unfortunately have
    limits at the moment.
    There is another thing about open source that you need to consider:
    you don't have to do it all on your own.

    It needs somebody to take the lead, start a project, define a plan,
    and small steps to approach it. If it's really something that the
    community desperately needs, and if you make it clear that you will
    just lead, but get nowhere without contributions, then the
    contributions will come in.

    If there won't be any contributions, then the itch in the the
    community isn't that strong that it needs scratching.

    Regards,
    Martin
  • Terry Reedy at Oct 22, 2008 at 9:34 pm

    Andy wrote:

    This is just my second email, please be a little patient. :^)
    As a 10-year veteran, I welcome new contributors with new viewpoints and
    information.
    more appealing to commercial software developers. Hopefully, the
    python dev community doesn't underestimate the dev funding that could
    potentially come in from companies if python grew in certain ways!
    This seems to be something of a chicken-and-egg problem.
    So, that said, I represent a company willing to fund the development
    of features that move python towards thread-independent operation.
    Perhaps you know of and can persuade other companies to contribute to
    such focused effort.
    No
    software engineer can deny that we're entering a new era of
    multithreaded processing where support frameworks (such as python)
    need to be open minded with how they're used in a multi-threaded
    environment--that's all I'm saying here.
    The *current* developers seem to be more interested in exploiting
    multiple processors with multiprocessing. Note that Google choose that
    route for Chrome (as I understood their comic introduction). 2.6 and 3.0
    come with a new multiprocessing module that mimics the threading module
    api fairly closely. It is now being backported to run with 2.5 and 2.4.

    Advances in multithreading will probably require new ideas and
    development energy.

    Terry Jan Reedy
  • Jesse Noller at Oct 22, 2008 at 9:49 pm

    On Wed, Oct 22, 2008 at 5:34 PM, Terry Reedy wrote:
    The *current* developers seem to be more interested in exploiting multiple
    processors with multiprocessing. Note that Google choose that route for
    Chrome (as I understood their comic introduction). 2.6 and 3.0 come with a
    new multiprocessing module that mimics the threading module api fairly
    closely. It is now being backported to run with 2.5 and 2.4.
    That's not exactly correct. Multiprocessing was added to 2.6 and 3.0
    as a *additional* method for parallel/concurrent programming that
    allows you to use multiple cores - however, as I noted in the PEP:

    " In the future, the package might not be as relevant should the
    CPython interpreter enable "true" threading, however for some
    applications, forking an OS process may sometimes be more
    desirable than using lightweight threads, especially on those
    platforms where process creation is fast and optimized."

    Multiprocessing is not a replacement for a "free threading" future
    (ergo my mentioning Adam Olsen's work) - it is a tool in the
    "batteries included" box. I don't want my cheerleading and driving of
    this to somehow implicate that the rest of Python-Dev thinks this is
    the "silver bullet" or final answer in concurrency.

    However, a free-threaded python has a lot of implications, and if we
    were to do it, it requires we not only "drop" the GIL - it also
    requires we consider the ramifications of enabling true threading ala
    Java et al - just having "true threads" lying around is great if
    you've spent a ton of time learning locking, avoiding shared data/etc,
    stepping through and cursing poor debugger support for multiple
    threads, etc.

    This is why I've been a fan of Adam's approach - enabling free
    threading via GIL removal is actually secondary to the project's
    stated goal: Enable Safe Threading.

    In any case, I've jumped the rails - let's just say there's room in
    python for multiprocessing, threading and possible a concurrent
    package ala java.util.concurrent - but it really does have to be
    thought out and done right.

    Speaking of which: If you wanted "real" threads, you could use a
    combination of JCC (http://pypi.python.org/pypi/JCC/) and Jython. :)

    -jesse
  • Jesse Noller at Oct 22, 2008 at 9:21 pm

    On Wed, Oct 22, 2008 at 12:32 PM, Andy wrote:
    And, yes, I'm aware of the multiprocessing module added in 2.6, but
    that stuff isn't lightweight and isn't suitable at all for many
    environments (including ours). The bottom line is that if you want to
    perform independent processing (in python) on different threads, using
    the machine's multiple cores to the fullest, then you're out of luck
    under python 2.
    So, as the guy-on-the-hook for multiprocessing, I'd like to know what
    you might suggest for it to make it more apt for your - and other
    environments.

    Additionally, have you looked at:
    https://launchpad.net/python-safethread
    http://code.google.com/p/python-safethread/w/list
    (By Adam olsen)

    -jesse
  • Rhamphoryncus at Oct 22, 2008 at 10:06 pm

    On Oct 22, 10:32?am, Andy wrote:
    Dear Python dev community,

    I'm CTO at a small software company that makes music visualization
    software (you can check us out atwww.soundspectrum.com). ?About two
    years ago we went with decision to use embedded python in a couple of
    our new products, given all the great things about python. ?We were
    close to using lua but for various reasons we decided to go with
    python. ?However, over the last two years, there's been one area of
    grief that sometimes makes me think twice about our decision to go
    with python...

    Some background first... ? Our software is used for entertainment and
    centers around real time, high-performance graphics, so python's
    performance, embedded flexibility, and stability are the most
    important issues for us. ?Our software targets a large cross section
    of hardware and we currently ship products for Win32, OS X, and the
    iPhone and since our customers are end users, our products have to be
    robust, have a tidy install footprint, and be foolproof. ?Basically,
    we use embedded python and use it to wrap our high performance C++
    class set which wraps OpenGL, DirectX and our own software renderer.
    In addition to wrapping our C++ frameworks, we use python to perform
    various "worker" tasks on worker thread (e.g. image loading andprocessing). ?However, we require *true* thread/interpreter
    independence so python 2 has been frustrating at time, to say the
    least. ?Please don't start with "but really, python supports multiple
    interpreters" because I've been there many many times with people.
    And, yes, I'm aware of the multiprocessing module added in 2.6, but
    that stuff isn't lightweight and isn't suitable at all for many
    environments (including ours). ?The bottom line is that if you want to
    perform independentprocessing (in python) on different threads, using
    the machine's multiple cores to the fullest, then you're out of luck
    under python 2.

    Sadly, the only way we could get truly independent interpreters was to
    put python in a dynamic library, have our installer make a *duplicate*
    copy of it during the installationprocess(e.g. python.dll/.bundle ->
    python2.dll/.bundle) and load each one explicitly in our app, so we
    can get truly independent interpreters. ?In other words, we load a
    fresh dynamic lib for each thread-independent interpreter (you can't
    reuse the same dynamic library because the OS will just reference the
    already-loaded one).

    From what I gather from the python community, the basis for not
    offering "real" muti-threaded support is that it'd add to much
    internal overhead--and I couldn't agree more. ?As a high performance C
    and C++ guy, I fully agree that thread safety should be at the high
    level, not at the low level. ?BUT, the lack of truly independent
    interpreters is what ultimately prevents using python in cool,
    powerful ways. ?This shortcoming alone has caused game developers--
    both large and small--to choose other embedded interpreters over
    python (e.g. Blizzard chose lua over python). ?For example, Apple's
    QuickTime API is powerful in that high-level instance objects can
    leverage performance gains associated with multi-threadedprocessing.
    Meanwhile, the QuickTime API simply lists the responsibilities of the
    caller regarding thread safety and that's all its needs to do. ?In
    other words, CPython doesn't need to step in an provide a threadsafe
    environment; it just needs to establish the rules and make sure that
    its own implementation supports those rules.

    More than once, I had actually considered expending company resources
    to develop a high performance, truly independent interpreter
    implementation of the python core language and modules but in the end
    estimated that the size of that project would just be too much, given
    our company's current resources. ?Should such an implementation ever
    be developed, it would be very attractive for companies to support,
    fund, and/or license. ?The truth is, we just love python as a
    language, but it's lack of true interpreter independence (in a
    interpreter as well as in a thread sense) remains a *huge* liability.

    So, my question becomes: is python 3 ready for true multithreaded
    support?? ?Can we finally abandon our Frankenstein approach of loading
    multiple identical dynamic libs to achieve truly independent
    interpreters?? I've reviewed all the new python 3 C API module stuff,
    and all I have to say is: whew--better late then never!! ?So, although
    that solves modules offering truly independent interpreter support,
    the following questions remain:

    - In python 3, the C module API now supports true interpreter
    independence, but have all the modules in the python codebase been
    converted over? ?Are they all now truly compliant? ?It will only take
    a single static/global state variable in a module to potentially cause
    no end of pain in a multiple interpreter environment! ?Yikes!

    - How close is python 3 really to true multithreaded use? ?The
    assumption here is that caller ensures safety (e.g. ensuring that
    neither interpreter is in use when serializing data from one to
    another).

    I believe that true python independent thread/interpreter support is
    paramount and should become the top priority because this is the key
    consideration used by developers when they're deciding which
    interpreter to embed in their app. Until there's a hello world that
    demonstrates running independent python interpreters on multiple app
    threads, lua will remain the clear choice over python. ?Python 3 needs
    true interpreter independence and multi-threaded support!
    What you describe, truly independent interpreters, is not threading at
    all: it is processes, emulated at the application level, with all the
    memory cost and none of the OS protections. True threading would
    involve sharing most objects.

    Your solution depends on what you need:
    * Killable "threads" -> OS processes
    * multicore usage (GIL removal) -> OS processes or alternative Python
    implementations (PyPy/Jython/IronPython)
    * Sane shared objects -> safethread
  • Andy at Oct 23, 2008 at 1:04 am

    What you describe, truly independent interpreters, is not threading at
    all: it is processes, emulated at the application level, with all the
    memory cost and none of the OS protections. ?True threading would
    involve sharing most objects.

    Your solution depends on what you need:
    * Killable "threads" -> OS processes
    * multicore usage (GIL removal) -> OS processes or alternative Python
    implementations (PyPy/Jython/IronPython)
    * Sane shared objects -> safethread

    I realize what you're saying, but it's better said there's two issues
    at hand:

    1) Independent interpreters (this is the easier one--and solved, in
    principle anyway, by PEP 3121, by Martin v. L?wis, but is FAR from
    being carried through in modules as he pointed out). As you point
    out, this doesn't directly relate to multi-threading BUT it is
    intimately tied to the issue because if, in principle, every module
    used instance data (rather than static data), then python would be
    WELL on its way to "free threading" (as Jesse Noller calls it), or as
    I was calling it "true multi-threading".

    2) Barriers to "free threading". As Jesse describes, this is simply
    just the GIL being in place, but of course it's there for a reason.
    It's there because (1) doesn't hold and there was never any specs/
    guidance put forward about what should and shouldn't be done in multi-
    threaded apps (see my QuickTime API example). Perhaps if we could go
    back in time, we would not put the GIL in place, strict guidelines
    regarding multithreaded use would have been established, and PEP 3121
    would have been mandatory for C modules. Then again--screw that, if I
    could go back in time, I'd just go for the lottery tickets!! :^)

    Anyway, I've been at this issue for quite a while now (we're
    approaching our 3rd release cycle), so I'm pretty comfortable with the
    principles at hand. I'd say the theme of your comments share the
    theme of others here, so perhaps consider where end-user software
    houses (like us) are coming from. Specifically, developing commercial
    software for end users imposes some restrictions that open source
    development communities aren't often as sensitive to, namely:

    - Performance -- emulation is a no-go (e.g. Jython)
    - Maturity and Licensing -- experimental/academic projects are no-go
    (PyPy)
    - Cross platform support -- love it or hate it, Win32 and OS X are all
    that matter when you're talking about selling (and supporting)
    software to the masses. I'm just the messenger here (ie. this is NOT
    flamebait). We publish for OS X, so IronPython is therefore out.

    Basically, our company is at a crossroads where we really need light,
    clean "free threading" as Jesse calls it (e.g. on the iPhone, using
    our python drawing wrapper to do primary drawing while running python
    jobs on another thread doing image decoding and processing). In our
    current iPhone app, we achieve this by using two python bundles
    (dynamic libs) in the way I described in my initial post. Sure, thus
    solves our problem, but it's pretty messy, sucks up resources, and has
    been a pain to maintain.

    Moving forward, please understand my posts here are also intended to
    give the CPython dev community a glimpse of the issues that may not be
    as visible to you guys (as they are for dev houses like us). For
    example, it'd be pretty cool if Blizzard went with python instead of
    lua, wouldn't you think? But some of the issues I've raised here no
    doubt factor in to why end-user dev houses ultimately may have to pass
    up python in favor of another interpreted language.

    Bottom line: why give prospective devs any reason to turn down python--
    there's just so many great things about python!

    Regards,
    Andy
  • Rhamphoryncus at Oct 23, 2008 at 2:06 am

    On Oct 22, 7:04?pm, Andy wrote:
    What you describe, truly independent interpreters, is not threading at
    all: it is processes, emulated at the application level, with all the
    memory cost and none of the OS protections. ?True threading would
    involve sharing most objects.
    Your solution depends on what you need:
    * Killable "threads" -> OS processes
    * multicore usage (GIL removal) -> OS processes or alternative Python
    implementations (PyPy/Jython/IronPython)
    * Sane shared objects -> safethread
    I realize what you're saying, but it's better said there's two issues
    at hand:

    1) Independent interpreters (this is the easier one--and solved, in
    principle anyway, by PEP 3121, by Martin v. L?wis, but is FAR from
    being carried through in modules as he pointed out). ?As you point
    out, this doesn't directly relate to multi-threading BUT it is
    intimately tied to the issue because if, in principle, every module
    used instance data (rather than static data), then python would be
    WELL on its way to "free threading" (as Jesse Noller calls it), or as
    I was calling it "true multi-threading".
    If you want processes, use *real* processes. Your arguments fail to
    get transaction because you don't provide a good, justified reason why
    they don't and can't work.

    Although isolated interpreters would be convenient to you, it's a
    specialized use case, and bad language design. There's far more use
    cases that aren't isolated (actual threading), so why exclude them?

    2) Barriers to "free threading". ?As Jesse describes, this is simply
    just the GIL being in place, but of course it's there for a reason.
    It's there because (1) doesn't hold and there was never any specs/
    guidance put forward about what should and shouldn't be done in multi-
    threaded apps (see my QuickTime API example). ?Perhaps if we could go
    back in time, we would not put the GIL in place, strict guidelines
    regarding multithreaded use would have been established, and PEP 3121
    would have been mandatory for C modules. ?Then again--screw that, if I
    could go back in time, I'd just go for the lottery tickets!! :^)
    You seem confused. PEP 3121 is for isolated interpreters (ie emulated
    processes), not threading.

    Getting threading right would have been a massive investment even back
    then, and we probably wouldn't have as mature of a python we do
    today. Make no mistake, the GIL has substantial benefits. It may be
    old and tired, surrounded by young bucks, but it's still winning most
    of the races.

    Anyway, I've been at this issue for quite a while now (we're
    approaching our 3rd release cycle), so I'm pretty comfortable with the
    principles at hand. ?I'd say the theme of your comments share the
    theme of others here, so perhaps consider where end-user software
    houses (like us) are coming from. ?Specifically, developing commercial
    software for end users imposes some restrictions that open source
    development communities aren't often as sensitive to, namely:

    - Performance -- emulation is a no-go (e.g. Jython)
    Got some real benchmarks to back that up? How about testing it on a
    16 core (or more) box and seeing how it scales?

    - Maturity and Licensing -- experimental/academic projects are no-go
    (PyPy)
    - Cross platform support -- love it or hate it, Win32 and OS X are all
    that matter when you're talking about selling (and supporting)
    software to the masses. ?I'm just the messenger here (ie. this is NOT
    flamebait). ?We publish for OS X, so IronPython is therefore out.
    You might be able to use Java on one, IronPython on another, and PyPy
    in between. Regardless, my point is that CPython will *never* remove
    the GIL. It cannot be done in an effective, highly scalable fashion
    without a total rewrite.

    Basically, our company is at a crossroads where we really need light,
    clean "free threading" as Jesse calls it (e.g. on the iPhone, using
    our python drawing wrapper to do primary drawing while running python
    jobs on another thread doing image decoding and processing). ?In our
    current iPhone app, we achieve this by using two python bundles
    (dynamic libs) in the way I described in my initial post. ?Sure, thus
    solves our problem, but it's pretty messy, sucks up resources, and has
    been a pain to maintain.
    Is the iPhone multicore, or is it an issue of fairness (ie a soft
    realtime app)?

    Moving forward, please understand my posts here are also intended to
    give the CPython dev community a glimpse of the issues that may not be
    as visible to you guys (as they are for dev houses like us). ?For
    example, it'd be pretty cool if Blizzard went with python instead of
    lua, wouldn't you think? ?But some of the issues I've raised here no
    doubt factor in to why end-user dev houses ultimately may have to pass
    up python in favor of another interpreted language.

    Bottom line: why give prospective devs any reason to turn down python--
    there's just so many great things about python!
    I'd like to see python used more, but fixing these things properly is
    not as easy as believed. Those in the user community see only their
    immediate problem (threads don't use multicore). People like me see
    much bigger problems. We need consensus on the problems, and how to
    solve it, and a commitment to invest what's required.
  • Andy at Oct 23, 2008 at 4:31 am

    You seem confused. ?PEP 3121 is for isolated interpreters (ie emulated
    processes), not threading.
    Please reread my points--inherently isolated interpreters (ie. the top
    level object) are indirectly linked to thread independence. I don't
    want to argue, but you seem hell-bent on not hearing what I'm trying
    to say here.
    Got some real benchmarks to back that up? ?How about testing it on a
    16 core (or more) box and seeing how it scales?
    I don't care to argue with you, and you'll have to take it on faith
    that I'm not spouting hot air. But just to put this to rest, I'll
    make it clear in this Jython case:

    You can't sell software to end users and expect them have a recent,
    working java distro. Look around you: no real commercial software
    title that sells to soccer moms and gamers use java. There's method
    to commercial software production, so please don't presume that you
    know my job, product line, and customers better than me, ok?

    Just to put things in perspective, I already have exposed my company
    to more support and design liability than I knew I was getting into by
    going with python (as a result of all this thread safety and
    interpreter independence business). I love to go into that one, but
    it's frankly just not a good use of my time right now. Please just
    accept that when someone says an option is a deal breaker, then it's a
    deal breaker. This isn't some dude's masters thesis project here--we
    pay our RENT and put our KIDS through school because we sell and ship
    software that works is meant to entertain people happy.
    I'd like to see python used more, but fixing these things properly is
    not as easy as believed. ?Those in the user community see only their
    immediate problem (threads don't use multicore). ?People like me see
    much bigger problems. ?We need consensus on the problems, and how to
    solve it, and a commitment to invest what's required.
    Well, you seem to come down pretty hard on people that at your
    doorstep saying their WILLING and INTERESTED in supporting python
    development. And, you're exactly right: users see only their
    immediate problem--but that's the definition of being a user. If
    users saw the whole picture from the dev side, then they be
    developers, not users.

    Please consider that you're representing the python dev community
    here; I'm you're friend here, not your enemy.

    Andy
  • Rhamphoryncus at Oct 23, 2008 at 6:51 am

    On Oct 22, 10:31?pm, Andy wrote:
    You seem confused. ?PEP 3121 is for isolated interpreters (ie emulated
    processes), not threading.
    Please reread my points--inherently isolated interpreters (ie. the top
    level object) are indirectly linked to thread independence. ?I don't
    want to argue, but you seem hell-bent on not hearing what I'm trying
    to say here.
    I think the confusion is a matter of context. Your app, written in C
    or some other non-python language, shares data between the threads and
    thus treats them as real threads. However, from python's perspective
    nothing is shared, and thus it is processes.

    Although this contradiction is fine for embedding purposes, python is
    a general purpose language, and needs to be capable of directly
    sharing objects. Imagine you wanted to rewrite the bulk of your app
    in python, with only a relatively small portion left in a C extension
    module.

    Got some real benchmarks to back that up? ?How about testing it on a
    16 core (or more) box and seeing how it scales?
    I don't care to argue with you, and you'll have to take it on faith
    that I'm not spouting hot air. ?But just to put this to rest, I'll
    make it clear in this Jython case:

    You can't sell software to end users and expect them have a recent,
    working java distro. ?Look around you: no real commercial software
    title that sells to soccer moms and gamers use java. ?There's method
    to commercial software production, so please don't presume that you
    know my job, product line, and customers better than me, ok?

    Just to put things in perspective, I already have exposed my company
    to more support and design liability than I knew I was getting into by
    going with python (as a result of all this thread safety and
    interpreter independence business). ?I love to go into that one, but
    it's frankly just not a good use of my time right now. ?Please just
    accept that when someone says an option is a deal breaker, then it's a
    deal breaker. ?This isn't some dude's masters thesis project here--we
    pay our RENT and put our KIDS through school because we sell and ship
    software that works is meant to entertain people happy.
    Consider it accepted. I understand that PyPy/Jython/IronPython don't
    fit your needs. Likewise though, CPython cannot fit my needs. What
    we both need simply does not exist today.

    I'd like to see python used more, but fixing these things properly is
    not as easy as believed. ?Those in the user community see only their
    immediate problem (threads don't use multicore). ?People like me see
    much bigger problems. ?We need consensus on the problems, and how to
    solve it, and a commitment to invest what's required.
    Well, you seem to come down pretty hard on people that at your
    doorstep saying their WILLING and INTERESTED in supporting python
    development. ?And, you're exactly right: ?users see only their
    immediate problem--but that's the definition of being a user. ?If
    users saw the whole picture from the dev side, then they be
    developers, not users.

    Please consider that you're representing the python dev community
    here; I'm you're friend here, not your enemy.
    I'm sorry if I came across harshly. My intent was merely to push you
    towards supporting long-term solutions, rather than short-term ones.
  • Martin v. Löwis at Oct 24, 2008 at 7:07 am

    You seem confused. PEP 3121 is for isolated interpreters (ie emulated
    processes), not threading.
    Just a small remark: this wasn't the primary objective of the PEP.
    The primary objective was to support module cleanup in a reliable
    manner, to allow eventually to get modules garbage-collected properly.
    However, I also kept the isolated interpreters feature in mind there.

    Regards,
    Martin
  • Christian Heimes at Oct 23, 2008 at 7:24 am

    Andy wrote:
    2) Barriers to "free threading". As Jesse describes, this is simply
    just the GIL being in place, but of course it's there for a reason.
    It's there because (1) doesn't hold and there was never any specs/
    guidance put forward about what should and shouldn't be done in multi-
    threaded apps (see my QuickTime API example). Perhaps if we could go
    back in time, we would not put the GIL in place, strict guidelines
    regarding multithreaded use would have been established, and PEP 3121
    would have been mandatory for C modules. Then again--screw that, if I
    could go back in time, I'd just go for the lottery tickets!! :^)
    I'm very - not absolute, but very - sure that Guido and the initial
    designers of Python would have added the GIL anyway. The GIL makes
    Python faster on single core machines and more stable on multi core
    machines. Other language designers think the same way. Ruby recently got
    a GIL. The article
    http://www.infoq.com/news/2007/05/ruby-threading-futures explains the
    rationales for a GIL in Ruby. The article also holds a quote from Guido
    about threading in general.

    Several people inside and outside the Python community think that
    threads are dangerous and don't scale. The paper
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf sums it
    up nicely, It explains why modern processors are going to cause more and
    more trouble with the Java approach to threads, too.

    Python *must* gain means of concurrent execution of CPU bound code
    eventually to survive on the market. But it must get the right means or
    we are going to suffer the consequences.

    Christian
  • Rhamphoryncus at Oct 23, 2008 at 9:24 pm

    On Oct 23, 11:30?am, Glenn Linderman wrote:
    On approximately 10/23/2008 12:24 AM, came the following characters from
    the keyboard of Christian Heimes:
    Andy wrote:
    2) Barriers to "free threading". ?As Jesse describes, this is simply
    just the GIL being in place, but of course it's there for a reason.
    It's there because (1) doesn't hold and there was never any specs/
    guidance put forward about what should and shouldn't be done in multi-
    threaded apps (see my QuickTime API example). ?Perhaps if we could go
    back in time, we would not put the GIL in place, strict guidelines
    regarding multithreaded use would have been established, and PEP 3121
    would have been mandatory for C modules. ?Then again--screw that, if I
    could go back in time, I'd just go for the lottery tickets!! :^)
    I've been following this discussion with interest, as it certainly seems
    that multi-core/multi-CPU machines are the coming thing, and many
    applications will need to figure out how to use them effectively.
    I'm very - not absolute, but very - sure that Guido and the initial
    designers of Python would have added the GIL anyway. The GIL makes
    Python faster on single core machines and more stable on multi core
    machines. Other language designers think the same way. Ruby recently
    got a GIL. The article
    http://www.infoq.com/news/2007/05/ruby-threading-futuresexplains the
    rationales for a GIL in Ruby. The article also holds a quote from
    Guido about threading in general.
    Several people inside and outside the Python community think that
    threads are dangerous and don't scale. The paper
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdfsums
    it up nicely, It explains why modern processors are going to cause
    more and more trouble with the Java approach to threads, too.
    Reading this PDF paper is extremely interesting (albeit somewhat
    dependent on understanding abstract theories of computation; I have
    enough math background to follow it, sort of, and most of the text can
    be read even without fully understanding the theoretical abstractions).

    I have already heard people talking about "Java applications are
    buggy". ?I don't believe that general sequential programs written in
    Java are any buggier than programs written in other languages... so I
    had interpreted that to mean (based on some inquiry) that complex,
    multi-threaded Java applications are buggy. ?And while I also don't
    believe that complex, multi-threaded programs written in Java are any
    buggier than complex, multi-threaded programs written in other
    languages, it does seem to be true that Java is one of the currently
    popular languages in which to write complex, multi-threaded programs,
    because of its language support for threads and concurrency primitives. ?
    These reports were from people that are not programmers, but are field
    IT people, that have bought and/or support software and/or hardware with
    drivers, that are written in Java, and seem to have non-ideal behavior,
    (apparently only) curable by stopping/restarting the application or
    driver, or sometimes requiring a reboot.

    The paper explains many traps that lead to complex, multi-threaded
    programs being buggy, and being hard to test. ?I have worked with
    parallel machines, applications, and databases for 25 years, and can
    appreciate the succinct expression of the problems explained within the
    paper, and can, from experience, agree with its premises and
    conclusions. ?Parallel applications only have been commercial successes
    when the parallelism is tightly constrained to well-controlled patterns
    that could be easily understood. ?Threads, especially in "cooperation"
    with languages that use memory pointers, have the potential to get out
    of control, in inexplicable ways.
    Although the paper is correct in many ways, I find it fails to
    distinguish the core of the problem from the chaff surrounding it, and
    thus is used to justify poor language designs.

    For example, the amount of interaction may be seen as a spectrum: at
    one end is C or Java threads, with complicated memory models, and a
    tendency to just barely control things using locks. At the other end
    would be completely isolated processes with no form of IPC. The later
    is considered the worst possible, while the latter is the best
    possible (purely sequential).

    However, the latter is too weak for many uses. At a minimum we'd like
    some pipes to communicate. Helps, but it's still too weak. What if
    you have a large amount of data to share, created at startup but
    otherwise not modified? So we add some read only types and ways to
    define your own read only types. A couple of those types need a
    process associated with them, so we make sure process handles are
    proper objects too.

    What have we got now? It's more on the thread end of the spectrum
    than the process end, but it's definitely not a C or Java thread, and
    it's definitely not an OS process. What is it? Does it have the
    problems in the paper? Only some? Which?

    Another peeve I have is his characterization of the observer pattern.
    The generalized form of the problem exists in both single-threaded
    sequential programs, in the form of unexpected reentrancy, and message
    passing, with infinite CPU usage or infinite number of pending
    messages.

    Perhaps threading makes it much worse; I've heard many anecdotes that
    would support that. Or perhaps it's the lack of automatic deadlock
    detection, giving a clear and diagnosable error for you to fix.
    Certainly, the mystery and extremeness of a deadlock could explain how
    much it scales people. Either way the paper says nothing.

    Python *must* gain means of concurrent execution of CPU bound code
    eventually to survive on the market. But it must get the right means
    or we are going to suffer the consequences.
    This statement, after reading the paper, seems somewhat in line with the
    author's premise that language acceptability requires that a language be
    self-contained/monolithic, and potentially sufficient to implement
    itself. ?That seems to also be one of the reasons that Java is used
    today for threaded applications. ?It does seem to be true, given current
    hardware trends, that _some mechanism_ must be provided to obtain the
    benefit of multiple cores/CPUs to a single application, and that Python
    must either implement or interface to that mechanism to continue to be a
    viable language for large scale application development.

    Andy seems to want an implementation of independent Python processes
    implemented as threads within a single address space, that can be
    coordinated by an outer application. ?This actually corresponds to the
    model promulgated in the paper as being most likely to succeed. ?Of
    course, it maps nicely into a model using separate processes,
    coordinated by an outer process, also. ?The differences seem to be:

    1) Most applications are historically perceived as corresponding to
    single processes. ?Language features for multi-processing are rare, and
    such languages are not in common use.

    2) A single address space can be convenient for the coordinating outer
    application. ?It does seem simpler and more efficient to simply "copy"
    data from one memory location to another, rather than send it in a
    message, especially if the data are large. ?On the other hand,
    coordination of memory access between multiple cores/CPUs effectively
    causes memory copies from one cache to the other, and if memory is
    accessed from multiple cores/CPUs regularly, the underlying hardware
    implements additional synchronization and copying of data, potentially
    each time the memory is accessed. ?Being forced to do message passing of
    data between processes can actually be more efficient than access to
    shared memory at times. ?I should note that in my 25 years of parallel
    development, all the systems created used a message passing paradigm,
    partly because the multiple CPUs often didn't share the same memory
    chips, much less the same address space, and that a key feature of all
    the successful systems of that nature was an efficient inter-CPU message
    passing mechanism. ?I should also note that Herb Sutter has a recent
    series of columns in Dr Dobbs regarding multi-core/multi-CPU parallelism
    and a variety of implementation pitfalls, that I found to be very
    interesting reading.
    Try looking at it on another level: when your CPU wants to read from a
    bit of memory controlled by another CPU it sends them a message
    requesting they get it for us. They send back a message containing
    that memory. They also note we have it, in case they want to modify
    it later. We also note where we got it, in case we want to modify it
    (and not wait for them to do modifications for us).

    Message passing vs shared memory isn't really a yes/no question. It's
    about ratios, usage patterns, and tradeoffs. *All* programs will
    share data, but in what way? If it's just the code itself you can
    move the cache validation into software and simplify the CPU, making
    it faster. If the shared data is a lot more than that, and you use it
    to coordinate accesses, then it'll be faster to have it in hardware.

    It's quite possible they'll come up with something that seems quite
    different, but in reality is the same sort of rearrangement. Add
    hardware support for transactions, move the caching partly into
    software, etc.
    I have noted the multiprocessing module that is new to Python 2.6/3.0
    being feverishly backported to Python 2.5, 2.4, etc... indicating that
    people truly find the model/module useful... seems that this is one way,
    in Python rather than outside of it, to implement the model Andy is
    looking for, although I haven't delved into the details of that module
    yet, myself. ?I suspect that a non-Python application could load one
    embedded Python interpreter, and then indirectly use the multiprocessing
    module to control other Python interpreters in other processors. ?I
    don't know that multithreading primitives such as described in the paper
    are available in the multiprocessing module, but perhaps they can be
    implemented in some manner using the tools that are provided; in any
    case, some interprocess communication primitives are provided via this
    new Python module.

    There could be opportunity to enhance Python with process creation and
    process coordination operations, rather than have it depend on
    easy-to-implement-incorrectly coordination patterns or
    easy-to-use-improperly libraries/modules of multiprocessing primitives
    (this is not a slam of the new multiprocessing module, which appears to
    be filling a present need in rather conventional ways, but just to point
    out that ideas promulgated by the paper, which I suspect 2 years later
    are still research topics, may be a better abstraction than the
    conventional mechanisms).

    One thing Andy hasn't yet explained (or I missed) is why any of his
    application is coded in a language other than Python. ?I can think of a
    number of possibilities:

    A) (Historical) It existed, then the desire for extensions was seen, and
    Python was seen as a good extension language.

    B) Python is inappropriate (performance?) for some of the algorithms
    (but should they be coded instead as Python extensions, with the core
    application being in Python?)

    C) Unavailability of Python wrappers for particularly useful 3rd-party
    libraries

    D) Other?
    "It already existed" is definitely the original reason, but now it
    includes single-threaded performance and multi-threaded scalability.
    Although the idea of "just write an extension that releases the GIL"
    is a common suggestion, it needs to be fairly coarse to be effective,
    and ensure little of the CPU time is left in python. If the apps
    spreads around it's CPU time it is likely impossible to use python
    effectively.
  • Rhamphoryncus at Oct 24, 2008 at 9:16 pm

    On Oct 24, 3:02?pm, Glenn Linderman wrote:
    On approximately 10/23/2008 2:24 PM, came the following characters from the
    keyboard of Rhamphoryncus:
    On Oct 23, 11:30 am, Glenn Linderman wrote:


    On approximately 10/23/2008 12:24 AM, came the following characters from
    the keyboard of Christian Heimes
    Andy wrote:
    I'm very - not absolute, but very - sure that Guido and the initial
    designers of Python would have added the GIL anyway. The GIL makes
    Python faster on single core machines and more stable on multi core
    machines.
    Actually, the GIL doesn't make Python faster; it is a design decision that
    reduces the overhead of lock acquisition, while still allowing use of global
    variables.

    Using finer-grained locks has higher run-time cost; eliminating the use of
    global variables has a higher programmer-time cost, but would actually run
    faster and more concurrently than using a GIL. Especially on a
    multi-core/multi-CPU machine.
    Those "globals" include classes, modules, and functions. You can't
    have *any* objects shared. Your interpreters are entirely isolated,
    much like processes (and we all start wondering why you don't use
    processes in the first place.)

    Or use safethread. It imposes safe semantics on shared objects, so
    you can keep your global classes, modules, and functions. Still need
    garbage collection though, and on CPython that means refcounting and
    the GIL.

    Another peeve I have is his characterization of the observer pattern.
    The generalized form of the problem exists in both single-threaded
    sequential programs, in the form of unexpected reentrancy, and message
    passing, with infinite CPU usage or infinite number of pending
    messages.
    So how do you get reentrancy is a single-threaded sequential program? I
    think only via recursion? Which isn't a serious issue for the observer
    pattern. If you add interrupts, then your program is no longer sequential.
    Sorry, I meant recursion. Why isn't it a serious issue for
    single-threaded programs? Just the fact that it's much easier to
    handle when it does happen?

    Try looking at it on another level: when your CPU wants to read from a
    bit of memory controlled by another CPU it sends them a message
    requesting they get it for us. They send back a message containing
    that memory. They also note we have it, in case they want to modify
    it later. We also note where we got it, in case we want to modify it
    (and not wait for them to do modifications for us).
    I understand that level... one of my degrees is in EE, and I started college
    wanting to design computers (at about the time the first microprocessor chip
    came along, and they, of course, have now taken over). But I was side-lined
    by the malleability of software, and have mostly practiced software during
    my career.

    Anyway, that is the level that Herb Sutter was describing in the Dr Dobbs
    articles I mentioned. And the overhead of doing that at the level of a cache
    line is high, if there is lots of contention for particular memory locations
    between threads running on different cores/CPUs. So to achieve concurrency,
    you must not only limit explicit software locks, but must also avoid memory
    layouts where data needed by different cores/CPUs are in the same cache
    line.
    I suspect they'll end up redesigning the caching to use a size and
    alignment of 64 bits (or smaller). Same cache line size, but with
    masking.

    You still need to minimize contention of course, but that should at
    least be more predictable. Having two unrelated mallocs contend could
    suck.

    Message passing vs shared memory isn't really a yes/no question. It's
    about ratios, usage patterns, and tradeoffs. *All* programs will
    share data, but in what way? If it's just the code itself you can
    move the cache validation into software and simplify the CPU, making
    it faster. If the shared data is a lot more than that, and you use it
    to coordinate accesses, then it'll be faster to have it in hardware.
    I agree there are tradeoffs... unfortunately, the hardware architectures
    vary, and the languages don't generally understand the hardware. So then it
    becomes an OS API, which adds the overhead of an OS API call to the cost of
    the synchronization... It could instead be (and in clever applications is) a
    non-portable assembly level function that wraps on OS locking or waiting
    API.
    In practice I highly doubt we'll see anything that doesn't extend
    traditional threading (posix threads, whatever MS has, etc).

    Nonetheless, while putting the shared data accesses in hardware might be
    more efficient per unit operation, there are still tradeoffs: A software
    solution can group multiple accesses under a single lock acquisition; the
    hardware probably doesn't have enough smarts to do that. So it may well
    require many more hardware unit operations for the same overall concurrently
    executed function, and the resulting performance may not be any better.
    Speculative ll/sc? ;)

    Sidestepping the whole issue, by minimizing shared data in the application
    design, avoiding not only software lock calls, and hardware cache
    contention, is going to provide the best performance... it isn't the things
    you do efficiently that make software fast ? it is the things you don't do
    at all.
    Minimizing contention, certainly. Minimizing the shared data itself
    is iffier though.
  • Adam Olsen at Oct 25, 2008 at 1:07 am

    On Fri, Oct 24, 2008 at 5:38 PM, Glenn Linderman wrote:
    On approximately 10/24/2008 2:16 PM, came the following characters from the
    keyboard of Rhamphoryncus:
    On Oct 24, 3:02 pm, Glenn Linderman wrote:


    On approximately 10/23/2008 2:24 PM, came the following characters from
    the
    keyboard of Rhamphoryncus:
    On Oct 23, 11:30 am, Glenn Linderman wrote:

    On approximately 10/23/2008 12:24 AM, came the following characters
    from
    the keyboard of Christian Heimes
    Andy wrote:
    I'm very - not absolute, but very - sure that Guido and the initial
    designers of Python would have added the GIL anyway. The GIL makes
    Python faster on single core machines and more stable on multi core
    machines.
    Actually, the GIL doesn't make Python faster; it is a design decision
    that
    reduces the overhead of lock acquisition, while still allowing use of
    global
    variables.

    Using finer-grained locks has higher run-time cost; eliminating the use
    of
    global variables has a higher programmer-time cost, but would actually
    run
    faster and more concurrently than using a GIL. Especially on a
    multi-core/multi-CPU machine.
    Those "globals" include classes, modules, and functions. You can't
    have *any* objects shared. Your interpreters are entirely isolated,
    much like processes (and we all start wondering why you don't use
    processes in the first place.)
    Indeed; isolated, independent interpreters are one of the goals. It is,
    indeed, much like processes, but in a single address space. It allows the
    master process (Python or C for the embedded case) to be coded using memory
    references and copies and pointer swaps instead of using semaphores, and
    potentially multi-megabyte message transfers.

    It is not clear to me that with the use of shared memory between processes,
    that the application couldn't use processes, and achieve many of the same
    goals. On the other hand, the code to create and manipulate processes and
    shared memory blocks is harder to write and has more overhead than the code
    to create and manipulate threads, which can, when told, access any memory
    block in the process. This allows the shared memory to be resized more
    easily, or more blocks of shared memory created more easily. On the other
    hand, the creation of shared memory blocks shouldn't be a high-use operation
    in a program that has sufficient number crunching to do to be able to
    consume multiple cores/CPUs.
    Or use safethread. It imposes safe semantics on shared objects, so
    you can keep your global classes, modules, and functions. Still need
    garbage collection though, and on CPython that means refcounting and
    the GIL.
    Sounds like safethread has 35-40% overhead. Sounds like too much, to me.
    The specific implementation of safethread, which attempts to remove
    the GIL from CPython, has significant overhead and had very limited
    success at being scalable.

    The monitor design proposed by safethread has no inherent overhead and
    is completely scalable.


    --
    Adam Olsen, aka Rhamphoryncus
  • Terry Reedy at Oct 25, 2008 at 3:39 am

    Glenn Linderman wrote:

    For example, Python presently has a rather stupid algorithm for string
    concatenation.
    Python the language has syntax and semantics. Python implementations
    have algorithms that fulfill the defined semantics.
    It allocates only the exactly necessary space for the
    concatenated string. This is a brilliant move, when you realize that
    strings are immutable, and once allocated can never change, but the
    operation

    for line in mylistofstrings:
    string = string + line

    is basically O(N-squared) as a result. The better algorithm would
    double the size of memory allocated for string each time there is not
    enough room to add the next line, and that reduces the cost of the
    algorithm to O(N).
    If there is more than one reference to a guaranteed immutable object,
    such as a string, the 'stupid' algorithm seem necessary to me. In-place
    modification of a shared immutable would violate semantics.

    However, if you do

    string = ''
    for line in strings:
    string =+ line

    so that there is only one reference and you tell the interpreter that
    you don't mind the old value being updated, then I believe in 2.6, if
    not before, CPython does overallocation and in-place extension. (I am
    not sure about s=s+l.) But this is just ref-counted CPython.

    Terry Jan Reedy
  • Terry Reedy at Oct 25, 2008 at 4:54 pm

    Glenn Linderman wrote:
    On approximately 10/24/2008 8:39 PM, came the following characters from
    the keyboard of Terry Reedy:
    Glenn Linderman wrote:
    For example, Python presently has a rather stupid algorithm for
    string concatenation.
    Yes, CPython2.x, x<=5 did.
    Python the language has syntax and semantics. Python implementations
    have algorithms that fulfill the defined semantics.
    I can buy that, but when Python is not qualified, CPython should be
    assumed, as it predominates.
    People do that, and it sometimes leads to unnecessary confusion. As to
    the present discussion, is it about
    * changing Python, the language
    * changing all Python implementations
    * changing CPython, the leading implementation
    * branching CPython with a compiler switch, much as there was one for
    including Unicode or not.
    * forking CPython
    * modifying an existing module
    * adding a new module
    * making better use of the existing facilities
    * some combination of the above
    Of course, the latest official release
    should probably also be assumed, but that is so recent,
    People do that, and it sometimes leads to unnecessary confusion. People
    routine posted version specific problems and questions without
    specifying the version (or platform when relevant). In a month or so,
    there will be *2* latest official releases. There will be more
    confusion without qualification.
    few have likely
    upgraded as yet... I should have qualified the statement.
    * Is the target of this discussion 2.7 or 3.1 (some changes would be 3.1
    only).

    [diversion to the side topic]
    If there is more than one reference to a guaranteed immutable object,
    such as a string, the 'stupid' algorithm seem necessary to me.
    In-place modification of a shared immutable would violate semantics.
    Absolutely. But after the first iteration, there is only one reference
    to string.
    Which is to say, 'string' is the only reference to its object it refers
    too. You are right, so I presume that the optimization described would
    then kick in. But I have not read the code, and CPython optimizations
    are not part of the *language* reference.

    [back to the main topic]

    There is some discussion/debate/confusion about how much of the stdlib
    is 'standard Python library' versus 'standard CPython library'. [And
    there is some feeling that standard Python modules should have a default
    Python implementation that any implementation can use until it
    optionally replaces it with a faster compiled version.] Hence my
    question about the target of this discussion and the first three options
    listed above.

    Terry Jan Reedy
  • Patrick Stinson at Oct 24, 2008 at 3:26 pm
    I'm not finished reading the whole thread yet, but I've got some
    things below to respond to this post with.
    On Thu, Oct 23, 2008 at 9:30 AM, Glenn Linderman wrote:
    On approximately 10/23/2008 12:24 AM, came the following characters from the
    keyboard of Christian Heimes:
    Andy wrote:
    2) Barriers to "free threading". As Jesse describes, this is simply
    just the GIL being in place, but of course it's there for a reason.
    It's there because (1) doesn't hold and there was never any specs/
    guidance put forward about what should and shouldn't be done in multi-
    threaded apps (see my QuickTime API example). Perhaps if we could go
    back in time, we would not put the GIL in place, strict guidelines
    regarding multithreaded use would have been established, and PEP 3121
    would have been mandatory for C modules. Then again--screw that, if I
    could go back in time, I'd just go for the lottery tickets!! :^)

    I've been following this discussion with interest, as it certainly seems
    that multi-core/multi-CPU machines are the coming thing, and many
    applications will need to figure out how to use them effectively.
    I'm very - not absolute, but very - sure that Guido and the initial
    designers of Python would have added the GIL anyway. The GIL makes Python
    faster on single core machines and more stable on multi core machines. Other
    language designers think the same way. Ruby recently got a GIL. The article
    http://www.infoq.com/news/2007/05/ruby-threading-futures explains the
    rationales for a GIL in Ruby. The article also holds a quote from Guido
    about threading in general.

    Several people inside and outside the Python community think that threads
    are dangerous and don't scale. The paper
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf sums it up
    nicely, It explains why modern processors are going to cause more and more
    trouble with the Java approach to threads, too.
    Reading this PDF paper is extremely interesting (albeit somewhat dependent
    on understanding abstract theories of computation; I have enough math
    background to follow it, sort of, and most of the text can be read even
    without fully understanding the theoretical abstractions).

    I have already heard people talking about "Java applications are buggy". I
    don't believe that general sequential programs written in Java are any
    buggier than programs written in other languages... so I had interpreted
    that to mean (based on some inquiry) that complex, multi-threaded Java
    applications are buggy. And while I also don't believe that complex,
    multi-threaded programs written in Java are any buggier than complex,
    multi-threaded programs written in other languages, it does seem to be true
    that Java is one of the currently popular languages in which to write
    complex, multi-threaded programs, because of its language support for
    threads and concurrency primitives. These reports were from people that are
    not programmers, but are field IT people, that have bought and/or support
    software and/or hardware with drivers, that are written in Java, and seem to
    have non-ideal behavior, (apparently only) curable by stopping/restarting
    the application or driver, or sometimes requiring a reboot.

    The paper explains many traps that lead to complex, multi-threaded programs
    being buggy, and being hard to test. I have worked with parallel machines,
    applications, and databases for 25 years, and can appreciate the succinct
    expression of the problems explained within the paper, and can, from
    experience, agree with its premises and conclusions. Parallel applications
    only have been commercial successes when the parallelism is tightly
    constrained to well-controlled patterns that could be easily understood.
    Threads, especially in "cooperation" with languages that use memory
    pointers, have the potential to get out of control, in inexplicable ways.

    Python *must* gain means of concurrent execution of CPU bound code
    eventually to survive on the market. But it must get the right means or we
    are going to suffer the consequences.
    This statement, after reading the paper, seems somewhat in line with the
    author's premise that language acceptability requires that a language be
    self-contained/monolithic, and potentially sufficient to implement itself.
    That seems to also be one of the reasons that Java is used today for
    threaded applications. It does seem to be true, given current hardware
    trends, that _some mechanism_ must be provided to obtain the benefit of
    multiple cores/CPUs to a single application, and that Python must either
    implement or interface to that mechanism to continue to be a viable language
    for large scale application development.

    Andy seems to want an implementation of independent Python processes
    implemented as threads within a single address space, that can be
    coordinated by an outer application. This actually corresponds to the model
    promulgated in the paper as being most likely to succeed. Of course, it
    maps nicely into a model using separate processes, coordinated by an outer
    process, also. The differences seem to be:

    1) Most applications are historically perceived as corresponding to single
    processes. Language features for multi-processing are rare, and such
    languages are not in common use.

    2) A single address space can be convenient for the coordinating outer
    application. It does seem simpler and more efficient to simply "copy" data
    from one memory location to another, rather than send it in a message,
    especially if the data are large. On the other hand, coordination of memory
    access between multiple cores/CPUs effectively causes memory copies from one
    cache to the other, and if memory is accessed from multiple cores/CPUs
    regularly, the underlying hardware implements additional synchronization and
    copying of data, potentially each time the memory is accessed. Being forced
    to do message passing of data between processes can actually be more
    efficient than access to shared memory at times. I should note that in my
    25 years of parallel development, all the systems created used a message
    passing paradigm, partly because the multiple CPUs often didn't share the
    same memory chips, much less the same address space, and that a key feature
    of all the successful systems of that nature was an efficient inter-CPU
    message passing mechanism. I should also note that Herb Sutter has a recent
    series of columns in Dr Dobbs regarding multi-core/multi-CPU parallelism and
    a variety of implementation pitfalls, that I found to be very interesting
    reading.

    I have noted the multiprocessing module that is new to Python 2.6/3.0 being
    feverishly backported to Python 2.5, 2.4, etc... indicating that people
    truly find the model/module useful... seems that this is one way, in Python
    rather than outside of it, to implement the model Andy is looking for,
    although I haven't delved into the details of that module yet, myself. I
    suspect that a non-Python application could load one embedded Python
    interpreter, and then indirectly use the multiprocessing module to control
    other Python interpreters in other processors. I don't know that
    multithreading primitives such as described in the paper are available in
    the multiprocessing module, but perhaps they can be implemented in some
    manner using the tools that are provided; in any case, some interprocess
    communication primitives are provided via this new Python module.

    There could be opportunity to enhance Python with process creation and
    process coordination operations, rather than have it depend on
    easy-to-implement-incorrectly coordination patterns or
    easy-to-use-improperly libraries/modules of multiprocessing primitives (this
    is not a slam of the new multiprocessing module, which appears to be filling
    a present need in rather conventional ways, but just to point out that ideas
    promulgated by the paper, which I suspect 2 years later are still research
    topics, may be a better abstraction than the conventional mechanisms).

    One thing Andy hasn't yet explained (or I missed) is why any of his
    application is coded in a language other than Python. I can think of a
    number of possibilities:

    A) (Historical) It existed, then the desire for extensions was seen, and
    Python was seen as a good extension language.

    B) Python is inappropriate (performance?) for some of the algorithms (but
    should they be coded instead as Python extensions, with the core application
    being in Python?)

    C) Unavailability of Python wrappers for particularly useful 3rd-party
    libraries

    D) Other?
    We develop virtual instrument plugins for music production using
    AudioUnit, VST, and RTAS on Windows and OS X. While our dsp engine's
    code has to be written in C/C++ for performance reasons, the gui could
    have been written in python. But, we didn't because:

    1) Our project lead didn't know python, and the project began with
    little time for him to learn it.
    2) All of our third-party libs (for dsp, plugin-wrappers, etc) are
    written in C++, so it would far easier to write and debug our app if
    written in the same language. Could I do it now? yes. Could we do it
    then? No.

    ** Additionally **, we would have run into this problem, which is very
    appropriate to this thread:

    3) Adding python as an audio scripting language in the audio thread
    would have caused concurrency issues if our GUI had been written in
    python, since audio threads are not allowed to make blockign calls
    (f.ex. acquiring the GIL).

    OK, I'll continue reading the thread now :)
    --
    Glenn -- http://nevcal.com/
    ===========================
    A protocol is complete when there is nothing left to remove.
    -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

    --
    http://mail.python.org/mailman/listinfo/python-list
  • Andy O'Meara at Oct 24, 2008 at 3:42 pm
    Glenn, great post and points!
    Andy seems to want an implementation of independent Python processes
    implemented as threads within a single address space, that can be
    coordinated by an outer application. ?This actually corresponds to the
    model promulgated in the paper as being most likely to succeed.
    Yeah, that's the idea--let the highest levels run and coordinate the
    show.
    It does seem simpler and more efficient to simply "copy"
    data from one memory location to another, rather than send it in a
    message, especially if the data are large.
    That's the rub... In our case, we're doing image and video
    manipulation--stuff not good to be messaging from address space to
    address space. The same argument holds for numerical processing with
    large data sets. The workers handing back huge data sets via
    messaging isn't very attractive.
    One thing Andy hasn't yet explained (or I missed) is why any of his
    application is coded in a language other than Python. ?
    Our software runs in real time (so performance is paramount),
    interacts with other static libraries, depends on worker threads to
    perform real-time image manipulation, and leverages Windows and Mac OS
    API concepts and features. Python's performance hits have generally
    been a huge challenge with our animators because they often have to go
    back and massage their python code to improve execution performance.
    So, in short, there are many reasons why we use python as a part
    rather than a whole.

    The other area of pain that I mentioned in one of my other posts is
    that what we ship, above all, can't be flaky. The lack of module
    cleanup (intended to be addressed by PEP 3121), using a duplicate copy
    of the python dynamic lib, and namespace black magic to achieve
    independent interpreters are all examples that have made using python
    for us much more challenging and time-consuming then we ever
    anticipated.

    Again, if it turns out nothing can be done about our needs (which
    appears to be more and more like the case), I think it's important for
    everyone here to consider the points raised here in the last week.
    Moreover, realize that the python dev community really stands to gain
    from making python usable as a tool (rather than a monolith). This
    fact alone has caused lua to *rapidly* rise in popularity with
    software companies looking to embed a powerful, lightweight
    interpreter in their software.

    As a python language fan an enthusiast, don't let lua win! (I say
    this endearingly of course--I have the utmost respect for both
    communities and I only want to see CPython be an attractive pick when
    a company is looking to embed a language that won't intrude upon their
    app's design).


    Andy
  • Patrick Stinson at Oct 24, 2008 at 4:08 pm
    As a side note to the performance question, we are executing python
    code in an audio thread that is used in all of the top-end music
    production environments. We have found the language to perform
    extremely well when executed at control-rate frequency, meaning we
    aren't doing DSP computations, just responding to less-frequent events
    like user input and MIDI messages.

    So we are sitting this music platform with unimaginable possibilities
    in the music world (of which python does not play a role), but those
    little CPU spikes caused by the GIL at low latencies won't let us have
    it. AFAIK, there is no music scripting language out there that would
    come close, and yet we are sooooo close! This is a big deal.
    On Fri, Oct 24, 2008 at 7:42 AM, Andy O'Meara wrote:

    Glenn, great post and points!
    Andy seems to want an implementation of independent Python processes
    implemented as threads within a single address space, that can be
    coordinated by an outer application. This actually corresponds to the
    model promulgated in the paper as being most likely to succeed.
    Yeah, that's the idea--let the highest levels run and coordinate the
    show.
    It does seem simpler and more efficient to simply "copy"
    data from one memory location to another, rather than send it in a
    message, especially if the data are large.
    That's the rub... In our case, we're doing image and video
    manipulation--stuff not good to be messaging from address space to
    address space. The same argument holds for numerical processing with
    large data sets. The workers handing back huge data sets via
    messaging isn't very attractive.
    One thing Andy hasn't yet explained (or I missed) is why any of his
    application is coded in a language other than Python.
    Our software runs in real time (so performance is paramount),
    interacts with other static libraries, depends on worker threads to
    perform real-time image manipulation, and leverages Windows and Mac OS
    API concepts and features. Python's performance hits have generally
    been a huge challenge with our animators because they often have to go
    back and massage their python code to improve execution performance.
    So, in short, there are many reasons why we use python as a part
    rather than a whole.

    The other area of pain that I mentioned in one of my other posts is
    that what we ship, above all, can't be flaky. The lack of module
    cleanup (intended to be addressed by PEP 3121), using a duplicate copy
    of the python dynamic lib, and namespace black magic to achieve
    independent interpreters are all examples that have made using python
    for us much more challenging and time-consuming then we ever
    anticipated.

    Again, if it turns out nothing can be done about our needs (which
    appears to be more and more like the case), I think it's important for
    everyone here to consider the points raised here in the last week.
    Moreover, realize that the python dev community really stands to gain
    from making python usable as a tool (rather than a monolith). This
    fact alone has caused lua to *rapidly* rise in popularity with
    software companies looking to embed a powerful, lightweight
    interpreter in their software.

    As a python language fan an enthusiast, don't let lua win! (I say
    this endearingly of course--I have the utmost respect for both
    communities and I only want to see CPython be an attractive pick when
    a company is looking to embed a language that won't intrude upon their
    app's design).


    Andy
    --
    http://mail.python.org/mailman/listinfo/python-list
  • Patrick Stinson at Oct 29, 2008 at 7:03 am
    Close, I work currently for EastWest :)

    Well, I actually like almost everything else about CPython,
    considering my audio work the only major problem I've had is with the
    GIL. I like the purist community, and I like the code, since
    integrating it on both platforms has been relatively clean, and
    required *zero* support. Frankly, with the exception of some windows
    deployment issues relating to static linking of libpython and some
    extensions, it's been a dream lib to use.

    Further, I really appreciate the discussions that happen in these
    lists, and I think that this particular problem is a wonderful example
    of a situation that requires tons of miscellaneous opinions and input
    from all angles - especially at this stage. I think that this problem
    has lots of standing discussion and lots of potential solutions and/or
    workarounds, and it would be cool for someone to aggregate and
    paraphrase that stuff into a page to assist those thinking about doing
    some patching. That's probably something that the coder would do
    themselves though.
    On Fri, Oct 24, 2008 at 10:25 AM, Andy O'Meara wrote:

    So we are sitting this music platform with unimaginable possibilities
    in the music world (of which python does not play a role), but those
    little CPU spikes caused by the GIL at low latencies won't let us have
    it. AFAIK, there is no music scripting language out there that would
    come close, and yet we are sooooo close! This is a big deal.

    Perfectly said, Patrick. It pains me to know how widespread python
    *could* be in commercial software!

    Also, good points about people being longwinded and that "code talks".

    Sadly, the time alone I've spend in the last couple days on this
    thread is scary, but I'm committed now, I guess. :^( I look at the
    length of the posts of some of these guys and I have to wonder what
    the heck they do for a living!

    As I mentioned, however, I'm close to just blowing the whistle on this
    crap and start making CPythonES (as I call it, in the spirit of the
    "ES" in "OpenGLES"). Like you, we just want the core features of
    python in a clean, tidy, *reliable* fashion--something that we can
    ship and not lose sleep (or support hours) over. Basically, I imagine
    developing an interpreter designed for dev houses like yours and mine
    (you're Ableton or Propellerhead, right?)--a python version of lua, if
    you will. The nice thing about it is that is could start fresh and
    small, but I have a feeling it would really catch on because every
    commercial dev house would choose it over CPython any day of the week
    and it would be completely disjoint form CPython.

    Andy
  • Rhamphoryncus at Oct 24, 2008 at 8:09 pm

    On Oct 24, 1:02?pm, Glenn Linderman wrote:
    On approximately 10/24/2008 8:42 AM, came the following characters from
    the keyboard of Andy O'Meara:
    Glenn, great post and points!
    Thanks. I need to admit here that while I've got a fair bit of
    professional programming experience, I'm quite new to Python -- I've not
    learned its internals, nor even the full extent of its rich library. So
    I have some questions that are partly about the goals of the
    applications being discussed, partly about how Python is constructed,
    and partly about how the library is constructed. I'm hoping to get a
    better understanding of all of these; perhaps once a better
    understanding is achieved, limitations will be understood, and maybe
    solutions be achievable.

    Let me define some speculative Python interpreters; I think the first is
    today's Python:

    PyA: Has a GIL. PyA threads can run within a process; but are
    effectively serialized to the places where the GIL is obtained/released.
    Needs the GIL because that solves lots of problems with non-reentrant
    code (an example of non-reentrant code, is code that uses global (C
    global, or C static) variables ? note that I'm not talking about Python
    vars declared global... they are only module global). In this model,
    non-reentrant code could include pieces of the interpreter, and/or
    extension modules.

    PyB: No GIL. PyB threads acquire/release a lock around each reference to
    a global variable (like "with" feature). Requires massive recoding of
    all code that contains global variables. Reduces performance
    significantly by the increased cost of obtaining and releasing locks.

    PyC: No locks. Instead, recoding is done to eliminate global variables
    (interpreter requires a state structure to be passed in). Extension
    modules that use globals are prohibited... this eliminates large
    portions of the library, or requires massive recoding. PyC threads do
    not share data between threads except by explicit interfaces.

    PyD: (A hybrid of PyA & PyC). The interpreter is recoded to eliminate
    global variables, and each interpreter instance is provided a state
    structure. There is still a GIL, however, because globals are
    potentially still used by some modules. Code is added to detect use of
    global variables by a module, or some contract is written whereby a
    module can be declared to be reentrant and global-free. PyA threads will
    obtain the GIL as they would today. PyC threads would be available to be
    created. PyC instances refuse to call non-reentrant modules, but also
    need not obtain the GIL... PyC threads would have limited module support
    initially, but over time, most modules can be migrated to be reentrant
    and global-free, so they can be used by PyC instances. Most 3rd-party
    libraries today are starting to care about reentrancy anyway, because of
    the popularity of threads.
    PyE: objects are reclassified as shareable or non-shareable, many
    types are now only allowed to be shareable. A module and its classes
    become shareable with the use of a __future__ import, and their
    shareddict uses a read-write lock for scalability. Most other
    shareable objects are immutable. Each thread is run in its own
    private monitor, and thus protected from the normal threading memory
    module nasties. Alas, this gives you all the semantics, but you still
    need scalable garbage collection.. and CPython's refcounting needs the
    GIL.

    Our software runs in real time (so performance is paramount),
    interacts with other static libraries, depends on worker threads to
    perform real-time image manipulation, and leverages Windows and Mac OS
    API concepts and features. ?Python's performance hits have generally
    been a huge challenge with our animators because they often have to go
    back and massage their python code to improve execution performance.
    So, in short, there are many reasons why we use python as a part
    rather than a whole.
    [...]
    As a python language fan an enthusiast, don't let lua win! ?(I say
    this endearingly of course--I have the utmost respect for both
    communities and I only want to see CPython be an attractive pick when
    a company is looking to embed a language that won't intrude upon their
    app's design).
    I agree with the problem, and desire to make python fill all niches,
    but let's just say I'm more ambitious with my solution. ;)
  • Glenn Linderman at Oct 24, 2008 at 8:59 pm
    On approximately 10/24/2008 1:09 PM, came the following characters from
    the keyboard of Rhamphoryncus:
    On Oct 24, 1:02 pm, Glenn Linderman wrote:

    On approximately 10/24/2008 8:42 AM, came the following characters from
    the keyboard of Andy O'Meara:

    Glenn, great post and points!
    Thanks. I need to admit here that while I've got a fair bit of
    professional programming experience, I'm quite new to Python -- I've not
    learned its internals, nor even the full extent of its rich library. So
    I have some questions that are partly about the goals of the
    applications being discussed, partly about how Python is constructed,
    and partly about how the library is constructed. I'm hoping to get a
    better understanding of all of these; perhaps once a better
    understanding is achieved, limitations will be understood, and maybe
    solutions be achievable.

    Let me define some speculative Python interpreters; I think the first is
    today's Python:

    PyA: Has a GIL. PyA threads can run within a process; but are
    effectively serialized to the places where the GIL is obtained/released.
    Needs the GIL because that solves lots of problems with non-reentrant
    code (an example of non-reentrant code, is code that uses global (C
    global, or C static) variables ? note that I'm not talking about Python
    vars declared global... they are only module global). In this model,
    non-reentrant code could include pieces of the interpreter, and/or
    extension modules.

    PyB: No GIL. PyB threads acquire/release a lock around each reference to
    a global variable (like "with" feature). Requires massive recoding of
    all code that contains global variables. Reduces performance
    significantly by the increased cost of obtaining and releasing locks.

    PyC: No locks. Instead, recoding is done to eliminate global variables
    (interpreter requires a state structure to be passed in). Extension
    modules that use globals are prohibited... this eliminates large
    portions of the library, or requires massive recoding. PyC threads do
    not share data between threads except by explicit interfaces.

    PyD: (A hybrid of PyA & PyC). The interpreter is recoded to eliminate
    global variables, and each interpreter instance is provided a state
    structure. There is still a GIL, however, because globals are
    potentially still used by some modules. Code is added to detect use of
    global variables by a module, or some contract is written whereby a
    module can be declared to be reentrant and global-free. PyA threads will
    obtain the GIL as they would today. PyC threads would be available to be
    created. PyC instances refuse to call non-reentrant modules, but also
    need not obtain the GIL... PyC threads would have limited module support
    initially, but over time, most modules can be migrated to be reentrant
    and global-free, so they can be used by PyC instances. Most 3rd-party
    libraries today are starting to care about reentrancy anyway, because of
    the popularity of threads.
    PyE: objects are reclassified as shareable or non-shareable, many
    types are now only allowed to be shareable. A module and its classes
    become shareable with the use of a __future__ import, and their
    shareddict uses a read-write lock for scalability. Most other
    shareable objects are immutable. Each thread is run in its own
    private monitor, and thus protected from the normal threading memory
    module nasties. Alas, this gives you all the semantics, but you still
    need scalable garbage collection.. and CPython's refcounting needs the
    GIL.
    Hmm. So I think your PyE is an instance is an attempt to be more
    explicit about what I said above in PyC: PyC threads do not share data
    between threads except by explicit interfaces. I consider your
    definitions of shared data types somewhat orthogonal to the types of
    threads, in that both PyA and PyC threads could use these new shared
    data items.

    I think/hope that you meant that "many types are now only allowed to be
    non-shareable"? At least, I think that should be the default; they
    should be within the context of a single, independent interpreter
    instance, so other interpreters don't even know they exist, much less
    how to share them. If so, then I understand most of the rest of your
    paragraph, and it could be a way of providing shared objects, perhaps.

    I don't understand the comment that CPython's refcounting needs the
    GIL... yes, it needs the GIL if multiple threads see the object, but not
    for private objects... only one threads uses the private objects... so
    today's refcounting should suffice... with each interpreter doing its
    own refcounting and collecting its own garbage.

    Shared objects would have to do refcounting in a protected way, under
    some lock. One "easy" solution would be to have just two types of
    objects; non-shared private objects in a thread, and global shared
    objects; access to global shared objects would require grabbing the GIL,
    and then accessing the object, and releasing the GIL. An interface
    could allow for grabbing releasing the GIL around a block of accesses to
    shared objects (with GIL:) This could reduce the number of GIL
    acquires. Then the reference counting for those objects would also be
    done under the GIL, and the garbage collecting? By another PyA thread,
    perhaps, that grabs the GIL by default? Or a PyC one that explicitly
    grabs the GIL and does a step of global garbage collection?

    A more complex, more parallel solution would allow for independent
    groups of shared objects. Of course, once there is more than one lock
    involved, there is more potential for deadlock, but it also provides for
    more parallelism. So a shared object might inherit from a "concurrency
    group" which would have a lock that could be acquired (with conc_group:)
    for access to those data items. Again, the reference counting would be
    done under that lock for that group of objects, and garbage collecting
    those objects would potentially require that lock as well...

    The solution with multiple concurrency groups allows for such groups to
    contain a single shared object, or many (probably related) shared
    objects. So the application gets a choice of the granularity of sharing
    and locking, and can choose the number of locks to optimize performance
    and achieve correctness. This sort of shared data among threads,
    though, suffers in the limit from all the problems described in the
    Berkeley paper. More reliable programs might be achieved by using
    straight PyC threads, and some very limited "data ports" that can be
    combined using a higher-order flow control concept, as outlined in the
    paper.

    While Python might be extended with these flow control concepts, they
    could be added gradually over time, and in the embedded case, could be
    implemented in some other language.


    --
    Glenn
    ------------------------------------------------------------------------

    . _|_|_| _|
    . _| _| _|_| _|_|_| _|_|_|
    . _| _|_| _| _|_|_|_| _| _| _| _|
    . _| _| _| _| _| _| _| _|
    . _|_|_| _| _|_|_| _| _| _| _|

    ------------------------------------------------------------------------
    Obstacles are those frightful things you see when you take your eyes off
    of the goal. --Henry Ford
  • Rhamphoryncus at Oct 24, 2008 at 9:15 pm

    On Oct 24, 2:59?pm, Glenn Linderman wrote:
    On approximately 10/24/2008 1:09 PM, came the following characters from
    the keyboard of Rhamphoryncus:
    PyE: objects are reclassified as shareable or non-shareable, many
    types are now only allowed to be shareable. ?A module and its classes
    become shareable with the use of a __future__ import, and their
    shareddict uses a read-write lock for scalability. ?Most other
    shareable objects are immutable. ?Each thread is run in its own
    private monitor, and thus protected from the normal threading memory
    module nasties. ?Alas, this gives you all the semantics, but you still
    need scalable garbage collection.. and CPython's refcounting needs the
    GIL.
    Hmm. ?So I think your PyE is an instance is an attempt to be more
    explicit about what I said above in PyC: PyC threads do not share data
    between threads except by explicit interfaces. ?I consider your
    definitions of shared data types somewhat orthogonal to the types of
    threads, in that both PyA and PyC threads could use these new shared
    data items.
    Unlike PyC, there's a *lot* shared by default (classes, modules,
    function), but it requires only minimal recoding. It's as close to
    "have your cake and eat it too" as you're gonna get.

    I think/hope that you meant that "many types are now only allowed to be
    non-shareable"? ?At least, I think that should be the default; they
    should be within the context of a single, independent interpreter
    instance, so other interpreters don't even know they exist, much less
    how to share them. ?If so, then I understand most of the rest of your
    paragraph, and it could be a way of providing shared objects, perhaps.
    There aren't multiple interpreters under my model. You only need
    one. Instead, you create a monitor, and run a thread on it. A list
    is not shareable, so it can only be used within the monitor it's
    created within, but the list type object is shareable.

    I've no interest in *requiring* a C/C++ extension to communicate
    between isolated interpreters. Without that they're really no better
    than processes.
  • Adam Olsen at Oct 25, 2008 at 12:59 am

    On Fri, Oct 24, 2008 at 4:48 PM, Glenn Linderman wrote:
    On approximately 10/24/2008 2:15 PM, came the following characters from the
    keyboard of Rhamphoryncus:
    On Oct 24, 2:59 pm, Glenn Linderman wrote:


    On approximately 10/24/2008 1:09 PM, came the following characters from
    the keyboard of Rhamphoryncus:
    PyE: objects are reclassified as shareable or non-shareable, many
    types are now only allowed to be shareable. A module and its classes
    become shareable with the use of a __future__ import, and their
    shareddict uses a read-write lock for scalability. Most other
    shareable objects are immutable. Each thread is run in its own
    private monitor, and thus protected from the normal threading memory
    module nasties. Alas, this gives you all the semantics, but you still
    need scalable garbage collection.. and CPython's refcounting needs the
    GIL.
    Hmm. So I think your PyE is an instance is an attempt to be more
    explicit about what I said above in PyC: PyC threads do not share data
    between threads except by explicit interfaces. I consider your
    definitions of shared data types somewhat orthogonal to the types of
    threads, in that both PyA and PyC threads could use these new shared
    data items.
    Unlike PyC, there's a *lot* shared by default (classes, modules,
    function), but it requires only minimal recoding. It's as close to
    "have your cake and eat it too" as you're gonna get.
    Yes, but I like my cake frosted with performance; Guido's non-acceptance of
    granular locks in the blog entry someone referenced was due to the slowdown
    acquired with granular locking and shared objects. Your PyE model, with
    highly granular sharing, will likely suffer the same fate.
    No, my approach includes scalable performance. Typical paths will
    involve *no* contention (ie no locking). classes and modules use
    shareddict, which is based on a read-write lock built into the
    interpreter, so it's uncontended for read-only usage patterns. Pretty
    much everything else is immutable.

    Of course that doesn't include the cost of garbage collection.
    CPython's refcounting can't scale.

    The independent threads model, with only slight locking for a few explicitly
    shared objects, has a much better chance of getting better performance
    overall. With one thread running, it would be the same as today; with
    multiple threads, it should scale at the same rate as the system... minus
    any locking done at the higher level.
    So use processes with a little IPC for these expensive-yet-"shared"
    objects. multiprocessing does it already.

    I think/hope that you meant that "many types are now only allowed to be
    non-shareable"? At least, I think that should be the default; they
    should be within the context of a single, independent interpreter
    instance, so other interpreters don't even know they exist, much less
    how to share them. If so, then I understand most of the rest of your
    paragraph, and it could be a way of providing shared objects, perhaps.
    There aren't multiple interpreters under my model. You only need
    one. Instead, you create a monitor, and run a thread on it. A list
    is not shareable, so it can only be used within the monitor it's
    created within, but the list type object is shareable.
    The python interpreter code should be sharable, having been written in C,
    and being/becoming reentrant. So in that sense, there is only one
    interpreter. Similarly, any other reentrant C extensions would be that way.
    On the other hand, each thread of execution requires its own interpreter
    context, so that would have to be independent for the threads to be
    independent. It is the combination of code+context that I call an
    interpreter, and there would be one per thread for PyC threads. Bytecode
    for loaded modules could potentially be shared, if it is also immutable.
    However, that could be in my mental "phase 2", as it would require an extra
    level of complexity in the interpreter as it creates shared bytecode...
    there would be a memory savings from avoiding multiple copies of shared
    bytecode, likely, and maybe also a compilation performance savings. So it
    sounds like a win, but it is a win that can deferred for initial simplicity,
    to prove the concept is or is not workable.

    A monitor allows a single thread to run at a time; that is the same
    situation as the present GIL. I guess I don't fully understand your model.
    To use your terminology, each monitor is a context. Each thread
    operates in a different monitor. As you say, most C functions are
    already thread-safe (reentrant). All I need to do is avoid letting
    multiple threads modify a single mutable object (such as a list) at a
    time, which I do by containing it within a single monitor (context).


    --
    Adam Olsen, aka Rhamphoryncus
  • Greg at Oct 25, 2008 at 6:29 am

    Rhamphoryncus wrote:
    A list
    is not shareable, so it can only be used within the monitor it's
    created within, but the list type object is shareable.
    Type objects contain dicts, which allow arbitrary values
    to be stored in them. What happens if one thread puts
    a private object in there? It becomes visible to other
    threads using the same type object. If it's not safe
    for sharing, bad things happen.

    Python's data model is not conducive to making a clear
    distinction between "private" and "shared" objects,
    except at the level of an entire interpreter.

    --
    Greg
  • Rhamphoryncus at Oct 25, 2008 at 10:22 pm

    On Oct 25, 12:29?am, greg wrote:
    Rhamphoryncus wrote:
    A list
    is not shareable, so it can only be used within the monitor it's
    created within, but the list type object is shareable.
    Type objects contain dicts, which allow arbitrary values
    to be stored in them. What happens if one thread puts
    a private object in there? It becomes visible to other
    threads using the same type object. If it's not safe
    for sharing, bad things happen.

    Python's data model is not conducive to making a clear
    distinction between "private" and "shared" objects,
    except at the level of an entire interpreter.
    shareable type objects (enabled by a __future__ import) use a
    shareddict, which requires all keys and values to themselves be
    shareable objects.

    Although it's a significant semantic change, in many cases it's easy
    to deal with: replace mutable (unshareable) global constants with
    immutable ones (ie list -> tuple, set -> frozenset). If you've got
    some global state you move it into a monitor (which doesn't scale, but
    that's your design). The only time this really fails is when you're
    deliberately storing arbitrary mutable objects from any thread, and
    later inspecting them from any other thread (such as our new ABC
    system's cache). If you want to store an object, but only to give it
    back to the original thread, I've got a way to do that.
  • Andy O'Meara at Oct 24, 2008 at 8:51 pm
    Another great post, Glenn!! Very well laid-out and posed!! Thanks for
    taking the time to lay all that out.
    Questions for Andy: is the type of work you want to do in independent
    threads mostly pure Python? Or with libraries that you can control to
    some extent? Are those libraries reentrant? Could they be made
    reentrant? How much of the Python standard library would need to be
    available in reentrant mode to provide useful functionality for those
    threads? I think you want PyC
    I think you've defined everything perfectly, and you're you're of
    course correct about my love for for the PyC model. :^)

    Like any software that's meant to be used without restrictions, our
    code and frameworks always use a context object pattern so that
    there's never and non-const global/shared data). I would go as far to
    say that this is the case with more performance-oriented software than
    you may think since it's usually a given for us to have to be parallel
    friendly in as many ways as possible. Perhaps Patrick can back me up
    there.

    As to what modules are "essential"... As you point out, once
    reentrant module implementations caught on in PyC or hybrid world, I
    think we'd start to see real effort to whip them into compliance--
    there's just so much to be gained imho. But to answer the question,
    there's the obvious ones (operator, math, etc), string/buffer
    processing (string, re), C bridge stuff (struct, array), and OS basics
    (time, file system, etc). Nice-to-haves would be buffer and image
    decompression (zlib, libpng, etc), crypto modules, and xml. As far as
    I can imagine, I have to believe all of these modules already contain
    little, if any, global data, so I have to believe they'd be super easy
    to make "PyC happy". Patrick, what would you see you guys using?

    That's the rub... ?In our case, we're doing image and video
    manipulation--stuff not good to be messaging from address space to
    address space. ?The same argument holds for numerical processing with
    large data sets. ?The workers handing back huge data sets via
    messaging isn't very attractive.
    In the module multiprocessing environment could you not use shared
    memory, then, for the large shared data items?
    As I understand things, the multiprocessing puts stuff in a child
    process (i.e. a separate address space), so the only to get stuff to/
    from it is via IPC, which can include a shared/mapped memory region.
    Unfortunately, a shared address region doesn't work when you have
    large and opaque objects (e.g. a rendered CoreVideo movie in the
    QuickTime API or 300 megs of audio data that just went through a
    DSP). Then you've got the hit of serialization if you're got
    intricate data structures (that would normally would need to be
    serialized, such as a hashtable or something). Also, if I may speak
    for commercial developers out there who are just looking to get the
    job done without new code, it's usually always preferable to just a
    single high level sync object (for when the job is complete) than to
    start a child processes and use IPC. The former is just WAY less
    code, plain and simple.


    Andy
  • Jesse Noller at Oct 24, 2008 at 9:02 pm

    On Fri, Oct 24, 2008 at 4:51 PM, Andy O'Meara wrote:

    In the module multiprocessing environment could you not use shared
    memory, then, for the large shared data items?
    As I understand things, the multiprocessing puts stuff in a child
    process (i.e. a separate address space), so the only to get stuff to/
    from it is via IPC, which can include a shared/mapped memory region.
    Unfortunately, a shared address region doesn't work when you have
    large and opaque objects (e.g. a rendered CoreVideo movie in the
    QuickTime API or 300 megs of audio data that just went through a
    DSP). Then you've got the hit of serialization if you're got
    intricate data structures (that would normally would need to be
    serialized, such as a hashtable or something). Also, if I may speak
    for commercial developers out there who are just looking to get the
    job done without new code, it's usually always preferable to just a
    single high level sync object (for when the job is complete) than to
    start a child processes and use IPC. The former is just WAY less
    code, plain and simple.
    Are you familiar with the API at all? Multiprocessing was designed to
    mimic threading in about every way possible, the only restriction on
    shared data is that it must be serializable, but event then you can
    override or customize the behavior.

    Also, inter process communication is done via pipes. It can also be
    done with messages if you want to tweak the manager(s).

    -jesse
  • Andy O'Meara at Oct 24, 2008 at 11:50 pm

    Are you familiar with the API at all? Multiprocessing was designed to
    mimic threading in about every way possible, the only restriction on
    shared data is that it must be serializable, but event then you can
    override or customize the behavior.

    Also, inter process communication is done via pipes. It can also be
    done with messages if you want to tweak the manager(s).
    I apologize in advance if I don't understand something correctly, but
    as I understand them, everything has to be serialized in order to go
    through IPC. So when you're talking about thousands of objects,
    buffers, and/or large OS opaque objects (e.g. memory-resident video
    and images), that seems like a pretty rough hit of run-time resources.

    Please don't misunderstand my comments to suggest that multiprocessing
    isn't great stuff. On the contrary, it's very impressive and it
    singlehandedly catapults python *way* closer to efficient CPU bound
    processing than it ever was before. All I mean to say is that in the
    case where using a shared address space with a worker pthread per
    spare core to do CPU bound work, it's a really big win not to have to
    serialize stuff. And in the case of hundreds of megs of data and/or
    thousands of data structure instances, it's a deal breaker to
    serialize and unserialize everything just so that it can be sent
    though IPC. It's a deal breaker for most performance-centric apps
    because of the unnecessary runtime resource hit and because now all
    those data structures being passed around have to have accompanying
    serialization code written (and maintained) for them. That's
    actually what I meant when I made the comment that a high level sync
    object in a shared address space is "better" then sending it all
    through IPC (when the data sets are wild and crazy). From a C/C++
    point of view, I would venture to say that it's always a huge win to
    just stick those "embarrassingly easy" parallelization cases into the
    thread with a sync object than forking and using IPC and having to
    write all the serialization code. And in the case of huge data types--
    such as video or image rendering--it makes me nervous to think of
    serializing it all just so it can go through IPC when it could just be
    passed using a pointer change and a single sync object.

    So, if I'm missing something and there's a way so pass data structures
    without serialization, then I'd definitely like to learn more (sorry
    in advance if I missed something there). When I took a look at
    multiprocessing my concerns where:
    - serialization (discussed above)
    - maturity (are we ready to bet the farm that mp is going to work
    properly on the platforms we need it to?)

    Again, I'm psyched that multiprocessing appeared in 2.6 and it's a
    huge huge step in getting everyone to unlock the power of python!
    But, then some of the tidbits described above are additional data
    points for you and others to chew on. I can tell you they're pretty
    important points for any performance-centric software provider (us,
    game developers--from EA to Ambrosia, and A/V production app
    developers like Patrick).

    Andy
  • Andy O'Meara at Oct 25, 2008 at 8:43 pm

    On Oct 24, 10:24?pm, Glenn Linderman wrote:
    And in the case of hundreds of megs of data
    ... and I would be surprised at someone that would embed hundreds of
    megs of data into an object such that it had to be serialized... seems
    like the proper design is to point at the data, or a subset of it, in a
    big buffer. ?Then data transfers would just transfer the offset/length
    and the reference to the buffer.
    and/or thousands of data structure instances,
    ... and this is another surprise! ?You have thousands of objects (data
    structure instances) to move from one thread to another?
    Heh, no, we're actually in agreement here. I'm saying that in the
    case where the data sets are large and/or intricate, a single top-
    level pointer changing hands is *always* the way to go rather than
    serialization. For example, suppose you had some nifty python code
    and C procs that were doing lots of image analysis, outputting tons of
    intricate and rich data structures. Once the thread is done with that
    job, all that output is trivially transferred back to the appropriate
    thread by a pointer changing hands.
    Of course, I know that data get large, but typical multimedia streams
    are large, binary blobs. ?I was under the impression that processing
    them usually proceeds along the lines of keeping offsets into the blobs,
    and interpreting, etc. ?Editing is usually done by making a copy of a
    blob, transforming it or a subset in some manner during the copy
    process, resulting in a new, possibly different-sized blob.
    No, you're definitely right-on, with the the additional point that the
    representation of multimedia usually employs intricate and diverse
    data structures (imagine the data structure representation of a movie
    encoded in modern codec, such as H.264, complete with paths, regions,
    pixel flow, geometry, transformations, and textures). As we both
    agree, that's something that you *definitely* want to move around via
    a single pointer (and not in a serialized form). Hence, my position
    that apps that use python can't be forced to go through IPC or else:
    (a) there's a performance/resource waste to serialize and unserialize
    large or intricate data sets, and (b) they're required to write and
    maintain serialization code that otherwise doesn't serve any other
    purpose.

    Andy
  • Andy O'Meara at Oct 27, 2008 at 2:03 am

    And in the case of hundreds of megs of data
    ... and I would be surprised at someone that would embed hundreds of
    megs of data into an object such that it had to be serialized... seems
    like the proper design is to point at the data, or a subset of it, in a
    big buffer. ?Then data transfers would just transfer the offset/length
    and the reference to the buffer.
    and/or thousands of data structure instances,
    ... and this is another surprise! ?You have thousands of objects (data
    structure instances) to move from one thread to another?
    I think we miscommunicated there--I'm actually agreeing with you. I
    was trying to make the same point you were: that intricate and/or
    large structures are meant to be passed around by a top-level pointer,
    not using and serialization/messaging. This is what I've been trying
    to explain to others here; that IPC and shared memory unfortunately
    aren't viable options, leaving app threads (rather than child
    processes) as the solution.

    Of course, I know that data get large, but typical multimedia streams
    are large, binary blobs. ?I was under the impression that processing
    them usually proceeds along the lines of keeping offsets into the blobs,
    and interpreting, etc. ?Editing is usually done by making a copy of a
    blob, transforming it or a subset in some manner during the copy
    process, resulting in a new, possibly different-sized blob.

    Your instincts are right. I'd only add on that when you're talking
    about data structures associated with an intricate video format, the
    complexity and depth of the data structures is insane -- the LAST
    thing you want to burn cycles on is serializing and unserializing that
    stuff (so IPC is out)--again, we're already on the same page here.

    I think at one point you made the comment that shared memory is a
    solution to handle large data sets between a child process and the
    parent. Although this is certainty true in principle, it doesn't hold
    up in practice since complex data structures often contain 3rd party
    and OS API objects that have their own allocators. For example, in
    video encoding, there's TONS of objects that comprise memory-resident
    video from all kinds of APIs, so the idea of having them allocated
    from shared/mapped memory block isn't even possible. Again, I only
    raise this to offer evidence that doing real-world work in a child
    process is a deal breaker--a shared address space is just way too much
    to give up.


    Andy
  • James Mills at Oct 27, 2008 at 2:11 am

    On Mon, Oct 27, 2008 at 12:03 PM, Andy O'Meara wrote:
    I think we miscommunicated there--I'm actually agreeing with you. I
    was trying to make the same point you were: that intricate and/or
    large structures are meant to be passed around by a top-level pointer,
    not using and serialization/messaging. This is what I've been trying
    to explain to others here; that IPC and shared memory unfortunately
    aren't viable options, leaving app threads (rather than child
    processes) as the solution.
    Andy,

    Why don't you just use a temporary file
    system (ram disk) to store the data that
    your app is manipulating. All you need to
    pass around then is a file descriptor.

    --JamesMills

    --
    --
    -- "Problems are solved by method"
  • Michael Sparks at Oct 28, 2008 at 9:34 am

    Glenn Linderman wrote:

    so a 3rd party library might be called to decompress the stream into a
    set of independently allocated chunks, each containing one frame (each
    possibly consisting of several allocations of memory for associated
    metadata) that is independent of other frames
    We use a combination of a dictionary + RGB data for this purpose. Using a
    dictionary works out pretty nicely for the metadata, and obviously one
    attribute holds the frame data as a binary blob.

    http://www.kamaelia.org/Components/pydoc/Kamaelia.Codec.YUV4MPEG gives some
    idea structure and usage. The example given there is this:

    Pipeline( RateControlledFileReader("video.dirac",readmode="bytes", ...),
    DiracDecoder(),
    FrameToYUV4MPEG(),
    SimpleFileWriter("output.yuv4mpeg")
    ).run()

    Now all of those components are generator components.

    That's useful since:
    a) we can structure the code to show what it does more clearly, and it
    still run efficiently inside a single process
    b) We can change this over to using multiple processes trivially:

    ProcessPipeline(
    RateControlledFileReader("video.dirac",readmode="bytes", ...),
    DiracDecoder(),
    FrameToYUV4MPEG(),
    SimpleFileWriter("output.yuv4mpeg")
    ).run()

    This version uses multiple processes (under the hood using Paul Boddies
    pprocess library, since this support predates the multiprocessing module
    support in python).

    The big issue with *this* version however is that due to pprocess (and
    friends) pickling data to be sent across OS pipes, the data throughput on
    this would be lowsy. Specifically in this example, if we could change it
    such that the high level API was this:

    ProcessPipeline(
    RateControlledFileReader("video.dirac",readmode="bytes", ...),
    DiracDecoder(),
    FrameToYUV4MPEG(),
    SimpleFileWriter("output.yuv4mpeg")
    use_shared_memory_IPC = True,
    ).run()

    That would be pretty useful, for some hopefully obvious reasons. I suppose
    ideally we'd just use shared_memory_IPC for everything and just go back to
    this:

    ProcessPipeline(
    RateControlledFileReader("video.dirac",readmode="bytes", ...),
    DiracDecoder(),
    FrameToYUV4MPEG(),
    SimpleFileWriter("output.yuv4mpeg")
    ).run()

    But essentially for us, this is an optimisation problem, not a "how do I
    even begin to use this" problem. Since it is an optimisation problem, it
    also strikes me as reasonable to consider it OK to special purpose and
    specialise such links until you get an approach that's reasonable for
    general purpose data.

    In theory, poshmodule.sourceforge.net, with a bit of TLC would be a good
    candidate or good candidate starting point for that optimisation work
    (since it does work in Linux, contrary to a reply in the thread - I've not
    tested it under windows :).

    If someone's interested in building that, then someone redoing our MiniAxon
    tutorial using processes & shared memory IPC rather than generators would
    be a relatively gentle/structured approach to dealing with this:

    * http://www.kamaelia.org/MiniAxon/

    The reason I suggest that is because any time we think about fiddling and
    creating a new optimisation approach or concurrency approach, we tend to
    build a MiniAxon prototype to flesh out the various issues involved.


    Michael
  • Andy O'Meara at Oct 28, 2008 at 2:23 pm

    On Oct 26, 10:11?pm, "James Mills" wrote:
    On Mon, Oct 27, 2008 at 12:03 PM, Andy O'Meara wrote:
    I think we miscommunicated there--I'm actually agreeing with you. ?I
    was trying to make the same point you were: that intricate and/or
    large structures are meant to be passed around by a top-level pointer,
    not using and serialization/messaging. ?This is what I've been trying
    to explain to others here; that IPC and shared memory unfortunately
    aren't viable options, leaving app threads (rather than child
    processes) as the solution.
    Andy,

    Why don't you just use a temporary file
    system (ram disk) to store the data that
    your app is manipulating. All you need to
    pass around then is a file descriptor.

    --JamesMills
    Unfortunately, it's the penalty of serialization and unserialization.
    When you're talking about stuff like memory-resident images and video
    (complete with their intricate and complex codecs), then the only
    option is to be passing around a couple pointers rather then take the
    hit of serialization (which is huge for video, for example). I've
    gone into more detail in some other posts but I could have missed
    something.


    Andy
  • Andy O'Meara at Oct 28, 2008 at 4:14 pm

    On Oct 27, 10:55?pm, Glenn Linderman wrote:


    And I think we still are miscommunicating! ?Or maybe communicating anyway!

    So when you said "object", I actually don't know whether you meant
    Python object or something else. ?I assumed Python object, which may not
    have been correct... but read on, I think the stuff below clears it up.


    Then when you mentioned thousands of objects, I imagined thousands of
    Python objects, and somehow transforming the blob into same... and back
    again. ?
    My apologies to you and others here on my use of "objects" -- I'm use
    the term generically and mean it to *not* refer to python objects (for
    the all the reasons discussed here). Python only makes up a small
    part of our app, hence my habit of "objects" to refer to other APIs'
    allocated and opaque objects (including our own and OS APIs). For all
    the reasons we've discussed, in our world, python objects don't travel
    around outside of our python C modules -- when python objects need to
    be passed to other parts of the app, they're converted into their non-
    python (portable) equivalents (ints, floats, buffers, etc--but most of
    the time, the objects are PyCObjects, so they can enter and leave a
    python context with negligible overhead). I venture to say this is
    pretty standard when any industry app uses a package (such as python),
    for various reasons:
    - Portability/Future (e.g. if we do decode to drop Python and go
    with Lua, the changes are limited to only one region of code).
    - Sanity (having any API's objects show up in places "far away"
    goes against easy-to-follow code).
    - MT flexibility (because we always never use static/global
    storage, we have all kinds of options when it comes to
    multithreading). For example, recall that by throwing python in
    multiple dynamic libs, we were able to achieve the GIL-less
    interpreter independence that we want (albeit ghetto and a pain).



    Andy
  • Patrick Stinson at Oct 29, 2008 at 10:45 pm
    If you are dealing with "lots" of data like in video or sound editing,
    you would just keep the data in shared memory and send the reference
    over IPC to the worker process. Otherwise, if you marshal and send you
    are looking at a temporary doubling of the memory footprint of your
    app because the data will be copied, and marshaling overhead.
    On Fri, Oct 24, 2008 at 3:50 PM, Andy O'Meara wrote:

    Are you familiar with the API at all? Multiprocessing was designed to
    mimic threading in about every way possible, the only restriction on
    shared data is that it must be serializable, but event then you can
    override or customize the behavior.

    Also, inter process communication is done via pipes. It can also be
    done with messages if you want to tweak the manager(s).
    I apologize in advance if I don't understand something correctly, but
    as I understand them, everything has to be serialized in order to go
    through IPC. So when you're talking about thousands of objects,
    buffers, and/or large OS opaque objects (e.g. memory-resident video
    and images), that seems like a pretty rough hit of run-time resources.

    Please don't misunderstand my comments to suggest that multiprocessing
    isn't great stuff. On the contrary, it's very impressive and it
    singlehandedly catapults python *way* closer to efficient CPU bound
    processing than it ever was before. All I mean to say is that in the
    case where using a shared address space with a worker pthread per
    spare core to do CPU bound work, it's a really big win not to have to
    serialize stuff. And in the case of hundreds of megs of data and/or
    thousands of data structure instances, it's a deal breaker to
    serialize and unserialize everything just so that it can be sent
    though IPC. It's a deal breaker for most performance-centric apps
    because of the unnecessary runtime resource hit and because now all
    those data structures being passed around have to have accompanying
    serialization code written (and maintained) for them. That's
    actually what I meant when I made the comment that a high level sync
    object in a shared address space is "better" then sending it all
    through IPC (when the data sets are wild and crazy). From a C/C++
    point of view, I would venture to say that it's always a huge win to
    just stick those "embarrassingly easy" parallelization cases into the
    thread with a sync object than forking and using IPC and having to
    write all the serialization code. And in the case of huge data types--
    such as video or image rendering--it makes me nervous to think of
    serializing it all just so it can go through IPC when it could just be
    passed using a pointer change and a single sync object.

    So, if I'm missing something and there's a way so pass data structures
    without serialization, then I'd definitely like to learn more (sorry
    in advance if I missed something there). When I took a look at
    multiprocessing my concerns where:
    - serialization (discussed above)
    - maturity (are we ready to bet the farm that mp is going to work
    properly on the platforms we need it to?)

    Again, I'm psyched that multiprocessing appeared in 2.6 and it's a
    huge huge step in getting everyone to unlock the power of python!
    But, then some of the tidbits described above are additional data
    points for you and others to chew on. I can tell you they're pretty
    important points for any performance-centric software provider (us,
    game developers--from EA to Ambrosia, and A/V production app
    developers like Patrick).

    Andy










    --
    http://mail.python.org/mailman/listinfo/python-list
  • Jesse Noller at Oct 30, 2008 at 1:26 pm

    On Wed, Oct 29, 2008 at 8:05 PM, Glenn Linderman wrote:
    On approximately 10/29/2008 3:45 PM, came the following characters from the
    keyboard of Patrick Stinson:
    If you are dealing with "lots" of data like in video or sound editing,
    you would just keep the data in shared memory and send the reference
    over IPC to the worker process. Otherwise, if you marshal and send you
    are looking at a temporary doubling of the memory footprint of your
    app because the data will be copied, and marshaling overhead.
    Right. Sounds, and is, easy, if the data is all directly allocated by the
    application. But when pieces are allocated by 3rd party libraries, that use
    the C-runtime allocator directly, then it becomes more difficult to keep
    everything in shared memory.

    One _could_ replace the C-runtime allocator, I suppose, but that could have
    some adverse effects on other code, that doesn't need its data to be in
    shared memory. So it is somewhat between a rock and a hard place.

    By avoiding shared memory, such problems are sidestepped... until you run
    smack into the GIL.
    If you do not have shared memory: You don't need threads, ergo: You
    don't get penalized by the GIL. Threads are only useful when you need
    to have that requirement of large in-memory data structures shared and
    modified by a pool of workers.

    -jesse
  • Glenn Linderman at Oct 30, 2008 at 10:54 pm
    On approximately 10/30/2008 6:26 AM, came the following characters from
    the keyboard of Jesse Noller:
    On Wed, Oct 29, 2008 at 8:05 PM, Glenn Linderman wrote:

    On approximately 10/29/2008 3:45 PM, came the following characters from the
    keyboard of Patrick Stinson:
    If you are dealing with "lots" of data like in video or sound editing,
    you would just keep the data in shared memory and send the reference
    over IPC to the worker process. Otherwise, if you marshal and send you
    are looking at a temporary doubling of the memory footprint of your
    app because the data will be copied, and marshaling overhead.
    Right. Sounds, and is, easy, if the data is all directly allocated by the
    application. But when pieces are allocated by 3rd party libraries, that use
    the C-runtime allocator directly, then it becomes more difficult to keep
    everything in shared memory.

    One _could_ replace the C-runtime allocator, I suppose, but that could have
    some adverse effects on other code, that doesn't need its data to be in
    shared memory. So it is somewhat between a rock and a hard place.

    By avoiding shared memory, such problems are sidestepped... until you run
    smack into the GIL.
    If you do not have shared memory: You don't need threads, ergo: You
    don't get penalized by the GIL. Threads are only useful when you need
    to have that requirement of large in-memory data structures shared and
    modified by a pool of workers.
    The whole point of this thread is to talk about large in-memory data
    structures that are shared and modified by a pool of workers.

    My reference to shared memory was specifically referring to the concept
    of sharing memory between processes... a particular OS feature that is
    called shared memory.

    The need for sharing memory among a pool of workers is still the
    premise. Threads do that automatically, without the need for the OS
    shared memory feature, that brings with it the need for a special
    allocator to allocate memory in the shared memory area vs the rest of
    the address space.

    Not to pick on you, particularly, Jesse, but this particular response
    made me finally understand why there has been so much repetition of the
    same issues and positions over and over and over in this thread: instead
    of comprehending the whole issue, people are responding to small
    fragments of it, with opinions that may be perfectly reasonable for that
    fragment, but missing the big picture, or the explanation made when the
    same issue was raised in a different sub-thread.

    --
    Glenn -- http://nevcal.com/
    ===========================
    A protocol is complete when there is nothing left to remove.
    -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Related Discussions

People

Translate

site design / logo © 2022 Grokbase