FAQ
Hi, I am observing periods of qfiles/in backlogs in the 400-600 message
count range that take 1-2hours to clear with the standard Mailman 2.1.9 +
Spamassassin (the vette log shows these messages process in an avg of ~10
seconds each)

Is there an easy way to parallelize what looks like a single serialized
Mailman queue?
I see some posts re: multi-slice ? but nothing definitive

I would also like the option of working this into an overall loadbalancing
scheme where I have multiple smtp nodes behind an F5 loadbalancer and the
nodes share an NFS backend...

Many thanks for pointers to any HOWTO docs or wiki notes on this

Fletcher.

Search Discussions

  • Mark Sapiro at Jun 20, 2008 at 4:01 pm

    Fletcher Cocquyt wrote:
    Hi, I am observing periods of qfiles/in backlogs in the 400-600 message
    count range that take 1-2hours to clear with the standard Mailman 2.1.9 +
    Spamassassin (the vette log shows these messages process in an avg of ~10
    seconds each)

    Is Spamassassin invoked from Mailman or from the MTA before Mailman? If
    this plain Mailman, 10 seconds is a hugely long time to process a
    single post through IncomingRunner.

    If you have some Spamassassin interface like
    <http://sourceforge.net/tracker/index.php?funcÞtail&aidd0518&group_id3&atid00103>
    that calls spamd from a Mailman handler, you might consider moving
    Spamassassin ahead of Mailman and using something like
    <http://sourceforge.net/tracker/index.php?funcÞtail&aid„0426&group_id3&atid00103>
    or just header_filter_rules instead.

    Is there an easy way to parallelize what looks like a single serialized
    Mailman queue?
    I see some posts re: multi-slice ? but nothing definitive

    See the section of Defaults.py headed with

    #####
    # Qrunner defaults
    #####

    In order to run multiple, parallel IncomingRunner processes, you can
    either copy the entire QRUNNERS definition from Defaults.py to
    mm_cfg.py
    and change

    ('IncomingRunner', 1), # posts from the outside world

    to

    ('IncomingRunner', 4), # posts from the outside world


    which says run 4 IncomingRunner processes, or you can just add
    something like

    QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

    to mm_cfg.py. You can use any power of two for the number.

    I would also like the option of working this into an overall loadbalancing
    scheme where I have multiple smtp nodes behind an F5 loadbalancer and the
    nodes share an NFS backend...

    The following search will return some information.

    <http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22load+balancing%22>

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Fletcher Cocquyt at Jun 24, 2008 at 12:55 am
    Mike, many thanks for your (as always) very helpful response - I added the 1
    liner to mm_cfg.py to increase the threads to 16.
    Now I am observing (via memory trend graphs) an acceleration of what looks
    like a memory leak - maybe from python - currently at 2.4

    I am compiling the latest 2.5.2 to see if that helps - for now the
    workaround is to restart mailman occasionally.

    (and yes the spamassassin checks are the source of the 4-10 second delay -
    now those happen in parallel x16 - so no spikes in the backlog...)

    Thanks again

    On 6/20/08 9:01 AM, "Mark Sapiro" wrote:

    Fletcher Cocquyt wrote:
    Hi, I am observing periods of qfiles/in backlogs in the 400-600 message
    count range that take 1-2hours to clear with the standard Mailman 2.1.9 +
    Spamassassin (the vette log shows these messages process in an avg of ~10
    seconds each)

    Is Spamassassin invoked from Mailman or from the MTA before Mailman? If
    this plain Mailman, 10 seconds is a hugely long time to process a
    single post through IncomingRunner.

    If you have some Spamassassin interface like
    <http://sourceforge.net/tracker/index.php?funcÞtail&aidd0518&group_id3&
    atid00103>
    that calls spamd from a Mailman handler, you might consider moving
    Spamassassin ahead of Mailman and using something like
    <http://sourceforge.net/tracker/index.php?funcÞtail&aid„0426&group_id3&
    atid00103>
    or just header_filter_rules instead.

    Is there an easy way to parallelize what looks like a single serialized
    Mailman queue?
    I see some posts re: multi-slice ? but nothing definitive

    See the section of Defaults.py headed with

    #####
    # Qrunner defaults
    #####

    In order to run multiple, parallel IncomingRunner processes, you can
    either copy the entire QRUNNERS definition from Defaults.py to
    mm_cfg.py
    and change

    ('IncomingRunner', 1), # posts from the outside world

    to

    ('IncomingRunner', 4), # posts from the outside world


    which says run 4 IncomingRunner processes, or you can just add
    something like

    QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

    to mm_cfg.py. You can use any power of two for the number.

    I would also like the option of working this into an overall loadbalancing
    scheme where I have multiple smtp nodes behind an F5 loadbalancer and the
    nodes share an NFS backend...

    The following search will return some information.

    <http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22lo
    ad+balancing%22>
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Brad Knowles at Jun 24, 2008 at 5:44 am

    On 6/23/08, Fletcher Cocquyt wrote:

    (and yes the spamassassin checks are the source of the 4-10 second delay -
    now those happen in parallel x16 - so no spikes in the backlog...)
    Search the FAQ for "performance". Do all such spam/virus/DNS/etc...
    checking up front, and run a second copy of your MTA with all these
    checks disabled. Have Mailman deliver to the second copy of the MTA,
    because you will have already done all this stuff on input.

    There's no sense running the same message through SpamAssassin (or
    whatever) thousands of times, if you can do it once on input and then
    never again.

    --
    Brad Knowles <brad at shub-internet.org>
    LinkedIn Profile: <http://tinyurl.com/y8kpxu>
  • Fletcher Cocquyt at Jul 1, 2008 at 4:19 pm
    An update - I've upgraded to the latest stable python (2.5.2) and its made
    no difference to the process growth:
    Config:
    Solaris 10 x86
    Python 2.5.2
    Mailman 2.1.9 (8 Incoming queue runners - the leak rate increases with this)
    SpamAssassin 3.2.5

    At this point I am looking for ways to isolate the suspected memory leak - I
    am looking at using dtrace: http://blogs.sun.com/sanjeevb/date/200506

    Any other tips appreciated!

    Initial (immediately after a /etc/init.d/mailman restart):
    last pid: 10330; load averages: 0.45, 0.19, 0.15
    09:13:33
    93 processes: 92 sleeping, 1 on cpu
    CPU states: 98.6% idle, 0.4% user, 1.0% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 1160M free, 444M swap in use, 2779M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    10314 mailman 1 59 0 9612K 7132K sleep 0:00 0.35% python
    10303 mailman 1 59 0 9604K 7080K sleep 0:00 0.15% python
    10305 mailman 1 59 0 9596K 7056K sleep 0:00 0.14% python
    10304 mailman 1 59 0 9572K 7036K sleep 0:00 0.14% python
    10311 mailman 1 59 0 9572K 7016K sleep 0:00 0.13% python
    10310 mailman 1 59 0 9572K 7016K sleep 0:00 0.13% python
    10306 mailman 1 59 0 9556K 7020K sleep 0:00 0.14% python
    10302 mailman 1 59 0 9548K 6940K sleep 0:00 0.13% python
    10319 mailman 1 59 0 9516K 6884K sleep 0:00 0.15% python
    10312 mailman 1 59 0 9508K 6860K sleep 0:00 0.12% python
    10321 mailman 1 59 0 9500K 6852K sleep 0:00 0.14% python
    10309 mailman 1 59 0 9500K 6852K sleep 0:00 0.13% python
    10307 mailman 1 59 0 9500K 6852K sleep 0:00 0.13% python
    10308 mailman 1 59 0 9500K 6852K sleep 0:00 0.12% python
    10313 mailman 1 59 0 9500K 6852K sleep 0:00 0.12% python

    After 8 hours:
    last pid: 9878; load averages: 0.14, 0.12, 0.13
    09:12:18
    97 processes: 96 sleeping, 1 on cpu
    CPU states: 97.2% idle, 1.2% user, 1.6% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 179M free, 2121M swap in use, 1100M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python
    10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python
    10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python
    10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python
    10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python
    10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python
    10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python
    10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python
    10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python
    10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python
    10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python
    10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python
    10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python
    10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python
    10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

    On 6/23/08 8:55 PM, "Fletcher Cocquyt" wrote:

    Mike, many thanks for your (as always) very helpful response - I added the 1
    liner to mm_cfg.py to increase the threads to 16.
    Now I am observing (via memory trend graphs) an acceleration of what looks
    like a memory leak - maybe from python - currently at 2.4

    I am compiling the latest 2.5.2 to see if that helps - for now the workaround
    is to restart mailman occasionally.

    (and yes the spamassassin checks are the source of the 4-10 second delay - now
    those happen in parallel x16 - so no spikes in the backlog...)

    Thanks again

    On 6/20/08 9:01 AM, "Mark Sapiro" wrote:

    Fletcher Cocquyt wrote:
    Hi, I am observing periods of qfiles/in backlogs in the 400-600 message
    count range that take 1-2hours to clear with the standard Mailman 2.1.9 +
    Spamassassin (the vette log shows these messages process in an avg of ~10
    seconds each)

    Is Spamassassin invoked from Mailman or from the MTA before Mailman? If
    this plain Mailman, 10 seconds is a hugely long time to process a
    single post through IncomingRunner.

    If you have some Spamassassin interface like
    <http://sourceforge.net/tracker/index.php?funcÞtail&aidd0518&group_id3>>
    &
    atid00103>
    that calls spamd from a Mailman handler, you might consider moving
    Spamassassin ahead of Mailman and using something like
    <http://sourceforge.net/tracker/index.php?funcÞtail&aid„0426&group_id3>>
    &
    atid00103>
    or just header_filter_rules instead.

    Is there an easy way to parallelize what looks like a single serialized
    Mailman queue?
    I see some posts re: multi-slice ? but nothing definitive

    See the section of Defaults.py headed with

    #####
    # Qrunner defaults
    #####

    In order to run multiple, parallel IncomingRunner processes, you can
    either copy the entire QRUNNERS definition from Defaults.py to
    mm_cfg.py
    and change

    ('IncomingRunner', 1), # posts from the outside world

    to

    ('IncomingRunner', 4), # posts from the outside world


    which says run 4 IncomingRunner processes, or you can just add
    something like

    QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

    to mm_cfg.py. You can use any power of two for the number.

    I would also like the option of working this into an overall loadbalancing
    scheme where I have multiple smtp nodes behind an F5 loadbalancer and the
    nodes share an NFS backend...

    The following search will return some information.
    <http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l>>
    o
    ad+balancing%22>
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Vidiot at Jul 1, 2008 at 5:09 pm

    Solaris 10 x86
    Python 2.5.2
    Mailman 2.1.9 (8 Incoming queue runners - the leak rate increases with this)
    SpamAssassin 3.2.5

    At this point I am looking for ways to isolate the suspected memory leak - I
    am looking at using dtrace: http://blogs.sun.com/sanjeevb/date/200506

    Any other tips appreciated!
    I'd start by installing 2.1.11, which was just released yesterday.

    MB
    --
    e-mail: vidiot at vidiot.com /~\ The ASCII
    [I've been to Earth. I know where it is. ] \ / Ribbon Campaign
    [And I'm gonna take us there. Starbuck 3/25/07] X Against
    Visit - URL: http://vidiot.com/ / \ HTML Email
  • Fletcher Cocquyt at Jul 1, 2008 at 7:56 pm
    I'm having a hard time finding the release notes for 2.1.11 - can you please
    provide a link?
    (I want to see where it details any memory leak fixes since 2.1.9)

    thanks

    On 7/1/08 10:09 AM, "Vidiot" wrote:

    Solaris 10 x86
    Python 2.5.2
    Mailman 2.1.9 (8 Incoming queue runners - the leak rate increases with this)
    SpamAssassin 3.2.5

    At this point I am looking for ways to isolate the suspected memory leak - I
    am looking at using dtrace: http://blogs.sun.com/sanjeevb/date/200506

    Any other tips appreciated!
    I'd start by installing 2.1.11, which was just released yesterday.

    MB
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Vidiot at Jul 1, 2008 at 8:05 pm

    I'm having a hard time finding the release notes for 2.1.11 - can you please
    provide a link?
    (I want to see where it details any memory leak fixes since 2.1.9)
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine
    There should be one on the list.org website. If not, I do not know where
    it is. Should also be in the package.

    MB
    --
    e-mail: vidiot at vidiot.com /~\ The ASCII
    [I've been to Earth. I know where it is. ] \ / Ribbon Campaign
    [And I'm gonna take us there. Starbuck 3/25/07] X Against
    Visit - URL: http://vidiot.com/ / \ HTML Email
  • Fletcher Cocquyt at Jul 1, 2008 at 8:28 pm
    Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb
    one:

    god at irt-smtp-02:mailman-2.1.11 1:26pm 58 # ls
    ACKNOWLEDGMENTS Mailman README-I18N.en STYLEGUIDE.txt
    configure doc misc templates
    BUGS Makefile.in README.CONTRIB TODO
    configure.in gnu-COPYING-GPL mkinstalldirs tests
    FAQ NEWS README.NETSCAPE UPGRADING
    contrib install-sh scripts
    INSTALL README README.USERAGENT bin cron
    messages src
    god at irt-smtp-02:mailman-2.1.11 1:26pm 59 # egrep -i leak *
    NEWS: (Tokio Kikuchi's i18n patches), 862906 (unicode prefix leak in
    admindb),


    Thanks



    On 7/1/08 1:05 PM, "Vidiot" wrote:

    I'm having a hard time finding the release notes for 2.1.11 - can you please
    provide a link?
    (I want to see where it details any memory leak fixes since 2.1.9)
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine
    There should be one on the list.org website. If not, I do not know where
    it is. Should also be in the package.

    MB
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Mark Sapiro at Jul 1, 2008 at 10:37 pm

    Fletcher Cocquyt wrote:
    Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb

    Nothing has been done in Mailman to fix any memory leaks. As far as I
    know, nothing has been done to create any either.

    If there is a leak, it is most likely in the underlying Python and not
    a Mailman issue per se.

    I am curious. You say this problem was exacerbated when you went from
    one IncomingRunner to eight (sliced) IncomingRunners. The
    IncomingRunner instances themselves should be processing fewer
    messages each, and I would expect them to leak less. The other runners
    are doing the same as before so I would expect them to be the same
    unless by solving your 'in' queue backlog, you're just handling a
    whole lot more messages.

    Also, in an 8 hour period, I would expect that RetryRunner and
    CommandRunner and, unless you are doing a lot of mail -> news
    gatewaying, NewsRunner to have done virtually nothing.

    In this snapshot

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python
    10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python
    10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python
    10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python
    10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python
    10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python
    10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python
    10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python
    10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python
    10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python
    10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python
    10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python
    10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python
    10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python
    10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

    Which processes correspond to which runners. And why are the two
    processes that have apparently done the least the ones that have grown
    the most.

    In fact, why are none of these 15 PIDs the same as the ones from 8
    hours earlier, or was that snapshot actually from after the above were
    restarted?

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Fletcher Cocquyt at Jul 1, 2008 at 11:20 pm

    On 7/1/08 3:37 PM, "Mark Sapiro" wrote:

    Fletcher Cocquyt wrote:
    Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb

    Nothing has been done in Mailman to fix any memory leaks. As far as I
    know, nothing has been done to create any either.
    Ok, thanks for confirming that - I will not prioritize a mailman
    2.1.9->2.1.11 upgrade
    If there is a leak, it is most likely in the underlying Python and not
    a Mailman issue per se.
    Agreed - hence my first priority to upgrade from python 2.4.x to 2.5.2 (the
    latest on python.org) - but upgrading did not help this
    I am curious. You say this problem was exacerbated when you went from
    one IncomingRunner to eight (sliced) IncomingRunners. The
    IncomingRunner instances themselves should be processing fewer
    messages each, and I would expect them to leak less. The other runners
    are doing the same as before so I would expect them to be the same
    unless by solving your 'in' queue backlog, you're just handling a
    whole lot more messages.
    Also, in an 8 hour period, I would expect that RetryRunner and
    CommandRunner and, unless you are doing a lot of mail -> news
    gatewaying, NewsRunner to have done virtually nothing.

    In this snapshot

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python
    10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python
    10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python
    10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python
    10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python
    10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python
    10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python
    10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python
    10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python
    10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python
    10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python
    10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python
    10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python
    10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python
    10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

    Which processes correspond to which runners. And why are the two
    processes that have apparently done the least the ones that have grown
    the most.

    In fact, why are none of these 15 PIDs the same as the ones from 8
    hours earlier, or was that snapshot actually from after the above were
    restarted?
    Yes, I snapshot'ed the current leaked state, then restarted and snapped
    those new PIDs to show the size diff.

    Here is the current leaked state since the the cron 13:27 restart only 3
    hours ago:
    last pid: 20867; load averages: 0.53, 0.47, 0.24
    16:04:15
    91 processes: 90 sleeping, 1 on cpu
    CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python
    24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python
    24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python
    24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python
    24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python
    24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python
    24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python
    24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python
    24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python
    24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python
    24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python
    24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python
    24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python
    24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python
    24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

    And the mapping to the runners:
    god at irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman
    awk '{print $2 " " $11}'
    24167 --runner=IncomingRunner:5:8
    24165 --runner=BounceRunner:0:1
    24158 --runner=IncomingRunner:7:8
    24162 --runner=VirginRunner:0:1
    24163 --runner=IncomingRunner:1:8
    24166 --runner=IncomingRunner:0:8
    24168 --runner=IncomingRunner:4:8
    24169 --runner=IncomingRunner:2:8
    24171 --runner=IncomingRunner:6:8
    24172 --runner=IncomingRunner:3:8
    24160 --runner=CommandRunner:0:1
    24161 --runner=OutgoingRunner:0:1
    24164 --runner=ArchRunner:0:1
    24170 /bin/python
    24159 /bin/python

    Thanks for the analysis,
    Fletcher
  • Mark Sapiro at Jul 2, 2008 at 1:14 am

    Fletcher Cocquyt wrote:
    Here is the current leaked state since the the cron 13:27 restart only 3
    hours ago:
    last pid: 20867; load averages: 0.53, 0.47, 0.24
    16:04:15
    91 processes: 90 sleeping, 1 on cpu
    CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python
    24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python
    24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python
    24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python
    24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python
    24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python
    24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python
    24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python
    24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python
    24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python
    24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python
    24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python
    24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python
    24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python
    24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

    And the mapping to the runners:
    god at irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman
    awk '{print $2 " " $11}'
    24167 --runner=IncomingRunner:5:8
    24165 --runner=BounceRunner:0:1
    24158 --runner=IncomingRunner:7:8
    24162 --runner=VirginRunner:0:1
    24163 --runner=IncomingRunner:1:8
    24166 --runner=IncomingRunner:0:8
    24168 --runner=IncomingRunner:4:8
    24169 --runner=IncomingRunner:2:8
    24171 --runner=IncomingRunner:6:8
    24172 --runner=IncomingRunner:3:8
    24160 --runner=CommandRunner:0:1
    24161 --runner=OutgoingRunner:0:1
    24164 --runner=ArchRunner:0:1
    24170 /bin/python
    24159 /bin/python

    What are these last 2? Presumably they are the missing NewsRunner and
    RetryRunner, but what is the extra stuff in the ps output causing $11
    to be the python command and not the runner option? And again, why are
    these two, which presumably have done nothing, seemingly the biggest.

    Here's some additional thought.

    Are you sure there is an actual leak? Do you know that if you just let
    them run, they don't reach some stable size and remain there as
    opposed to growing so large that they eventually throw a MemoryError
    exception and get restarted by mailmanctl.

    If you allowed them to do that once, the MemoryError traceback might
    provide a clue.

    Caveat! I know very little about Python's memory management. Some of
    what follows may be wrong.

    Here's what I think - Python allocates more memory (from the OS) as
    needed to import additional modules and create new objects. Imports
    don't go away, but objects that are destroyed or become unreachable
    (eg a file object that is closed or a message object whose only
    reference gets assigned to something else) become candidates for
    garbage collection and ultimately the memory allocated to them is
    collected and reused (assuming no leaks). I *think* however, that no
    memory is ever actually freed back to the OS. Thus, Python processes
    that run for a long time can grow, but don't shrink.

    Now, IncomingRunner in particular can get very large if large messages
    are arriving, even if those messages are ultimately not processed very
    far. Incoming runner reads the entire message into memory and then
    parses it into a message object which is even bigger than the message
    string. So, if someone happens to send a 100MB attachment to a list,
    IncomingRunner is going to need over 200MB before it ever looks at the
    message itself. This memory will later become available for other use
    within that IncomingRunner instance, but I don't think it is ever
    freed back to the OS.

    Also, I see very little memory change between the 3 hour old snapshot
    above and the 8 hour old one from your prior post. If this is really a
    memory leak, I'd expect the 8 hour old ones to be perhaps twice as big
    as the 3 hour old ones.

    Also, do you have any really big lists with big config.pck files. If
    so, Runners will grow as they instantiate that (those) big list(s).

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Mark Sapiro at Jul 2, 2008 at 2:16 am

    Mark Sapiro wrote:

    Which processes correspond to which runners. And why are the two
    processes that have apparently done the least the ones that have grown
    the most. and
    What are these last 2? Presumably they are the missing NewsRunner and
    RetryRunner, but what is the extra stuff in the ps output causing $11
    to be the python command and not the runner option? And again, why are
    these two, which presumably have done nothing, seemingly the biggest.
    Doh? I finally noticed these are in K and the others are in M so that
    question is answered at least - the two that haven't done anything
    actually haven't grown.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Fletcher Cocquyt at Jul 2, 2008 at 3:26 am
    Pmap shows its the heap

    god at irt-smtp-02:in 8:08pm 64 # pmap 24167
    24167: /bin/python /opt/mailman-2.1.9/bin/qrunner
    --runner=IncomingRunner:5:8
    08038000 64K rwx-- [ stack ]
    08050000 940K r-x-- /usr/local/stow/Python-2.5.2/bin/python
    0814A000 172K rwx-- /usr/local/stow/Python-2.5.2/bin/python
    08175000 312388K rwx-- [ heap ]
    CF210000 64K rwx-- [ anon ]
    <--many small libs -->
    total 318300K

    Whether its a leak or not - we need to understand why the heap is growing
    and put a limit on its growth to avoid exausting memory and swapping into
    oblivion...

    None of the lists seem too big:
    god at irt-smtp-02:lists 8:24pm 73 # du -sk */*pck | sort -nr | head | awk
    '{print $1}'
    1392
    1240
    1152
    1096
    912
    720
    464
    168
    136
    112

    Researching python heap alloaction....

    thanks



    On 7/1/08 6:14 PM, "Mark Sapiro" wrote:

    Fletcher Cocquyt wrote:
    Here is the current leaked state since the the cron 13:27 restart only 3
    hours ago:
    last pid: 20867; load averages: 0.53, 0.47, 0.24
    16:04:15
    91 processes: 90 sleeping, 1 on cpu
    CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python
    24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python
    24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python
    24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python
    24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python
    24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python
    24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python
    24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python
    24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python
    24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python
    24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python
    24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python
    24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python
    24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python
    24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

    And the mapping to the runners:
    god at irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman
    awk '{print $2 " " $11}'
    24167 --runner=IncomingRunner:5:8
    24165 --runner=BounceRunner:0:1
    24158 --runner=IncomingRunner:7:8
    24162 --runner=VirginRunner:0:1
    24163 --runner=IncomingRunner:1:8
    24166 --runner=IncomingRunner:0:8
    24168 --runner=IncomingRunner:4:8
    24169 --runner=IncomingRunner:2:8
    24171 --runner=IncomingRunner:6:8
    24172 --runner=IncomingRunner:3:8
    24160 --runner=CommandRunner:0:1
    24161 --runner=OutgoingRunner:0:1
    24164 --runner=ArchRunner:0:1
    24170 /bin/python
    24159 /bin/python

    What are these last 2? Presumably they are the missing NewsRunner and
    RetryRunner, but what is the extra stuff in the ps output causing $11
    to be the python command and not the runner option? And again, why are
    these two, which presumably have done nothing, seemingly the biggest.

    Here's some additional thought.

    Are you sure there is an actual leak? Do you know that if you just let
    them run, they don't reach some stable size and remain there as
    opposed to growing so large that they eventually throw a MemoryError
    exception and get restarted by mailmanctl.

    If you allowed them to do that once, the MemoryError traceback might
    provide a clue.

    Caveat! I know very little about Python's memory management. Some of
    what follows may be wrong.

    Here's what I think - Python allocates more memory (from the OS) as
    needed to import additional modules and create new objects. Imports
    don't go away, but objects that are destroyed or become unreachable
    (eg a file object that is closed or a message object whose only
    reference gets assigned to something else) become candidates for
    garbage collection and ultimately the memory allocated to them is
    collected and reused (assuming no leaks). I *think* however, that no
    memory is ever actually freed back to the OS. Thus, Python processes
    that run for a long time can grow, but don't shrink.

    Now, IncomingRunner in particular can get very large if large messages
    are arriving, even if those messages are ultimately not processed very
    far. Incoming runner reads the entire message into memory and then
    parses it into a message object which is even bigger than the message
    string. So, if someone happens to send a 100MB attachment to a list,
    IncomingRunner is going to need over 200MB before it ever looks at the
    message itself. This memory will later become available for other use
    within that IncomingRunner instance, but I don't think it is ever
    freed back to the OS.

    Also, I see very little memory change between the 3 hour old snapshot
    above and the 8 hour old one from your prior post. If this is really a
    memory leak, I'd expect the 8 hour old ones to be perhaps twice as big
    as the 3 hour old ones.

    Also, do you have any really big lists with big config.pck files. If
    so, Runners will grow as they instantiate that (those) big list(s).
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Mark Sapiro at Jul 2, 2008 at 4:22 am

    Fletcher Cocquyt wrote:
    Pmap shows its the heap

    god at irt-smtp-02:in 8:08pm 64 # pmap 24167
    24167: /bin/python /opt/mailman-2.1.9/bin/qrunner
    --runner=IncomingRunner:5:8
    08038000 64K rwx-- [ stack ]
    08050000 940K r-x-- /usr/local/stow/Python-2.5.2/bin/python
    0814A000 172K rwx-- /usr/local/stow/Python-2.5.2/bin/python
    08175000 312388K rwx-- [ heap ]
    CF210000 64K rwx-- [ anon ]
    <--many small libs -->
    total 318300K

    Whether its a leak or not - we need to understand why the heap is growing
    and put a limit on its growth to avoid exausting memory and swapping into
    oblivion...

    At this point, I don't think it's a leak.

    Your runners start out at about 9.5 MB. Most of your working runners
    grow to about the 20-40 MB range which I don't think is unusual for a
    site with some config.pck files approaching 1.4 MB.

    Only your IncomingRunners seem to grow really big, and I think that is
    because you are seeing occasional, very large messages, or perhaps it
    has something to do with your custom spam filtering interface.

    Does your MTA limit incoming message size?

    In any case, I know you're reluctant to just let it run, but I think if
    you did let it run for a couple of days that the IncomingRunners
    wouldn't get any bigger than the 310 +- MB that you're already seeing
    after 3 hours, and the rest of the runners would remain in the 10 - 50
    MB range.

    I don't think you'll see a lot of paging activity of that 300+MB
    because I suspect that most of the time nothing is going on in most of
    that memory

    None of the lists seem too big:
    god at irt-smtp-02:lists 8:24pm 73 # du -sk */*pck | sort -nr | head | awk
    '{print $1}'
    1392
    1240
    1152
    1096
    912
    720
    464
    168
    136
    112

    Researching python heap alloaction....

    You may also be interested in the FAQ article at
    <http://wiki.list.org/x/94A9>.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Brad Knowles at Jul 2, 2008 at 4:48 am

    On 7/1/08, Mark Sapiro wrote:

    In any case, I know you're reluctant to just let it run, but I think if
    you did let it run for a couple of days that the IncomingRunners
    wouldn't get any bigger than the 310 +- MB that you're already seeing
    after 3 hours, and the rest of the runners would remain in the 10 - 50
    MB range.
    The thing I've discovered when doing detailed memory/performance
    analysis of Mailman queue runners in the past is that, by far, the
    vast amount of memory that is in use is actually shared across all
    the processes, so in this case you'd only take that ~310MB hit once.

    Some OSes make it more clear than others that this memory is being
    shared, and conversely some OSes appear to count this shared memory
    as actually belonging to multiple separate processes and end up
    vastly overstating the amount of real memory that is being allocated.
    You may also be interested in the FAQ article at
    <http://wiki.list.org/x/94A9>.
    It would be interesting to have all those same commands run for the
    system in question, to compare with the numbers in the FAQ.

    --
    Brad Knowles <brad at shub-internet.org>
    LinkedIn Profile: <http://tinyurl.com/y8kpxu>
  • Brad Knowles at Jul 2, 2008 at 5:14 am

    On 7/1/08, Fletcher Cocquyt wrote:

    Pmap shows its the heap

    god at irt-smtp-02:in 8:08pm 64 # pmap 24167
    24167: /bin/python /opt/mailman-2.1.9/bin/qrunner
    --runner=IncomingRunner:5:8
    08038000 64K rwx-- [ stack ]
    08050000 940K r-x-- /usr/local/stow/Python-2.5.2/bin/python
    0814A000 172K rwx-- /usr/local/stow/Python-2.5.2/bin/python
    08175000 312388K rwx-- [ heap ]
    CF210000 64K rwx-- [ anon ]
    <--many small libs -->
    total 318300K
    And when I do the same thing on the mail server for python.org (which
    hosts over 100 lists, including some pretty active lists with large
    numbers of subscribers), on the largest queue runner we have
    (ArchRunner at 41m), I see:

    # pmap 1040 | sort -nr -k 2 | head
    total 45800K
    0815f000 23244K rwx-- [ anon ]
    40f61000 4420K rw--- [ anon ]
    40a0f000 2340K rw--- [ anon ]
    408aa000 1300K rw--- [ anon ]
    40745000 1300K rw--- [ anon ]
    40343000 1160K r-x-- /usr/lib/i686/cmov/libcrypto.so.0.9.8
    4009c000 1092K r-x-- /lib/libc-2.3.6.so
    41844000 1040K rw--- [ anon ]
    08048000 944K r-x-- /usr/local/bin/python

    No heap showing up anywhere. Doing the same for our IncomingRunner, I get:

    # pmap 1043 | sort -nr -k 2 | head
    total 23144K
    0815f000 7740K rwx-- [ anon ]
    40b12000 1560K rw--- [ anon ]
    40745000 1300K rw--- [ anon ]
    40cb8000 1168K rw--- [ anon ]
    40347000 1160K r-x-- /usr/lib/i686/cmov/libcrypto.so.0.9.8
    4009c000 1092K r-x-- /lib/libc-2.3.6.so
    4098d000 1040K rw--- [ anon ]
    08048000 944K r-x-- /usr/local/bin/python
    4063b000 936K rw--- [ anon ]

    Again, no heap.
    None of the lists seem too big:
    god at irt-smtp-02:lists 8:24pm 73 # du -sk */*pck | sort -nr | head | awk
    '{print $1}'
    1392
    1240
    1152
    1096
    912
    720
    464
    168
    136
    112
    Where did you do this? In the /usr/local/mailman directory?

    When I did this in /usr/local/mailman, all of the .pck files that
    showed up were actually held messages in the data/ directory, not in
    lists/. This would mean that they were individual messages that had
    been pickled and then held for moderation, not pickles for lists.

    Doing the same in /usr/local/mailman/lists, I find that one of our
    smaller mailing lists (python-help, seventeen recipients) has the
    largest list pickle (1044 kilobytes). We have a total of 150 lists,
    and here's the current subscription count of the five biggest lists:

    4075 Python-list
    3305 Tutor
    2600 Mailman-Users
    2329 Mailman-announce
    1528 Python-announce-list

    Of these, python-list and tutor frequently gets between twenty to a
    hundred or more messages in a day. However, here's their respective
    list.pck files, using the same "du -sk" script from above:

    904 tutor/config.pck
    652 python-list/config.pck
    476 mailman-users/config.pck
    324 mailman-announce/config.pck
    208 python-announce-list/config.pck

    --
    Brad Knowles <brad at shub-internet.org>
    LinkedIn Profile: <http://tinyurl.com/y8kpxu>
  • Brad Knowles at Jul 2, 2008 at 5:21 am

    On 7/1/08, Fletcher Cocquyt wrote:

    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine
    BTW, in case it hasn't come through yet -- I am very sensitive to
    your issues. In my "real" life, I am currently employed as a Sr.
    System Administrator at the University of Texas at Austin, with about
    ~50,000 students and ~20,000 faculty and staff, and one of my jobs is
    helping out with both the mail system administration and the mailing
    list system administration.

    So, just because I post messages quoting the current statistics we're
    seeing on python.org, that doesn't mean I'm not sensitive to the
    problems you're seeing. All I'm saying is that we're not currently
    seeing them on python.org, so it may be a bit more difficult for us
    to directly answer your questions, although we'll certainly do
    everything we can do help.

    --
    Brad Knowles <brad at shub-internet.org>
    LinkedIn Profile: <http://tinyurl.com/y8kpxu>
  • Brad Knowles at Jul 2, 2008 at 4:58 am

    On 7/1/08, Mark Sapiro wrote:

    In this snapshot

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python
    10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python
    10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python
    10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python
    10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python
    10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python
    10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python
    10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python
    10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python
    10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python
    10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python
    10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python
    10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python
    10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python
    10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

    Which processes correspond to which runners. And why are the two
    processes that have apparently done the least the ones that have grown
    the most.
    In contrast, the mail server for python.org shows the following:

    top - 06:54:48 up 29 days, 9:09, 4 users, load average: 1.05, 1.08, 0.95
    Tasks: 151 total, 1 running, 149 sleeping, 0 stopped, 1 zombie
    Cpu(s): 0.2% user, 1.1% system, 0.0% nice, 98.7% idle

    PID USER PR VIRT NI RES SHR S %CPU TIME+ %MEM COMMAND
    1040 mailman 9 42960 0 41m 12m S 0 693:59.44 2.1 ArchRunner:0:1 -s
    1041 mailman 9 22876 0 20m 7488 S 0 478:18.62 1.0 BounceRunner:0:1
    1045 mailman 9 20412 0 19m 10m S 0 3031:12 0.9 OutgoingRunner:0:
    1043 mailman 9 20476 0 18m 4968 S 0 127:02.62 0.9 IncomingRunner:0:
    1042 mailman 9 18564 0 17m 7316 S 0 11:34.14 0.9 CommandRunner:0:1
    1046 mailman 11 17276 0 15m 10m S 1 66:32.16 0.8 VirginRunner:0:1
    1044 mailman 9 11568 0 9964 5184 S 0 12:34.04 0.5 NewsRunner:0:1 -s

    And those are the only Python-related processes that show up in the
    first twenty lines.

    --
    Brad Knowles <brad at shub-internet.org>
    LinkedIn Profile: <http://tinyurl.com/y8kpxu>
  • Fletcher Cocquyt at Jul 2, 2008 at 6:05 am
    I did a test - I disabled the SpamAssassin integration and watched the heap
    grow steadily - I do not believe its SA related:

    god at irt-smtp-02:mailman-2.1.9 10:51pm 68 # pmap 22804 | egrep heap
    08175000 14060K rwx-- [ heap ]
    god at irt-smtp-02:mailman-2.1.9 10:51pm 69 # pmap 22804 | egrep heap
    08175000 16620K rwx-- [ heap ]
    god at irt-smtp-02:mailman-2.1.9 10:52pm 70 # pmap 22804 | egrep heap
    08175000 16620K rwx-- [ heap ]
    god at irt-smtp-02:mailman-2.1.9 10:53pm 75 # pmap 22804 | egrep heap
    08175000 18924K rwx-- [ heap ]
    god at irt-smtp-02:mailman-2.1.9 10:54pm 81 # pmap 22804 | egrep heap
    08175000 19692K rwx-- [ heap ]
    god at irt-smtp-02:mailman-2.1.9 10:55pm 82 # pmap 22804 | egrep heap
    08175000 19692K rwx-- [ heap ]

    Trying to find a way to look at the contents of the heap or at least limit
    its growth.
    Or is there not a way expire & restart mailman processes analogous to the
    apache httpd process expiration (designed to mitigate this kind of resource
    growth over time)?

    thanks
    On 7/1/08 9:58 PM, "Brad Knowles" wrote:
    On 7/1/08, Mark Sapiro wrote:

    In this snapshot

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python
    10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python
    10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python
    10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python
    10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python
    10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python
    10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python
    10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python
    10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python
    10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python
    10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python
    10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python
    10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python
    10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python
    10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

    Which processes correspond to which runners. And why are the two
    processes that have apparently done the least the ones that have grown
    the most.
    In contrast, the mail server for python.org shows the following:

    top - 06:54:48 up 29 days, 9:09, 4 users, load average: 1.05, 1.08, 0.95
    Tasks: 151 total, 1 running, 149 sleeping, 0 stopped, 1 zombie
    Cpu(s): 0.2% user, 1.1% system, 0.0% nice, 98.7% idle

    PID USER PR VIRT NI RES SHR S %CPU TIME+ %MEM COMMAND
    1040 mailman 9 42960 0 41m 12m S 0 693:59.44 2.1 ArchRunner:0:1
    -s
    1041 mailman 9 22876 0 20m 7488 S 0 478:18.62 1.0 BounceRunner:0:1
    1045 mailman 9 20412 0 19m 10m S 0 3031:12 0.9
    OutgoingRunner:0:
    1043 mailman 9 20476 0 18m 4968 S 0 127:02.62 0.9
    IncomingRunner:0:
    1042 mailman 9 18564 0 17m 7316 S 0 11:34.14 0.9
    CommandRunner:0:1
    1046 mailman 11 17276 0 15m 10m S 1 66:32.16 0.8 VirginRunner:0:1
    1044 mailman 9 11568 0 9964 5184 S 0 12:34.04 0.5 NewsRunner:0:1
    -s

    And those are the only Python-related processes that show up in the
    first twenty lines.
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Mark Sapiro at Jul 2, 2008 at 3:15 pm

    Fletcher Cocquyt wrote:
    I did a test - I disabled the SpamAssassin integration and watched the heap
    grow steadily - I do not believe its SA related:

    OK.

    Does your MTA limit the size of incoming messages? Can it?

    At some point in the next day or so, I'm going to make a modified
    scripts/post script which will queue incoming messages in qfiles/bad
    and then move them to qfiles/in only if they are under a certain size.
    I'm really curious to see if that will help.

    Trying to find a way to look at the contents of the heap or at least limit
    its growth.
    Or is there not a way expire & restart mailman processes analogous to the
    apache httpd process expiration (designed to mitigate this kind of resource
    growth over time)?

    bin/mailmanctl could be modified to do this automatically, but
    currently only does it on command (restart) or signal (SIGINT), but I
    gather you're already running a cron that does a periodic restart.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Fletcher Cocquyt at Jul 2, 2008 at 3:55 pm

    On 7/2/08 8:15 AM, "Mark Sapiro" wrote:

    Fletcher Cocquyt wrote:
    I did a test - I disabled the SpamAssassin integration and watched the heap
    grow steadily - I do not believe its SA related:

    OK.

    Does your MTA limit the size of incoming messages? Can it?
    No, Yes
    # maximum message size
    #O MaxMessageSize=0
    At some point in the next day or so, I'm going to make a modified
    scripts/post script which will queue incoming messages in qfiles/bad
    and then move them to qfiles/in only if they are under a certain size.
    I'm really curious to see if that will help.
    Yes, having a global incoming maxmessagesize limit and handler (what will
    the sender receive back?) for mailman would be useful.
    Trying to find a way to look at the contents of the heap or at least limit
    its growth.
    Or is there not a way expire & restart mailman processes analogous to the
    apache httpd process expiration (designed to mitigate this kind of resource
    growth over time)?

    bin/mailmanctl could be modified to do this automatically, but
    currently only does it on command (restart) or signal (SIGINT), but I
    gather you're already running a cron that does a periodic restart.
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Mark Sapiro at Jul 3, 2008 at 3:01 am

    Fletcher Cocquyt wrote:
    On 7/2/08 8:15 AM, "Mark Sapiro" wrote:

    At some point in the next day or so, I'm going to make a modified
    scripts/post script which will queue incoming messages in qfiles/bad
    and then move them to qfiles/in only if they are under a certain size.
    I'm really curious to see if that will help.
    Yes, having a global incoming maxmessagesize limit and handler (what will
    the sender receive back?) for mailman would be useful.

    The attached 'post' file is a modified version of scripts/post.

    It does the following compared to the normal script.

    The normal script reads the message from the pipe from the MTA and
    queues it in the 'in' queue for processing by an IncomingRunner. This
    script receives the message and instead queues it in the 'bad' queue.
    It then looks at the size of the 'bad' queue entry (a Python pickle
    that will be just slightly larger than the message text). If the size
    is less than MAXSIZE bytes (a parameter near the beginning of the
    script, currently set to 1000000, but which you can change as you
    desire), it moves the queue entry from the 'bad' queue to the 'in'
    queue for processing.

    The end result is queue entries smaller than MAXSIZE will be processed
    normally, and entries >= MAXSIZE will be left in the 'bad' queue for
    manual examination (with bin/dumpdb or bin/show_qfiles) and either
    manual deletion or manual moving to the 'in' queue for processing.

    The delivery is accepted by the MTA in either case so the poster sees
    nothing unusual.

    This is not intended to be used in a normal production environment. It
    is only intended as a debug aid to see if IncomingRunners will not
    grow so large if incoming message size is limited.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Barry Warsaw at Jul 3, 2008 at 3:05 am

    On Jul 2, 2008, at 11:01 PM, Mark Sapiro wrote:

    The attached 'post' file is a modified version of scripts/post.
    Hi Mark, there was no attachment.
    It does the following compared to the normal script.

    The normal script reads the message from the pipe from the MTA and
    queues it in the 'in' queue for processing by an IncomingRunner. This
    script receives the message and instead queues it in the 'bad' queue.
    It then looks at the size of the 'bad' queue entry (a Python pickle
    that will be just slightly larger than the message text). If the size
    is less than MAXSIZE bytes (a parameter near the beginning of the
    script, currently set to 1000000, but which you can change as you
    desire), it moves the queue entry from the 'bad' queue to the 'in'
    queue for processing.
    I'm not sure 'bad' should be used. Perhaps a separate queue called
    'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

    - -Barry
  • Mark Sapiro at Jul 3, 2008 at 3:15 am

    Barry Warsaw wrote:
    On Jul 2, 2008, at 11:01 PM, Mark Sapiro wrote:
    The attached 'post' file is a modified version of scripts/post.
    Hi Mark, there was no attachment.

    Yes, I know. I was just about to resend. It is attached here. The MUA I
    used to send the previous message gives any attachment without an
    extension Content-Type: application/octet-stream, so the list's content
    filtering removed it.

    It does the following compared to the normal script.
    The normal script reads the message from the pipe from the MTA and
    queues it in the 'in' queue for processing by an IncomingRunner. This
    script receives the message and instead queues it in the 'bad' queue.
    It then looks at the size of the 'bad' queue entry (a Python pickle
    that will be just slightly larger than the message text). If the size
    is less than MAXSIZE bytes (a parameter near the beginning of the
    script, currently set to 1000000, but which you can change as you
    desire), it moves the queue entry from the 'bad' queue to the 'in'
    queue for processing.
    I'm not sure 'bad' should be used. Perhaps a separate queue called
    'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

    If we're going to do something like this going forward, we can certainly
    change the queue. For this 'debug' effort, I wanted to keep it simple
    and use an existing mm_cfg queue name.

    - --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan

    -------------- next part --------------
    An embedded and charset-unspecified text was scrubbed...
    Name: post
    URL: <http://mail.python.org/pipermail/mailman-users/attachments/20080702/9c83b8ce/attachment.txt>
  • Barry Warsaw at Jul 3, 2008 at 3:55 am

    On Jul 2, 2008, at 11:15 PM, Mark Sapiro wrote:

    Yes, I know. I was just about to resend. It is attached here. The
    MUA I
    used to send the previous message gives any attachment without an
    extension Content-Type: application/octet-stream, so the list's
    content
    filtering removed it.
    Ah, np.
    It does the following compared to the normal script.
    The normal script reads the message from the pipe from the MTA and
    queues it in the 'in' queue for processing by an IncomingRunner.
    This
    script receives the message and instead queues it in the 'bad'
    queue.
    It then looks at the size of the 'bad' queue entry (a Python pickle
    that will be just slightly larger than the message text). If the
    size
    is less than MAXSIZE bytes (a parameter near the beginning of the
    script, currently set to 1000000, but which you can change as you
    desire), it moves the queue entry from the 'bad' queue to the 'in'
    queue for processing.
    I'm not sure 'bad' should be used. Perhaps a separate queue called
    'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

    If we're going to do something like this going forward, we can
    certainly
    change the queue. For this 'debug' effort, I wanted to keep it simple
    and use an existing mm_cfg queue name.
    Excellent point. A couple of very minor comments on the file, but
    other than that, it looks great. (I know you copied this from the
    original file, but still I can't resist. ;)
    # Copyright (C) 1998,1999,2000,2001,2002 by the Free Software
    Foundation, Inc. 1998-2008
    #
    # This program is free software; you can redistribute it and/or
    # modify it under the terms of the GNU General Public License
    # as published by the Free Software Foundation; either version 2
    # of the License, or (at your option) any later version.
    #
    # This program is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    # GNU General Public License for more details.
    #
    # You should have received a copy of the GNU General Public License
    # along with this program; if not, write to the Free Software
    # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
    02110-1301, USA.

    """Accept posts to a list and handle them properly.

    The main advertised address for a list should be filtered to this
    program,
    through the mail wrapper. E.g. for list `test at yourdomain.com', the
    `test'
    alias would deliver to this script.

    Stdin is the mail message, and argv[1] is the name of the target
    mailing list.

    """

    import os
    import sys

    import paths
    from Mailman import mm_cfg
    from Mailman import Utils
    from Mailman.i18n import _
    from Mailman.Queue.sbcache import get_switchboard
    from Mailman.Logging.Utils import LogStdErr

    LogStdErr("error", "post")

    MAXSIZE = 1000000



    def main():
    # TBD: If you've configured your list or aliases so poorly as to
    get
    # either of these first two errors, there's little that can be
    done to
    # save your messages. They will be lost. Minimal testing of new
    lists
    # should avoid either of these problems.
    try:
    listname = sys.argv[1]
    except IndexError:
    print >> sys.stderr, _('post script got no listname.')
    sys.exit(1)
    # Make sure the list exists
    if not Utils.list_exists(listname):
    print >> sys.stderr, _('post script, list not found: %
    (listname)s')
    sys.exit(1)
    # Immediately queue the message for the incoming qrunner to
    process. The
    # advantage to this approach is that messages should never get
    lost --
    # some MTAs have a hard limit to the time a filter prog can run.
    Postfix
    # is a good example; if the limit is hit, the proc is SIGKILL'd
    giving us
    # no chance to save the message.
    bdq = get_switchboard(mm_cfg.BADQUEUE_DIR)
    filebase = bdq.enqueue(sys.stdin.read(),
    listname=listname,
    tolist=1, _plaintext=1)
    Should probably use True there instead of 1.
    frompath= os.path.join(mm_cfg.BADQUEUE_DIR, filebase + '.pck')
    topath= os.path.join(mm_cfg.INQUEUE_DIR, filebase + '.pck')
    Space in front of the =

    if os.stat(frompath).st_size < MAXSIZE:
    os.rename(frompath,topath)
    Space after the comma.
    if __name__ == '__main__':
    main()
    - -Barry
  • Barry Warsaw at Jul 2, 2008 at 4:03 pm

    On Jul 2, 2008, at 11:15 AM, Mark Sapiro wrote:

    At some point in the next day or so, I'm going to make a modified
    scripts/post script which will queue incoming messages in qfiles/bad
    and then move them to qfiles/in only if they are under a certain size.
    I'm really curious to see if that will help.
    This should be moved to mailman-developers, but in general it's an
    interesting idea. In MM3 I've split the incoming queue into two
    separate queues. The incoming queue now solely determines the
    disposition of the message, i.e. held, rejected, discarded or
    accepted. If accepted, the message is moved to a pipeline queue where
    it's munged for delivery (i.e. headers and footers added, etc.).

    MM3 also has an LMTP queue runner, which I'd like to make the default
    delivery mechanism for 3.0 and possibly 2.2 (yes, I still have a todo
    to back port MM3's new process architecture to 2.2). Although it's
    not there right now, it would be trivial to add a check on the raw
    size of the message before it's parsed. If it's too large then it can
    be rejected before the email package attempts to parse it, and that
    would give the upstream LTMP client (i.e. your MTA) a better diagnostic.

    It still makes sense to put a size limit in your MTA so it never hits
    the LMTP server because the string will still be in the Python
    process's memory. But at least you won't pay the penalty for parsing
    such a huge message just to reject it later.
    Trying to find a way to look at the contents of the heap or at
    least limit
    its growth.
    Or is there not a way expire & restart mailman processes analogous
    to the
    apache httpd process expiration (designed to mitigate this kind of
    resource
    growth over time)?

    bin/mailmanctl could be modified to do this automatically, but
    currently only does it on command (restart) or signal (SIGINT), but I
    gather you're already running a cron that does a periodic restart.
    This is a good idea. It might be better to do this in
    Runner._doperiodic().

    - -Barry
  • Brad Knowles at Jul 2, 2008 at 4:22 pm

    Fletcher Cocquyt wrote:

    Or is there not a way expire & restart mailman processes analogous to the
    apache httpd process expiration (designed to mitigate this kind of resource
    growth over time)?
    You can do "mailmanctl restart", but that's not really a proper solution to
    this problem.

    --
    Brad Knowles <brad at shub-internet.org>
    LinkedIn Profile: <http://tinyurl.com/y8kpxu>
  • Fletcher Cocquyt at Jul 2, 2008 at 5:12 pm
    I am hopeful our esteemed code maintainers are thinking the built in restart
    idea is a good one:

    BW wrote:
    Or is there not a way expire & restart mailman processes analogous
    to the
    apache httpd process expiration (designed to mitigate this kind of
    resource
    growth over time)?

    bin/mailmanctl could be modified to do this automatically, but
    currently only does it on command (restart) or signal (SIGINT), but I
    gather you're already running a cron that does a periodic restart.
    This is a good idea. It might be better to do this in
    Runner._doperiodic().

    On 7/2/08 9:22 AM, "Brad Knowles" wrote:

    Fletcher Cocquyt wrote:
    Or is there not a way expire & restart mailman processes analogous to the
    apache httpd process expiration (designed to mitigate this kind of resource
    growth over time)?
    You can do "mailmanctl restart", but that's not really a proper solution to
    this problem.
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Barry Warsaw at Jul 2, 2008 at 5:14 pm

    On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

    I am hopeful our esteemed code maintainers are thinking the built in
    restart
    idea is a good one:
    Optionally, yes. By default, I'm not so sure.

    - -Barry
  • Fletcher Cocquyt at Jul 2, 2008 at 8:54 pm
    I had a parallel thread on the dtrace list to get memleak.d running

    http://blogs.sun.com/sanjeevb/date/200506

    - I just got this stack trace from a 10 second sample of the most actively
    growing python mailman process - the output is explained by Sanjeev on his
    blog, but I'm hoping the stack trace will point the analysis towards a cause
    for why my mailman processes are growing abnormally

    I will see if the findleaks.pl analysis of this output returns anything

    Thanks!

    0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size
    libc.so.1`realloc+0x33a
    python`addcleanup+0x45
    python`convertsimple+0x145d
    python`vgetargs1+0x259
    python`_PyArg_ParseTuple_SizeT+0x1d
    python`posix_listdir+0x55
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x824c268
    0 42244 lmalloc:return Ptr=0xcf890300 Size
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42244 lmalloc:return Ptr=0xcf894000 Size�92
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x86d78f0
    ^C
    0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size
    libc.so.1`realloc+0x33a
    python`addcleanup+0x45
    python`convertsimple+0x145d
    python`vgetargs1+0x259
    python`_PyArg_ParseTuple_SizeT+0x1d
    python`posix_listdir+0x55
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x824c268
    0 42244 lmalloc:return Ptr=0xcf890300 Size
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42244 lmalloc:return Ptr=0xcf894000 Size�92
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x86d78f0



    On 7/2/08 10:14 AM, "Barry Warsaw" wrote:

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1
    On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

    I am hopeful our esteemed code maintainers are thinking the built in
    restart
    idea is a good one:
    Optionally, yes. By default, I'm not so sure.

    - -Barry

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.9 (Darwin)

    iEYEARECAAYFAkhrt4UACgkQ2YZpQepbvXE9kACeLg04R4n22C4X3VInoJaaCqyI
    MdkAoJjgj0qwONIKM425QHh/Glxpo4gm
    =yOaG
    -----END PGP SIGNATURE-----
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Fletcher Cocquyt at Jul 2, 2008 at 9:16 pm
    Below is the findleaks output from a ~5minute sample of a python runner - I
    will take a larger sample to see if this is representative or not:
    (again the reference is http://blogs.sun.com/sanjeevb/date/200506 )

    Thanks

    god at irt-smtp-02:~ 2:10pm 67 # ./findleaks.pl ./ml.out
    Ptr=0xcf890340 Size
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    ---------
    Ptr=0xcf894000 Size�92
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    ---------


    On 7/2/08 1:54 PM, "Fletcher Cocquyt" wrote:

    I had a parallel thread on the dtrace list to get memleak.d running

    http://blogs.sun.com/sanjeevb/date/200506

    - I just got this stack trace from a 10 second sample of the most actively
    growing python mailman process - the output is explained by Sanjeev on his
    blog, but I'm hoping the stack trace will point the analysis towards a cause
    for why my mailman processes are growing abnormally

    I will see if the findleaks.pl analysis of this output returns anything

    Thanks!

    0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size
    libc.so.1`realloc+0x33a
    python`addcleanup+0x45
    python`convertsimple+0x145d
    python`vgetargs1+0x259
    python`_PyArg_ParseTuple_SizeT+0x1d
    python`posix_listdir+0x55
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x824c268
    0 42244 lmalloc:return Ptr=0xcf890300 Size
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42244 lmalloc:return Ptr=0xcf894000 Size�92
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x86d78f0
    ^C
    0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size
    libc.so.1`realloc+0x33a
    python`addcleanup+0x45
    python`convertsimple+0x145d
    python`vgetargs1+0x259
    python`_PyArg_ParseTuple_SizeT+0x1d
    python`posix_listdir+0x55
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x824c268
    0 42244 lmalloc:return Ptr=0xcf890300 Size
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42244 lmalloc:return Ptr=0xcf894000 Size�92
    libc.so.1`lmalloc+0x143
    libc.so.1`opendir+0x3e
    python`posix_listdir+0x6d
    python`PyEval_EvalFrameEx+0x59ff
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalFrameEx+0x49ff
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalFrameEx+0x6133
    python`PyEval_EvalCodeEx+0x57f
    python`PyEval_EvalCode+0x22
    python`PyRun_FileExFlags+0xaf
    python`PyRun_SimpleFileExFlags+0x156
    python`Py_Main+0xa6b
    python`main+0x17
    python`_start+0x80

    0 42249 free:entry Ptr=0x86d78f0



    On 7/2/08 10:14 AM, "Barry Warsaw" wrote:

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1
    On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

    I am hopeful our esteemed code maintainers are thinking the built in
    restart
    idea is a good one:
    Optionally, yes. By default, I'm not so sure.

    - -Barry

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.9 (Darwin)

    iEYEARECAAYFAkhrt4UACgkQ2YZpQepbvXE9kACeLg04R4n22C4X3VInoJaaCqyI
    MdkAoJjgj0qwONIKM425QHh/Glxpo4gm
    =yOaG
    -----END PGP SIGNATURE-----
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Tim Bell at Jul 2, 2008 at 6:37 am

    Back at the beginning of this thread, Fletcher Cocquyt wrote:

    Config:
    Solaris 10 x86
    Python 2.5.2
    Mailman 2.1.9 (8 Incoming queue runners - the leak rate increases with this)
    SpamAssassin 3.2.5

    At this point I am looking for ways to isolate the suspected memory leak - I
    am looking at using dtrace: http://blogs.sun.com/sanjeevb/date/200506

    Any other tips appreciated!
    With Solaris 10, you can interpose the libumem library when starting those python processes.
    This gives you different malloc()/free() allocators including extra instrumentation that is
    low enough in overhead to run in a production environment, and (when combined with mdb) a
    powerful set of debugging tools.

    Set LD_PRELOAD, UMEM_DEBUG, UMEM_LOGGING environment variables in the parent process before
    starting python so they will inherit the settings. If you have to, you could replace 'python'
    with a script that sets what you want in the environment and then runs the python executable.

    I know this will be looking at the lower, native layers of the problem, and you may not see the
    upper (python) part of the stack very well, but libumem has been a big help to me so I thought
    I would mention it.

    Here are two references... there are many more if you start searching:

    Identifying Memory Management Bugs Within Applications
    Using the libumem Library
    http://access1.sun.com/techarticles/libumem.html

    Solaris Modular Debugger Guide
    http://docs.sun.com/db/doc/806-6545

    Hope this helps - this is too long, so I'll stop now.

    Tim
  • Brad Knowles at Jun 20, 2008 at 4:57 pm

    Fletcher Cocquyt wrote:

    Hi, I am observing periods of qfiles/in backlogs in the 400-600 message
    count range that take 1-2hours to clear with the standard Mailman 2.1.9 +
    Spamassassin (the vette log shows these messages process in an avg of ~10
    seconds each)
    Search the FAQ for performance. The short URL for the web page is
    <http://wiki.list.org/x/AgA3>.

    --
    Brad Knowles <brad at python.org>
    Member of the Python.org Postmaster Team & Co-Moderator of the
    mailman-users and mailman-developers mailing lists
  • Fletcher Cocquyt at Jul 2, 2008 at 4:01 pm
    Last night I added
    ulimit -v 50000

    To the /etc/init.d/mailman script and restarted

    And I am seeing the processes stop growing at 49M - and so far no adverse
    affects.

    I view this as a workaround until the underlying cause is determined...
    But it also allows me to bump my incoming runners from 8 to 16 and
    manage/enforce the overall memory footprint - I like it (so far)


    last pid: 8371; load averages: 0.07, 0.11, 0.16
    08:57:41
    91 processes: 90 sleeping, 1 on cpu
    CPU states: 98.2% idle, 0.5% user, 1.3% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 792M free, 697M swap in use, 2602M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    7565 mailman 1 59 0 49M 46M sleep 0:07 0.01% python
    7571 mailman 1 59 0 35M 31M sleep 0:04 0.01% python
    7551 mailman 1 59 0 35M 32M sleep 0:18 0.02% python
    7554 mailman 1 59 0 31M 27M sleep 0:22 0.02% python
    7568 mailman 1 59 0 31M 28M sleep 0:03 0.01% python
    7574 mailman 1 59 0 30M 28M sleep 0:03 0.01% python
    7570 mailman 1 59 0 30M 27M sleep 0:14 0.01% python
    7560 mailman 1 59 0 30M 27M sleep 0:12 0.15% python
    7562 mailman 1 59 0 27M 25M sleep 0:03 0.02% python
    7566 mailman 1 59 0 27M 25M sleep 0:03 0.01% python
    7569 mailman 1 59 0 27M 25M sleep 0:04 0.01% python
    7561 mailman 1 59 0 27M 24M sleep 0:03 0.01% python
    7558 mailman 1 59 0 26M 24M sleep 0:19 0.02% python
    7572 mailman 1 59 0 26M 23M sleep 0:03 0.01% python
    7567 mailman 1 59 0 26M 23M sleep 0:03 0.01% python

    On 7/1/08 11:43 PM, "Fletcher Cocquyt" wrote:

    Hey thanks, I really appreciated the responsiveness, and helpful analysis by
    the folks on this list

    BTW I am experimenting with ulimit -v to limit the growth of mailman procs
    (added it to the init script)
    Will see what the growth looks like overnight and report back - thanks


    On 7/1/08 10:21 PM, "Brad Knowles" wrote:
    On 7/1/08, Fletcher Cocquyt wrote:

    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine
    BTW, in case it hasn't come through yet -- I am very sensitive to
    your issues. In my "real" life, I am currently employed as a Sr.
    System Administrator at the University of Texas at Austin, with about
    ~50,000 students and ~20,000 faculty and staff, and one of my jobs is
    helping out with both the mail system administration and the mailing
    list system administration.

    So, just because I post messages quoting the current statistics we're
    seeing on python.org, that doesn't mean I'm not sensitive to the
    problems you're seeing. All I'm saying is that we're not currently
    seeing them on python.org, so it may be a bit more difficult for us
    to directly answer your questions, although we'll certainly do
    everything we can do help.
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Fletcher Cocquyt at Jul 2, 2008 at 5:17 pm
    Interesting, the growth per python is limited to 50M by ulimit -v 50000, but
    I'm seeing each one gradually take up that limit - then what ? (stay tuned!
    - mailman fails to malloc?)

    load averages: 0.14, 0.11, 0.11
    10:14:43
    120 processes: 119 sleeping, 1 on cpu
    CPU states: 93.9% idle, 3.7% user, 2.4% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 646M free, 858M swap in use, 2436M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    7566 mailman 1 59 0 49M 46M sleep 0:08 0.01% python
    7565 mailman 1 59 0 49M 46M sleep 0:09 0.02% python
    7557 mailman 1 59 0 48M 46M sleep 0:08 0.01% python
    7569 mailman 1 59 0 48M 45M sleep 0:09 0.01% python
    7552 mailman 1 59 0 47M 45M sleep 0:08 0.01% python
    7561 mailman 1 59 0 47M 45M sleep 0:08 0.06% python
    7570 mailman 1 59 0 47M 44M sleep 0:20 0.01% python
    7571 mailman 1 59 0 35M 31M sleep 0:05 0.01% python
    7551 mailman 1 59 0 35M 32M sleep 0:31 0.16% python
    7554 mailman 1 59 0 31M 27M sleep 0:37 0.35% python
    7568 mailman 1 59 0 31M 28M sleep 0:05 0.01% python
    7574 mailman 1 59 0 30M 28M sleep 0:05 0.37% python
    7560 mailman 1 59 0 30M 27M sleep 0:20 0.02% python
    7562 mailman 1 59 0 27M 25M sleep 0:06 0.03% python
    7558 mailman 1 59 0 26M 24M sleep 0:33 0.35% python

    On 7/2/08 9:01 AM, "Fletcher Cocquyt" wrote:

    Last night I added
    ulimit -v 50000

    To the /etc/init.d/mailman script and restarted

    And I am seeing the processes stop growing at 49M - and so far no adverse
    affects.

    I view this as a workaround until the underlying cause is determined...
    But it also allows me to bump my incoming runners from 8 to 16 and
    manage/enforce the overall memory footprint - I like it (so far)


    last pid: 8371; load averages: 0.07, 0.11, 0.16
    08:57:41
    91 processes: 90 sleeping, 1 on cpu
    CPU states: 98.2% idle, 0.5% user, 1.3% kernel, 0.0% iowait, 0.0% swap
    Memory: 1640M real, 792M free, 697M swap in use, 2602M swap free

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    7565 mailman 1 59 0 49M 46M sleep 0:07 0.01% python
    7571 mailman 1 59 0 35M 31M sleep 0:04 0.01% python
    7551 mailman 1 59 0 35M 32M sleep 0:18 0.02% python
    7554 mailman 1 59 0 31M 27M sleep 0:22 0.02% python
    7568 mailman 1 59 0 31M 28M sleep 0:03 0.01% python
    7574 mailman 1 59 0 30M 28M sleep 0:03 0.01% python
    7570 mailman 1 59 0 30M 27M sleep 0:14 0.01% python
    7560 mailman 1 59 0 30M 27M sleep 0:12 0.15% python
    7562 mailman 1 59 0 27M 25M sleep 0:03 0.02% python
    7566 mailman 1 59 0 27M 25M sleep 0:03 0.01% python
    7569 mailman 1 59 0 27M 25M sleep 0:04 0.01% python
    7561 mailman 1 59 0 27M 24M sleep 0:03 0.01% python
    7558 mailman 1 59 0 26M 24M sleep 0:19 0.02% python
    7572 mailman 1 59 0 26M 23M sleep 0:03 0.01% python
    7567 mailman 1 59 0 26M 23M sleep 0:03 0.01% python

    On 7/1/08 11:43 PM, "Fletcher Cocquyt" wrote:

    Hey thanks, I really appreciated the responsiveness, and helpful analysis by
    the folks on this list

    BTW I am experimenting with ulimit -v to limit the growth of mailman procs
    (added it to the init script)
    Will see what the growth looks like overnight and report back - thanks


    On 7/1/08 10:21 PM, "Brad Knowles" wrote:
    On 7/1/08, Fletcher Cocquyt wrote:

    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine
    BTW, in case it hasn't come through yet -- I am very sensitive to
    your issues. In my "real" life, I am currently employed as a Sr.
    System Administrator at the University of Texas at Austin, with about
    ~50,000 students and ~20,000 faculty and staff, and one of my jobs is
    helping out with both the mail system administration and the mailing
    list system administration.

    So, just because I post messages quoting the current statistics we're
    seeing on python.org, that doesn't mean I'm not sensitive to the
    problems you're seeing. All I'm saying is that we're not currently
    seeing them on python.org, so it may be a bit more difficult for us
    to directly answer your questions, although we'll certainly do
    everything we can do help.
    --
    Fletcher Cocquyt
    Senior Systems Administrator
    Information Resources and Technology (IRT)
    Stanford University School of Medicine

    Email: fcocquyt at stanford.edu
    Phone: (650) 724-7485
  • Mark Sapiro at Jul 2, 2008 at 11:29 pm

    Fletcher Cocquyt wrote:
    Interesting, the growth per python is limited to 50M by ulimit -v 50000, but
    I'm seeing each one gradually take up that limit - then what ? (stay tuned!
    - mailman fails to malloc?)

    Look in Mailman's error and qrunner logs.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmailman-users @
categoriespython
postedJun 20, '08 at 10:33a
activeJul 3, '08 at 3:55a
posts37
users7
websitelist.org

People

Translate

site design / logo © 2022 Grokbase