I'm running mailman 2.1.8 on Solaris 10 (06/06)

After working fine for a month, my mailman install broke this morning (I
know something must have changed but I can't see what!) Mail sent to lists
ends up in /usr/local/mailman/qfiles/out . If I run unshunt, nothing
changes. If I run unshunt /usr/local/mailman/qfiles/out, the files go away
from the qfiles directory but no mail is received by the users.

I am not getting any error messages relating to this in any file that I can
see. Nothing in any file in /usr/local/mailman/logs, nothing in any system
or mail error files.

Tearing my hair out here. Thanks for any clues as to what might be causing
this or where to LOOK.

Betsy

Search Discussions

  • Dragon at Sep 22, 2006 at 10:34 pm
    Elizabeth Schwartz sent the message below at 15:15 9/22/2006:
    I'm running mailman 2.1.8 on Solaris 10 (06/06)

    After working fine for a month, my mailman install broke this morning (I
    know something must have changed but I can't see what!) Mail sent to lists
    ends up in /usr/local/mailman/qfiles/out . If I run unshunt, nothing
    changes. If I run unshunt /usr/local/mailman/qfiles/out, the files go away
    from the qfiles directory but no mail is received by the users.

    I am not getting any error messages relating to this in any file that I can
    see. Nothing in any file in /usr/local/mailman/logs, nothing in any system
    or mail error files.

    Tearing my hair out here. Thanks for any clues as to what might be causing
    this or where to LOOK.
    ---------------- End original message. ---------------------

    You did not mention if you checked that all of your qrunners are
    running. If your Outgoing runner is not working, it would result in
    the behavior you see.

    Try the following command to see what qrunners are actually functioning:

    ps aux | grep mailman

    Have you tried a restart of mailman?


    Dragon

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Venimus, Saltavimus, Bibimus (et naribus canium capti sumus)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  • Brad Knowles at Sep 22, 2006 at 10:43 pm

    At 6:15 PM -0400 9/22/06, Elizabeth Schwartz wrote:

    Tearing my hair out here. Thanks for any clues as to what might be causing
    this or where to LOOK.
    I'd be willing to bet that you got a message that came in which had
    some malformed MIME structures, and that has clogged up the queue for
    that list. More recent versions of Mailman are more resistent to
    this kind of problem, but not completely immune. If you're not
    already running the latest version (2.1.9), I'd suggest looking into
    making that upgrade.


    As for where to look, any time you have questions one of the best
    places to start looking is in the FAQ (see
    <http://www.python.org/cgi-bin/faqw-mm.py>) and the documentation
    (see <http://www.list.org/docs.html>). In particular, FAQ 3.14 may
    be useful to you.

    You should also check the archives of the list, and the instructions
    for doing that are in FAQ 1.18.

    --
    Brad Knowles, <brad at stop.mail-abuse.org>

    "Those who would give up essential Liberty, to purchase a little
    temporary Safety, deserve neither Liberty nor Safety."

    -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
    Assembly to the Governor, November 11, 1755

    Founding Individual Sponsor of LOPSA. See <http://www.lopsa.org/>.
  • Elizabeth Schwartz at Sep 22, 2006 at 10:55 pm
    To follow up to my own post:
    I have rebooted twice and restarted twice
    I see no interesting error messages in any logs
    A few messages occasionally are getting through
    I have read the FAQ but found nothing applicable.

    If there's a MIME message gunking up the works, where *is* it? The qfiles
    subdirectories are all empty
    On 9/22/06, Elizabeth Schwartz wrote:

    I'm running mailman 2.1.8 on Solaris 10 (06/06)

    After working fine for a month, my mailman install broke this morning (I
    know something must have changed but I can't see what!) Mail sent to lists
    ends up in /usr/local/mailman/qfiles/out . If I run unshunt, nothing
    changes. If I run unshunt /usr/local/mailman/qfiles/out, the files go away
    from the qfiles directory but no mail is received by the users.

    I am not getting any error messages relating to this in any file that I
    can see. Nothing in any file in /usr/local/mailman/logs, nothing in any
    system or mail error files.

    Tearing my hair out here. Thanks for any clues as to what might be causing
    this or where to LOOK.

    Betsy
  • Elizabeth Schwartz at Sep 22, 2006 at 10:56 pm
    (and yep, all the qrunner processes are running and look OK, from here)
    On 9/22/06, Elizabeth Schwartz wrote:

    To follow up to my own post:
    I have rebooted twice and restarted twice
    I see no interesting error messages in any logs
    A few messages occasionally are getting through
    I have read the FAQ but found nothing applicable.

    If there's a MIME message gunking up the works, where *is* it? The qfiles
    subdirectories are all empty
    On 9/22/06, Elizabeth Schwartz wrote:

    I'm running mailman 2.1.8 on Solaris 10 (06/06)

    After working fine for a month, my mailman install broke this morning
    (I know something must have changed but I can't see what!) Mail sent to
    lists ends up in /usr/local/mailman/qfiles/out . If I run unshunt, nothing
    changes. If I run unshunt /usr/local/mailman/qfiles/out, the files go away
    from the qfiles directory but no mail is received by the users.

    I am not getting any error messages relating to this in any file that I
    can see. Nothing in any file in /usr/local/mailman/logs, nothing in any
    system or mail error files.

    Tearing my hair out here. Thanks for any clues as to what might be
    causing this or where to LOOK.

    Betsy
  • Brad Knowles at Sep 22, 2006 at 11:15 pm

    At 6:55 PM -0400 9/22/06, Elizabeth Schwartz wrote:

    I have read the FAQ but found nothing applicable.
    FAQ 3.14 has most of the relevant advice that I could provide. If
    you didn't find anything there that was helpful to you, then I'm not
    sure I can say much of anything more.
    If there's a MIME message gunking up the works, where *is* it? The qfiles
    subdirectories are all empty
    It might be in qfiles/shunt/, but if that was the case then it
    shouldn't be gumming up the rest of the works. If it was gumming up
    the works, it should be in qfiles/in/.

    An alternative may be that your MTA keeps trying to deliver it to
    Mailman, but is not successful. So, the message isn't recorded in
    the Mailman queues, and the MTA keeps trying to redeliver the message.


    Either way, this should show up in the Mailman logs and in the MTA logs.

    --
    Brad Knowles, <brad at stop.mail-abuse.org>

    "Those who would give up essential Liberty, to purchase a little
    temporary Safety, deserve neither Liberty nor Safety."

    -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
    Assembly to the Governor, November 11, 1755

    Founding Individual Sponsor of LOPSA. See <http://www.lopsa.org/>.
  • Mark Sapiro at Sep 23, 2006 at 12:11 am

    Elizabeth Schwartz wrote:
    To follow up to my own post:
    I have rebooted twice and restarted twice
    I see no interesting error messages in any logs
    A few messages occasionally are getting through
    I have read the FAQ but found nothing applicable.

    If there's a MIME message gunking up the works, where *is* it? The qfiles
    subdirectories are all empty

    If the qfiles/* directories are all empty, Mailman has nothing to do.
    Are there new posts being archived and not being sent to the list?

    You previous report indicated that posts were being archived, but were
    sitting in qfiles/out and not being delivered. This is a symptom of
    OutgoingRunner not processing or some problem between OutgoingRunner
    and the outgoing MTA, but then you ran "bin/unshunt qfiles/out" which
    may have caused the loss of all the queued, outgoing messages.

    Now you may have a different symptom.

    There is probably not a message gunking up the works. If there is one,
    it is most likely in lists/<listname>/digest.mbox, but the symptom of
    that is different from what you've reported.

    --
    Mark Sapiro <msapiro at value.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Elizabeth Schwartz at Sep 23, 2006 at 2:51 am
    Sigh... OK, thank you all for the help. I think I understand now!

    My theory: my **original** problem was a large HTML message gunking up the
    out box. Having cleaned that out, it sounds like I then proceeded to trash
    an afternoon's worth of messages by trying to flush the queue incorrectly.
    Fortunately these are all low-volume lists and the messages *did* reach the
    list archives. But still, embarassing. But it does explain why none of the
    troubleshooting steps in the FAQ revealed any problems. I'm not clear why
    new messages failed to go out right away, but possibly with all the restarts
    and such I made some temporary error with the qrunner processes or some
    other such thing. However it happened, mail is now flowing smoothly.

    Can someone point me to a detailed explanation of the path a message takes
    from the time it is first forwarded to mailman, to the time when it is sent
    out to the MTA for delivery? Some time elapses there, and I'd like to
    understand in detail what's going on.

    thank you all for all your help.
  • Mark Sapiro at Sep 23, 2006 at 4:07 am

    Elizabeth Schwartz wrote:
    My theory: my **original** problem was a large HTML message gunking up the
    out box.

    I'm not sure about this part. What's in Mailman's 'smtp' log?

    Having cleaned that out, it sounds like I then proceeded to trash
    an afternoon's worth of messages by trying to flush the queue incorrectly.

    I'm afraid so.

    Can someone point me to a detailed explanation of the path a message takes
    from the time it is first forwarded to mailman, to the time when it is sent
    out to the MTA for delivery? Some time elapses there, and I'd like to
    understand in detail what's going on.

    There shouldn't be much delay in Mailman. There may be delay in the
    incoming MTA and in the SMTP handoff to the outgoing MTA (here again,
    Mailman's 'smtp' log will show how much) and in the outgoing MTA
    itself, but processing through Mailman is fairly quick.

    Here's a rough sketch of what happens.

    The incoming MTA receives a post for list and pipes it to the wrapper
    with the command "| path/to/mail/mailman post list".

    The wrapper invokes the scripts/post script which queues the message in
    the qfiles/in queue for list.

    IncomingRunner picks up the queue entry and processes it through the
    pipeline of handler modules. The pipeline is normally the one defined
    as GLOBAL_PIPELINE in Defaults.py. The handler modules do various
    things to the message and can cause it to be rejected, discarded or
    held for approval, but assuming that it is a valid post which will be
    processed all the way through, after various other handlers are
    called, the message is passed through
    ToDigest - which appends it to the lists digest.mbox and triggers a
    digest if the size threshold is reached.
    ToArchive - which queues a copy in qfiles/archive where it is picked
    up by ArchRunner and archived
    ToUsenet - which, if mail to news gatewaying is done for the list,
    queues a copy in qfiles/news for NewsRunner
    a couple of housekeeping handlers and finally
    ToOutgoing - which queues the message in qfiles/out
    and then IncomingRunner is done with this message.

    OutgoingRunner picks up the entry from qfiles/out, and passes it to the
    delivery module, normally SMTPDirect, to deliver it to the MTA for the
    recipients (the list of which was built earlier by the CalcRecips
    handler which is part of the IncomingRunner pipeline).

    --
    Mark Sapiro <msapiro at value.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Mark Sapiro at Sep 23, 2006 at 4:32 am

    Mark Sapiro wrote:
    Elizabeth Schwartz wrote:
    Having cleaned that out, it sounds like I then proceeded to trash
    an afternoon's worth of messages by trying to flush the queue incorrectly.

    I'm afraid so.

    In case anyone is interested in the details, what actually happened
    when you ran bin/unshunt against qfiles/out is unshunt processed each
    queue entry as follows:

    It looked for the "original queue". This is placed in the message
    metadata by the shunting process to tell unshunt to which queue to
    restore the entry. Since the message was never shunted, there was no
    "original queue" so unshunt put it in the default 'in' queue. There it
    was picked up by IncomingRunner. Now the issue is that the message
    metadata in the queue entry has a 'pipeline' attribute which happens
    to be the empty list because when it was previously passed to the
    ToOutgoing handler, that was the last entry that was popped off the
    list leaving the list empty. Then, IncomingRunner proceeded to pass
    the message through the empty pipeline and was immediately finished
    without doing anything.

    Thus, the messages were all moved from the 'out' queue to the 'in'
    queue where IncomingRunner effectively recognized that it had already
    completely processed the message so this time it discarded the queue
    entry without doing anything.

    The moral is "don't unshunt anything which wasn't shunted to begin
    with".

    --
    Mark Sapiro <msapiro at value.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Patrick Bogen at Sep 25, 2006 at 2:55 pm

    On 9/22/06, Mark Sapiro wrote:
    The moral is "don't unshunt anything which wasn't shunted to begin
    with".
    Might it be worthwhile to add a cautionary note to unshunt's help
    files, to the effect that it should ONLY be used on qfiles/shunt, and
    that its use on other queues will probably result in lost messages?

    Alternatively, maybe have it detect thigns that weren't actually
    shunted (i.e., messages that don't have the original queue attribute
    you mentioned), and refuse to operate on these without being, say,
    --forced ?

    --
    - Patrick Bogen
  • Mark Sapiro at Sep 25, 2006 at 10:01 pm

    Patrick Bogen wrote:
    Might it be worthwhile to add a cautionary note to unshunt's help
    files, to the effect that it should ONLY be used on qfiles/shunt, and
    that its use on other queues will probably result in lost messages?

    Alternatively, maybe have it detect thigns that weren't actually
    shunted (i.e., messages that don't have the original queue attribute
    you mentioned), and refuse to operate on these without being, say,
    --forced ?

    These are both excellent suggestions.

    --
    Mark Sapiro <msapiro at value.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Elizabeth Schwartz at Sep 26, 2006 at 12:51 pm
    Thanks again for all your help. I checked in last night and mailman was hung
    again, but this time I saw that the OutgoingRunner process was missing, and
    there are errors in the error log:

    Sep 23 08:10:17 2006 (2180) Master qrunner detected subprocess exit
    (pid: 1592, sig: None, sts: 1, class: OutgoingRunner, slice: 1/1)
    [restarting]
    Sep 23 08:10:18 2006 (1602) OutgoingRunner qrunner started.
    Sep 23 08:11:34 2006 (2435) Master qrunner detected subprocess exit
    (pid: 1598, sig: None, sts: 1, class: OutgoingRunner, slice: 1/1)
    [restarting]
    Sep 23 08:11:34 2006 (2435) Qrunner OutgoingRunner reached maximum restart
    limit
    of 10, not restarting.

    Will add a check to kick mailman if OutgoingRunner is not running, although
    I'd like to understand why OutgoingRunner is dying.

    Another question: is there any parallelism of processing files in the out
    queue or are they done sequentially? We have one very big list with 751
    members that takes quite a while to get through any one message, and it
    seems like when I had a couple of messages in qfiles/out for this list, that
    messages for my little test list weren't going through.
  • Brad Knowles at Sep 26, 2006 at 6:12 pm

    At 8:51 AM -0400 9/26/06, Elizabeth Schwartz wrote:

    Another question: is there any parallelism of processing files in the out
    queue or are they done sequentially?
    By default, it's single-threaded -- but not really "sequential". It
    has more to do with how the directory entries are written, and not
    what you or I would think of as "sequential".

    There are ways to get more than one Outgoing queue runner working,
    but that's a very non-standard configuration, and takes more work to
    maintain. I would like to see Mailman move towards a hashed queue
    mechanism (like postfix), so that it would be a lot easier to have
    multiple Outgoing queue runners working at the same time.

    --
    Brad Knowles, <brad at stop.mail-abuse.org>

    "Those who would give up essential Liberty, to purchase a little
    temporary Safety, deserve neither Liberty nor Safety."

    -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
    Assembly to the Governor, November 11, 1755

    Founding Individual Sponsor of LOPSA. See <http://www.lopsa.org/>.
  • Barry Warsaw at Sep 26, 2006 at 6:45 pm

    On Sep 26, 2006, at 2:12 PM, Brad Knowles wrote:
    At 8:51 AM -0400 9/26/06, Elizabeth Schwartz wrote:

    Another question: is there any parallelism of processing files
    in the out
    queue or are they done sequentially?
    By default, it's single-threaded -- but not really "sequential". It
    has more to do with how the directory entries are written, and not
    what you or I would think of as "sequential".

    There are ways to get more than one Outgoing queue runner working,
    but that's a very non-standard configuration, and takes more work to
    maintain. I would like to see Mailman move towards a hashed queue
    mechanism (like postfix), so that it would be a lot easier to have
    multiple Outgoing queue runners working at the same time.
    Actually, Mailman does implement a hashed queue of sorts for its
    queue runners. Every queue file is assigned a hash and a timestamp,
    encoded in the file name. The timestamp is so that qrunners can
    handle the files in FIFO order. The hash creates a "hash space" for
    when multiple qrunners are used per queue. In that case, each
    qrunner is responsible for a slice of the hash space, so that they
    can run concurrently without having to deal with expensive and tricky
    locks.

    Multiple qrunners are a supported configuration, although I believe
    their use is rare. In fact, qrunners can contend for list locks
    which can reduce their concurrency, but the OutgoingRunner is one
    place where write access to the list data isn't necessary and care
    was taken so that multiple OutgoingRunners could maximize their
    concurrency.

    (fyi: there is an end-case bug in the hash space algorithm in Mailman
    pre-2.1.9. I don't think anyone's ever hit it, but Mark found it
    through visual inspection of the code. Fixed in 2.1.9.)

    Cheers,
    - -Barry
  • Brad Knowles at Sep 27, 2006 at 1:07 am

    At 2:45 PM -0400 9/26/06, Barry Warsaw wrote:

    Actually, Mailman does implement a hashed queue of sorts for its
    queue runners. Every queue file is assigned a hash and a timestamp,
    encoded in the file name. The timestamp is so that qrunners can
    handle the files in FIFO order. The hash creates a "hash space" for
    when multiple qrunners are used per queue. In that case, each
    qrunner is responsible for a slice of the hash space, so that they
    can run concurrently without having to deal with expensive and
    tricky locks.
    But you're still using a single directory as an on-disk queue, and
    that single directory has to be completely locked, operated on, and
    then unlocked every single time you want to create a new file, delete
    an old file, or rename a file.

    These synchronous meta-data operations are what *kill* the
    performance of programs like postfix and sendmail in large sites,
    where the cost differential can be thousands, tens of thousands,
    hundreds of thousands, or even millions of times when compared to a
    true hashed directory scheme.


    Do an "ls" on a directory with millions of files on an SGI box
    running Irix and XFS (which incorporates it's own hashed directory
    scheme internally). Then do the same command on virtually any other
    box on virtually any other filesystem. Compare the difference in
    performance.


    There's a reason why postfix ships out-of-the-box with directory
    hashing turned on.

    Even just two levels of hashed directories with hexadecimal
    subdirectory names will mean that you could have over a hundred queue
    runners all going at once, with very little likelihood of them
    stepping on each others toes. If you locked down each queue runner
    to its own subdirectory, you could have 256 of them. Using two
    characters of base-32 hashing at each level, you could get 1024 queue
    runners with just one level of hashing.

    I've done lots of MTA tuning in my time, and directory hashing has to
    be the single biggest performance win that I have ever encountered.
    Multiple qrunners are a supported configuration, although I
    believe their use is rare. In fact, qrunners can contend
    for list locks which can reduce their concurrency, but the
    OutgoingRunner is one place where write access to the list
    data isn't necessary and care was taken so that multiple
    OutgoingRunners could maximize their concurrency.
    It's good that we allow an application level of concurrency within
    the Outgoing queue runners, and that we can avoid file locking. But
    we're not getting any real concurrency within the filesystem until we
    can physically break the queue down into multiple chunks and operate
    on each chunk in a manner that is completely and totally independant
    of all the other chunks.

    --
    Brad Knowles, <brad at stop.mail-abuse.org>

    "Those who would give up essential Liberty, to purchase a little
    temporary Safety, deserve neither Liberty nor Safety."

    -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
    Assembly to the Governor, November 11, 1755

    Founding Individual Sponsor of LOPSA. See <http://www.lopsa.org/>.
  • Barry Warsaw at Sep 27, 2006 at 3:56 pm

    On Sep 26, 2006, at 9:07 PM, Brad Knowles wrote:

    But you're still using a single directory as an on-disk queue, and
    that single directory has to be completely locked, operated on, and
    then unlocked every single time you want to create a new file,
    delete an old file, or rename a file.
    You've made this point before and each time you do, I remember that
    it's a good one. :) Brad, would you mind adding this to the Mailman
    2.2 wiki page? I think it's a worthy feature to add.

    - -Barry
  • Brad Knowles at Sep 27, 2006 at 5:35 pm

    At 11:56 AM -0400 9/27/06, Barry Warsaw wrote:

    You've made this point before and each time you do, I remember
    that it's a good one. :) Brad, would you mind adding this to
    the Mailman 2.2 wiki page? Will do.
    I think it's a worthy feature to add.
    Thanks!

    --
    Brad Knowles, <brad at stop.mail-abuse.org>

    "Those who would give up essential Liberty, to purchase a little
    temporary Safety, deserve neither Liberty nor Safety."

    -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
    Assembly to the Governor, November 11, 1755

    Founding Individual Sponsor of LOPSA. See <http://www.lopsa.org/>.
  • Mark Sapiro at Sep 27, 2006 at 3:12 pm

    Elizabeth Schwartz wrote:
    I checked in last night and mailman was hung
    again, but this time I saw that the OutgoingRunner process was missing, and
    there are errors in the error log:

    The 'error' log or the 'qrunner' log?

    Sep 23 08:10:17 2006 (2180) Master qrunner detected subprocess exit
    (pid: 1592, sig: None, sts: 1, class: OutgoingRunner, slice: 1/1)
    [restarting]
    Sep 23 08:10:18 2006 (1602) OutgoingRunner qrunner started.
    Sep 23 08:11:34 2006 (2435) Master qrunner detected subprocess exit
    (pid: 1598, sig: None, sts: 1, class: OutgoingRunner, slice: 1/1)
    [restarting]
    Sep 23 08:11:34 2006 (2435) Qrunner OutgoingRunner reached maximum restart
    limit
    of 10, not restarting.

    Are there any other messages (in Mailman's error log or elsewhere) from
    these times indicating why OutgoingRunner exited with status 1?

    Will add a check to kick mailman if OutgoingRunner is not running, although
    I'd like to understand why OutgoingRunner is dying.

    So would I. Please check other logs - system logs too.

    --
    Mark Sapiro <msapiro at value.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • Mark Sapiro at Sep 22, 2006 at 11:57 pm

    Elizabeth Schwartz wrote:
    After working fine for a month, my mailman install broke this morning (I
    know something must have changed but I can't see what!) Mail sent to lists
    ends up in /usr/local/mailman/qfiles/out . If I run unshunt, nothing
    changes. If I run unshunt /usr/local/mailman/qfiles/out, the files go away
    from the qfiles directory but no mail is received by the users.


    unshunt should only be run to move files from qfiles/shunt back to
    their original queue after fixing the problem that caused them to be
    shunted in the first place. The optional directory argument to unshunt
    is intended for cases where you've moved files from qfiles/shunt to
    some other (non-queue) directory. Running unshunt on a qfiles/
    directory other than qfiles/shunt at best will do nothing and at worst
    will result in lost messages (as it apparently did here).

    I am not getting any error messages relating to this in any file that I can
    see. Nothing in any file in /usr/local/mailman/logs, nothing in any system
    or mail error files.

    Are there entries in the files in /usr/local/mailman/logs? If so, what
    is in 'error', 'qrunner', 'smtp' and 'smtp-failure' whether you think
    it's relevant or not. If not, then your Mailman logs are somewhere
    else. Look at the LOG_DIR setting in Defaults.py/mm_cfg.py.


    --
    Mark Sapiro <msapiro at value.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmailman-users @
categoriespython
postedSep 22, '06 at 10:15p
activeSep 27, '06 at 5:35p
posts20
users6
websitelist.org

People

Translate

site design / logo © 2022 Grokbase