FAQ
Greetings,

I have looked through the archives for something similar to my
issue, and I noticed that by searching on "disk full", I get similar
reports beginning in roughly July of 08.

As with these other reports, I have noticed *tremendous*
disappearing space. When I tried to find the actual files, I was
unsuccessful. Interestingly, if I stop mailman and then restart it, the
"missing" space miraculously reappears!

So, now that the background is over with, here's where I find
myself (besides just looking stupid):

(1) Yesterday I enabled VERP, and it appeared to be working well, At the
time I turned on VERP, I had around 5gb of free space (which would take
about two weeks to "disappear" before VERP).
(2) Around 2pm today, the disk was full, and mailman died.
(3) My inkling of something being wrong was this on the web interface:

"Bug in Mailman version 2.1.11rc2

We're sorry, we hit a bug!

Please inform the webmaster for this site of this problem. Printing of
traceback and other system information has been explicitly inhibited,
but the webmaster can find this information in the Mailman error logs. "

(4) Upon looking at the system in response to the above missive, I checked
and saw the system ws out of space again. I did what I always do - shut
down mailman (which usually drops ~5gb of "missing" space, and then
restart it. Everything before today has come up roses doing this.

(5) There is nothing in any of the logs that indicate why this message is
continuing to poke fun at me.

(6) I have looked through the various manuals, pdfs, etc, and cannot find
anything about explicitly enabling logging so that I can get a better
handle on this.


Oddly, as a mailman user since around 2001, this is the first real problem
Ive had: Great platform!!! Mediocre admin though, and one begging for
help as well.

All the best,

//Alif

--
Yours,
J.A. Terranson
sysadmin_at_mfn.org
0xpgp_key_mgmt_is_broken-dont_bother

Search Discussions

  • Mark Sapiro at Nov 18, 2008 at 10:53 pm

    J.A. Terranson wrote:
    I have looked through the archives for something similar to my
    issue, and I noticed that by searching on "disk full", I get similar
    reports beginning in roughly July of 08.

    As with these other reports, I have noticed *tremendous*
    disappearing space. When I tried to find the actual files, I was
    unsuccessful. Interestingly, if I stop mailman and then restart it, the
    "missing" space miraculously reappears!

    Is this Solaris? If so, see the thread beginning at
    <http://mail.python.org/pipermail/mailman-users/2008-July/062359.html>
    which is about an alleged memory leak.

    If you're running out of disk, and restarting the processes solves it,
    it may be swap space that's eating up the disk.

    So, now that the background is over with, here's where I find
    myself (besides just looking stupid):

    (1) Yesterday I enabled VERP, and it appeared to be working well, At the
    time I turned on VERP, I had around 5gb of free space (which would take
    about two weeks to "disappear" before VERP).

    5gb of free disk space doesn't seem like a lot these days.

    (2) Around 2pm today, the disk was full, and mailman died.

    Enabling VERP might cause the MTA to use a lot more queue space, but I
    don't see that it would affect Mailman much.

    (3) My inkling of something being wrong was this on the web interface:

    "Bug in Mailman version 2.1.11rc2

    We're sorry, we hit a bug!

    Please inform the webmaster for this site of this problem. Printing of
    traceback and other system information has been explicitly inhibited,
    but the webmaster can find this information in the Mailman error logs. "

    (4) Upon looking at the system in response to the above missive, I checked
    and saw the system ws out of space again. I did what I always do - shut
    down mailman (which usually drops ~5gb of "missing" space, and then
    restart it. Everything before today has come up roses doing this.

    So are you saying that this time you didn't recover any disk space or
    just that the web error didn't go away. If the latter, it seems likely
    that the disk space error caused a config.pck file to be corrupted and
    that is the cause of the recurrent "bug". What is the traceback from
    the most recent of these from the error log?

    (5) There is nothing in any of the logs that indicate why this message is
    continuing to poke fun at me.

    There almost certainly is something in Mailman's error log unless the
    logfile just can't be grown to accommodate the message.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan
  • J.A. Terranson at Nov 19, 2008 at 11:44 pm

    On Tue, 18 Nov 2008, Mark Sapiro wrote:

    J.A. Terranson wrote:
    I have looked through the archives for something similar to my
    issue, and I noticed that by searching on "disk full", I get similar
    reports beginning in roughly July of 08.

    As with these other reports, I have noticed *tremendous*
    disappearing space. When I tried to find the actual files, I was
    unsuccessful. Interestingly, if I stop mailman and then restart it, the
    "missing" space miraculously reappears!

    Is this Solaris? If so, see the thread beginning at
    <http://mail.python.org/pipermail/mailman-users/2008-July/062359.html>
    which is about an alleged memory leak.
    FreeBSD 6.3, and the issue in the above thread doesn't look like the same
    thing.
    If you're running out of disk, and restarting the processes solves it,
    it may be swap space that's eating up the disk.
    I considered this, however swap is simply not being used at any time. I
    put up a cron job to monitor, and swap use is literally zero all the way
    up to the crash for lack of space. Honestly, it *feels* like some huge
    log file somewhere, but I can find no viable explanation for it. This is
    a single partitioned box, with a single dedicated user (running 4 mailman
    lists and a few websites). Using du I see space being eaten, but no
    indication as to where. I see /usr and /var "growing", but looking into
    them shows no file(s) that could account for the amount of missing space.

    So, now that the background is over with, here's where I find
    myself (besides just looking stupid):

    (1) Yesterday I enabled VERP, and it appeared to be working well, At the
    time I turned on VERP, I had around 5gb of free space (which would take
    about two weeks to "disappear" before VERP).

    5gb of free disk space doesn't seem like a lot these days.
    Agreed. But then, they arent doing much either.

    (2) Around 2pm today, the disk was full, and mailman died.

    Enabling VERP might cause the MTA to use a lot more queue space, but I
    don't see that it would affect Mailman much.
    The only difference was in the rate at which the "loss" accrued. Roughly
    a 6x increase.

    (3) My inkling of something being wrong was this on the web interface:

    "Bug in Mailman version 2.1.11rc2

    We're sorry, we hit a bug!

    Please inform the webmaster for this site of this problem. Printing of
    traceback and other system information has been explicitly inhibited,
    but the webmaster can find this information in the Mailman error logs. "

    (4) Upon looking at the system in response to the above missive, I checked
    and saw the system ws out of space again. I did what I always do - shut
    down mailman (which usually drops ~5gb of "missing" space, and then
    restart it. Everything before today has come up roses doing this.

    So are you saying that this time you didn't recover any disk space or
    just that the web error didn't go away. If the latter, it seems likely
    that the disk space error caused a config.pck file to be corrupted and
    that is the cause of the recurrent "bug". What is the traceback from
    the most recent of these from the error log?
    I apologize for the lack of clarity. Im saying that the space did come
    back, as always, but this time was unique in throwing up this web message.
    All of the mailman core functionality appeared to be running normally
    (lots of traffic back and forth), but the web UI was dead.

    (5) There is nothing in any of the logs that indicate why this message is
    continuing to poke fun at me.
    There almost certainly is something in Mailman's error log unless the
    logfile just can't be grown to accommodate the message.
    No, there really isn't. I have combed through all of them (bounce, error,
    mischeif, post qrunner, smtp & failure, subscribe and vette. Did I miss
    anything?), with no sign of anything being wrong.

    I have several other mailman systems, and I have always seen a traceback
    or slew of messages when something went south, but nothing here. Also, of
    note, this is the only mailman with the disappearing disk issue. My other
    boxen are all running *really* old versions, and this new customer build
    is doubling as my canary: so far, I see BIG improvements in throughput,
    but this disk thing has me crazy. If I stop mailman when the drive hits
    99%, I instantly get my 5gb back. It feels like Im writing a file that I
    cannot see, but I dont think this is physically possible (anyone know
    otherwise?).

    I spent a few hours mucking around with the pickles trying to figure what
    broke, and finally gave up due to screaming users: I rebuilt. The new
    build acts *just* like the last one (the reason for the delay in answering
    your kind reply was to see if the rebuild would get rid of this). Ive
    lost about a gig over 24 hours, and I have NO idea where its going. I
    stopped the job while writing this paragraph just to double check, and
    yes, I get it all back when the job is terminated. Very odd indeed.

    Im not comfy with debuggers, so Im at the mercy of others.

    Have I missed any log files? Is there somewhere specific I should be
    looking? Is there some way to (easily) increase logging details to try
    and track this down?

    The answers to this and other important questions await. On the next
    episode of MailSoap. <cue jingle>

    Seriously though, I appreciate your response, and the time spent on this.

    All the best,

    //Alif

    --
    Yours,
    J.A. Terranson
    sysadmin_at_mfn.org
    0xpgp_key_mgmt_is_broken-dont_bother
  • Gary Algier at Nov 20, 2008 at 2:58 am

    J.A. Terranson wrote:
    On Tue, 18 Nov 2008, Mark Sapiro wrote:

    J.A. Terranson wrote:
    I have looked through the archives for something similar to my
    issue, and I noticed that by searching on "disk full", I get similar
    reports beginning in roughly July of 08.

    As with these other reports, I have noticed *tremendous*
    disappearing space. When I tried to find the actual files, I was
    unsuccessful. Interestingly, if I stop mailman and then restart it, the
    "missing" space miraculously reappears!
    Is this Solaris? If so, see the thread beginning at
    <http://mail.python.org/pipermail/mailman-users/2008-July/062359.html>
    which is about an alleged memory leak.
    FreeBSD 6.3, and the issue in the above thread doesn't look like the same
    thing.
    If you're running out of disk, and restarting the processes solves it,
    it may be swap space that's eating up the disk.
    I considered this, however swap is simply not being used at any time. I
    put up a cron job to monitor, and swap use is literally zero all the way
    up to the crash for lack of space. Honestly, it *feels* like some huge
    log file somewhere, but I can find no viable explanation for it. This is
    a single partitioned box, with a single dedicated user (running 4 mailman
    lists and a few websites). Using du I see space being eaten, but no
    indication as to where. I see /usr and /var "growing", but looking into
    them shows no file(s) that could account for the amount of missing space.

    So, now that the background is over with, here's where I find
    myself (besides just looking stupid):

    (1) Yesterday I enabled VERP, and it appeared to be working well, At the
    time I turned on VERP, I had around 5gb of free space (which would take
    about two weeks to "disappear" before VERP).
    5gb of free disk space doesn't seem like a lot these days.
    Agreed. But then, they arent doing much either.

    (2) Around 2pm today, the disk was full, and mailman died.
    Enabling VERP might cause the MTA to use a lot more queue space, but I
    don't see that it would affect Mailman much.
    The only difference was in the rate at which the "loss" accrued. Roughly
    a 6x increase.

    (3) My inkling of something being wrong was this on the web interface:

    "Bug in Mailman version 2.1.11rc2

    We're sorry, we hit a bug!

    Please inform the webmaster for this site of this problem. Printing of
    traceback and other system information has been explicitly inhibited,
    but the webmaster can find this information in the Mailman error logs. "

    (4) Upon looking at the system in response to the above missive, I checked
    and saw the system ws out of space again. I did what I always do - shut
    down mailman (which usually drops ~5gb of "missing" space, and then
    restart it. Everything before today has come up roses doing this.
    So are you saying that this time you didn't recover any disk space or
    just that the web error didn't go away. If the latter, it seems likely
    that the disk space error caused a config.pck file to be corrupted and
    that is the cause of the recurrent "bug". What is the traceback from
    the most recent of these from the error log?
    I apologize for the lack of clarity. Im saying that the space did come
    back, as always, but this time was unique in throwing up this web message.
    All of the mailman core functionality appeared to be running normally
    (lots of traffic back and forth), but the web UI was dead.

    (5) There is nothing in any of the logs that indicate why this message is
    continuing to poke fun at me.
    There almost certainly is something in Mailman's error log unless the
    logfile just can't be grown to accommodate the message.
    No, there really isn't. I have combed through all of them (bounce, error,
    mischeif, post qrunner, smtp & failure, subscribe and vette. Did I miss
    anything?), with no sign of anything being wrong.

    I have several other mailman systems, and I have always seen a traceback
    or slew of messages when something went south, but nothing here. Also, of
    note, this is the only mailman with the disappearing disk issue. My other
    boxen are all running *really* old versions, and this new customer build
    is doubling as my canary: so far, I see BIG improvements in throughput,
    but this disk thing has me crazy. If I stop mailman when the drive hits
    99%, I instantly get my 5gb back. It feels like Im writing a file that I
    cannot see, but I dont think this is physically possible (anyone know
    otherwise?).
    Yes, this is very possible:
    1. open a file.
    2. write data to it.
    3. delete it
    if the file is not closed, the space will still be in use, but there
    won't be any entry in the parent directory for it. You can test for
    this by cd-ing to the base of the file system which is running out of
    space. Run "du -dks .", then "df -k .". The two usage numbers should
    be the same, within a few k. If different, then the used space is not
    reflected in any directory.

    If this is the case, you may be able to find out which process has the
    open, unlinked file using "lsof". Run it as "lsof -s -p PID" once for
    each Mailman process. The offender should report open files that
    either it can't resolve the name or it will show a name that does
    not exist. The flag "-s" tells it to report the size. This may help
    identify a large file. The ability of lsof to report the name of
    open files may very by OS, however.

    Rereading the man page for lsof, I just noticed the "+L" option.
    Using "+aL1" (that is plus aye ell one) causes it to select unlinked
    open files. Perhaps this will help.

    I hope this will help ID which process, at least. Perhaps that will
    give clues.
    I spent a few hours mucking around with the pickles trying to figure what
    broke, and finally gave up due to screaming users: I rebuilt. The new
    build acts *just* like the last one (the reason for the delay in answering
    your kind reply was to see if the rebuild would get rid of this). Ive
    lost about a gig over 24 hours, and I have NO idea where its going. I
    stopped the job while writing this paragraph just to double check, and
    yes, I get it all back when the job is terminated. Very odd indeed.

    Im not comfy with debuggers, so Im at the mercy of others.

    Have I missed any log files? Is there somewhere specific I should be
    looking? Is there some way to (easily) increase logging details to try
    and track this down?

    The answers to this and other important questions await. On the next
    episode of MailSoap. <cue jingle>

    Seriously though, I appreciate your response, and the time spent on this.

    All the best,

    //Alif

    --
    Gary Algier, WB2FWZ gaa at ulticom.com +1 856 787 2758
    Ulticom Inc., 1020 Briggs Rd, Mt. Laurel, NJ 08054 Fax:+1 856 866 2033

    Nielsen's First Law of Computer Manuals:
    People don't read documentation voluntarily.
  • J.A. Terranson at Nov 20, 2008 at 5:07 am
    A quick note to all three of you who responded with the lsof suggestion.

    Thank you! The first time I saw the lsof suggestion I wanted to kick
    myself in the back of the head! It's been *so* long since I've had any
    need, that I had simply forgotten all about it. It appears that it is no
    longer even part of the FreeBSD standard distribution anymore (I had to go
    fetch, although the last time I remember using lsof - ~20009ish or so - I
    am pretty sure it was already present and ready for service)...

    I will be mucking with it tonight and I'll let everyone know what I find.

    One last note, to everyone: I always cringe when I have to break down and
    send out a "help" to a support list such as this, as so many of them will
    land you more "RTFM and come back when you find it!" than they civil
    replies. This has been an inspiring experience - there is yet hope for
    the Intarwebs :-)

    Thanks again, and I'll be back in a few days (or less).

    //Alif

    --
    Yours,
    J.A. Terranson
    sysadmin_at_mfn.org
    0xpgp_key_mgmt_is_broken-dont_bother
  • Stephen J. Turnbull at Nov 20, 2008 at 3:26 am
    J.A. Terranson writes:
    It feels like Im writing a file that I cannot see, but I dont think
    this is physically possible (anyone know otherwise?).
    Oh, indeed it is possible, and happens with log files all the time.
    All you need to do is start a process that doesn't close its logfile
    until it exits, then rm the logfile. A variant of this technique is
    also used to create "secure" scratch files (what other programs can't
    see, they can't touch).

    In Unix file system semantics, rm simply changes the entry in the
    directory to make the file inaccessible, but the inode where all the
    space allocation details are still exists, and the process with the
    open file descriptor can continue writing to it. However, when the
    process exits, the file descriptor is close, the inode and the space
    become garbage, and they get freed.
    Have I missed any log files? Is there somewhere specific I should be
    looking? Is there some way to (easily) increase logging details to try
    and track this down?
    Unlikely. A more direct approach is lsof ("list open files").
    Mailman has a bunch of processes, though, so make sure you've
    identified the one you need to look at. You want the -c (check
    processes running certain command names) or -p (check processes for
    certain PIDs) options. Here's a look at my shell on Mac OS X:

    chibi:SeminarSEA steve$ lsof -p 3771
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    bash 3771 steve cwd VDIR 14,2 952 40850782 /Users/steve/Work/Teaching/MBA/SeminarSEA
    bash 3771 steve txt VREG 14,2 581636 10221991 /bin/bash
    bash 3771 steve txt VREG 14,2 1797576 21766582 /usr/lib/dyld
    bash 3771 steve txt VREG 14,2 4402196 39746339 /usr/lib/libSystem.B.dylib
    bash 3771 steve txt VREG 14,2 304580 39764872 /usr/lib/libncurses.5.4.dylib
    bash 3771 steve 0u VCHR 4,6 0t72917 42347780 /dev/ttyp6
    bash 3771 steve 1u VCHR 4,6 0t72917 42347780 /dev/ttyp6
    bash 3771 steve 2u VCHR 4,6 0t72917 42347780 /dev/ttyp6
    bash 3771 steve 255u VCHR 4,6 0t72917 42347780 /dev/ttyp6

    In the FD column, "cwd" and "txt" are files that have been read into
    the process space in some sense; they are not subject to IO. The
    numerical FDs are the ones of interest; here they are all just the
    attached TTY (0, 1, and 2 are stdin, stdout, and stderr, of course).
    bash apparently isn't writing or reading any regular files at the
    moment.

    Although Mac OS X uses the Mach microkernel, userland is based on
    FreeBSD, so Your Mileage Should Not Vary (much).

    I haven't actually looked at a file with no links in maybe a decade
    (over precisely the issue I started with, I needed to free up space
    fast so I nuked an unimportant log file ... but the process hadn't
    closed it so I didn't get any space back :-P), so I'm not sure exactly
    what you're looking for. But I bet it sticks out like a sore
    thumb. ;-) I suppose there may be a way to look at its content
    (perhaps in gdb?) which might help to identify what is going on.
  • Mark Sapiro at Nov 20, 2008 at 5:28 pm
    J.A. Terranson wrote:
    On Tue, 18 Nov 2008, Mark Sapiro wrote:

    Enabling VERP might cause the MTA to use a lot more queue space, but I
    don't see that it would affect Mailman much.
    The only difference was in the rate at which the "loss" accrued. Roughly
    a 6x increase.

    Mailman's VERP would only affect OutgoingRunner and SMTPDirect.py, and
    all it would do is cause additional SMTP transactions with the MTA.
    Thus, this might help localize the problem to this specific area.

    There almost certainly is something in Mailman's error log unless the
    logfile just can't be grown to accommodate the message.
    No, there really isn't. I have combed through all of them (bounce, error,
    mischeif, post qrunner, smtp & failure, subscribe and vette. Did I miss
    anything?), with no sign of anything being wrong.

    Near the beginning of Mailman's scripts/driver, you will see

    STEALTH_MODE = 1

    if you change this to

    STEALTH_MODE = 0

    or

    STEALTH_MODE = False

    The traceback from the "bug" message should be included with the
    message. This may help if this particular issue recurs.

    --
    Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
    San Francisco Bay Area, California better use your sense - B. Dylan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmailman-users @
categoriespython
postedNov 18, '08 at 4:26a
activeNov 20, '08 at 5:28p
posts7
users4
websitelist.org

People

Translate

site design / logo © 2022 Grokbase