FAQ

[PostgreSQL-Hackers] Aborted Transaction During Vacuum

G. Anthony Reina
Aug 12, 1999 at 11:26 pm
I've been getting a strange error from the vacuum command. When I try
'vacuum verbose analyze' the vacuum goes through the tables fine until
just after finishing one particular table. Then I get the error:

NOTICE: AbortTransaction and not in in-progress state
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
We have lost the connection to the backend, so further processing is
impossible. Terminating.


When I try to vacuum the tables individually, I get no problems with
aborted backends.

Does anyone know what is going on here?

-Tony

PostgreSQL 6.5.1 on RH Linux 6.0, PII/400 MHz, 512 Meg RAM, 18 Gig HD
reply

Search Discussions

12 responses

  • Tom Lane at Aug 13, 1999 at 1:29 am

    "G. Anthony Reina" <reina@nsi.edu> writes:
    I've been getting a strange error from the vacuum command. When I try
    'vacuum verbose analyze' the vacuum goes through the tables fine until
    just after finishing one particular table. Then I get the error:
    NOTICE: AbortTransaction and not in in-progress state
    pqReadData() -- backend closed the channel unexpectedly.
    Interesting. Is there any additional message appearing in the
    postmaster log? Is a core file being generated? (look in the
    data/base/ subdirectory for the database in question) If there
    is a corefile, a debugger backtrace from it would be helpful.

    regards, tom lane
  • G. Anthony Reina at Aug 13, 1999 at 5:20 pm

    Tom Lane wrote:
    Interesting. Is there any additional message appearing in the
    postmaster log? Is a core file being generated? (look in the
    data/base/ subdirectory for the database in question) If there
    is a corefile, a debugger backtrace from it would be helpful.

    regards, tom lane
    Tom,

    I tried the 'vacuum verbose analyze' again today. I get the same error
    with the AbortTransaction. There is a core file generated but no pg_vlock
    file. The core is over 1 Gig in size (38 Megs gzipped) so I'm not so sure
    you'd want to get that (you can have it if you want it though!). Plus,
    there seems to be nothing written to the postmaster.log file (I re-started
    the postmaster before the vacuum using 'nohup postmaster -i -B 15000 -o -F
    postmaster.log&').
    Oliver Elphick (Oliver.Elphick@lfix.co.uk) told me that this sounds
    like like a bug you patched for him a few days ago:
    It does sound very like the bug that was found a few days back where the
    pg_vlock file gets deleted by a mistaken call to vc_abort(). This only
    gets called in the analyze code.
    We're probably talking about the same bug. When I try just 'vacuum verbose'
    without the analyze, the vacuum completes just fine. So it must be within
    the analyze code.

    -Tony
  • Tom Lane at Aug 13, 1999 at 5:38 pm

    "G. Anthony Reina" <reina@nsi.edu> writes:
    I tried the 'vacuum verbose analyze' again today. I get the same error
    with the AbortTransaction. There is a core file generated but no pg_vlock
    file. The core is over 1 Gig in size (38 Megs gzipped) so I'm not so sure
    you'd want to get that (you can have it if you want it though!).
    It wouldn't do me any good anyway without a copy of the executable and a
    copy of gdb built for whatever platform you are on. I was hoping you
    could run gdb on the corefile there and just send the backtrace output
    ("gdb postgres-executable core-file" and then say "bt").
    Oliver Elphick (Oliver.Elphick@lfix.co.uk) told me that this sounds
    like like a bug you patched for him a few days ago:
    I doubt it's the same bug --- the error message generated by the other
    bug was different...

    regards, tom lane
  • Oliver Elphick at Aug 13, 1999 at 6:05 pm

    Tom Lane wrote:
    Oliver Elphick (Oliver.Elphick@lfix.co.uk) told me that this sounds
    like like a bug you patched for him a few days ago:
    >
    I doubt it's the same bug --- the error message generated by the other
    bug was different...
    The error message I posted was the log output at level 3, which he doesn't
    seem to be running; if you switch to logging at that level, look for
    messages that mention pg_vlock.

    As far as I recall, the only error message he has listed so far is from
    psql, which merely records the unlink failure at the end of the vacuum.
    Everything else looks remarkably like the vc_abort bug.

    --
    Vote against SPAM: http://www.politik-digital.de/spam/
    ========================================
    Oliver Elphick Oliver.Elphick@lfix.co.uk
    Isle of Wight http://www.lfix.co.uk/oliver
    PGP key from public servers; key ID 32B8FAA1
    ========================================
    "Watch ye therefore, and pray always, that ye may be
    accounted worthy to escape all these things that shall
    come to pass, and to stand before the Son of man."
    Luke 21:36
  • Tom Lane at Aug 13, 1999 at 6:10 pm

    "Oliver Elphick" <olly@lfix.co.uk> writes:
    As far as I recall, the only error message he has listed so far is from
    psql, which merely records the unlink failure at the end of the vacuum.
    Everything else looks remarkably like the vc_abort bug.
    But the vc_abort problem didn't cause a backend coredump --- it just
    reported an error and failed to complete the vacuum, no?

    regards, tom lane
  • Oliver Elphick at Aug 13, 1999 at 6:28 pm

    Tom Lane wrote:
    "Oliver Elphick" <olly@lfix.co.uk> writes:
    As far as I recall, the only error message he has listed so far is from
    psql, which merely records the unlink failure at the end of the vacuum.
    Everything else looks remarkably like the vc_abort bug.
    >
    But the vc_abort problem didn't cause a backend coredump --- it just
    reported an error and failed to complete the vacuum, no?
    >

    I got a coredump too; I never mentioned it, because I found the proximate
    cause and curing that made it go away. When unlink failed in vc_shutdown,
    it called ELOG and a segfault occurred a little later.

    I forgot about that when your patch fixed the original problem.

    --
    Vote against SPAM: http://www.politik-digital.de/spam/
    ========================================
    Oliver Elphick Oliver.Elphick@lfix.co.uk
    Isle of Wight http://www.lfix.co.uk/oliver
    PGP key from public servers; key ID 32B8FAA1
    ========================================
    "Watch ye therefore, and pray always, that ye may be
    accounted worthy to escape all these things that shall
    come to pass, and to stand before the Son of man."
    Luke 21:36
  • Tom Lane at Aug 13, 1999 at 9:33 pm

    "Oliver Elphick" <olly@lfix.co.uk> writes:
    But the vc_abort problem didn't cause a backend coredump --- it just
    reported an error and failed to complete the vacuum, no?
    I got a coredump too; I never mentioned it, because I found the proximate
    cause and curing that made it go away. When unlink failed in vc_shutdown,
    it called ELOG and a segfault occurred a little later.
    Ah, I wish I'd known that. So what Tony is seeing is exactly the same
    behavior you observed. OK, I feel better now --- I thought the coredump
    was probably some platform-specific misbehavior that only Tony was seeing.

    We still need to figure out what is causing it, because I can see no
    reason for a coredump after vc_shutdown elog()s. Something is being
    clobbered that should not be. But it sounds like installing the
    vc_abort patch will get Tony on his feet, and then we can look for the
    secondary bug at our leisure.

    regards, tom lane
  • G. Anthony Reina at Aug 13, 1999 at 6:34 pm

    Tom Lane wrote:

    But the vc_abort problem didn't cause a backend coredump --- it just
    reported an error and failed to complete the vacuum, no?

    regards, tom lane
    Here's the error message again:

    NOTICE: --Relation ex_ellipse_proc--
    NOTICE: --Relation ex_ellipse_proc--
    NOTICE: Pages 2545: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac 0,
    Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 660, MaxLen 660; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 1/0 sec.
    NOTICE: Pages 2545: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac 0,
    Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 660, MaxLen 660; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 1/0 sec.
    NOTICE: Index pkex_ellipse_proc: Pages 138; Tuples 30535. Elapsed 0/0 sec.
    NOTICE: Index pkex_ellipse_proc: Pages 138; Tuples 30535. Elapsed 0/0 sec.
    NOTICE: --Relation ex_ellipse_cell--
    NOTICE: --Relation ex_ellipse_cell--
    NOTICE: Pages 370: Changed 0, Reapped 0, Empty 0, New 0; Tup 6109: Vac 0,
    Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 80, MaxLen 2736; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 0/0 sec.
    NOTICE: Pages 370: Changed 0, Reapped 0, Empty 0, New 0; Tup 6109: Vac 0,
    Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 80, MaxLen 2736; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 0/0 sec.
    NOTICE: Index pkex_ellipse_cell: Pages 42; Tuples 6109. Elapsed 0/0 sec.
    NOTICE: Index pkex_ellipse_cell: Pages 42; Tuples 6109. Elapsed 0/0 sec.
    NOTICE: --Relation ex_ellipse_opto--
    NOTICE: --Relation ex_ellipse_opto--
    NOTICE: Pages 26356: Changed 0, Reapped 0, Empty 0, New 0; Tup 71475: Vac
    0, Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 1760, MaxLen 6108; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 3/2 sec.
    NOTICE: Pages 26356: Changed 0, Reapped 0, Empty 0, New 0; Tup 71475: Vac
    0, Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 1760, MaxLen 6108; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 3/2 sec.
    NOTICE: Index pkex_ellipse_opto: Pages 357; Tuples 71475. Elapsed 0/0 sec.
    NOTICE: Index pkex_ellipse_opto: Pages 357; Tuples 71475. Elapsed 0/0 sec.
    NOTICE: --Relation ex_ellipse_opto_proc--
    NOTICE: --Relation ex_ellipse_opto_proc--
    NOTICE: Pages 14742: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac
    0, Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 1816, MaxLen 5900; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 1/1 sec.
    NOTICE: Pages 14742: Changed 0, Reapped 0, Empty 0, New 0; Tup 30535: Vac
    0, Keep/VTL 0/0, Crash 0, UnUsed 0,
    MinLen 1816, MaxLen 5900; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail.
    Pages 0/0. Elapsed 1/1 sec.
    NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
    sec.
    NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
    sec.
    ERROR: vacuum: can't destroy lock file!
    NOTICE: AbortTransaction and not in in-progress state
    NOTICE: AbortTransaction and not in in-progress state
    pqReadData() -- backend closed the channel unexpectedly.
    This probably means the backend terminated abnormally
    before or while processing the request.
    We have lost the connection to the backend, so further processing is
    impossible. Terminating.
    [postgres@bigred ~]$



    -Tony
  • G. Anthony Reina at Aug 13, 1999 at 8:38 pm

    Tom Lane wrote:

    However, there should have been an "ERROR" message if something reported
    an error. You said you only saw "NOTICE: AbortTransaction and not in
    in-progress state" and not any "ERROR" before or after it? The NOTICE
    presumably came out of AbortTransaction itself, after whatever went
    wrong went wrong...
    Yes, I have an ERROR message (either I didn't notice it before or it is
    new):

    NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
    sec.
    NOTICE: Index pkex_ellipse_opto_proc: Pages 138; Tuples 30535. Elapsed 0/0
    sec.
    ERROR: vacuum: can't destroy lock file!
    NOTICE: AbortTransaction and not in in-progress state
    NOTICE: AbortTransaction and not in in-progress state
    pqReadData() -- backend closed the channel unexpectedly.
    This probably means the backend terminated abnormally
    before or while processing the request.
    We have lost the connection to the backend, so further processing is
    impossible. Terminating.


    It looks like the error is either occuring on table ex_ellipse_opto_proc or
    the next table in the list ex_ellipse_proc. However, I think the error is
    more general than that. It appears to occur just before the last table in
    the database gets vacuumed. Here's the list of my tables:

    Database = db01
    +------------------+----------------------------------+----------+
    Owner | Relation | Type |
    +------------------+----------------------------------+----------+
    postgres | center_out | table |
    postgres | center_out_analog | table |
    postgres | center_out_analog_proc | table |
    postgres | center_out_cell | table |
    postgres | center_out_cell_proc | table |
    postgres | center_out_opto | table |
    postgres | center_out_opto_proc | table |
    postgres | center_out_pref_direction | table |
    postgres | center_out_proc | table |
    postgres | electrode | table |
    postgres | ellipse | table |
    postgres | ellipse_analog | table |
    dan | ellipse_analog_proc | table |
    postgres | ellipse_cell | table |
    dan | ellipse_cell_proc | table |
    postgres | ellipse_opto | table |
    dan | ellipse_opto_proc | table |
    dan | ellipse_proc | table |
    dan | ex_ellipse | table |
    dan | ex_ellipse_analog_proc | table |
    dan | ex_ellipse_cell | table |
    dan | ex_ellipse_cell_proc | table |
    dan | ex_ellipse_opto | table |
    dan | ex_ellipse_opto_proc | table
    <---- ERROR occurs somewhere after here
    dan | ex_ellipse_proc | table |
    +------------------+----------------------------------+----------+

    Yesterday, I was adding tables in one by one from a previous pg_dump. The
    error didn't pop up until after I had about 9 or 10 tables restored. I
    didn't think about it then, but it may have always occured after the second
    to last table in the list. But don't hold me to that.

    In any case, I'll try to re-build everything like you've asked to get a
    better error message. Maybe if I go through step-by-step again. You'll be
    able to help me find where the error is taking place.

    Thanks Tom and Oliver. I'll get back to you when I finish the rebuild.

    -Tony
  • Tom Lane at Aug 13, 1999 at 9:23 pm

    "G. Anthony Reina" <reina@nsi.edu> writes:
    ERROR: vacuum: can't destroy lock file!
    NOTICE: AbortTransaction and not in in-progress state
    NOTICE: AbortTransaction and not in in-progress state
    pqReadData() -- backend closed the channel unexpectedly.
    Ah-hah! Oliver is right, then, at least in part --- that error message
    from vacuum suggests that the vc_abort bug *is* biting you. However,
    there may be more going on, because what Oliver and others observed did
    not include a coredump (at least not that I heard about).

    You can probably suppress the problem by installing the patch I posted
    to pgsql-patches a few days ago. However, I'd appreciate it if you'd
    first try to reproduce the problem again with debug/assert turned on,
    so that we can get some idea whether there is an additional bug that's
    only biting you and not the previous reporters.

    BTW, if vc_abort is involved then the occurrence of the problem probably
    depends on whether any other backends are using the database and what
    they are doing. (The vc_abort bug escaped notice up till last week
    because it doesn't trigger when vacuum is the only thing running.)
    Do you have other clients running when you do the vacuum? What are
    they doing?

    regards, tom lane
  • G. Anthony Reina at Aug 14, 1999 at 4:05 am
    Tom and Oliver,

    I added the patch and re-built everything. No more problems with the
    vacuum analyze (knock on wood since this is still Friday the 13th).

    Tom, does this patch address the problem you found in your
    EndPortalAllocMode comments? I'm just wondering whether another patch
    will be needed or if this one should cover the problem.

    Thanks again for the help guys.
    -Tony


    Tom wrote:
    (a) vacuum.c does a CommitTransaction to commit its final table's
    worth of fixes.
    (b) CommitTransaction calls EndPortalAllocMode.
    (c) vacuum calls vc_shutdown, which tries to remove pg_vlock,
    and reports an error when the unlink() call fails.
    (d) during error cleanup, AbortTransaction is called.
    (e) AbortTransaction calls EndPortalAllocMode. There has been no
    intervening StartPortalAllocMode, so the portal's context stack
    is empty. EndPortalAllocMode tries to free a nonexistent memory
    context (or, if you have asserts turned on, dies with an assert
    failure). Ka-boom.
  • Tom Lane at Aug 14, 1999 at 2:29 pm

    "G. Anthony Reina" <reina@nsi.edu> writes:
    Tom, does this patch address the problem you found in your
    EndPortalAllocMode comments? I'm just wondering whether another patch
    will be needed or if this one should cover the problem.
    I think you need not worry. EndPortalAllocMode has a bug that ought to
    be fixed IMHO, but it's only triggered if an error is reported outside
    of any transaction context, and there is darn little backend processing
    that happens outside of any transaction context. This vacuum shutdown
    check might be the only such error, in fact. In short, it's something
    to clean up for 6.6, but I doubt it's worth issuing a 6.5 patch for.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post