Had a drive failure on a raid 5 array of a backup box that a couple of postgres databases sit on. One of the databases is a slony subscriber to a production database and the other is a test-environment database.

The drive was offline...brought it back online, hoping it would start a rebuild...which it didn't. Almost immediately started getting errors from slony

could not access status of transaction 2463273456
could not open file "pg_clog/0937": No such file or directory
...
etc.

Looks like the subscriber database had some issues (at least with one specific table). In addition, trying to access to the other (test) database yielded an error accessing pg_namespace.

So....reseated the drive which started a rebuild. I stopped postgres. When the rebuild is done (or if it fails, I will replace the drive), I will restart postgres and see what happens.

Question...should I just re-initdb and restore databases from backup? Should I have done something differently once I noticed the failure? I've had drive failures before on this box and either rebuilt the array or replaced the drive with no postgres issues (although the amount of traffic was much less than now)

Any help would be appreciated.



---------------------------------
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.

Search Discussions

  • Jeff Amiel at Jan 18, 2007 at 7:29 pm
    raid rebuilt...
    ran fsck

    PARTIALLY TRUNCATED INODE I=612353
    SALVAGE? yes

    INCORRECT BLOCK COUNT I=612353 (544 should be 416)
    CORRECT? yes

    PARTIALLY TRUNCATED INODE I=612389
    SALVAGE? yes

    INCORRECT BLOCK COUNT I=612389 (544 should be 416)
    CORRECT? yes

    INCORRECT BLOCK COUNT I=730298 (676448 should be 675520)
    CORRECT? yes

    root@back-app-1# find /db -inum 612353
    /db/pg_clog/0952

    root@back-app-1# find /db -inum 612389
    /db/pg_clog/0951

    root@back-app-1# find /db -inum 730298
    /db/base/1093090/1212223

    hmmm...wanted to see what the third one was so I

    test=# select oid, relname from pg_class order by oid;

    ERROR: could not access status of transaction 2485385834
    DETAIL: could not open file "pg_clog/0942": No such file or directory

    So....am I screwed here...just I just re-init-db and restore the entire kit and kaboodle from scratch?

    Jeff Amiel wrote: Had a drive failure on a raid 5 array of a backup box that a couple of postgres databases sit on. One of the databases is a slony subscriber to a production database and the other is a test-environment database.

    The drive was offline...brought it back online, hoping it would start a rebuild...which it didn't. Almost immediately started getting errors from slony

    could not access status of transaction 2463273456
    could not open file "pg_clog/0937": No such file or directory
    ...
    etc.

    Looks like the subscriber database had some issues (at least with one specific table). In addition, trying to access to the other (test) database yielded an error accessing pg_namespace.

    So....reseated the drive which started a rebuild. I stopped postgres. When the rebuild is done (or if it fails, I will replace the drive), I will restart postgres and see what happens.

    Question...should I just re-initdb and restore databases from
    backup? Should I have done something differently once I noticed the failure? I've had drive failures before on this box and either rebuilt the array or replaced the drive with no postgres issues (although the amount of traffic was much less than now)

    Any help would be appreciated.



    ---------------------------------
    Don't pick lemons.
    See all the new 2007 cars at Yahoo! Autos.


    ---------------------------------
    Any questions? Get answers on any topic at Yahoo! Answers. Try it now.
  • Matthew Peter at Jan 18, 2007 at 10:48 pm
    Wow. I just noticed I have the same problem today after a vacuum. As well as an
    degraded array. Musta been a time release Y2k7 bug. Hopefully didn't loose anything
    too important.



    ____________________________________________________________________________________
    Now that's room service! Choose from over 150,000 hotels
    in 45,000 destinations on Yahoo! Travel to find your fit.
    http://farechase.yahoo.com/promo-generic-14795097
  • Tom Lane at Jan 18, 2007 at 11:43 pm

    Jeff Amiel writes:
    ran fsck
    PARTIALLY TRUNCATED INODE I=612353
    SALVAGE? yes
    INCORRECT BLOCK COUNT I=612353 (544 should be 416)
    CORRECT? yes
    root@back-app-1# find /db -inum 612353
    /db/pg_clog/0952
    Yech. So much for RAID reliability ... maybe you need to reconfigure
    the array for more redundancy?
    So....am I screwed here...just I just re-init-db and restore the entire kit and kaboodle from scratch?
    Given that it's just a backup machine, it's probably not worth heroics
    to try to recover. I'm not sure that you could trust any data you got
    out of it, anyway --- corrupted pg_clog is likely to lead to
    inconsistency in the form of partially-applied transactions, which can
    be hard to detect.

    regards, tom lane
  • Jeff Amiel at Jan 18, 2007 at 11:54 pm

    Tom Lane wrote:
    Yech. So much for RAID reliability ... maybe you need to reconfigure
    the array for more redundancy?
    Yeah...I'm not sure if I screwed the pooch by trying the bring the drive
    back 'online'.....in the past we just try re-seating it and the raid
    card 'does its thing' and rebuilds or takes it offline again.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedJan 18, '07 at 3:16p
activeJan 18, '07 at 11:54p
posts5
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase