Subject: [BUGS] possible database corruption From: Chris Anderson To: pgsql-bugs@postgresql.org cc: John Maddalozzo Date: Fri, 12 Jan 2001 21:16:31 -0600 (CST) Status: ============================================================================ POSTGRESQL BUG REPORT TEMPLATE ============================================================================ Your name : Chris Anderson Your email address : chris@journyx.com System Configuration --------------------- Architecture (example: Intel Pentium) : Intel Pentium 3 (x2) Operating System (example: Linux 2.0.26 ELF) : Linux 2.2.14 (SMP) PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0.3 Compiler used (example: gcc 2.8.0) : egcs-2.91.66 Please enter a FULL description of your problem: ------------------------------------------------ We are using postgresql as the backend for an online service where we host a web based application for customers. Each customer has their own copy of the application server (written in python) which maintains three persistant connections to postgres. We presently have a single postgres instance on a dedicated machine which maintains 94 databases and around 280 connections. These connections are are initiated from four additional servers which provide the application to the customer. These machines are all running the same version of linux as the database server, however their pgres clients are only at version 6.5. This solution has worked very well for us in the past, but now we are experiencing very strange behavior which seems to be the result of periodic corruption in the database files. Sometimes immediately after we create a new database, it will somehow become corrupted and trying to access it will cause postmaster to crash, thereby killing everyone else's connections. Note, that not all types of accesses will cause it to crash, however a vacuum will almost always do the trick. Actually, selects and inserts usually work just fine. However, it does tend to lead toward a general instability in the server, and we see postgres crashes quite regularly after it happens. We cannot predict when this will happen, though we've been seeing it almost weekly now, but once it does happen any new databases created will exhibit the exact same behavior every time. Once this happens, the only way I've been able to recover from the problem seems to be to wipe the data directory and restore from a pg_dump. Deleting the offending database and recreating it will not do the trick. The server itself has never locked up, there are no known filesystem errors, and I have been very careful to cleanup any lingering shm stuff before reinvoking postmaster. When postmaster dies, it does dump a core, which I can provide. The stack trace looks like this: -- begin gdb output -- GNU gdb 19991004 Copyright 1998 Free Software Foundation, Inc. This GDB was configured as "i386-redhat-linux"... Core was generated by `/usr/local/pgres/bin/postgres localhost postgres d'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libcrypt.so.1...done. Reading symbols from /lib/libnsl.so.1...done. Reading symbols from /lib/libdl.so.2...done. Reading symbols from /lib/libm.so.6...done. Reading symbols from /lib/libutil.so.1...done. Reading symbols from /usr/lib/libreadline.so.3...done. Reading symbols from /lib/libtermcap.so.2...done. Reading symbols from /usr/lib/libncurses.so.4...done. Reading symbols from /lib/libc.so.6...done. Reading symbols from /lib/ld-linux.so.2...done. Reading symbols from /lib/libnss_files.so.2...done. #0 0x81253ea in GetRawDatabaseInfo () (gdb) where #0 0x81253ea in GetRawDatabaseInfo () #1 0x8125016 in InitPostgres () #2 0x80ebed5 in PostgresMain () #3 0x80d6652 in DoBackend () #4 0x80d6231 in BackendStartup () #5 0x80d55ea in ServerLoop () #6 0x80d5074 in PostmasterMain () #7 0x80ab866 in main () #8 0x401049cb in __libc_start_main (main=0x80ab800
, argc=6, argv=0xbffffb64, init=0x8064084 <_init>, fini=0x812a0cc <_fini>, rtld_fini=0x4000ae60 <_dl_fini>, stack_end=0xbffffb5c) at ../sysdeps/generic/libc-start.c:92 -- end gdb output -- Needless to say this is quite disconcerting, and absolutely _any_ input you could provide would be invaluable. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- As I mentioned above, it is difficult to predict when it will start happening, however we have only ever seen this once we started getting the number of connections pretty high. If it is significant, postmaster is started with the following options: su -l postgres -c '/usr/local/pgres/bin/postmaster -i -N 512 -B 2048 2>&1 > /var/log/postgres.log If you know how this problem might be fixed, list the solution below: --------------------------------------------------------------------- Well, I know how to repair it, but what I am most interested in is how to prevent it, or at least how to debug what may be causing the problem in the first place.