[tor-bugs] #32692 [Internal Services/Tor Sysadmin Team]: meronense database server crashes on I/O errors
Tor Bug Tracker & Wiki
blackhole at torproject.org
Sat Dec 7 17:05:27 UTC 2019
#32692: meronense database server crashes on I/O errors
-----------------------------------------------------+-----------------
Reporter: anarcat | Owner: tpa
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-----------------------------------------------------+-----------------
meronense crashed today. local time (UTC-5)
{{{
08:35:18 <nsa> tor-nagios: [meronense] system - filesystem check is
WARNING: /dev/mapper/vg_meronense-srv (/home) has 2 errors
09:03:53 <nsa> tor-nagios: [meronense] process - postgresql96 - master is
CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (postgres), command
name postgres, args
/usr/lib/postgresql/9.6/bin/postgres
09:08:33 <nsa> tor-nagios: [meronense] system - all services running is
CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
09:22:43 <nsa> tor-nagios: [meronense] system - all services running is
WARNING: degraded
}}}
i restarted psql when i noticed:
{{{
10:11:50 <anarcat> restarted postgres on meronense, unknown reason crash
}}}
but it crashed again.
the errors actually started here (UTC), in
/var/log/postgresql/postgresql-9.6-main.log:
{{{
2019-12-07 12:09:30 GMT WARNING: corrupted statistics file
"pg_stat_tmp/global.stat"
[repeats until]
2019-12-07 12:44:28 GMT WARNING: corrupted statistics file
"pg_stat_tmp/global.stat"
2019-12-07 12:44:28 GMT LOG: using stale statistics instead of current
ones because stats collector is not responding
2019-12-07 12:44:28 GMT WARNING: corrupted statistics file
"pg_stat_tmp/global.stat"
2019-12-07 12:47:26 GMT PANIC: could not fsync file "base/16401/17372":
Input/output error
2019-12-07 12:47:26 GMT LOG: checkpointer process (PID 1284) was
terminated by signal 6: Aborted
2019-12-07 12:47:26 GMT LOG: terminating any other active server
processes
2019-12-07 12:47:26 GMT WARNING: terminating connection because of crash
of another server process
2019-12-07 12:47:26 GMT DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly
corrupted shared memory.
2019-12-07 12:47:26 GMT HINT: In a moment you should be able to reconnect
to the database and repeat your command.
2019-12-07 12:47:26 GMT LOG: archiver process (PID 1288) exited with exit
code 1
2019-12-07 12:47:26 GMT WARNING: terminating connection because of crash
of another server process
2019-12-07 12:47:26 GMT DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly
corrupted shared memory.
2019-12-07 12:47:26 GMT HINT: In a moment you should be able to reconnect
to the database and repeat your command.
2019-12-07 12:47:26 GMT CONTEXT: SQL statement "UPDATE bwhist
SET read = array_merge(read, insert_read),
written = array_merge(written, insert_written),
dirread = array_merge(dirread, insert_dirread),
dirwritten = array_merge(dirwritten, insert_dirwritten)
WHERE fingerprint = insert_fingerprint AND date = insert_date"
PL/pgSQL function
insert_bwhist(character,date,bigint[],bigint[],bigint[],bigint[]) line 12
at SQL statement
2019-12-07 12:47:36 GMT LOG: statistics collector process (PID 1289) was
terminated by signal 9: Killed
2019-12-07 12:47:36 GMT LOG: all server processes terminated;
reinitializing
2019-12-07 12:47:58 GMT LOG: database system was interrupted; last known
up at 2019-12-07 11:02:32 GMT
2019-12-07 12:48:09 GMT LOG: could not fsync file
"./base/16401/17372_fsm": Input/output error
2019-12-07 12:48:09 GMT LOG: could not fsync file
"./base/16401/17372_vm": Input/output error
2019-12-07 12:48:09 GMT LOG: could not fsync file "./base/16401/17488":
Input/output error
2019-12-07 12:48:17 GMT LOG: could not fsync file
"./pg_stat_tmp/global.stat": Input/output error
2019-12-07 12:48:17 GMT LOG: database system was not properly shut down;
automatic recovery in progress
2019-12-07 12:48:17 GMT LOG: redo starts at 3D7/57E470D8
2019-12-07 12:48:17 GMT LOG: redo done at 3D7/5BB04860
2019-12-07 13:49:37 GMT PANIC: could not fsync file "base/16401/17488":
Input/output error
2019-12-07 13:49:37 GMT LOG: startup process (PID 20894) was terminated
by signal 6: Aborted
2019-12-07 13:49:37 GMT LOG: aborting startup due to startup process
failure
2019-12-07 13:49:37 GMT LOG: database system is shut down
}}}
here's my attempt at restarting it:
{{{
2019-12-07 15:11:33 GMT FATAL: the database system is starting up
2019-12-07 15:11:33 GMT FATAL: the database system is starting up
2019-12-07 15:11:34 GMT LOG: could not stat file
"./pg_stat_tmp/global.stat": Structure needs cleaning
2019-12-07 15:11:34 GMT LOG: database system was not properly shut down;
automatic recovery in progress
2019-12-07 15:11:34 GMT FATAL: the database system is starting up
2019-12-07 15:11:34 GMT FATAL: the database system is starting up
2019-12-07 15:11:35 GMT FATAL: the database system is starting up
2019-12-07 15:11:35 GMT FATAL: the database system is starting up
2019-12-07 15:11:36 GMT FATAL: the database system is starting up
2019-12-07 15:11:36 GMT LOG: redo starts at 3D7/57E470D8
2019-12-07 15:11:36 GMT FATAL: the database system is starting up
2019-12-07 15:11:37 GMT FATAL: the database system is starting up
2019-12-07 15:11:37 GMT FATAL: the database system is starting up
2019-12-07 15:11:38 GMT FATAL: the database system is starting up
2019-12-07 15:11:38 GMT LOG: incomplete startup packet
2019-12-07 15:11:46 GMT LOG: redo done at 3D7/5BB04860
2019-12-07 15:48:19 GMT PANIC: could not fsync file "base/16401/17488":
Input/output error
2019-12-07 15:48:19 GMT LOG: startup process (PID 27916) was terminated
by signal 6: Aborted
2019-12-07 15:48:19 GMT LOG: aborting startup due to startup process
failure
2019-12-07 15:48:19 GMT LOG: database system is shut down
}}}
then i shutdown the host from kvm4, and restarted it. then psql was able
to recover:
{{{
2019-12-07 16:54:17 GMT LOG: incomplete startup packet
2019-12-07 16:54:17 GMT LOG: database system shutdown was interrupted;
last known up at 2019-12-07 15:26:47 GMT
2019-12-07 16:54:18 GMT FATAL: the database system is starting up
2019-12-07 16:54:18 GMT FATAL: the database system is starting up
2019-12-07 16:54:18 GMT LOG: database system was not properly shut down;
automatic recovery in progress
2019-12-07 16:54:18 GMT LOG: redo starts at 3D7/57E470D8
2019-12-07 16:54:19 GMT FATAL: the database system is starting up
2019-12-07 16:54:19 GMT FATAL: the database system is starting up
2019-12-07 16:54:20 GMT FATAL: the database system is starting up
2019-12-07 16:54:20 GMT FATAL: the database system is starting up
2019-12-07 16:54:21 GMT FATAL: the database system is starting up
2019-12-07 16:54:21 GMT FATAL: the database system is starting up
2019-12-07 16:54:22 GMT FATAL: the database system is starting up
2019-12-07 16:54:22 GMT FATAL: the database system is starting up
2019-12-07 16:54:23 GMT FATAL: the database system is starting up
2019-12-07 16:54:23 GMT LOG: incomplete startup packet
2019-12-07 16:54:23 GMT LOG: redo done at 3D7/5BB04860
2019-12-07 16:57:28 GMT LOG: MultiXact member wraparound protections are
now enabled
2019-12-07 16:57:28 GMT LOG: autovacuum launcher started
2019-12-07 16:57:28 GMT LOG: database system is ready to accept
connections
}}}
disk arrays are fine according to mdadm on kvm4,the parent host:
{{{
root at kvm4:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
[raid4] [raid10]
md2 : active raid1 sda[2] sdb[3]
3906887488 blocks super 1.2 [2/2] [UU]
bitmap: 12/30 pages [48KB], 65536KB chunk
md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
479002944 blocks super 1.2 [2/2] [UU]
bitmap: 3/4 pages [12KB], 65536KB chunk
md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
20955136 blocks super 1.2 [2/2] [UU]
unused devices: <none>
}}}
this is the second I see this happen. not sure what's going on. i need to
run now, would love another pair of eyes on this.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32692>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list