Skip to content

Londiste process dead this morning

I came in this morning and found that the londiste replay job that replicates into our disaster recovery database had silently failed. The process was still running. Replication lag was over 11 hours at that point so that implies that something strange occurred at 10 pm last night. The postgres logs for the DR database were completely empty after the midnight rotation which is very unusual since we log events at the “mod” level so we can see all the row level replication activity in real-time if we need to.

I tried to gracefully stop the replay job but it had no effect so I had to kill it. Restarting the job did the trick and replication lag immediately started to come down.

My nagios check didn’t fire off an alert which was a bt puzzling until I realized that the monitoring system is running an older version of the replication lag check script that only measures a single londiste job. Now that we have multiple replay jobs it needed to be updated.

I already have a new version of the nagios plugin that handles lag measurements for multiple replicas tested and ready to go. Now I guess I need to actually put it into production.

I wish I could figure out what caused londiste to fail last night. There was nothing in any of the system logs and no trail of breadcrumbs to follow for further investigation. Londiste has been pretty trouble free since we started using it in January. I’m not sure why we are starting to see issues now…

Post a Comment

Your email is never published nor shared. Required fields are marked *

Powered by WP Hashcash