Monitoring daemons
2008-10-20 13:22Monitoring daemons for unexpected exit is tricky.
Polling is popular. Check /var/run/... for the PID, get that process's details from /proc/, check that it's not a zombie, and go back to sleep. But polling is resourceful, and even one second is a long time for some services (such as a anycast DNS server at an ISP).
The next approach is to have a parent process. The parent starts, becomes a daemon, and then starts the monitored daemon process with some flag which keeps the process in foreground. When the child exits, the parent is told. The problem here is what happens if the parent exits? The child is then killed. So monitoring the process for reliability reasons decreases reliability. Hmmm, not quite where we want to be.
What we want to do is to have an unrelated process which monitors process exits. I've looked at lots of ways of doing that, and the way which works is to use the process accounting TASKSTATS system. The listening daemon is told by the TASKSTATS channel of every process exit. When you see an exit for the PID of interest, check that the parent PID is 0 (ie: it is not the daemon launcher task which is exiting, but the launched daemon itself) then run a program to take whatever action is necessary.
For an anycast service, that action is simply downing the interface with the anycast service's address -- the routing daemon will withdraw the route for the service and customers will use another service on that anycast address.