Perl processes that got stock kept me from sleep last night. I’m not sure what happened, but probably they were waiting for the database that reached a max_connections limit for this particular user. When Apache reached its max processes, monitoring paged me. Fortunately this is all running in a cluster, so no service was interrupted.
A few days ago I found myself killing Apache processes from a user that created an endless loop. Also, some scripts our users upload generate a lot of load (which is ok for a while, but not too long). So I thought I’d write something in bash to help me with all this 🙂
The script below is what I’ve come up with so far and it seems to work pretty well. What it does, is allowing user processes to run for 10 minutes and then kick in and kill the process if it’s still there.
Apache processes are running as the owner of the account (using mod_ruid) so I can actually see to what user the Apache processes belong. Perl, of course, is also run as the account user. Making use of this is the base idea of the script. Sample ‘ps aux’:
user1 24891 1.1 1.8 112132 61000 ? S 18:54 0:02 /usr/sbin/apache2 -k start www-data 24894 0.1 0.8 82624 30240 ? S 18:54 0:00 /usr/sbin/apache2 -k start user2 24900 0.2 0.9 82984 31552 ? S 18:54 0:00 /usr/sbin/apache2 -k start www-data 25201 4.8 1.2 95540 43296 ? S 18:57 0:00 /usr/sbin/apache2 -k start www-data 25202 1.0 0.8 82972 30016 ? S 18:57 0:00 /usr/sbin/apache2 -k start user2 25213 6.0 0.1 8992 5692 ? S 18:57 0:00 /usr/local/bin/perl -- /var/www/site.com/HTML/script.cgi
For safety, I ignore FTP (pure-ftp in my case) and SSH processes. And of course only UID’s from 1000 and up are taken into account. So the ‘www-data’ processes never gets killed’; ‘user1’ and ‘user2’ processes will be killed when they do not complete within 10 minutes.
#!/bin/bash # 2012-03-04, remi: Kill processes from users (id 1000 and up) that have been running for more than 600 seconds # For safety, I exclude some processes that users might run that are OK (ftp,ssh,etc) # # get processes ps -eo uid,pid,cmd:9,lstart --no-heading | tail -n+2 | # we never want to kill pure-ftp grep -v "pure-ftpd" | # nor sshd grep -v "sshd" | # nor bash grep -v "bash" | # ignore all sbin processes, incl Apache grep -v "/usr/sbin" | # loop remaining processes while read PROC_UID PROC_PID PROC_CMD PROC_LSTART; do # only interested in user processes, so ignore system processes if [ $PROC_UID -ge 1000 ]; then # how long is this process running? SECONDS=$[$(date +%s) - $(date -d"$PROC_LSTART" +%s)] # 600 seconds should be more than enough if [ $SECONDS -gt 600 ]; then # now, output pid's to be killed on the final line of this script echo $PROC_PID # do save log for debugging cat /proc/$PROC_PID/cmdline >> /var/log/killed.log 2>&1 echo ", details: " >> /var/log/killed.log date >> /var/log/killed.log ls -la /proc/$PROC_PID/ >> /var/log/killed.log 2>&1 fi fi done | # finally, kill them! xargs kill
Problems occur when the ‘cmd’ argument has spaces. Since the script delimiters its parameters by spaces also, a syntax error occurs. I worked around that by limiting the output to only 9 characters ‘cmd:9’. In my specific case that did the trick but I’d like to know if there’s a better way to handle it 🙂
Just run this script from cron every minute:
* * * * * /usr/local/bin/killSlowUserProcesses.sh > /dev/null 2>&1
I hope this will bring me a good night’s sleep tonight 😉
Update: I’ve disabled killing of Apache processes since mod_ruid switches users around including back/forth to www-data. The process might be running for some time, it is not certain the current user has been running it all the time. I need to think about this one for some more time 😉 Updated the above script, added this line:
grep -v "/usr/sbin" |