www.lonelycoder.com outage

Added by Andreas Smas almost 10 years ago

Yesterday (25th of April) www.lonelycoder.com had an outage rendering the website inaccessible for almost 24 hours.

I think the root cause of the problem was a bug in Phusion Passenger (the glue between Apache and the Ruby on Rails app) causing all worker processes to set themselves as immune to the OOM killer. This is obviously very bad as it causes the Linux kernel to kill all other (much more important) processes first in case of a low memory condition.

I've upgraded Passenger to a new version where this bug is fixed. Also our fellow hosting provider has doubled the amount of RAM available to the server so a similar outage should hopefully not happen again.


Added by Andreas Smas almost 10 years ago

It seems there is more to it. Apparently the Redmine ruby processes leak memory. Perhaps I need to drop Passenger and run Redmine in a different setup so the processes are not that long lived.

Added by Andreas Smas almost 10 years ago

Crashed again yesterday. Sigh. Runaway Ruby processes consuming all available RAM.

I've logged RAM usage for the processes every second and it seems to be leaking a little memory constantly rather than huge slabs of leaks.

After having read a bit of Passenger's documentation I found the PassengerMaxRequests option which can force-kill the Rails worker processes after having served N requests. I've set this value to 100 now. Let's see how that works.

Added by Andreas Smas almost 10 years ago

PassengerMaxRequests is to no avail because the problem is that one of the worker enters an infinite loop.
I think it's related to the PDF generator (spider that are crawling the site seems to be the only ones downloading those PDFs)

I could of course just block the PDFs and be done with it but I would really like to be able to track this issue down once and for all so I could also file a bug back to the redmine devs.

I've cranked up the Passenger loglevel so I can see to which worker it routes requests.

To be able to track down the issue I have a small script running that will catch leaking processes and SIGSTOP them so I can use gcore(1) to get a coredump and also send a SIGABRT to the worker and get a ruby traceback in the logfiles.


while true; do 
    b=`ps -C ruby1.8 -o rss=,pid= | awk '{ if($1 > 200000) print $2}'`
    for a in $b; do
        echo $a is bad, stopped
        kill -STOP $a
    sleep 10

The hunt is on...