Always catch errors

On the 31st of Jan my NAS stopped responding, no idea what was going on and with zero response to the power button I did a hard reset, I spent the next few hours double checking all my config to find out what the hell had happened. I couldn’t find a solid reason, but at least none of the hardware was failing which give me some good news, I marked it down as an odd issue and carried on.

The same happened tonight, exact same result but this time I was prepared to some point. After attempting to login in the console and seeing memory allocation errors, then SSH dying on its arse, I checked my Munin install and notice the memory was heavy swapping. This machine has about 8GB of RAM but at any time its using about 600mb, at first I thought it was a memory leak in something but usually OOM Killer does a good job smiting any unruly processes. Then I checked my process list and noticed it was well over 4,000 sleeping processes, something had obviously gone wrong.

On my Deluge setup, due to the instability of a few of the trackers I use, I have a small Python script that checks the current state of the torrent and if they’re “red" it restarts them. Deluge’s API uses the Twisted framework to make everything async and accordingly a lot easier to work with, this was my first venture into the land of Twisted and it seems I made an error; I didn’t catch the “unable to connect" error. So after it was unable to connect the Twisted reactor was sitting there and running constantly, and as this job was running every 5 minutes it stacked up over 24 hours and killed the machine.

So, its always worth checking for errors, and not assuming that it’ll sort itself out. Lesson learned.

comments powered by Disqus