Always catch errors
On the 31st of Jan my NAS stopped responding, no idea what was going on and with zero response to the power button I did a hard reset, I spent the next few hours double checking all my config to find out what the hell had happened. I couldn’t find a solid reason, but at least none of the hardware was failing which give me some good news, I marked it down as an odd issue and carried on.
The same happened tonight, exact same result but this time I was prepared to some point. After attempting to login in the console and seeing memory allocation errors, then SSH dying on its arse, I checked my Munin install and notice the memory was heavy swapping. This machine has about 8GB of RAM but at any time its using about 600mb, at first I thought it was a memory leak in something but usually OOM Killer does a good job smiting any unruly processes. Then I checked my process list and noticed it was well over 4,000 sleeping processes, something had obviously gone wrong.
On my Deluge setup, due to the instability of a few of the trackers I use, I have a small Python script that checks the current state of the torrent and if they’re “red" it restarts them. Deluge’s API uses the Twisted framework to make everything async and accordingly a lot easier to work with, this was my first venture into the land of Twisted and it seems I made an error; I didn’t catch the “unable to connect" error. So after it was unable to connect the Twisted reactor was sitting there and running constantly, and as this job was running every 5 minutes it stacked up over 24 hours and killed the machine.
So, its always worth checking for errors, and not assuming that it’ll sort itself out. Lesson learned.