Thursday, December 22, 2011

The three most fatal bugs, ever.

There are bugs and then there are BUGs. The bugs are usually fixed and forgotten, but BUGs remain with you forever. I'd like to tell about three, that are worth telling.

The first one occurred (do bugs actually occur? pardon my english ;-) ) as I was working for FriendScout in 2005. We had a nice monitoring tool, which had a table representing the production system with a row for each webserver in the farm, and it changed the color according to the status of the server. If the server was replying it was green and if not - red. Once in August it started to switch red. One server after the next. And after the forth server it was over. The situation repeated. Once, twice, thrice... We had searched and search and had found nothing. It looked like it was a user which was doing something weird, that killed the server, but what?

Finally, Oliver, a fellow engineer, has found it, and it was a picture. But not just a picture, it was a pdf file named jpg. A lady wanted to upload a picture, which was a pdf. Pdf's weren't supported mime types (only jpegs and gifs were back then), so the upload servlet rejected the file. But the lady was (or thought she was) smarter than us and renamed the file to jpg. Now it passed the mime-check all-right and was passed over to the image processing software, a more or less standard tool which was called imagemagick. But the tool was also smarter than us and ignored the file name, extension and mime-type all together, instead it looked into the file and detected it was a pdf. After this glorious discovery it tried to call a pdf processing tool - ghostscript. But since we never actually wanted to process or even accept pdfs, ghostscript wasn't installed on the servers and the attempt to start it crashed the imagemagick lib. Since the code was native it took the whole JVM with it. Ouch.

The second bug is the power of the debug log. Back in prehistoric version of ASG I was hacking a very first version of this site. It worked absolutely great, the customer loved it and everything. After some time the customer called me and told me that the site was getting slower each day. I looked at my local copy - fast as wind. Live installation - everything ok, but adding new items (machines back then) lasts about a minute. I was searching for the bug for three days, as the customer called again, and said that creating new cms items is now about 2 minutes each, and that the only thing he did was adding 100 items. Now it was at least something. I double-checked the logs - nothing. I reviewed the code - nothing. I had no good profiler (yes, if I had MoSKito back than, I would have found it faster), so I started to add time measurements everywhere, and, after a long hunt, I finally found the line - it was a log.debug statement.

The service that was responsible for storing the items in the cms had one small innocent line: log.debug(cache).  Since debug output was off, the log call had no visible effect, hence I had nothing in logs, but the cache got bigger and bigger with each added item, and the effort to execute its toString method, which printed out all contained elements, was growing constantly. From 0 seconds in the unit test to 180 in production. That teached me to use if log.isDebugEnabled()... At least something ;-)

And this was the second bug. Now for my personal all time favorite, super-bug we have to return back in text and time. It also happened at FriendScout but 2003, before they hired me (and this may well have been the reason why ;-)) The platform was pretty unstable and run by people, who were thinking more like admins as like developers. And since the admin usually doesn't understand, why something is broken, he has a standard method to fix things -> the infamous alt-ctrl-del button. In this case it was a little bit more subtile that that, so some smart (admin) guy wrote a script that was watching the log files and searching for the keyword 'FATAL' in it. If it would find an occurrence of FATAL, it would assume a bad-bad-bad error happened and  trigger a full application restart. And there were some of the restarts in the year until they had to review this strategy. The reason for the review was a customer who called the customer desk and asked: "why does the system ALWAYS crash when I log in?" 

The answer was easy, it was her login name. It was femme_fatale.

So what about you? Do you have funny bugs worth mentioning? Tell me please! ;-)

No comments:

Post a Comment