Monday, June 18, 2012

Don't believe in availability, test for it!


Back in the year 2004 Helmut Oertel and I were conducting a technical due diligence of a job portal for the scout group. One of the topics was backup and disaster recovery, and the questions went like this:

- What happens if your database crashes?
- We have a stand-by database server.
- Is this a hot- or cold-standby?
- It's a hot standby, but we currently switched it off.
- So it's a cold standby then?
- Aehm... probably yes...
- Have you actually ever tested it?
- .... silence ....

Now, this is something you would call epic fail. Having a spare db in theory, but never testing it, is actually worse than not having it at all. In the later case, you at least don't have the feeling of false safety.

FriendScout for example, did it better, they had two database servers in master/slave failover mode, partially coded by themselves, and they did switch the master and slave every week or so. This way they actually new, that both, master and slave, are able to work. In fact, as they had a small problem with one of the less important databases, and the master failed on Dec 24th (no kidding), they didn't detected it until Januar 3rd, because the failover was performed so smoothly, that no customer was affected.

Another good example is Parship. They have a DistributeMe based SOA and each important service is replicated in multiple instances with FailoverToNextNode failing strategy. This means that if service Foo on server 1 fails, the call is retried with server 2, and so on. This is needed in production or staging, but in local development environment it is an unnecessary overhead to start nearly 100 services on a your dev machine, so they start only one instance of each service, and make failover routing do its job. If the client (a web controller or something) is issuing a request to server X, instance 1 (based on mod-based routing of the userid for example), and that instance is not running, the failover mechanism finds a working instance of this service. In fact they achieving two goals, 1) they save resources and 2) they continuously test their failover strategies. The prove came as a buggy FailoverRouter was committed to the trunk (a combination of version incompatibilities, nearly impossible to find by unit tests). One could expect that such an error would only be detected in a real failover situation in production environment, but due to continuous failover testing it was detected 20 minutes after commit.

But lets come to the point.

Talking about DistributeMe, it has a set of built-in interceptors which are meant for availability testing.
However, the concept of interceptors is common to many (and probably most) middleware solutions, so you can easily implement the same thing with CORBA or what-ever your platform offers.

Before we start, some ontology:

you might get the feeling that I mix up service and server. I don't. In the SOA world, at least how we understand it:
   a Service is a component which offers some services (methods) and which behavior is defined by  contract (interface). In other word - Service is some code. 
  a Server is a node in the distributed system which contains (runs) one or multiple services. In other words a server is a JavaVM. The most usual situation however, is that one server runs one service. 
 In a mod- or roundrobin- distributed, failover enabled systems a service is running in multiple servers simultaneously. 
  a Servant (CORBA Slang) is the service implementation process inside the server.
So whenever a client is talking to a logical component it is talking to a service, but the physical data is transformed to a service instance in a server.

So, what are the most interesting cases you should test for. The easiest and most obvious one is surely:

Server is not there.

Server is not there is a very common case. The server could have crashed, didn't start, the real or virtual machine crashed and so on. However, this failure is easy to detect. If the server is not there, the reaction comes immediately. 
To emulate this behavior the interceptor simply throws NoConnectionToServerException before the call could probably leave the stub (client side interceptor) or before the call could be delegated to the servant (server side interceptor). 

Server is slow. 

Server is slow is a more complicated and more dangerous situation. Generally I have encountered two reasons for otherwise healthy service to become slow.  The are surely not the only possible, but they are most probable: 
  • Unexpected DB Problems
  • Continuous Full GC cycles
An extreme version of a slow server is a never replying server, for example due to deadlock and thread starvation, but this case is pretty similar to the above in terms of consequences. 

So what happens if the server is very slow. Well it depends on your architecture and your middleware. For example RMI has no thread pool limitations by default. This means that a popular server would virtually drain all the threads from your web servers, until they don't have anything left and no user request can be proceed. I have seen service instances with over 10.000 threads waiting for them, which can get pretty ugly, cause those queues will a) take time to proceed and b) overload a newly started instances of the slow service, causing the problem to repeat. 

DistributeMe offers a SlowDownInterceptor which simply puts the current thread to sleep for a given amount of time, simulating slow responding service. Of course you are free to create your own, producing full CPU load on a machine instead would be an interesting options, especially regarding cross effects with other services on that machine.

So, how to protect your system against slow servers? Well it's not uncomplicated, because it's not that easy to interrupt a (hanging) thread from outside, without messing up your system state.
Right now, we have two weapons against slow servers:
  • Concurrency control
  • Asynchronous method calls
Concurrency control allows us to limit the number of the request that can be sent to the server/servant. It can be applied on client and/or server side. Different approaches are possible, but the simplest one is to count active connections, and not allow new connections once the limit is reached. It also shows good result to apply different limits on both sides. For example, if you have a server that can handle 100 parallel connections and 10 clients, it makes sense to apply the limit of 100 on the server side, and a limit of 20-30 on the client side. This would prevent one client from occupying all threads and letting other clients starve. 

Asynchronous method calls on the other hand give you full control of acceptable call duration. You can state that if the result is not there after 2 seconds, it can be skipped at all, and provide alternative information to the user. However they aren't un-tricky, since they have internal thread pools, hard to configure and tune. But you have to die one death, and this death is probably more pleasant to die. 

Flipping server

Flipping server emulates that strange unpredictable behavior that sometime happens. For example letting the server serve only 10% of the request slowly, or throw an exception from-time-to-time. Flipping at 1 percent or less can be consider as rare error, a condition that drives a lot of devops crazy worldwide. Testing for it will help you to tune your monitoring systems to be high grained enough. 

How to test

Once you made your homework and prepared your system for partial failure, installed and configured all interceptors, the hard time begins - the testing. 
You will need some things for successful testing: 
  • a reliable test which produces fair load of the system and checks all aspects of the system. 
  • an integration/test system to run this test against.
  • deployed version of your code. 
  • and a lot patience. 
Since you will probably never have a complete, reliable, uptodate, automatic, detailed error reporting test, I strongly recommend you, to grab the best QA guy your company has and do it together with him. If you are part of the DevOps team in your company - fine, if not, you should have one on board too, because the guys will have to solve the problems you are trying to simulate in production, and you will help them a lot if they can see and learn the symptoms.

Once you have all the guys you need, you are ready to enter the never-ending iteration: 
 - run the test 
 - detect the failures (where the system reacts badly on a service-failure)
 - fix it.

If you don't find anything to fix, try out new interceptors, manipulate data in the db, switch bytes in the packets, interrupt threads. Make availability testing part of your development process, and you will have a really responsive, robust, and almost unbreakable system you want. 

AND last but not least - it's fun!

No comments:

Post a Comment