Friday, November 30, 2012

Moving on

I really like this blog no more ;-)
Despite its weaknesses and the fact that it's php based, I'm going to abandon this blog in favor of our new devblog, which is wordpress (yeah php, I know...) but just much simpler to edit and use.
See you there.

Friday, August 17, 2012

MoSKito 1.5.0 goes after threads.

As MoSKito 1.5.0 hits the maven repositorty today, so do some new features about threads.
Here's a short overview on 4 new screens added in 1.5.0 for threading issues.

NOTE: We now do have an official anotheria blog: blog.anotheria.net
This post is moved there:
http://blog.anotheria.net/?p=9

Thursday, August 16, 2012

Everything gets better once Oracle buys you

Well, almost.
Recently I had to retrieve my sun developer framework account, I wasn't using for 5 years now.
I was hoping I can vote on this ugly bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7180557 with localhost and java7 on my mac, I was writing about in previous post. Well - no. However, the process itself was funny enough, since they require you to specify both, username and email, and in my case (and I suppose many other cases) its the same, but its not that easy to figure out.

However, sun's and now oracle's mailing system works much better than the update profile dialog, I was forced to fill and submit:















Well, at least we now know what GlassFish version they are running ;-)

Oracle Kills getLocalhost on MacOS X in Java 7

This is going to be another one of those How things that can't happen actually happen post.
I recently updated my Mac to Java 7. Pretty late though. Right after that DistributeMe services start behaving at least strange. Right after the start a service registers itself in a registry with a unique service identifier, which looks like that:

rmi://a_b_c_FooService.qhxgttedwr@192.168.1.113:9252@20120816162707


Where 192.168.1.113 being the ip adress of the machine the service runs on. This adress is used by the client, once the later wants to connect to the service.

After upgrade to Java 7 the service registered itself in the registry with the identifier:

rmi://a_b_c_FooService.qhxgttedwr@unknown:9252@20120816162707

Of course no client were able to resolve the host named unknown. I started investigating where this comes from, and well, it was in one of those sections...


 
private static String getHostName(){
  try{
    InetAddress localhost = InetAddress.getLocalHost();
    String host = localhost.getHostAddress();
    HashMap mappings = configuration.getMappings(); 
    String mappedHost = mappings.get(host);
    return mappedHost == null ? host : mappedHost;
  }catch(UnknownHostException e){
    return "unknown";
  }
}

Ok, this is certainly not the best idea to return "unknown" here (one of those, can't happen anyway bug), but why the heck did InetAddress.getLocalHost() fail?

I wrote a very small program to verify it:

package localhostbug;

import java.net.InetAddress;

public class PrintLocalhost {
  public static void main(String[] args) throws Exception{
    InetAddress localhost = InetAddress.getLocalHost();
    String host = localhost.getHostAddress();
    System.out.println("host: "+host);
  }
}

Running it with JAVA6 and JAVA7 shows the difference, watch yourself:

$JAVA6_HOME/bin/java -cp classes localhostbug.PrintLocalhost
host: 192.168.140.200
and
$JAVA7_HOME/bin/java -cp classes localhostbug.PrintLocalhost
Exception in thread "main" java.net.UnknownHostException: colin.speedport.ip: colin.speedport.ip: nodename nor servname provided, or not known
 at java.net.InetAddress.getLocalHost(InetAddress.java:1438)
 at localhostbug.PrintLocalhost.main(PrintLocalhost.java:7)
Caused by: java.net.UnknownHostException: colin.speedport.ip: nodename nor servname provided, or not known
 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
 at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
 at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
 at java.net.InetAddress.getLocalHost(InetAddress.java:1434)
 ... 1 more

After some googling I found a bug in oracle bug database and a similar issue in openjdk. Seems Java and Mac is getting less and less a love story.
Sad.

Monday, June 25, 2012

MoSKito 1.4.3 released

After our first iphone app  arrived in the store on friday, we also released 1.4.3 of the classic version over the weekend. As the number suggests, this release doesn't bring major breakthroughs (at least not yet), but some improvements in the UI for better user experience.

Larger graphs


First, and the simplest of all, we adopted the default size for the on-the-fly-graphs for producers from 600x300 to 1200x600, making it four times bigger. Now you will be able to use the whole size of your monitor, and not only a small section:




Of course they are a bit smaller here, because otherwise they would kill the layout of the blog. However, if the new size is too big, or still too little for you, you can configure it now by yourself, by simply adding a file named mskwebui.json into your classpath (web-inf/classes is easiest). 
The file should have the typical json config format: 


{
"producerChartWidth": 1200,
"producerChartHeight": 600,
}



Of course all ConfigureMe features like cascading environments and on-the-fly reconfiguration are supported.


Producer filtering 


This was actually a feature people have been asked for, for a long time, to allow filtering for producer names. This is especially useful, if you have a log of producers with similar names.
Here an example:


Enlarge the image for details. 




Finally the accumulators overview now offers a new link:



which leads to the new:

Single Accumulator View

Single accumulator view offers a quick glance at one accumulator and provides not only the graph, but also the data behind the graph for quick analysis:






The new version can be obtained as usual from our nexus repository.


Enjoy ;-)

Monday, June 18, 2012

Don't believe in availability, test for it!

Preamble


Back in the year 2004 Helmut Oertel and I were conducting a technical due diligence of a job portal for the scout group. One of the topics was backup and disaster recovery, and the questions went like this:

- What happens if your database crashes?
- We have a stand-by database server.
- Is this a hot- or cold-standby?
- It's a hot standby, but we currently switched it off.
- So it's a cold standby then?
- Aehm... probably yes...
- Have you actually ever tested it?
- .... silence ....

Now, this is something you would call epic fail. Having a spare db in theory, but never testing it, is actually worse than not having it at all. In the later case, you at least don't have the feeling of false safety.

FriendScout for example, did it better, they had two database servers in master/slave failover mode, partially coded by themselves, and they did switch the master and slave every week or so. This way they actually new, that both, master and slave, are able to work. In fact, as they had a small problem with one of the less important databases, and the master failed on Dec 24th (no kidding), they didn't detected it until Januar 3rd, because the failover was performed so smoothly, that no customer was affected.

Another good example is Parship. They have a DistributeMe based SOA and each important service is replicated in multiple instances with FailoverToNextNode failing strategy. This means that if service Foo on server 1 fails, the call is retried with server 2, and so on. This is needed in production or staging, but in local development environment it is an unnecessary overhead to start nearly 100 services on a your dev machine, so they start only one instance of each service, and make failover routing do its job. If the client (a web controller or something) is issuing a request to server X, instance 1 (based on mod-based routing of the userid for example), and that instance is not running, the failover mechanism finds a working instance of this service. In fact they achieving two goals, 1) they save resources and 2) they continuously test their failover strategies. The prove came as a buggy FailoverRouter was committed to the trunk (a combination of version incompatibilities, nearly impossible to find by unit tests). One could expect that such an error would only be detected in a real failover situation in production environment, but due to continuous failover testing it was detected 20 minutes after commit.

But lets come to the point.


Talking about DistributeMe, it has a set of built-in interceptors which are meant for availability testing.
However, the concept of interceptors is common to many (and probably most) middleware solutions, so you can easily implement the same thing with CORBA or what-ever your platform offers.



Before we start, some ontology:

you might get the feeling that I mix up service and server. I don't. In the SOA world, at least how we understand it:
   a Service is a component which offers some services (methods) and which behavior is defined by  contract (interface). In other word - Service is some code. 
  a Server is a node in the distributed system which contains (runs) one or multiple services. In other words a server is a JavaVM. The most usual situation however, is that one server runs one service. 
 In a mod- or roundrobin- distributed, failover enabled systems a service is running in multiple servers simultaneously. 
  a Servant (CORBA Slang) is the service implementation process inside the server.
So whenever a client is talking to a logical component it is talking to a service, but the physical data is transformed to a service instance in a server.


So, what are the most interesting cases you should test for. The easiest and most obvious one is surely:

Server is not there.


Server is not there is a very common case. The server could have crashed, didn't start, the real or virtual machine crashed and so on. However, this failure is easy to detect. If the server is not there, the reaction comes immediately. 
To emulate this behavior the interceptor simply throws NoConnectionToServerException before the call could probably leave the stub (client side interceptor) or before the call could be delegated to the servant (server side interceptor). 

Server is slow. 

Server is slow is a more complicated and more dangerous situation. Generally I have encountered two reasons for otherwise healthy service to become slow.  The are surely not the only possible, but they are most probable: 
  • Unexpected DB Problems
  • Continuous Full GC cycles
An extreme version of a slow server is a never replying server, for example due to deadlock and thread starvation, but this case is pretty similar to the above in terms of consequences. 

So what happens if the server is very slow. Well it depends on your architecture and your middleware. For example RMI has no thread pool limitations by default. This means that a popular server would virtually drain all the threads from your web servers, until they don't have anything left and no user request can be proceed. I have seen service instances with over 10.000 threads waiting for them, which can get pretty ugly, cause those queues will a) take time to proceed and b) overload a newly started instances of the slow service, causing the problem to repeat. 

DistributeMe offers a SlowDownInterceptor which simply puts the current thread to sleep for a given amount of time, simulating slow responding service. Of course you are free to create your own, producing full CPU load on a machine instead would be an interesting options, especially regarding cross effects with other services on that machine.

So, how to protect your system against slow servers? Well it's not uncomplicated, because it's not that easy to interrupt a (hanging) thread from outside, without messing up your system state.
Right now, we have two weapons against slow servers:
  • Concurrency control
  • Asynchronous method calls
Concurrency control allows us to limit the number of the request that can be sent to the server/servant. It can be applied on client and/or server side. Different approaches are possible, but the simplest one is to count active connections, and not allow new connections once the limit is reached. It also shows good result to apply different limits on both sides. For example, if you have a server that can handle 100 parallel connections and 10 clients, it makes sense to apply the limit of 100 on the server side, and a limit of 20-30 on the client side. This would prevent one client from occupying all threads and letting other clients starve. 


Asynchronous method calls on the other hand give you full control of acceptable call duration. You can state that if the result is not there after 2 seconds, it can be skipped at all, and provide alternative information to the user. However they aren't un-tricky, since they have internal thread pools, hard to configure and tune. But you have to die one death, and this death is probably more pleasant to die. 

Flipping server

Flipping server emulates that strange unpredictable behavior that sometime happens. For example letting the server serve only 10% of the request slowly, or throw an exception from-time-to-time. Flipping at 1 percent or less can be consider as rare error, a condition that drives a lot of devops crazy worldwide. Testing for it will help you to tune your monitoring systems to be high grained enough. 

How to test

Once you made your homework and prepared your system for partial failure, installed and configured all interceptors, the hard time begins - the testing. 
You will need some things for successful testing: 
  • a reliable test which produces fair load of the system and checks all aspects of the system. 
  • an integration/test system to run this test against.
  • deployed version of your code. 
  • and a lot patience. 
Since you will probably never have a complete, reliable, uptodate, automatic, detailed error reporting test, I strongly recommend you, to grab the best QA guy your company has and do it together with him. If you are part of the DevOps team in your company - fine, if not, you should have one on board too, because the guys will have to solve the problems you are trying to simulate in production, and you will help them a lot if they can see and learn the symptoms.

Once you have all the guys you need, you are ready to enter the never-ending iteration: 
 - run the test 
 - detect the failures (where the system reacts badly on a service-failure)
 - fix it.
repeat. 

If you don't find anything to fix, try out new interceptors, manipulate data in the db, switch bytes in the packets, interrupt threads. Make availability testing part of your development process, and you will have a really responsive, robust, and almost unbreakable system you want. 

AND last but not least - it's fun!














Monday, March 12, 2012

anotheria starts to belong to common knowledge

Well, every thing starts small. Unless it starts big, but lets stick with small starting things. Recently I was searching linkedin.com for anotheria and thats what I found in skills profile of someone who was never officially on our pay role (but who's known to me of course ;-)):


Well, the start is taken ;-)