These days, I am playing with a large scale messaging system (broker network). Things great when it works, but when things start to go wrong with one of these things, you do not need to be there.
One big down side of Asynchronous messaging is it is so hard to debug, specially when messages jump from node to node, and your are not the author of the code (then you do not know the magic places where to put stdouts :( ).
Following is a little trick that helped me, it is not a silver bullet, but does give some comfort. You need to start with followings.
1. Log the message ID, or some unique ID with every log message. Usually with luck developers have already done this.
2. Set the log4j to print the time stamp first thing in each log statement
Assuming you were able to get all the logs to a same directory (my case I have NFS mounted across all nodes, so that was easy), following command will list all the things related to a given message, in the real time order, so you can walk though it and tell what happend. (sort simply sort them with log4j time stamps)
grep message-id *.log| sed -'s/.*\.log://'|sort
It is a simple command, but could be very useful. Also if you need to merge all the logs, in time order do cat *|sort. As you could guess, there are many variations of this. Actually maybe message system developer should put down some standard format for message sending, receiving and routing, which let people to write log mining code/scripts that can uncover problems.
If anyone knows some useful log mining tool, please! please! drop me a note :).