Sunday, May 11, 2014

Why throughput reduces when going though an ESB?

Often we measure the overhead of an ESB by first hitting the backend service directly measuring throughput and latency, and then going though an ESB and comparing the two. Often, latency is higher and throughput is lesser when going though an ESB.

Going via ESB means an additional network hop, so everyone agrees that latency must be increased. But why throughput reduces?

Question has troubled us for a long time, and Paul described the answer in http://pzf.fremantle.org/2012/09/understanding-esb-performance.html. Simple gist of the answer is that ESB do twice IO (it reads, sends to backend server, reads the response, and sends it back as oppose to backend server who just reads and responds to the messages). Hence, if the backend is very fast (if it just reads the message and responds), though ESB throughput will be around 50%.

Now, what if backend service is slow? E.g. add 80ms sleep to the backend service. It turns out even then the throughput has decreased. In this case, the client uses many threads (e.g. 20). Each thread sends a request, waits for a response to come back, and then sends the next request. E.g. If you use some load client like JMeter or Apache Bench, this is what happening. (This is the Saturation Test as Paul explained).

Reduction of throughput is caused due to following reason.

  1. Lets say that when we hit backend service, it takes time t ms. 
  2. When go though ESB, latency will always be more than going direct. Lets say it takes t+2ms. 
  3. Since client sends one message after the other, in the direct case it will send 1000/t messages. 
  4. Mediating though ESB client will only send 1000/(t+2) messages. 
  5. So the throughput of the to direct case is always higher than via ESB case if you use a client that wait for response to come before sending the next message. 
  6. Applying this to some numbers: let us assume latencies are 99ms vs. 111ms directly and though ESB respectively. Then the throughput should be 300 vs 267. 


What is the problem? Since we use few clients, then the delay in response affect how many messages are put to the system. However, a real usecase is different. Then, you will have thousands to millions of clients that send requests randomly without sending requests one after the other in lockstep. In such cases (e.g. simulated test as Paul explained), the same problem does not arise.

 Alternatively you can write a client that sends the next request in a fix time regardless of how long the response will take.

No comments: