Friday, May 30, 2014

ACM DEBS Grand Challenge 2014: Smart grids, 4 Billion events, throughout in range of 100Ks

ACM Distributed Event Based Systems (DEBS) happened at Mumbai this week. One of the highlight of the conference is DEBS Grand Challenge. 

Introduced at 2011, and happening for the forth time, the grand challenge poses a real world data set and a set of problems that would keep distributed event enthusiastic busy for few months. Selected solutions are invited to the conference, and final showdown and announcement happen at the conference. 

2013 challenge includes a football (soccer) game, which I blogged earlier. 2014 challenge is an smart metering use case, timely one given the advent of IoT, that included 4B (yes with a capital B) events collected over 6 weeks from 40 houses and 2000 sensors. You can find more details from (also Zbigniew Jerzak, The DEBS 2014 Grand Challenge, ACM Distributed Event based System, 2014)

In 2014 challenge, four solutions were accepted out of about 25 submissions: from Dresden University of Technology (Germany), Imperial College London, Fraunhofer Institute (Germany), and WSO2.   

Problem involves two queries: predicting load and finding outlier sensors. The predicting algorithm is given, as this is a event-processing challenge, not an machine learning challenge. Both queries needed a single node solution as well as a distributed solution. 

First query was easy to parallelize, and all four solutions had completed this successfully, posting single node throughputs of few thousands up to 300k and 400K events per second. Scaling first query turns out to be much trickier, and solutions had posted throughput of  0.85M with 4 nodes, 1.7 with 6 nodes, 1.1 with 50 nodes (events per second).

Second query includes a median calculated over 24 hour sliding window, with about 1000 events per second. This means about 74million events and calculated over sliding window sliding in 1 second. This had everyone swearing. Single node solutions gave throughputs ranging from few 100k to few millions. 

Distributed solution for query two, well no one had it solved. All distributed solutions were slower than the single node solution. Obviously, there are lot more work to be done by the community at large. 

There was stiff competition and everyone was at their toes though the presentations. Overall winner was solution from Fraunhofer Institute (#3) and audience award went to Imperial College London (#2).

It was lot of fun, and among my parting thought was "If 40 houses generate 4B events, how we are going to handle millions of houses?" 

You can find our presentation from slideshare and paper from "Solving the grand challenge using an opensource CEP engine". I will write a detailed blog about our solution soon. Other solutions presentations and papers are available in ACM library

Solving DEBS Grand Challenge with WSO2 CEP from Srinath Perera

Looking forward to the next year DEBS grand challenge. 

Sunday, May 11, 2014

Why throughput reduces when going though an ESB?

Often we measure the overhead of an ESB by first hitting the backend service directly measuring throughput and latency, and then going though an ESB and comparing the two. Often, latency is higher and throughput is lesser when going though an ESB.

Going via ESB means an additional network hop, so everyone agrees that latency must be increased. But why throughput reduces?

Question has troubled us for a long time, and Paul described the answer in Simple gist of the answer is that ESB do twice IO (it reads, sends to backend server, reads the response, and sends it back as oppose to backend server who just reads and responds to the messages). Hence, if the backend is very fast (if it just reads the message and responds), though ESB throughput will be around 50%.

Now, what if backend service is slow? E.g. add 80ms sleep to the backend service. It turns out even then the throughput has decreased. In this case, the client uses many threads (e.g. 20). Each thread sends a request, waits for a response to come back, and then sends the next request. E.g. If you use some load client like JMeter or Apache Bench, this is what happening. (This is the Saturation Test as Paul explained).

Reduction of throughput is caused due to following reason.

  1. Lets say that when we hit backend service, it takes time t ms. 
  2. When go though ESB, latency will always be more than going direct. Lets say it takes t+2ms. 
  3. Since client sends one message after the other, in the direct case it will send 1000/t messages. 
  4. Mediating though ESB client will only send 1000/(t+2) messages. 
  5. So the throughput of the to direct case is always higher than via ESB case if you use a client that wait for response to come before sending the next message. 
  6. Applying this to some numbers: let us assume latencies are 99ms vs. 111ms directly and though ESB respectively. Then the throughput should be 300 vs 267. 

What is the problem? Since we use few clients, then the delay in response affect how many messages are put to the system. However, a real usecase is different. Then, you will have thousands to millions of clients that send requests randomly without sending requests one after the other in lockstep. In such cases (e.g. simulated test as Paul explained), the same problem does not arise.

 Alternatively you can write a client that sends the next request in a fix time regardless of how long the response will take.