I was at
O'reilly Strata in last week and certainly interest for realtime analytics was at it’s top.
Realtime analytics, or what people call Realtime Analytics, has two flavours.
- Realtime Streaming Analytics ( static queries given once that do not change, they process data as they come in without storing. CEP, Apache Strom, Apache Samza etc., are examples of this.
- Realtime Interactive/Ad-hoc Analytics (user issue ad-hoc dynamic queries and system responds). Druid, SAP Hana, VolotDB, MemSQL, Apache Drill are examples of this.
In this post, I am focusing on Realtime Streaming Analytics. (Ad-hoc analytics uses a SQL like query language anyway.)
Still when thinking about Realtime Analytics, people think only counting usecases. However, that is the tip of the iceberg. Due to the time dimension of the data inherent in realtime usecases, there are lot more you can do. Lets us look at few common patterns.
- Simple counting (e.g. failure count)
- Counting with Windows ( e.g. failure count every hour)
- Preprocessing: filtering, transformations (e.g. data cleanup)
- Alerts , thresholds (e.g. Alarm on high temperature)
- Data Correlation, Detect missing events, detecting erroneous data (e.g. detecting failed sensors)
- Joining event streams (e.g. detect a hit on soccer ball)
- Merge with data in a database, collect, update data conditionally
- Detecting Event Sequence Patterns (e.g. small transaction followed by large transaction)
- Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle, tracking wild life)
- Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing)
- Learning a Model (e.g. Predictive maintenance)
- Predicting next value and corrective actions (e.g. automated car)
Why we need SQL like query language for Realtime Streaming Analytics?
Each of above has come up in use cases, and we have implemented them using SQL like CEP query languages. Knowing the internal of implementing the CEP core concepts like sliding windows, temporal query patterns, I do not think every Streaming use case developer should rewrite those. Algorithms are not trivial, and those are very hard to get right!
Instead, we need higher levels of abstractions. We should implement those once and for all, and reuse them. Best lesson we can learn from Hive and Hadoop, which does exactly that for batch analytics. I have explained Big Data with Hive many time, most gets it right away. Hive has become the major programming API most Big Data use cases.
Following is list of reasons for SQL like query language.
- Realtime analytics are hard. Every developer do not want to hand implement sliding windows and temporal event patterns, etc.
- Easy to follow and learn for people who knows SQL, which is pretty much everybody
- SQL like languages are Expressive, short, sweet and fast!!
- SQL like languages define core operations that covers 90% of problems
- They experts dig in when they like!
- Realtime analytics Runtimes can better optimize the executions with SQL like model. Most optimisations are already studied, and there is lot you can just borrow from database optimisations.
Finally what are such languages? There are lot defined in world of Complex Event processing (e.g. WSO2 Siddhi, Esper, Tibco StreamBase,IBM Infoshpere Streams etc. SQL stream has fully ANSI SQL comment version of it. Last week I did a talk on Strata discussing this problem in detail and how CEP could match the bill. You could find the slide deck from below.