Tuesday, November 17, 2015

Introduction to Anomaly Detection: Concepts and Techniques

Please find the post at https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/

Thursday, October 15, 2015

Thinking Deeply about IoT Analytics


Big data has solved many IoT analytics challenges. Especially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Following are the highlights.


  • How fast we need results? Real-time vs. batch or a combination.
  • How much data to keep? based on use cases and incoming data rate, we might choose between keeping none, summary, or everything. Edge analytics is also a related aspect of the same problem.
  • From analytics, do we want hindsight, insight or foresight? decide between aggregation and Machine learning methods. Also, techniques such as time series and spatiotemporal algorithms will play a key role with IoT use cases.
  • What is our Response from the system when we have an actionable insight? show a visualization, send alerts, or to do automatic control.
  • Finally, we discussed the shape of IoT data and few reusable scenarios and the potential of building middleware solutions for those scenarios.


Please find the full post from  https://iwringer.wordpress.com/2015/10/15/thinking-deeply-about-iot-analytics/

Thursday, September 24, 2015

Thursday, September 17, 2015

What Data Science and Big Data can do for Sri Lanka?

https://iwringer.wordpress.com/2015/09/14/what-can-data-science-and-big-data-can-do-for-sri-lanka/

Dissecting the Big data Twitter Community through a Big data Lense

https://iwringer.wordpress.com/2015/09/17/disecting-big-data-twitter-community-through-a-big-data-lense/

Wednesday, September 2, 2015

Analysis of Retweeting Patterns in Sri Lankan 2015 General Election

Post at https://iwringer.wordpress.com/2015/08/25/analysis-of-retweeting-patterns-in-sri-lankan-general-election/

Tuesday, March 17, 2015

Introducing WSO2 Analytics Platform: Note for Architects

WSO2 have had several analytics products:WSO2 BAM and WSO2 CEP for some time (or Big Data products if you prefer the term).  We are adding WSO2 Machine Learner, a product to create, evaluate, and deploy predictive models, very soon to that mix. This post describes how all those fit within to a single story. 

Following Picture summarises what you can do with the platform. 



Lets look at each stages depicted the picture in detail. 

Stage 1: Collecting Data

There are two things for you to do.

Define Streams - Just like you create tables before you put data into a database, first you define streams before sending events. Streams are description of how your data look like (Schema). You will use the same Streams to write queries at the second stage. You do this via CEP or BAM's admin console (https://host:9443/carbon) or via Sensor API described in the next step.


Publish Event - Now you can publish events. We provide a one Sensor API to publish events for both batch and realtime pipelines. Sensor API available as Java clients (Thrift, JMS, Kafka), java script clients* ( Web Socket and REST) and 100s of connectors via WSO2 ESB. See How to Publish Your own Events (Data) to WSO2 Analytics Platform (BAM, CEP)  for details on how to write your own data publisher. 

Stage 2: Analyse Data

Now time to analyse the data. There are two ways to do this: analytics and predictive analytics. 

Write Queries

For both batch and realtime processing you can write SQL like queries. For batch queries, we support HIVE SQL and for realtime queries we support Siddhi Event Query Language

Example 1: Realtime Query (e.g. Calculate Average Temperature over 1 minute sliding window from the Temperature Stream) 

from TemperatureStream#window.time(1 min)
  select roomNo, avg(temp) as avgTemp
  insert into HotRoomsStream ;

Example 2: Batch Query (e.g. Calculate Average Temperature per each hour from the Temperature Stream)

insert overwrite table TemperatureHistory
  select hour, average(t) as avgT, buildingId
  from TemperatureStream group by buildingId, getHour(ts);


Build Machine Learning (Predictive Analytics) Models

Predictive analytics let us learn “logic” from examples where such logic is complex. For example, we can build “a model” to find fraudulent transactions. To that end, we can use machine learning algorithms to train the model with historical data about Fraudulent and non-fraudulent transactions.


WSO2 Analytics platform supports predictive analytics in multiple forms
  1. Use WSO2 Machine Learner ( 2015 Q2) Wizard to build Machine Learning models, and we can use them with your Business Logic. For example, WSO2 CEP, BAM and ESB would support running those models.
  2. R is a widely used language for statistical computing, and we can build model using R, export them as PMML ( a XML description of Machine Learning Models), and use the model within WSO2 CEP. Also you can directly call R Scripts from CEP queries
  3. WSO2 CEP also includes several streaming Regression and Anomaly Detection Operators

Stage 3: Communicate the Results

OK now we have some results, and we communicate those results to users or systems that cares for these results. That communications can be done in three forms.
  1. Alerts detects special conditions and cover the last mile to notify the users ( e.g. Email, SMS, and Push notifications to a Mobile App, Pager, Trigger physical Alarm ). This can be easily done with CEP.
  2. Visualising data via Dashboards provide the “Overall idea” in a glance (e.g. car dashboard). They supports customising and creating user's own dashboards. Also when there is a special condition, they draw the user's attention to the condition and enable him to drill down and find details. Upcoming WSO2 BAM and CEP 2015 Q2 releases will have a Wizard to start from your data and build custom visualisation with the support for drill downs as well.
  3. APIs expose Data as to users external to the organisational boundary, which are often used by mobile phones. WSO2 API Manager is one of the leading API solutions, and you can use it to expose your data as APIs. In the later releases, we are planning to add support to expose data as APIs via a Wizard.

Why choose WSO2 Analytics Platform?

Reason 1: One Platform for both Realtime, Batch, and Combined Processing - with Single API for publish events, and with support to implement combined usecases like following
  1. Run the similar query in batch pipeline and realtime pipeline ( a.k.a Lambda Architecture)
  2. Train a Machine Learning model (e.g. Fraud Detection Model) in the batch pipeline, and use it in the realtime pipeline (usecases: Fraud Detections, Segmentation, Predict next value, Predict Churn)
  3. Detect conditions in the realtime pipeline, but switch to detail analysis using the data stored in the batch pipeline (e.g. Fraud, giving deals to customers in a e-commerce site)
Reason 2: Performance - WSO2 CEP can process 100K+ events per second and one of the fastest realtime processing engines around. WSO2 CEP was a Finalist for DEBS Grand Challenge 2014 where it processed 0.8 Million events per second with 4 nodes.

Reason 3: Scalable Realtime Pipeline with support for running SQL like CEP Queries Running on top of Storm. - Users can provide queries using SQL like Siddhi Event Query Language. SQL like query language provides higher level operators to build complex realtime queries. See SQL-like Query Language for Real-time Streaming Analytics for more details. 
For batch processing, we use Apache Spark ( 2015 Q2 release forward), and for realtime processing, users can run those queries in one of the two modes.
  1. Run those queries using a two CEP nodes, one nodes as the HA backup for the other. Since WSO2 CEP can process in excess of hundred thousand events per second, this choice is sufficient for many usecases.
  2. Partition the queries and streams, build a Apache Storm topology running CEP nodes as Storm Sprouts, and run it on top of Apache Storm. Please see the slide deck Scalable Realtime Analytics with declarative SQL like Complex Event Processing Scripts. This enable users to do complex queries as supported by Complex Event Processing, but still scale the computations for large data streams. 
Reason 4: Support for Predictive analytics support building Machine learning models, comparing them and selecting the best model, and using them within real life distributed deployments.


Almost forgot, all these are opensource under Apache Licence. Most design decisions are discussed publicly at architecture@wso2.org.


Refer to following talk at wso2con Europe for more details. ( slides).



If you find this interesting, please try it out. Please reach out to me or through http://wso2.com/contact/ if you want to know more information.




Monday, March 16, 2015

How to Publish Your own Events (Data) to WSO2 Analytics Platform (BAM, CEP)

We collect data via a Sensor API (a.k.a. agents), send them to servers: WSO2 CEP and WSO2 BAM, process them, and do something with the results. You can find more information about the big picture from the slide deck http://www.slideshare.net/hemapani/introduction-to-large-scale-data-analysis-with-wso2-analytics-platform

This post describes how you can collect data.

We provide a one Sensor API to publish events for both batch and realtime pipelines. The Sensor API is available as Java clients (Thrift, JMS, Kafka), java script clients* ( Web Socket and REST) and 100s of connectors via WSO2 ESB. 

Lets see how we can use the java thrift client to publish events. 

First of all, you need CEP or BAM running. Download, unzip, and run WSO2 CEP or WSO2 BAM (via bin/wso2server.sh). 

Now, lets write a client. Add the jars given in Appendix A or add POM Dependancies given in Appendix B to your Maven POM file to setup the classpath. 

The Java client would look like following. 




Just like you create tables before you put data into a database, first you define streams before sending events to WSO2 Analytic Platfrom. Streams are a description of how your data look like (a.k.a. Schema). Then you can publish events. In the code, the "Event Data" is an array of objects, and it must match the types and parameters given in the event stream definition.

You can find an example client from /samples/producers/pizza-shop from WSO2 CEP distribution. 


Appendix A: Dependancy Jars

You can find the jars from the location ${cep.home}/repository/components/plugins/ of CEP or BAM pack.

  1. org.wso2.carbon.logging_4.2.0.jar
  2. commons-pool_1.5.6.*.jar
  3. httpclient_4.2.5.*.jar
  4. httpcore_4.3.0.*.jar
  5. commons-httpclient_3.1.0.*.jar
  6. commons-codec_1.4.0.*.jar
  7. slf4j.log4j*.jar
  8. slf4j.api_*.jar
  9. axis2_1.6.1.*.jar
  10. axiom_1.2.11.*.jar
  11. wsdl4j_1.6.2.*.jar
  12. XmlSchema_1.4.7.*.jar
  13. neethi_*.jar
  14. org.wso2.securevault_*.jar
  15. org.wso2.carbon.databridge.agent.thrift_*.jar
  16. org.wso2.carbon.databridge.commons.thrift_*.jar
  17. org.wso2.carbon.databridge.commons_*.jar
  18. com.google.gson_*.jar
  19. libthrift_*.jar

Appendix B: Maven POM Dependancies 

Add the following WSO2 nexus repo and dependancies to pom.xml at corresponding sections. 




daily
ignore
wso2-nexus
http://maven.wso2.org/nexus/content/groups/wso2-public/

org.wso2.carbon
org.wso2.carbon.databridge.agent.thrift
4.2
org.wso2.carbon
org.wso2.carbon.databridge.commons
4.2



  

Friday, March 6, 2015

Embedding WSO2 Siddhi from Java

Siddhi is the CEP Engine that powers WSO2 CEP. WSO2 CEP is the server, that can accepts messages over the network via long list of protocols  such as Thrift, HTTP/JSON, JMS, Kafka, and Web Socket.

Siddhi, in contrast, is a java Library. That means you can use it from a Java class, or a java main method. I personally do this to debug CEP queries before putting them into WSO2 CEP. Following Describes how to do it. However, you can embedded it and create your own apps.

First, add following jars into class path. ( You can find them from WSO2 CEP pack, http://wso2.com/products/complex-event-processor/ and from http://mvnrepository.com/artifact/log4j/log4j/1.2.14 ). The Jar versions might change with new packs, but what ever in the same CEP pack will work.
  1. siddhi-api-2.1.0-wso2v1.jar (from CEP_PACK/repository/components/plugins/) 
  2. antlr-runtime-3.4.jar (from CEP_PACK/repository/components/plugins/) 
  3. log4j-1.2.14.jar ( download from http://mvnrepository.com/artifact/log4j/log4j/1.2.14
  4. siddhi-query-2.1.0-wso2v1.jar (from CEP_PACK/repository/components/plugins/) 
  5. siddhi-core-2.1.0-wso2v1.jar (from CEP_PACK/repository/components/plugins/) 
Now you can use Siddhi using the following code. You define a Siddhi engine, add queries, register callbacks to receive results, and send events.

SiddhiManager siddhiManager = new SiddhiManager();
//define stream
siddhiManager.defineStream("define stream StockQuoteStream (symbol string,
    value double, time long, count long); ");
//add CEP queries
siddhiManager.addQuery("from StockQuoteStream[value>20] 
   insert into HighValueQuotes;");
//add Callbacks to see results
siddhiManager.addCallback("HighValueQuotes", new StreamCallback() {
     public void receive(Event[] events) {
          EventPrinter.print(events);
     }
});

//send events in to Siddhi 
InputHandler inputHandler = siddhiManager.getInputHandler("StockQuoteStream");
inputHandler.send(new Object[]{"IBM", 34.0, System.currentTimeMillis(), 10});

Here events you sent in must agree with the event streams you have defined. For example, StockQuoteStream must have a string, double, long, and a long as per event stream definition.

See my earlier blog for example of more queries.

Please see  [1] and [2] for more information about the Siddhi query language. If you create a complicated query, you can check intermediate results by adding callbacks to intermediate streams.

Enjoy! reach us via wso2 tag at stackoverflow if you have any questions or send a mail to dev@wso2.org.

  1. http://www.slideshare.net/suho/wso2-complex-event-processor-wso2-coneu2014
  2. https://docs.wso2.com/display/CEP310/Siddhi+Language+Specification

Update: For CEP 4.0 and later, API has changed. You can find new sample (Siddhi 4.0) fromhttps://github.com/wso2/siddhi/blob/master/modules/siddhi-samples/quick-start-samples/pom.xml. Look at simple filter sample. Look for pom file for dependencies.


Wednesday, February 25, 2015

WSO2 Demo Videos from O'reilly Strata 2015 Booth


We just came back from O’reilly Strata. It was great to see most of the Big Data world gathered at a one place. 

WSO2 have had a booth, and following are demos we showed in the booth. 

Demo 1: Realtime Analytics for a Football Game played with Sensors

This is shows a realtime analytics done using a dataset created by playing football game with sensors in the ball and the boots of the player. You can find more information from the earlier post.  



Demo 2: GIS Queries using Public Transport for London Data Feeds 

TFL (Transport for London) provides several public data feeds about London public transport. We used those feeds within WSO2 CEP's Geo Dashboard to implement "Speed Alerts", "Proximity Alerts", and Geo Fencing.



Please see this slide deck for more information. 



Introduction to Large Scale Data Analysis with WSO2 Analytics Platform

Slide deck for the talk I did at Indiana University, Bloomington. It walks though WSO2 Big data offering providing example queries.

Tuesday, February 24, 2015

Why We need SQL like Query Language for Realtime Streaming Analytics?


I was at O'reilly Strata in last week and certainly interest for realtime analytics was at it’s top.

Realtime analytics, or what people call Realtime Analytics, has two flavours.  
  1. Realtime Streaming Analytics ( static queries given once that do not change, they process data as they come in without storing. CEP, Apache Strom, Apache Samza etc., are examples of this. 
  2. Realtime Interactive/Ad-hoc Analytics (user issue ad-hoc dynamic queries and system responds). Druid, SAP Hana, VolotDB, MemSQL, Apache Drill are examples of this. 
In this post, I am focusing on Realtime Streaming Analytics. (Ad-hoc analytics uses a SQL like query language anyway.)

Still when thinking about Realtime Analytics, people think only counting usecases. However, that is the tip of the iceberg. Due to the time dimension of the data inherent in realtime usecases, there are lot more you can do. Lets us look at few common patterns. 
  1. Simple counting (e.g. failure count)
  2. Counting with Windows ( e.g. failure count every hour)
  3. Preprocessing: filtering, transformations (e.g. data cleanup)
  4. Alerts , thresholds (e.g. Alarm on high temperature)
  5. Data Correlation, Detect missing events, detecting erroneous data (e.g. detecting failed sensors)
  6. Joining event streams (e.g. detect a hit on soccer ball)
  7. Merge with data in a database, collect, update data conditionally
  8. Detecting Event Sequence Patterns (e.g. small transaction followed by large transaction)
  9. Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle, tracking wild life)
  10. Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing)
  11. Learning a Model (e.g. Predictive maintenance)
  12. Predicting next value and corrective actions (e.g. automated car)

Why we need SQL like query language for Realtime Streaming  Analytics?

Each of above has come up in use cases, and we have implemented them using SQL like CEP query languages. Knowing the internal of implementing the CEP core concepts like sliding windows, temporal query patterns, I do not think every Streaming use case developer should rewrite those. Algorithms are not trivial, and those are very hard to get right! 

Instead, we need higher levels of abstractions. We should implement those once and for all, and reuse them. Best lesson we can learn from Hive and Hadoop, which does exactly that for batch analytics. I have explained Big Data with Hive many time, most gets it right away. Hive has become the major programming API most Big Data use cases.

Following is list of reasons for SQL like query language. 
  1. Realtime analytics are hard. Every developer do not want to hand implement sliding windows and temporal event patterns, etc.  
  2. Easy to follow and learn for people who knows SQL, which is pretty much everybody 
  3. SQL like languages are Expressive, short, sweet and fast!!
  4. SQL like languages define core operations that covers 90% of problems
  5. They experts dig in when they like!
  6. Realtime analytics Runtimes can better optimize the executions with SQL like model. Most optimisations are already studied, and there is lot you can just borrow from database optimisations. 
Finally what are such languages? There are lot defined in world of Complex Event processing (e.g. WSO2 Siddhi, Esper, Tibco StreamBase,IBM Infoshpere Streams etc. SQL stream has fully ANSI SQL comment version of it. Last week I did a talk on Strata discussing this problem in detail and how CEP could match the bill. You could find the slide deck from below.


Scalable Realtime Analytics with declarative SQL like Complex Event Processing Scripts from Srinath Perera

Following is a video of the talk.


An Implementation of Steaming SQL can be found in WSO2 Stream Processor, Apache Storm, Apache Flink, and Apache Kafka.