Friday, December 9, 2011

Siddhi: A Second Look at Complex Event Processing Implementations

Following slide deck presents Siddhi, a University of Moaratuwa final year project I worked with Suho, Isuru, Subash, and Kasun last year. Siddhi is a Complex Event Processing Implementation that incorporates most of the state of the art advances.

Siddhi: A Second Look at Complex Event Processing Implementations

We presented the Siddhi paper titled Siddhi: A Second Look at Complex Event Processing Architectures, in Gateway Computing Environments Workshop (GCE), Seattle, 2011 (Co-located with Super Computing 2011). Above slides provide an outline of our work.

Following three graph's shows some of the key results of the Paper. Other CEP engine is Esper, which is a well-known opensource CEP engine. In the given scenarios we did from about 0.3 times to 10 times better.

Query 1: Filter (select symbol, price from StockTick(price>6))

Query2: Window (select irstream symbol, price, avg(price) from StockTick(symbol=’IBM’).win:time(.005))

Query3: Followed by pattern (select f.symbol, p.accountNumber, f.accountNumber from pattern [every f=FraudWarningEvent2 -> p=PINChangeEvent2(accountNumber= f.accountNumber)])

Siddhi is currently being used by
  • Los Angeles Smart Grid Demonstration Project
  • It forecasts electricity demand, respond to peak load events, and improves sustainable use of energy by consumers.
  • Open MRS NCD module – idea is to detect and notify patient when certain conditions have occurred
Also likely that future WSO2 Complex Event Processing Server would also use Siddhi, yes that is if it can continue to do better. Siddhi code can be found from, and If you are interested to help, please join us at for discussions.

CSE 2011 Keynote: Distributed Systems: What? Why? And bit of How?

Last week I did the keynote for the Annual research conference of Computer Science and Engineering Department, University of Moratuwa, and following are the slides I used. I had to talked about distributed systems in 30 minutes, and it was a interesting challenge. I will write down some of the things I talked about soon, specially the timeline for some of the distributed systems technologies.

View more presentations from Srinath Perera.

Tuesday, November 8, 2011

WSO2 Con 2011: Data Panel

Following is the recording of the data panel, and you can find the openning I did from here.

Monday, October 31, 2011

Shelan's talk at Telecom 2011

Shelan's talk at Telecom 2011 ( participating as a young innovator. The conference was held in Vianna. His proposal is a medical report management framework based on Mahasen project, a final year project at University of Moaratuwa to build a scalable storage framework building on to of P2P technologies.

Saturday, October 29, 2011

List of Known Scalable Architecture Templates

For most Architects, "Scale" is the most illusive aspect of software architectures. Not surprisingly, it is also one of the most sort-out goals of todays software design. However, computer scientists do not yet know of a single architecture that can scale for all scenarios. Instead, we design scalable architectures case by case, composing known scalable patterns together and trusting our instincts. Simply put, building a scalable system has become more an art than a science.

We learn art by learning masterpieces, and scale should not be different! In this post I am listing several architectures that are known to be scalable. Often, architects can use those known scalable architectures as patterns to build new scalable architectures.
  1. LB (Load Balancers) + Shared nothing Units - This model includes a set of units that does not share anything with each other fronted with a load balancer that routes incoming messages to a unit based on some criteria (round-robin, based on load etc.). A unit can be a single node or a cluster of tightly coupled nodes. As the Load balancer, users can use DNS round robin, hardware load balancers, or software load balancers. It is also possible to build a hierarchy of load balancers that includes combination of above load balancers. The article, "The Case for Shared Nothing Architecture" by Michael Stonebraker, discusses these architectures
  2. LB + Stateless Nodes + Scalable Storage - Classical Three tire Web architectures follows this model. This model includes several stateless nodes talking to a scalable storage, and a load balancer distributes load among the nodes. In this model, the storage is the limiting factor, but with NoSQL storages, it is possible to build very scalable systems using this model.
  3. Peer to Peer Architectures (Distributed Hash Table (DHT) and Content Addressable Networks (CAN)) - This model provides several classical scalable algorithm, which almost all aspects about the algorithm scaling up logarithmically. Example systems are Chord, Pastry (FreePastry), and CAN. Also several of the NoSQL systems like Cassandra is based on P2P architectures. The article "Looking up data in P2P systems" discuss these models in detail.
  4. Distributed Queues – This model is based on a Queue implementation (FIFO delivery) implemented as a network service. This model has found wide adoption through JMS queues. Often this is used as task queues, and scalable versions of task queues scale out by keeping a hierarchy of queues where lower levels sends jobs to upper levels when they cannot handle the load.
  5. Publish/Subscribe Paradigm - implemented using network publish subscribe brokers that route messages to each other. The classical paper "The many faces of publish/subscribe" describes this model in detail and among examples of this model are NaradaBroker and EventJava.
  6. Gossip and Nature-inspired Architectures - This model follows the idea of gossip in normal life, and the idea is that each node randomly pick and exchange information with follow nodes. Just like in real life, Gossip algorithms spread the information surprisingly fast. Another aspect of this are Biology inspired algorithms. Natural world has remarkable algorithm for coordination and scale. For example, Ants, Folks, Bees etc., are capable of coordinating in scalable manner with minimal communication. These algorithms borrow ideas from such occurrences. The paper "From epidemics to distributed computing" discusses the models.
  7. Map Reduce/ Data flows - First introduced by Google, MapReduce provides a scalable pattern to describe and execute Jobs. Although simple, it is the primary pattern in use for OLAP use cases. Data flows are a more advanced approach to express executions, and projects like Dryad and Pig provide scalable frameworks to execute data flows. The paper,  "MapReduce: Simplified data processing on large clusters"  provides a detailed discussion of the topic. Apache Hadoop is an implementation of this model.
  8. Tree of responsibility - This model breaks the problem recursively and assign to a tree, each parent node delegating work to children nodes. This model is scalable and used within several scalable architectures.
  9. Stream processing - This model is used to process data streams, data that is keep coming. This type of processing is supported through a network for processing nodes. (e.g. Aurora, Twitter Strom, Apache S4)
  10. Scalable Storages – ranges from Databases, NoSQL storages, Service Registries, to File systems. Following article discusses their scalability aspects.
Having said that there are only 3 way to scale: that is distribution, caching, and asynchronous processing. Each of the above architectures uses combinations of those in their implementation. On the other hand, the scalability killer, apart from bad coding, is global coordination. Simply put, any kind of global coordination will limit the scalability of your system. Each of aforementioned architectures has managed to achieve local coordination instead of global coordination.

However, combining them to create a scalable architecture is not at all trivial undertaking. Unless you are trying to discover new scalable pattern, often it is a good idea to solve problems by adopting known scalable solutions than coming up with new architecture using first principals.

Wednesday, October 26, 2011

InfoQ Article: Finding the Right Data Solution for Your Application in the Data Storage Haystack

The article "Finding the Right Data Solution for Your Application in the Data Storage Haystack" just went live on InfoQ and you can also find the slide deck of the NoSQL Now talk in this earlier post.

I firmly believe that the reason scale and some of other problems are too hard because we are too lazy to consider specific cases and analysis them in detail. Instead we are trying to find general answers that works everywhere. For an example, I can not find a taxonomy of Computer Science usecase/domins anywhere (will write about this in a later post).

Following article takes four parameters about an application/usecas, then take some 40+ cases that arises from different combination of those parameters and make concrete recommendations for each case from the storage solutions Haystack (e.g. Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc.).

It is intended as a guide to choose the right storage (SQL or NoSQL) solution for a given usecase.

Four parameters are
  •  Types of data stored (structured, unstructured, semi-structured)
  • Scalability requirements (small 1-4,  medium 10s, and very large 100s)
  • Nature of data retrieval (i.e. Types of Queries: key lookup, WHERE, JOIN, Offline)
  • Consistency requirements (ACID, single atomic Operation, loose consistency)
I do not consider the second part is done in any way. I am sure there is lot to argue and analyze there, and please let me know if you have any thoughts on this. 

Tuesday, October 25, 2011

Tuesday, October 18, 2011

Beautiful Colombo

Great slide deck on Colombo! there is a saying that the frog lives in the pond never knows the fragrance of lotus while the bee from the forest comes to savor it ;)

Friday, October 14, 2011

Why We need Multi-tenancy?

This is an expert from my paper,  "A Multi-tenant Architecture for Business Process Execution" that was presented at  9th International Conference on Web Services (ICWS), 2011 with some changes to give context. 

This is related to Sumedha's blog and Larry Elision's comment on Multi-tenancy (MT). Also similar ideas were presented in my talk at WSO2 Con Multi-tenancy: Winning formula for a PaaS.  

What is Multi-tenancy? 

The idea is that the same server instance can support multiple tenants. In other words, it gives the illusion to each user that he has his own server/App while actually, the server is shared among many. MT enables hosting organizations to mix and match heavily used and lightly used tenants together, thus enabling them to run the overall infrastructure with much less resources. 

It is best compared to an apartment complex, where the owner of each apartment (tenant) thinks it is his own home, while the apartment complex shares resources like real state, plumbing, ventilation, security etc. Idea is to provide isolation while achieving maximum sharing. 

Why Multi-tenancy? 

1. Each VM runs its own OS etc., while with MT, the sharing happens at much higher level than VMs, thus enabling better resource sharing. 
2. Supporting Pay as you go within a Cloud platform. 

Let me explain this in bit more detail. Cloud platforms, SaaS, PaaS have “pay-as-you-go” model as a key assumption. That is users can ask for resources and use them only when he needs them and should be able to release resources when he does not. If this assumption holds, applications while not being used should cost the user almost nothing. Therefore, to support pay-as-you-go model, the both SaaS or PaaS middleware should be able to support applications owned by many users (we will call them tenants) within the same server while allocating resources on demand. 

It is possible to do this through IaaS where one can run a VM per each user. However, often many of the applications and users are not active (in use). For example, if a hosting provider has 10,000 tenants and if only few hundred are in use at a given time, then running a VM for each is a waste. Since booting up a VM often takes time and does not complete fast enough to serve the first request, keeping VMs in disk and booting VMs on demand is often not practical.

With MT, the cloud provider can handle this by allocating tenants to servers based on the projected load (e.g. cloud provider can give different classes of QOS). He should place many rarely used tenants in the same server using MT, thus reducing the cost. 

However, there may be other way to implement pay-as-you-go, and I would love to hear about them. 

On final note, implementing MT is not easy by any means, and it take some thinking and hard work. Main challenges are data isolation, execution isolation, and performance isolation. I will talk about them more in a later blog. Mean while, following papers talks about how WSO2 implemented some of them. 

  1. A. Azeez and S. Perera et al., WSO2 Stratos: An Industrial Stack to Support Cloud Computing, IT: Methods and Applications of Informatics and Information Technology Journal, the special Issue on Cloud Computing, 2011.
  2. Milinda Pathirage, Srinath Perera, Sanjiva Weerawarana, Indika Kumara, A Multi-tenant Architecture for Business Process Execution, 9th International Conference on Web Services (ICWS), 2011 
  3. Paul Fremantle, Srinath Perera, Afkham Azeez, Sameera Jayasoma, Sumedha Rubasinghe, Ruwan Linton, Sanjiva Weerawarana, and Samisa Abeysinghe. Carbon: towards a server building framework for SOA platform. In Proceedings of the 5th International Workshop on Middleware for Service Oriented Computing (MW4SOC '10). ACM, New York, NY, USA, 7-12. DOI=10.1145/1890912.1890914, 2010
  4. Afkham Azeez, Srinath Perera, Dimuthu Gamage, Ruwan Linton, Prabath Siriwardana, Dimuthu Leelaratne, Sanjiva Weerawarana, Paul Fremantle, Multi-Tenant SOA Middleware for Cloud Computing 3rd International Conference on Cloud Computing, Florida, 2010

Monday, October 10, 2011

ICWS Paper: A Multi-tenant Architecture for Business Process Execution

Following are the slides for the paper "A Multi-tenant Architecture for Business Process Executions" that I presented at ICWS 2011 last July. The paper discusses in detail our work on extending multi-tenancy support to Business processes, and the discussed technology is now in used within WSO2 Business Process Server and WSO2 Stratos.

Milinda Pathirage, Srinath Perera, Sanjiva Weerawarana, Indika Kumara, 
A Multi-tenant Architecture for Business Process Execution, 9th International Conference on Web Services (ICWS), 2011


Cloud computing, as a concept, promises cost savings to end-users by letting them outsource their non-critical business functions to a third party in pay-as-you-go style.
However, to enable economic pay-as-you-go services, we need Cloud middleware that maximizes sharing and support near zero costs for unused applications. Multi-tenancy, which let multiple tenants (user) to share a single application instance securely, is a key enabler for building such a middleware. On the other hand, Business processes capture Business logic of organizations in an abstract and reusable manner, and hence play a key role in most organizations. This paper presents the design and architecture of a Multi-tenant Workflow engine while discussing in detail potential use cases of such architecture.
Primary contributions of this paper are motivating workflow multi-tenancy, and the design and implementation of multi-tenant workflow engine that enables multiple tenants to run their workflows securely within the same workflow engine instance without modifications to the workflows.

Thursday, September 22, 2011

Writing Your First Thrift Service

Thrift provides a Binary RPC protocol for supporting service invocations, and it provides toolkits to generate thrift binding for several languages, including Java, C, C++ etc. Then just like with Web Services, the thrift service invocations will agree on the wire and enable multiple programming language implementations to talk to each other.

If you remember  CORBA, it was the same thing. Well except for the fact that it was nightmare to write a service using CORBA tools.

Well why am I interested with Thrift? Well simple answer is it is fast, and has lot of traction.

Luckily writing a thrift service is pretty easy. Following is how I did it.

  1. Download and build thrift from Look at README for instructions. However, I had to disable Erlang binding while running ./configure ( --without-erlang). 
  2. Then first step is to write the thrift IDL. Mine did looked like following. is the best source to learn how to write a Thrift IDL.
    namespace java Test
    struct Tuple {
      1: list tuples,
    service Bissa {
    void put(1:Tuple tuple),
    list read(1:string pattern),
    list take(1:string pattern)
  3. The I ran thrift to generate code. Command looked like thrift --gen java bissa.thrift. 
  4. Above created a service class and types define in the IDL. Then I wrote a server that uses the generated code and it looked like following. Sample class can be found from
    //This is a class that implement Bissa.Iface interface (class generated by thrift to represent the service. )
    BissaThriftServer handler = new BissaThriftServer(bissa);
    //then we initialize a processor passing that handler (implementation)
    Bissa.Processor processor = new Bissa.Processor(handler);
    TServerTransport serverTransport = new TServerSocket(thriftPort);
    // Use this for a multithreaded server
    TServer server = new TThreadPoolServer(new TThreadPoolServer.Args(serverTransport).processor(processor));
    System.out.println("Starting the simple server... on port "+ thriftPort);
  5. Client looked like following. 
     TTransport transport;
    transport = new TSocket("localhost", 9092);;
    TProtocol protocol = new TBinaryProtocol(transport);
    //following is generated code
    Bissa.Client client = new Bissa.Client(protocol);
    //now u have a stub, use it
    client.put(BissaThriftServer.createTuple("A", "B"));
  6. You might also find this blog useful. 

Saturday, September 17, 2011

Data, Data Everywhere and Challenges

I had the pleasure and privilege of moderating the Data Panel at WSO2 Conference 2011 that composed of the distinguished panel Sumedha Rubasinghe, C. Mohan, and Gregor Hohpe. Obviously, I did my homework for the panel and gave some thoughts on what to say. I felt at the end that I should write down the opening I did. So, here we go.

Let me start with a quote from Tim Barnes Lee, the founder of the Internet. He said, "Data is a precious thing because they last longer than systems". For example, if you take systems like Google or Yahoo, you will many who argue that the data those companies have collected over their operation are indeed the most important assert they have. Those data give them power to either optimize what they do or to go on new directions.

If you look around, you will see there is so much data being available. Let me try to touch on few types of data.

  1. Sensors – Human activities (e.g. near field communication), RFID, Nature (Weather), Surveillance, Traffic, Intelligence etc. 
  2. Activities in World Wide Web
  3. POS and transaction logs 
  4. Social networks
  5. Data collected by governments, NGOs etc. 

The paper, Miller, H.J., The Data Avalanche is here, Shouldn't we be digging? Journal of Regional Science, 2010, is a nice discussion on the subject.

Data do come in many shapes and forms. Some of them are moving data—or data steams--while others are in rest; some are public, some are tightly controlled; some are small, and some are large etc.

Thinks about a day in your life, and you will realize how much data are around you, that you know that is available; but very hard to accessed or processed. For example, do you know the distribution of your spending? Why it is so hard to find the best deal to by a used car? Why cannot I find the best route to drive now? the list goes on and on..

It is said that we are drowning in an ocean of data, and making sense of that data is considered to be the challenge for our time. To think about it, Google have made a fortune by solving a seemingly simple problem: the content-based search. There are so many companies that either provide data (e.g. Maps, best deals) or provide add on services on top of the data (e.g. analytics, targeted advertising etc.).

As I mentioned earlier, we have two types of data. First, moving data are data streams, and users want to process them near real time to either adopt themselves (e.g. monitoring the stock market) or to control the outcome (e.g. Battle field observations or logistic management). The Second is data in the rest. We want to store them, search then, and then process them. This processing is for either to detect some patterns  (fraud detection, anti-money laundering, surveillance) or to make predications (e.g. predict the cost of a project, predict natural disasters).

So broadly we have two main challenges.
  1. How to store and query data in a scalable manner? 
  2. How to make sense of data (how to run the transformations data->information->knowledge ->insight) 
And there are many other challenges, and following are some of them.

  1. Supporting semantics. This includes extracting semantics from data (e.g. using heuristics based AI systems or through statistical methods) and supporting efficient semantics based queries.  
  2. Supporting multiple representations of the same data. Does converting on demand is the right way to go or should we standardize? Does standardization is practical? 
  3. Master data management – Making sure all copies of data are updated, and any related data is identified, referenced and updated together.
  1. Data ownership, delegation, and permissions.
  2. Privacy concerns: unintended use of data and ability to correlation too much information. 
  3. Exposing private data in a controlled manner. 
  4. Making data accessible to all intended parties, from anywhere, anytime, from any device, through any format (subjected to permissions). 
  1. Making close to real-time decisions with large-scale data (e.g. targeted advertising). Or in other words how to make analytical jobs faster. 
  2. Distributed frameworks and languages to process large data processing tasks. Is Map-Reduce good enough? What about other parallel problems? 
  3. Ability to measure the confidence associated with results generated from a given set of data. 
  4. Taking decisions in the face of missing data (e.g. lost events etc.). Regardless of the design, some of the data will be lost while monitoring the system. Then decisions models have to still work, and be able to ignore or interpolate missing data. 
I am not trying to explain the solutions here, but hopefully I will write future posts talking about state of art about some of the challenges.

Tuesday, September 13, 2011

Multi-tenancy: Winning formula for a PaaS

Following is the slide deck I presented  at WSO2 Conference today. It provides a detailed discussion on what is Multi-tenancy, why it is needed and details about potential implementations.

View more presentations from Srinath Perera.

Wednesday, August 24, 2011

NoSQL Now Talk:Finding the Right Data Solution for Your Application in the Data Storage Haystack

Following are the slides for my NoSQL Now talk. Talk walk through different NoSQL choices and make concreate recommendations on which one to should be used when. End of the day, it provides tables that give you the choices directly. You can find the abstract from here.

Following table depicts the key idea of the talk. The table provide data store recommendations for given usecase (application) based on three properties. This only covers structured data, and slides have tables for unstructured and semi-structured data.  Thee properties are
  1. type of Search needed by the application(Different Columns, colored Blue)
  2. amount of Scale needed by the application (colored green)
  3. amount of consistency required by the application (colored brown)
Here we are presenting 3D data using a 2D table. Repeated columns under each type of scale column sets takes care of that. For example, "DB" in the forth row forth column says "if you need where clause like search, with small scale and transactions, use a DB". Other cells use the same idea.

The notation I use is KV: Key-Value Systems, CF: Column Families, Doc: document based Systems. Questions marks in the table means, it might work, but you should verify. The table only put the recommendations and does not exactly say how I come up with the recommendations. More details are in the slides, and I will get out a writeup soon. Some of the key ideas are

  1. Transactions and Joins does not scale great
  2. KV scale most, then CF and Doc models, then DB. So if KV is good enough, go for that. 
  3. Offline case have time to do MapReduce and walk through the data. 
  4. If you need transactions or Joins with scale, you have to try partitioned DBs. But you have to try and see, and it might not work either. If it does not, you are out of luck. 

Small (1-3 nodes)
Scalable (10 nodes)
Highly Scalable (1000s nodes)
Operation Consistency
Operation Consistency
Operation Consistency
Primary Key
DB/  CF/Doc
DB/  CF/Doc
CF/Doc (?)

Survey of System Management Frameworks

Following is couple of tables I took out of Survey I did on system management frameworks. A detailed writeup of this survey is in my thesis under related works. I want to write up a survey paper on this, but have not got to that yet.

Lets start with brief outline on rational for the choice of matrices to compare different implementations. System management frameworks monitor a system, take decisions on how to fix things, and execute those decisions. They often consist of three parts: Sensors, a brain that make decisions, and set of actuators to execute decisions. From 10,000-foot view, most architectures follow this model, but differ on how they collect data, make decisions, and execute them. First table compares them on this aspect.

System management frameworks themselves have to scale and provide High availability. To do those, they have to be composed of multiple servers (managers). Taking decisions with multiple managers (coordination) is one of the key challenges in system management framework design. The second table compares and contrast different coordination models using different architectural styles.

The 3rd table discusses decision models—in other words, implementation of the brain. More details can find from

Systems based on Functionality/Design

Sensors->Brain-> (Actuators)? InfoSpect(Prolog),WRMFDS (AI), Rule-basedDignosis (Hierachy) (Monitoring & Dignosis)
Ref Archi, Autopilot, Policy2Actions (Centralized)
JADE, JINI-Fed, ReactiveSys, (Managers)
Tivoli (Managers + Manual Coordinator)
Sensors->Gauges->Brain-> (Actuators)? Paradyn MRNet (Monitoring Only)
Rainbow (Archi Model*), ACME-CMU, CoordOfSysMngServices, Code
eXtreme(KX),Galaxy* (Managers)
Sensors->DataModel->Brain->Actuators DealingWithScale(CIM), Marvel( Centralized)
Scalable Management (PLDB/Managers), Nester (Distributed directory service), SelfOrgSWArchi
Sensors->ECA->Actuators DREAM, SmartSub, IrisLog, HiFi, Sophia (Prolog) - (Managers)
Management Hierarchy NaradaBrokerMng, WildCat, BISE,ReflexEngine
Queries on demand ACME , Decentralizing Network Management, Mng4P2P (Managers)
Workflow systems/ Resource management Unity, RecipeBased (Centralized),
SelfMngWFEngine, Automate Accord, ScWAreaRM, P2PDesktopGrid (Distributed)
Fault tolerant systems WSRF-Container, GEMS (Distributed)
Decentralized DMonA,Guerrilla, K-Componets
Deployment frameworks SmartFrog , Plush, CoordApat
Monitoring Inca (Centralized), PIPER (multicast over DHT), MDS (LDAP, Events), NWS(LDAP), Scalable Info management, Astorable (Gossip), Monitoring with accuracy, RGMA (DB like access)

Systems based on Coordination and Communication model

Monitoring Only One Manager without coordination One Manager with coordination Managers without coordination Managers with coordination
Pull / events InfoSpect
Rainbow, Unity, Ref Archi, Autopilot, InfraMonMangGrid CoordOfSysMngServices JADE DMonA, Tivoli (Manual)
Pub/Sub hierarchy DealingWithScale, ACME-CMU Scalable Management, DREAM, SmartSub NaradaBrokerMng,
P2P PIPER, Automate Accord, Mng4P2P, WSRF-Container,
Hierarchy Gangila, Globus MDS, Paradyn MRNet, WRMFDS IrisLog, HiFi, JINI-Fed, eXtreme(KX), ScWAreaRM WildCat
Gossip Astorable, GEMS Galaxy
Spanning Tree (Network/P2P) Scalable Info management, ACME
Group Communication ReactiveSys, Galaxy SelfOrgSWArchi
Distributed Queue SelfMngWFEngine

Decision Models in management Systems

Rules Conflict resolution Verifier Batch Mode used? Meta-model used? Planning
DIOS++ If/then yes, use priority No results of rules applied in next iteration Yes No
Rainbow If/then No Yes No
InfoSpect Prolog Like Only Monitoring No No Yes No
Marvel(1995) PreCnd->action->PostCond Yes No Yes No
CoordOfSysMngServices Yes (Detect ->Human Help) Yes Yes No Yes
Sophia Prolog like No No No
RecipeBased Java Code Yes No No No
HiFi, DREAM pub/sub Filter->action No No No No No
IrisLog DB triggers No No No No No
ACME (timer/sensor/completion)
No No possible with timer conditions Yes No
ReactiveSys (1993) if/then No No No Yes No
Policy2Actions (2007) Policy- name, conditions (name of method to run + parameters), actions, target component Yes, based on runtime state + history There are tests associated with each actions that decide should action need to run - No Yes

Friday, August 19, 2011

WSO2 Stratos in contrast to Other Java PaaS Offerings

In the article "Java PaaS shootout" Michael J. Yuan provides a pretty nice comparison of Google AppEngine, Amazon Elastic Beanstalk, and CloudBees RUN@Cloud. Following table provides a summary of that while adding WSO2 Stratos to the comparison.

App Engine
Amazon beanstalk
 CloudBee's Run@Cloud
WSO2 Stratos
What is it?
Users can upload servlets. AppEngine hosts them and manage them. 
Managed Tomcat
Tomcat, load balancer. Integrated with SVN. Can change source code and update all deployment aspects.
SOA platform as a service.

Fully multi-tenant.
Java support
Yes, does not support some I/O and network operations
Full java
Full Java
Yes, but File access is limited
Outbound connections
Time out in 10 seconds
Support for standard java Libs
Have problems when they use unsupported APIs
Yes (java security manager limits file accesses)
Performance and scalability
Auto scale, High scalability, but have bit high latency.
Swapping the app out might slow down first request
Auto scale by creating EC2 instances
Can swap unused processes out of JVM. Can load balance multiple tomcats in the same EC2 instance
Can lazy load services and other artifacts.
Auto scale (up down) by monitoring the load and creating new nodes. LB route the requests.
Support Big Table and Hosted MySQL. 
However, search support in BigTable case is limited. 
e.g.  Each query can only have 100 results.
Support RDS (relational) , SimpleDB (NoSQL) or can run with your own DB
Has managed MySQL databases and provide a console to manage them
Support Cassandra as a Service, managed MySQL, and HDFS. 

Cassandra and HDFS support native multi-tenancy
Import/ export data
No (hard due to 30 sec limit)
Can write code to automate
Can write code to automate
Can write code to automate
Integration with others
Integrate well with other Google services
SQS, SES (email service), payment APIs
S3, SQS, SES etc.
With Google auth model and other WSO2 services. Also S3, SQS, SES etc. 
Session handling
Store sessions to storage and handles them seamlessly
Only sticky sessions
Transparent session management
Only sticky sessions

Also let me listout couple of key differentiators of WSO2 Stratos.

  1. All these offerings support Web App Hosting as a Service. WSO2 Stratos supports that, but provides much more. In addition to Web App Hosting, it supports hosting Axis2 based services, Mediation, and Workflow hosting as a Service. It is real SOA platform as a Service, the only one to does that. 
  2. It let you move your Axis2 based Web Services (.aar files) and workflows to the Cloud (to WSO2 Statos Live) without any change to them. If you have some Axis2 based services, chances are that you can upload them to WSO2 Stratos and it will just work. 
  3. WSO2 Stratos provides real multi-tenancy support. That is different tenants will think that he has his own server, while actually all are served from one Java Server. In other words, Isolation is done at Java level, not at Virtualization level. That means it can provide greater sharing and provide "Pay as you go" and "Pay for what you use" better than VM based model. Only AppEngine does that out of other three PaaS offering. More details are in following papers. 
    • A. Azeez, S. Perera, D. Gamage et al. (2010) Multi-tenant SOA Middleware for Cloud Computing, 458–465. In 2010 IEEE 3rd International Conference on Cloud Computing.
    • A. Azeez and S. Perera et al., WSO2 Stratos: An Industrial Stack to Support Cloud Computing, IT: Methods and Applications of Informatics and Information Technology Journal, the special Issue on Cloud Computing, 2011.
    • Milinda Pathirage, Srinath Perera, Sanjiva Weerawarana, Indika Kumara, A Multi-tenant Architecture for Business Process Execution, 9th International Conference on Web Services (ICWS), 2011

What can you do with WSO2 Platform?

We have explained WSO2 platform in many ways. Here I am trying to take a more scenario driven approach.

1. Implementing Business Logic

If you are looking to build a SOA based architecture, first step is to implement your business logic as a set of services.  Users can use WSO2 Application Server (AS) to do this, and AS enables users to implement a business logic using any of the following methods.
  1. Using Java Code (WSO2 AS, any Axis2 .aar file works)
  2.  Exposing data in a data source (e.g. Relational Database, CSV file etc., see WSO2 Data services Server)
  3.  Using Business Rules (Supports Drools, see WSO2 BRS)
  4.   Using a script language (e.g. Javascript, Phython, see WSO2 AS)
Once the service is created, users can secure them using HTTPs or WS-Security. Furthermore, he can enable most of the WS-* support also using WSO2 Application Server. 

2. Mediating Messages

Often SOA deployment has third party services, legacy applications and other types of frication that force the architecture to mediate messages that flow between services.   Users can use WSO2 Enterprise Service Bus (ESB) to do this. Mediation can be of several forms. (* there are others, and following are only some of them.)
  1. Message transformations
  2. Message editing
  3.  Message filtering
  4.  Route (e.g. load balancing, Pub/Sub)
  5.  Support Enterprise Integration Patterns (EIP)

3. Creating the User Interface Layer

Often these services have user facing components. WSO2 platform provides two choices to do this.
    1.  Java Web Applications (.war file) in WSO2 AS
    2.  Google Gadget support in WSO2 Gadget Server

    4. Composing Services

    If the target application is complicated, specially when service execution flows are dynamic and themselves contains business logic, users may opt to compose those services to create business processes. WSO2 platform provides three choices for doing this.
    1. Create the process as a BPEL document, deploy, and run the business process using WSO2 Business Process Server
    2. Create the process as a mediation sequence using WSO2 ESB
    3. Create the process using java scripts with WSO2 Mashup Server

    5. Scaling

    If some of the services receive higher loads that exceed single node capacity, users would want to scale the system by clustering those bottleneck service. Exact details are usecase specific, but WSO2 platform support clustering through Axis2 clustering

    Cross Cutting Concerns 

    In addition, WSO2 platform supports two cross cutting aspects: namely governance and business actively monitoring.

    1. Govern SOA

    If a SOA deployment has many services and they goes through frequent business logic and configurations changes, ad-hoc management of the deployment could be very complex. WSO2 Governance registry let users store all service WSDL in a single repository and manage their configurations and lifecycles from a single point. 

    2. Monitoring the Business
    Most businesses needs a detailed close to real time view of their activities. WSO2 Business Activity Monitoring (BAM) let users define trace points in activities (service executions). At the runtime, BAM  and monitors data collected at those trace points, and present them to end users through Gadgets that aggregate and visualize them.

    Tuesday, August 16, 2011

    Useful data sets

    If you are doing a performance test, it always a good thing to do that using a real dataset. Following are several useful datasets.

    Often you can find data in the CSV format, and then parsing and using it is pretty easy.
    1. DLPB catalog - - this is data about publications. About 900MB raw size. 
    2. Google Fusion tables,
       - this has several useful datasets as CSV.
    3. Federal reserve economic data -  
    4. Amazon public datasets -
    There are lot more. Following are some of them. If anyone knows list giving a sizes of datasets and nature of datasets, please let me know. 

    Monday, August 8, 2011

    Offset Property in any WSO2 Server

    If you want to run multiple WSO2 Servers (any type .. e.g. ESB, AS, GReg etc.), you can easily switch the ports using the Ports/offset property in the repository/conf/carbon.xml. When a WSO2 server starts, it listens on several ports. Offset moves all ports by the given value, and very handy tool when you play with things.

    Yes does not make a huge difference on a production deployment, but a remove a major pain factor if you are playing with stuff. 

    IESL Talk Series: Apache System Projects in the Real World

    I did a public talk on "Apache System Projects in the Real World" at the IESL talk series organized by IT & Communications Sectional Committee, The Institution of Engineers, Sri Lanka (IESL). Following are the slides. This is a updated version of my earlier talk at Apache Con Roadshow 2009. The talk discussed opensource, apache, and LEAD project. LEAD project ( is a heavy user of Apache Projects like Axis2, ODE etc., and it has pushed those to the limit and made contributions back. Also at the end, the LEAD project itself was donated to Apache as "Apache Airvata" (  Therefore, I believe  LEAD project is indeed an great example of Apache Web Services in the real world.

    View more presentations from Srinath Perera.