Wednesday, August 24, 2011

NoSQL Now Talk:Finding the Right Data Solution for Your Application in the Data Storage Haystack

Following are the slides for my NoSQL Now talk. Talk walk through different NoSQL choices and make concreate recommendations on which one to should be used when. End of the day, it provides tables that give you the choices directly. You can find the abstract from here.

Following table depicts the key idea of the talk. The table provide data store recommendations for given usecase (application) based on three properties. This only covers structured data, and slides have tables for unstructured and semi-structured data. Thee properties are

type of Search needed by the application(Different Columns, colored Blue)
amount of Scale needed by the application (colored green)
amount of consistency required by the application (colored brown)

Here we are presenting 3D data using a 2D table. Repeated columns under each type of scale column sets takes care of that. For example, "DB" in the forth row forth column says "if you need where clause like search, with small scale and transactions, use a DB". Other cells use the same idea.

The notation I use is KV: Key-Value Systems, CF: Column Families, Doc: document based Systems. Questions marks in the table means, it might work, but you should verify. The table only put the recommendations and does not exactly say how I come up with the recommendations. More details are in the slides, and I will get out a writeup soon. Some of the key ideas are

Transactions and Joins does not scale great
KV scale most, then CF and Doc models, then DB. So if KV is good enough, go for that.
Offline case have time to do MapReduce and walk through the data.
If you need transactions or Joins with scale, you have to try partitioned DBs. But you have to try and see, and it might not work either. If it does not, you are out of luck.

	Small (1-3 nodes)			Scalable (10 nodes)			Highly Scalable (1000s nodes)
	Loose Consistency	Operation Consistency	ACID Transactions	Loose Consistency	Operation Consistency	ACID Transactions	Loose Consistency	Operation Consistency	ACID Transactions
Primary Key	DB/ KV/ CF	DB/ KV/ CF	DB	KV/CF	KV/CF	DB?	KV/CF	KV/CF	No
Where	DB/ CF/Doc	DB/ CF/Doc	DB	CF/Doc(?)	CF/Doc (?)	DB?	CF/Doc	CF/Doc	No
JOIN	DB	DB	DB	??	??	??	No	No	No
Offline	DB/CF/Doc	DB/CF/Doc	DB/CF/Doc	CF/Doc	CF/Doc	No	CF/Doc	CF/Doc	No

Survey of System Management Frameworks

Following is couple of tables I took out of Survey I did on system management frameworks. A detailed writeup of this survey is in my thesis under related works. I want to write up a survey paper on this, but have not got to that yet.

Lets start with brief outline on rational for the choice of matrices to compare different implementations. System management frameworks monitor a system, take decisions on how to fix things, and execute those decisions. They often consist of three parts: Sensors, a brain that make decisions, and set of actuators to execute decisions. From 10,000-foot view, most architectures follow this model, but differ on how they collect data, make decisions, and execute them. First table compares them on this aspect.

System management frameworks themselves have to scale and provide High availability. To do those, they have to be composed of multiple servers (managers). Taking decisions with multiple managers (coordination) is one of the key challenges in system management framework design. The second table compares and contrast different coordination models using different architectural styles.

The 3rd table discusses decision models—in other words, implementation of the brain. More details can find from http://people.apache.org/~hemapani/research/system-management-survey.html.

Systems based on Functionality/Design


Sensors->Brain-> (Actuators)?	InfoSpect(Prolog),WRMFDS (AI), Rule-basedDignosis (Hierachy) (Monitoring & Dignosis) Ref Archi, Autopilot, Policy2Actions (Centralized) JADE, JINI-Fed, ReactiveSys, (Managers) Tivoli (Managers + Manual Coordinator)
Sensors->Gauges->Brain-> (Actuators)?	Paradyn MRNet (Monitoring Only) Rainbow (Archi Model), ACME-CMU, CoordOfSysMngServices, Code InfraMonMangGrid(Centralized) eXtreme(KX),Galaxy (Managers)
Sensors->DataModel->Brain->Actuators	DealingWithScale(CIM), Marvel( Centralized) Scalable Management (PLDB/Managers), Nester (Distributed directory service), SelfOrgSWArchi
Sensors->ECA->Actuators	DREAM, SmartSub, IrisLog, HiFi, Sophia (Prolog) - (Managers)
Management Hierarchy	NaradaBrokerMng, WildCat, BISE,ReflexEngine
Queries on demand	ACME , Decentralizing Network Management, Mng4P2P (Managers)
Workflow systems/ Resource management	Unity, RecipeBased (Centralized), SelfMngWFEngine, Automate Accord, ScWAreaRM, P2PDesktopGrid (Distributed)
Fault tolerant systems	WSRF-Container, GEMS (Distributed)
Decentralized	DMonA,Guerrilla, K-Componets
Deployment frameworks	SmartFrog , Plush, CoordApat
Monitoring	Inca (Centralized), PIPER (multicast over DHT), MDS (LDAP, Events), NWS(LDAP), Scalable Info management, Astorable (Gossip), Monitoring with accuracy, RGMA (DB like access)

Systems based on Coordination and Communication model

	Monitoring Only	One Manager without coordination	One Manager with coordination	Managers without coordination	Managers with coordination
Pull / events	InfoSpect Sophia	Rainbow, Unity, Ref Archi, Autopilot, InfraMonMangGrid	CoordOfSysMngServices	JADE	DMonA, Tivoli (Manual)
Pub/Sub hierarchy		DealingWithScale, ACME-CMU		Scalable Management, DREAM, SmartSub	NaradaBrokerMng,
P2P	PIPER,			Automate Accord, Mng4P2P, WSRF-Container,
Hierarchy	Gangila, Globus MDS, Paradyn MRNet, WRMFDS			IrisLog, HiFi, JINI-Fed, eXtreme(KX), ScWAreaRM	WildCat
Gossip	Astorable, GEMS			Galaxy
Spanning Tree (Network/P2P)				Scalable Info management, ACME
Group Communication				ReactiveSys, Galaxy	SelfOrgSWArchi
Distributed Queue		SelfMngWFEngine

Decision Models in management Systems

	Rules	Conflict resolution	Verifier	Batch Mode used?	Meta-model used?	Planning
DIOS++	If/then	yes, use priority	No	results of rules applied in next iteration	Yes	No
Rainbow	If/then		No		Yes	No
InfoSpect	Prolog Like	Only Monitoring	No	No	Yes	No
Marvel(1995)	PreCnd->action->PostCond		Yes	No	Yes	No
CoordOfSysMngServices		Yes (Detect ->Human Help)	Yes	Yes	No	Yes
Sophia	Prolog like		No	No		No
RecipeBased	Java Code	Yes		No	No	No
HiFi, DREAM	pub/sub Filter->action	No	No	No	No	No
IrisLog	DB triggers	No	No	No	No	No
ACME	(timer/sensor/completion) conditions->action	No	No	possible with timer conditions	Yes	No
ReactiveSys (1993)	if/then	No	No	No	Yes	No
Policy2Actions (2007)	Policy- name, conditions (name of method to run + parameters), actions, target component	Yes, based on runtime state + history	There are tests associated with each actions that decide should action need to run	-	No	Yes

Friday, August 19, 2011

WSO2 Stratos in contrast to Other Java PaaS Offerings

In the article "Java PaaS shootout" Michael J. Yuan provides a pretty nice comparison of Google AppEngine, Amazon Elastic Beanstalk, and CloudBees RUN@Cloud. Following table provides a summary of that while adding WSO2 Stratos to the comparison.

	App Engine	Amazon beanstalk	CloudBee's Run@Cloud	WSO2 Stratos
What is it?	Users can upload servlets. AppEngine hosts them and manage them.	Managed Tomcat Expensive	Tomcat, load balancer. Integrated with SVN. Can change source code and update all deployment aspects.	SOA platform as a service. Fully multi-tenant.
Java support	Yes, does not support some I/O and network operations	Full java	Full Java	Yes, but File access is limited
Outbound connections	Time out in 10 seconds	OK	OK	OK
Support for standard java Libs	Have problems when they use unsupported APIs	Yes	Yes	Yes (java security manager limits file accesses)
Performance and scalability	Auto scale, High scalability, but have bit high latency. Swapping the app out might slow down first request	Auto scale by creating EC2 instances	Can swap unused processes out of JVM. Can load balance multiple tomcats in the same EC2 instance	Can lazy load services and other artifacts. Auto scale (up down) by monitoring the load and creating new nodes. LB route the requests.
Storage	Support Big Table and Hosted MySQL. However, search support in BigTable case is limited. e.g. Each query can only have 100 results.	Support RDS (relational) , SimpleDB (NoSQL) or can run with your own DB	Has managed MySQL databases and provide a console to manage them	Support Cassandra as a Service, managed MySQL, and HDFS. Cassandra and HDFS support native multi-tenancy
Import/ export data	No (hard due to 30 sec limit)	Can write code to automate	Can write code to automate	Can write code to automate
Integration with others	Integrate well with other Google services	SQS, SES (email service), payment APIs	S3, SQS, SES etc.	With Google auth model and other WSO2 services. Also S3, SQS, SES etc.
Session handling	Store sessions to storage and handles them seamlessly	Only sticky sessions	Transparent session management	Only sticky sessions
Multi-tenancy	Yes	No	No	Yes

Also let me listout couple of key differentiators of WSO2 Stratos.

All these offerings support Web App Hosting as a Service. WSO2 Stratos supports that, but provides much more. In addition to Web App Hosting, it supports hosting Axis2 based services, Mediation, and Workflow hosting as a Service. It is real SOA platform as a Service, the only one to does that.
It let you move your Axis2 based Web Services (.aar files) and workflows to the Cloud (to WSO2 Statos Live) without any change to them. If you have some Axis2 based services, chances are that you can upload them to WSO2 Stratos and it will just work.
WSO2 Stratos provides real multi-tenancy support. That is different tenants will think that he has his own server, while actually all are served from one Java Server. In other words, Isolation is done at Java level, not at Virtualization level. That means it can provide greater sharing and provide "Pay as you go" and "Pay for what you use" better than VM based model. Only AppEngine does that out of other three PaaS offering. More details are in following papers.

A. Azeez, S. Perera, D. Gamage et al. (2010) Multi-tenant SOA Middleware for Cloud Computing, 458–465. In 2010 IEEE 3rd International Conference on Cloud Computing.
A. Azeez and S. Perera et al., WSO2 Stratos: An Industrial Stack to Support Cloud Computing, IT: Methods and Applications of Informatics and Information Technology Journal, the special Issue on Cloud Computing, 2011.
Milinda Pathirage, Srinath Perera, Sanjiva Weerawarana, Indika Kumara, A Multi-tenant Architecture for Business Process Execution, 9th International Conference on Web Services (ICWS), 2011

What can you do with WSO2 Platform?

We have explained WSO2 platform in many ways. Here I am trying to take a more scenario driven approach.

1. Implementing Business Logic

If you are looking to build a SOA based architecture, first step is to implement your business logic as a set of services. Users can use WSO2 Application Server (AS) to do this, and AS enables users to implement a business logic using any of the following methods.

Using Java Code (WSO2 AS, any Axis2 .aar file works)
Exposing data in a data source (e.g. Relational Database, CSV file etc., see WSO2 Data services Server)
Using Business Rules (Supports Drools, see WSO2 BRS)
Using a script language (e.g. Javascript, Phython, see WSO2 AS)

Once the service is created, users can secure them using HTTPs or WS-Security. Furthermore, he can enable most of the WS-* support also using WSO2 Application Server.

2. Mediating Messages

Often SOA deployment has third party services, legacy applications and other types of frication that force the architecture to mediate messages that flow between services. Users can use WSO2 Enterprise Service Bus (ESB) to do this. Mediation can be of several forms. (* there are others, and following are only some of them.)

Message transformations
Message editing
Message filtering
Route (e.g. load balancing, Pub/Sub)
Support Enterprise Integration Patterns (EIP)

3. Creating the User Interface Layer

Often these services have user facing components. WSO2 platform provides two choices to do this.

Java Web Applications (.war file) in WSO2 AS
Google Gadget support in WSO2 Gadget Server

4. Composing Services

If the target application is complicated, specially when service execution flows are dynamic and themselves contains business logic, users may opt to compose those services to create business processes. WSO2 platform provides three choices for doing this.

Create the process as a BPEL document, deploy, and run the business process using WSO2 Business Process Server
Create the process as a mediation sequence using WSO2 ESB
Create the process using java scripts with WSO2 Mashup Server

5. Scaling

If some of the services receive higher loads that exceed single node capacity, users would want to scale the system by clustering those bottleneck service. Exact details are usecase specific, but WSO2 platform support clustering through Axis2 clustering

Cross Cutting Concerns

In addition, WSO2 platform supports two cross cutting aspects: namely governance and business actively monitoring.

1. Govern SOA

If a SOA deployment has many services and they goes through frequent business logic and configurations changes, ad-hoc management of the deployment could be very complex. WSO2 Governance registry let users store all service WSDL in a single repository and manage their configurations and lifecycles from a single point.

2. Monitoring the Business

Most businesses needs a detailed close to real time view of their activities. WSO2 Business Activity Monitoring (BAM) let users define trace points in activities (service executions). At the runtime, BAM and monitors data collected at those trace points, and present them to end users through Gadgets that aggregate and visualize them.

Wednesday, August 17, 2011

WSO2 Conference 2011

Tuesday, August 16, 2011

Useful data sets

If you are doing a performance test, it always a good thing to do that using a real dataset. Following are several useful datasets.

Often you can find data in the CSV format, and then parsing and using it is pretty easy.

DLPB catalog - http://kdl.cs.umass.edu/data/dblp/dblp-info.html - this is data about publications. About 900MB raw size.
Google Fusion tables, http://www.google.com/fusiontables/Home
- this has several useful datasets as CSV.
Federal reserve economic data - http://research.stlouisfed.org/fred2/
Amazon public datasets - aws.amazon.com/publicdatasets/

There are lot more. Following are some of them. If anyone knows list giving a sizes of datasets and nature of datasets, please let me know.

http://dvn.iq.harvard.edu/dvn/dv/cid
http://www.thejanuarist.com/9-fascinating-datasets-available-online-for-free/
http://bios.dfg.ca.gov/dataset_index.asp
http://www.uic.edu/orgs/rin/dataset.html
http://www.nas.nasa.gov/Resources/datasets.html
http://www.datawrangling.com/some-datasets-available-on-the-web
http://news.ycombinator.com/item?id=2165497
https://bitly.com/bundles/hmason/1
http://www.gutenberg.org/wiki/Gutenberg:Feeds
http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
http://getthedata.org/
http://news.ycombinator.com/item?id=2165497

Monday, August 8, 2011

Offset Property in any WSO2 Server

If you want to run multiple WSO2 Servers (any type .. e.g. ESB, AS, GReg etc.), you can easily switch the ports using the Ports/offset property in the repository/conf/carbon.xml. When a WSO2 server starts, it listens on several ports. Offset moves all ports by the given value, and very handy tool when you play with things.

Yes does not make a huge difference on a production deployment, but a remove a major pain factor if you are playing with stuff.

IESL Talk Series: Apache System Projects in the Real World

I did a public talk on "Apache System Projects in the Real World" at the IESL talk series organized by IT & Communications Sectional Committee, The Institution of Engineers, Sri Lanka (IESL). Following are the slides. This is a updated version of my earlier talk at Apache Con Roadshow 2009. The talk discussed opensource, apache, and LEAD project. LEAD project (http://portal.leadproject.org/) is a heavy user of Apache Projects like Axis2, ODE etc., and it has pushed those to the limit and made contributions back. Also at the end, the LEAD project itself was donated to Apache as "Apache Airvata" (http://incubator.apache.org/airavata/). Therefore, I believe LEAD project is indeed an great example of Apache Web Services in the real world.

Srinath's Blog :My views of the World