Srinath's Blog :My views of the World: September 2011

Thursday, September 22, 2011

Writing Your First Thrift Service

Thrift provides a Binary RPC protocol for supporting service invocations, and it provides toolkits to generate thrift binding for several languages, including Java, C, C++ etc. Then just like with Web Services, the thrift service invocations will agree on the wire and enable multiple programming language implementations to talk to each other.

If you remember CORBA, it was the same thing. Well except for the fact that it was nightmare to write a service using CORBA tools.

Well why am I interested with Thrift? Well simple answer is it is fast, and has lot of traction.

Luckily writing a thrift service is pretty easy. Following is how I did it.

Download and build thrift from thrift.apache.org. Look at README for instructions. However, I had to disable Erlang binding while running ./configure ( --without-erlang).
Then first step is to write the thrift IDL. Mine did looked like following. http://wiki.apache.org/thrift/Tutorial is the best source to learn how to write a Thrift IDL.
```
namespace java Test
struct Tuple {
  1: list tuples,
}
service Bissa {
void put(1:Tuple tuple),
list read(1:string pattern),
list take(1:string pattern)
}
```
The I ran thrift to generate code. Command looked like thrift --gen java bissa.thrift.

Above created a service class and types define in the IDL. Then I wrote a server that uses the generated code and it looked like following. Sample class can be found from http://svn.apache.org/viewvc/thrift/trunk/tutorial/java.

//This is a class that implement Bissa.Iface interface (class generated by thrift to represent the service. )
BissaThriftServer handler = new BissaThriftServer(bissa);

//then we initialize a processor passing that handler (implementation)
Bissa.Processor processor = new Bissa.Processor(handler);
TServerTransport serverTransport = new TServerSocket(thriftPort);
// Use this for a multithreaded server
TServer server = new TThreadPoolServer(new TThreadPoolServer.Args(serverTransport).processor(processor));
System.out.println("Starting the simple server... on port "+ thriftPort);
server.serve();

Client looked like following.

 TTransport transport;
transport = new TSocket("localhost", 9092);
transport.open();
TProtocol protocol = new TBinaryProtocol(transport);

//following is generated code
Bissa.Client client = new Bissa.Client(protocol);
//now u have a stub, use it
client.put(BissaThriftServer.createTuple("A", "B"));
client.close()

You might also find this blog useful.

Saturday, September 17, 2011

Data, Data Everywhere and Challenges

I had the pleasure and privilege of moderating the Data Panel at WSO2 Conference 2011 that composed of the distinguished panel Sumedha Rubasinghe, C. Mohan, and Gregor Hohpe. Obviously, I did my homework for the panel and gave some thoughts on what to say. I felt at the end that I should write down the opening I did. So, here we go.

Let me start with a quote from Tim Barnes Lee, the founder of the Internet. He said, "Data is a precious thing because they last longer than systems". For example, if you take systems like Google or Yahoo, you will many who argue that the data those companies have collected over their operation are indeed the most important assert they have. Those data give them power to either optimize what they do or to go on new directions.

If you look around, you will see there is so much data being available. Let me try to touch on few types of data.

Sensors – Human activities (e.g. near field communication), RFID, Nature (Weather), Surveillance, Traffic, Intelligence etc.
Activities in World Wide Web
POS and transaction logs
Social networks
Data collected by governments, NGOs etc.

The paper, Miller, H.J., The Data Avalanche is here, Shouldn't we be digging? Journal of Regional Science, 2010, is a nice discussion on the subject.

Data do come in many shapes and forms. Some of them are moving data—or data steams--while others are in rest; some are public, some are tightly controlled; some are small, and some are large etc.

Thinks about a day in your life, and you will realize how much data are around you, that you know that is available; but very hard to accessed or processed. For example, do you know the distribution of your spending? Why it is so hard to find the best deal to by a used car? Why cannot I find the best route to drive now? the list goes on and on..

It is said that we are drowning in an ocean of data, and making sense of that data is considered to be the challenge for our time. To think about it, Google have made a fortune by solving a seemingly simple problem: the content-based search. There are so many companies that either provide data (e.g. Maps, best deals) or provide add on services on top of the data (e.g. analytics, targeted advertising etc.).

As I mentioned earlier, we have two types of data. First, moving data are data streams, and users want to process them near real time to either adopt themselves (e.g. monitoring the stock market) or to control the outcome (e.g. Battle field observations or logistic management). The Second is data in the rest. We want to store them, search then, and then process them. This processing is for either to detect some patterns (fraud detection, anti-money laundering, surveillance) or to make predications (e.g. predict the cost of a project, predict natural disasters).

So broadly we have two main challenges.

How to store and query data in a scalable manner?
How to make sense of data (how to run the transformations data->information->knowledge ->insight)

And there are many other challenges, and following are some of them.

Representations

Supporting semantics. This includes extracting semantics from data (e.g. using heuristics based AI systems or through statistical methods) and supporting efficient semantics based queries.
Supporting multiple representations of the same data. Does converting on demand is the right way to go or should we standardize? Does standardization is practical?
Master data management – Making sure all copies of data are updated, and any related data is identified, referenced and updated together.

Security

Data ownership, delegation, and permissions.
Privacy concerns: unintended use of data and ability to correlation too much information.
Exposing private data in a controlled manner.
Making data accessible to all intended parties, from anywhere, anytime, from any device, through any format (subjected to permissions).

Analytics

Making close to real-time decisions with large-scale data (e.g. targeted advertising). Or in other words how to make analytical jobs faster.
Distributed frameworks and languages to process large data processing tasks. Is Map-Reduce good enough? What about other parallel problems?
Ability to measure the confidence associated with results generated from a given set of data.
Taking decisions in the face of missing data (e.g. lost events etc.). Regardless of the design, some of the data will be lost while monitoring the system. Then decisions models have to still work, and be able to ignore or interpolate missing data.

I am not trying to explain the solutions here, but hopefully I will write future posts talking about state of art about some of the challenges.

Tuesday, September 13, 2011

Multi-tenancy: Winning formula for a PaaS

Following is the slide deck I presented at WSO2 Conference today. It provides a detailed discussion on what is Multi-tenancy, why it is needed and details about potential implementations.

View more presentations from Srinath Perera.