Thursday, October 23, 2008

Book on RESTful PHP Web services

Samisa has done a Book on RESTful PHP Web services. That is the second book from Axis2 teams, and WSO2, the first been Quickstart Apache Axis2 by Deepal. Congratulations Samisa, having work on few papers, I can imagine what it take to write a book!!

Monday, October 13, 2008

Eating our own dog food: Wisdom of the Crowd for Rating the Papers

Could we use the idea of Wisdom of the Crowd (e.g. tagging, comments, ranking etc) with research papers? In a way, we already do so, using citations. However, citation is a slow porcesses, and at best, it take around six months to a citaton to appear. I believe, it is interesting to enable comments, tags, and recommendation, etc to enable more involved discussion. Like it or not, there is a lack of discussion between academia and industry (research labs of companies does not qulify as industry), and one reason been, people from the industry do not have time, energy, or intensive, to write a paper and go through the process of publishing it, even though they do have a comment, or improvement to a idea. However, given that comments are enabled, there is a better chance that more people comment. Of course, there will be a concern about quality of comments, but just like Wikipedia, the quality will prevail in the long run.

We are seen many posts on blogs lead to lengthy discussions on various aspects, and more often than not, research work has more than one aspects, and could benefit from more involved discussions. Therefore, features likes comments, tags would help. May be, one day we could argument the peer reviews using similar ideas, making the paper really peer reviewed by all the peers.

Furthermore, in my opinion we should be ashamed that, CS papers, despite being the state art of information processing, are very hard to search, and categorize. When you think of the papers, they do have well defined relationships in terms of citations. But, if I picked a paper now, how hard is it to understand provenance of it's idea. If we create a graph linked by citations, and weight them using something like page rank algorithm (Google algorithm to rank web pages), we can easily identify Hubs (both authoritative papers, and authors, and may be even groups), and important paths of development (provenance of ideas). I am sure this is already proposed somewhere elase, and maybe some tools already has it. But I think it is shame that ACM, or IEEE sites does not support it. We should use results of our own reserach, before expect other people to use them.

Saturday, October 11, 2008

Acceptance Rate, Impact Rank and CFPs for Conferences

Acceptance Rate and Conference Impact Rank are two measures of a conference, and a list of those ranking can be found in Networking Conferences Statistics and http://citeseer.ist.psu.edu/impact.html respectively. Also few good lists of Call for Papers can be found in following sites. Some are old, but one can Google for new version of those conferences.

  1. http://ft.ornl.gov/cfp/
  2. http://www.sigcomm.org/calls/papers/
  3. http://www.usenix.org/event/
  4. http://www.iaria.org/conferences.html
  5. http://i.cs.hku.hk/~scho/cfp.html
  6. http://www.ee.unsw.edu.au/~timm/netconf/

Case studies on Highly Scalable Systems

This site ( http://highscalability.com) links to what ever available data on large scale systems, starting from Google but going to lot of others. I found it interesting, as there is nothing like "really doing it!!", and I love to hear about first hand experience on scaling up!!.

Thursday, October 9, 2008

A Unix Trick to debugging With Asynchronous messaging

These days, I am playing with a large scale messaging system (broker network). Things great when it works, but when things start to go wrong with one of these things, you do not need to be there.

One big down side of Asynchronous messaging is it is so hard to debug, specially when messages jump from node to node, and your are not the author of the code (then you do not know the magic places where to put stdouts :( ).

Following is a little trick that helped me, it is not a silver bullet, but does give some comfort. You need to start with followings.

1. Log the message ID, or some unique ID with every log message. Usually with luck developers have already done this.
2. Set the log4j to print the time stamp first thing in each log statement

Assuming you were able to get all the logs to a same directory (my case I have NFS mounted across all nodes, so that was easy), following command will list all the things related to a given message, in the real time order, so you can walk though it and tell what happend. (sort simply sort them with log4j time stamps)

grep message-id *.log| sed -'s/.*\.log://'|sort

It is a simple command, but could be very useful. Also if you need to merge all the logs, in time order do cat *|sort. As you could guess, there are many variations of this. Actually maybe message system developer should put down some standard format for message sending, receiving and routing, which let people to write log mining code/scripts that can uncover problems.

If anyone knows some useful log mining tool, please! please! drop me a note :).

Wednesday, October 8, 2008

Learning/Teaching to do large scale parallel programming

With advent of the cloud, learning to do large scale parallel programming is becoming a useful skill. For an example, Google needs graduates to learn those things in the University. This is a old field, but has surfaced since the cloud. NSF did a workshop on the topic, 2008 NSF Data-Intensive Scalable Computing in Education Workshop. There some material, and pointers there.

As you would guess, Map-Reduce is the kind of the start point, but there are many others.

Monday, October 6, 2008

Computing at Scale: Challenges & Opportunities



I was watching the video, Computing at Scale: Challenges & Opportunities, a panel at Google Faculty Summit. Here are few interesting points made.

They observe few trends/problems (ones caught my ears, not comprehensive)
  1. We are drowning in data - Data Intensive Computing, How to handle lot of Data (e.g. Telescope could generate 200GB/sec).
  2. Data Driven approach is becoming popular
  3. How to Program large scale systems? Patterns, Middleware and teaching students to programme using them?
  4. Storage and Computing power is becoming Cheaper, and they are going to be placed remotely.
  5. Need for multidisciplinary collaborations to solve problems (e.g. e-science problems)
Few observations
  1. With cloud cost of 1000 cpus per day = 1 cpu for 1000 days - Prof Patterson's observation
  2. In large scale systems, no matter high reliable, H/W fails, and S/W has to handle it - observation at Google
  3. Animoto (Company running on EC2), was using about 50 nodes, but due to a face book app they had to handle 10X user base with in a week, and they were able to bump up their system to 3500 nodes using EC2. See here for details.

Few High level CS talks

Found few interesting talks/papers in the Computing Community Consortium web page, http://www.cra.org/ccc/resources.php. e.g. Computer Science: Past, Present and Future Ed Lazowska's SIGCSE Keynote, March 15, 2008