<<If you are facing a
computing problem far beyond the
capabilities of your present hardware, you are probably evaluating the
possibility of cloud computing as being faster and taking much less
investment than buying more hardware. Now it is time to consider how
you are going to distribute your computing jobs to cloud computing
resources, manage the whole thing, and get results back. In this
article I am going to look at two technologies, one very mature and the
other rather new.
Objects with Tuple-Spaces
Tuple-Space
processing
presents a model of distributed processing that is strikingly different
from other schemes. A client application needing some computation done
creates an object that contains the necessary information and writes it
to a "space manager." A space manager is responsible for matching the
client object with a worker process which can perform the desired job.
When finished, the worker process writes a new object containing the
results back to the space manager.
A tuple-space distributed computing environment will have one or more
space managers and any number of worker processes. Possible operations
on a Space have the virtue of simplicity, similar to REST. In their
simplest form these operations are:
- Write - A process writes a serialized object to the Space
Manager. This is analogous to a Rest PUT operation.
- Read - A process requests a copy of an object from the
Space
Manager by specifying the object contents it wants to match. This is
analogous to a Rest GET operation getting the current state of a
resource.
- Take - A process removes an object from the Space Manager.
In Rest terms this would be a GET followed by a DELETE operation.
- Notify - Sends a message to a process that a object of
interest has been written into the space. There is no exact Rest
analogy, a Rest client might do repeated GET operations using the
If-Modified-Since request header.
Tuple-Space pros and cons
Space based computing uses an object to define a computing problem.
Since these objects can be quite complex, a wide variety of computation
problems can be tackled. Space based computing easily adapts to
problems in which a series of operations needs to take place. For
example, suppose the problem is to conduct lexical analysis on large
blocks of text to determine the likely author. An object specifying the
text to be analyzed could first go to a simple parsing and counting
program which would write the modified object back into the space to be
picked up by a more complex program performing statistical analysis.
The advantage for cloud computing is the extreme flexibility in scaling
computing power to match demand by adding or removing worker processes.
The process of serializing, transmitting, and deserializing the objects
which represent tasks may be a significant consumer of bandwidth and
cpu time. For jobs that work on large datasets, some programmer
ingenuity may be required to keep object size small by transmitting
only a reference to the dataset rather than including the dataset in
the task object.
JavaSpaces in the cloud
Java is a natural fit for tuple-space computing due to the ease of
serializing Java objects, but implementations exist in many languages.
I discussed the applicability of JavaSpaces to web services in this article.
Sun appears to regard JavaSpaces as a mature technology and is leaving
commercial implementation to organizations such as GigaSpaces
corporation, which has been commercializing JavaSpaces since 2000. The
GigaSpaces company has created a Cloud Application Server platform called XAP
running on Amazon EC2 and private systems. Free downloads,
demonstrations and extensive documentation are available on the
GigaSpaces site.
Hadoop - the Open-Source MapReduce Implementation
Google famously uses a programming architecture they call "MapReduce"
to index the entire web, terabytes of data, using thousands of
commodity computers. A publication describing the programming model and
implementation is available here.
Many programmers have implemented their own versions of this
architecture, with the Apache Hadoop
open source project using Java the best known. The Apache Hadoop
project is very active with many contributors and frequent releases but
is still at the release 0.20 (April 2009) stage, so this has to be
regarded as an immature technology.
In spite of being such a relatively new project, Hadoop is widely used
in commercial systems such as Yahoo's search engine. Just about
everybody trying to sell cloud computing services, such as Amazon's
Elastic Compute Cloud has some sort of Hadoop support in place or
planned. The essential parts of Hadoop are the file system and the job
processing system.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is far from what your desktop
operating system uses. It is a "distributed" system that achieves fault
tolerance on cheap commodity hardware, typically running Linux, by
extensive duplication and high throughput by choosing not to provide
many typical file system functions. Files once written to the system
and closed are not expected to be changed. This vastly simplifies
replicating data, buffering and access control. A single HDFS instance
may contain terabytes in tens of millions of files and be spread over
hundreds of thousands of servers. For speed and simplification of
networking, file read operations deal with large blocks of data, 64MB
or more in size. In spite of all these simplifications, processes deal
with HDFS using typical file names and directory structures.
The purpose of these design decisions is to give reliable fast data
transfer rates to computing processes that require a stream of data.
The low latency and random access required by interactive applications
are sacrificed for the sake of very high throughput. HDFS can be
managed so that a process that needs a file can read a copy from the
physically closest server.
Hadoop Job Processing
In order to use Hadoop, you must have a way to fit it into the
MapReduce logical structure. It must be possible to break the problem
into pieces which can be processed independently and which together
cover all of the input data. Furthermore, it must be possible to
express the results of each process in a form which lends itself to
being easily combined with all of the other processes to create a final
result.
- MAP - The total problem space is broken into suitable
independent subsets for distribution to workers. Input data takes the
form of a collection of pairs of key and value objects where both key
objects and value objects must be serializable Java objects. Worker
processes return a collection of pairs of serialized key and value
objects. Note that "map" is being used in two senses here, the problem
space is mapped into independent tasks, and the description of each
task and the returned data are in the form of a map.
- REDUCE - When all workers have returned result lists, one
or
more "reduce" processes combines them progressively to get a final
product. In order for this to work, it must be possible to sort the
returned list by the keys so that the reduce process only has to look
at the next item in each result list to determine if the results can be
combined. The usual example given is counting the number of times
different words appear in a collection of documents. The result lists
have word keys and count values, sorted alphabetically by word so the
reduce process can determine when counts can be added and combine the
lists.
Pros and Cons of Hadoop
Hadoop is well suited to analysis of huge volumes of data by virtue of
the efficiency, capacity, and fault tolerance of HDFS. Fault tolerance
permits use of cheap commodity hardware In addition to the limit on the
types of jobs suitable for mapping, Hadoop has a single point of
failure in that there is only one file system master index process and
one job distributing process.
Summary
JavaSpaces concentrates on distributing work to multiple processors
with a very flexible system for defining jobs. JavaSpaces processes can
use any kind of file system or database. Hadoop concentrates on
reliable fast access to huge datasets through a restricted client api
for file reading and can only work on problems which can be mapped to
separate processes.>>
You can find this at:
http://searchsoa.techtarget.com/tip/0,289483,sid26_gci1357905_mem1,00.html?track=NL-130&ad=712058&Offer=mn_eh062409sSOAUNSC_pcc2&asrc=EM_USC_8103522&uid=5532089
Gervas