Objects with Tuple-Spaces
Tuple-Space
processing
presents a model of distributed processing that is strikingly different
from other schemes. A client application needing some computation done
creates an object that contains the necessary information and writes it
to a "space manager." A space manager is responsible for matching the
client object with a worker process which can perform the desired job.
When finished, the worker process writes a new object containing the
results back to the space manager.
- Write - A process writes a serialized object to the Space Manager. This is analogous to a Rest PUT operation.
- Read - A process requests a copy of an object from the Space Manager by specifying the object contents it wants to match. This is analogous to a Rest GET operation getting the current state of a resource.
- Take - A process removes an object from the Space Manager. In Rest terms this would be a GET followed by a DELETE operation.
- Notify - Sends a message to a process that a object of interest has been written into the space. There is no exact Rest analogy, a Rest client might do repeated GET operations using the If-Modified-Since request header.
Tuple-Space pros and cons
Space based computing uses an object to define a computing problem.
Since these objects can be quite complex, a wide variety of computation
problems can be tackled. Space based computing easily adapts to
problems in which a series of operations needs to take place. For
example, suppose the problem is to conduct lexical analysis on large
blocks of text to determine the likely author. An object specifying the
text to be analyzed could first go to a simple parsing and counting
program which would write the modified object back into the space to be
picked up by a more complex program performing statistical analysis.
The advantage for cloud computing is the extreme flexibility in scaling
computing power to match demand by adding or removing worker processes.
JavaSpaces in the cloud
Java is a natural fit for tuple-space computing due to the ease of
serializing Java objects, but implementations exist in many languages.
I discussed the applicability of JavaSpaces to web services in this article.
Hadoop - the Open-Source MapReduce Implementation
Google famously uses a programming architecture they call "MapReduce"
to index the entire web, terabytes of data, using thousands of
commodity computers. A publication describing the programming model and
implementation is available here.
Many programmers have implemented their own versions of this
architecture, with the Apache Hadoop
open source project using Java the best known. The Apache Hadoop
project is very active with many contributors and frequent releases but
is still at the release 0.20 (April 2009) stage, so this has to be
regarded as an immature technology.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is far from what your desktop
operating system uses. It is a "distributed" system that achieves fault
tolerance on cheap commodity hardware, typically running Linux, by
extensive duplication and high throughput by choosing not to provide
many typical file system functions. Files once written to the system
and closed are not expected to be changed. This vastly simplifies
replicating data, buffering and access control. A single HDFS instance
may contain terabytes in tens of millions of files and be spread over
hundreds of thousands of servers. For speed and simplification of
networking, file read operations deal with large blocks of data, 64MB
or more in size. In spite of all these simplifications, processes deal
with HDFS using typical file names and directory structures.
Hadoop Job Processing
In order to use Hadoop, you must have a way to fit it into the
MapReduce logical structure. It must be possible to break the problem
into pieces which can be processed independently and which together
cover all of the input data. Furthermore, it must be possible to
express the results of each process in a form which lends itself to
being easily combined with all of the other processes to create a final
result.
- MAP - The total problem space is broken into suitable independent subsets for distribution to workers. Input data takes the form of a collection of pairs of key and value objects where both key objects and value objects must be serializable Java objects. Worker processes return a collection of pairs of serialized key and value objects. Note that "map" is being used in two senses here, the problem space is mapped into independent tasks, and the description of each task and the returned data are in the form of a map.
- REDUCE - When all workers have returned result lists, one or more "reduce" processes combines them progressively to get a final product. In order for this to work, it must be possible to sort the returned list by the keys so that the reduce process only has to look at the next item in each result list to determine if the results can be combined. The usual example given is counting the number of times different words appear in a collection of documents. The result lists have word keys and count values, sorted alphabetically by word so the reduce process can determine when counts can be added and combine the lists.
Pros and Cons of Hadoop
Hadoop is well suited to analysis of huge volumes of data by virtue of
the efficiency, capacity, and fault tolerance of HDFS. Fault tolerance
permits use of cheap commodity hardware In addition to the limit on the
types of jobs suitable for mapping, Hadoop has a single point of
failure in that there is only one file system master index process and
one job distributing process.
Summary
JavaSpaces concentrates on distributing work to multiple processors
with a very flexible system for defining jobs. JavaSpaces processes can
use any kind of file system or database. Hadoop concentrates on
reliable fast access to huge datasets through a restricted client api
for file reading and can only work on problems which can be mapped to
separate processes.>>
You can find this at:
Gervas