Alliance Global Services: Data Processing: Do you Map Reduce for Large Datasets

A common problem statement that customers talk about is being able to process large amounts of data quickly. Some of them have the ability to scale up horizontally and some of them have to come up with alternate ways to improve the processing time. Map Reduce implementations can be applied to any Cloud deployment model.

Map Reduce is a is a patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Hadoop Map-Reduce is a software framework based on Google’s map reduce for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The 3 core pillars of Hadoop are Map, Reduce and HDFS (Hadoop distributed file system).

Large data sets are split into smaller units for processing and are processed by the Map tasks. Map tasks can run in parallel independent of other Map tasks or any other tasks. The output of Map tasks is normally referred to as “Intermediate output”. The output is always in the form of a map. The output of Map tasks act as input to the reduce task. The intermediate output is sorted by the Hadoop framework. The reduce task is then responsible for reducing the set of intermediate values which share the same key on the intermediate output. In general we find that both input and output are stored in some form or manner on a file system. Figure 1 below depicts map reduce processing. We can introduce a combiner operation in addition to Map and Reduce that will be responsible for local segregation of data based on key prior to the reduce task.

Figure 1

One of the core features of the Hadoop Framework is HDFS. HDFS stands for Hadoop Distributed file system and is the configuration that allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The distributed file system can handle large files with sequential read/write operation. Files are split into smaller units and saved on local nodes.

Through Hadoop, you have the ability to run parallel, concurrent versions of the mapping and reduction code to generate the desired result in Batch processing mode. I have seen a typical problem of non-availability of server hardware in the desired environment whereas a lot of server hardware lying idle and the logic to process large amounts of data is very machine dependent. Major configuration implementation is required before we could use a server and as always Data should be delivered yesterday. This is a major advantage of the Hadoop implementation where Programmers familiar with the problem domain can write elegant solutions without worrying about the multi-processing nature of the task. As long as the infrastructure is in place to synchronize process execution across multiple threads, processors, or complete systems adding Hadoop can work wonders to utilize available capacity with very minimal configuration change.

One question I wanted to pose is how do you ensure that you have replaced the legacy home grown systems with logic embedded in code for both functional and non-functional implementation for data processing with a Map Reduce implementation appropriately. This problem is typical for Data intensive systems as to “How do you test a reservoir without filling it” in this case how do you know the data is correct unless you have tested the full data.

Obviously one question that follows data processing is how do you use it efficiently through a machine learning system. I am fascinated by the concept and implementation of Lucene Mahout and will share my thoughts around that later.

Trackback URL for this post:

http://www.allianceglobalservices.com/trackback/718

Alliance Global Services

Sunday, October 10, 2010

Data Processing: Do you Map Reduce for Large Datasets

Trackback URL for this post:

1 comment:

Blog Archive