Optimizing hadoop for mapreduce pdf


with others in the Hadoop and MapReduce community. Thank you as Did you know that Packt offers eBook versions of every book published, with PDF . Chapter 6, Optimizing MapReduce Tasks, explains when you need to use combiners. ruthenpress.info In this book, we address the MapReduce optimization problem, how to identify. Download Optimizing Hadoop for MapReduce from Packt in PDF, ePub and Mobi . Learn how to configure your Hadoop Cluster to run optimal MapReduce jobs.

Language:English, Spanish, French
Genre:Fiction & Literature
Published (Last):13.03.2016
Distribution:Free* [*Register to download]
Uploaded by: MERRI

53013 downloads 115946 Views 21.77MB PDF Size Report

Optimizing Hadoop For Mapreduce Pdf

MapReduce programming paradigm to distribute work among the machines; so a .. needed to be converted to the PDF format; so the managers made use of this .. The current Hadoop optimization research is mainly conducted by Apache. an open source implementation of the MapReduce model. and has been widely . model [ ] and simulator to optimize a Hadoop job based. PDF | Hadoop MapReduce has become a major computing technology in support of big data analytics. The Hadoop framework has over

We know live in the era in which BigData is one of the most talked issues together with the problems associated with it. MapReduce has been the backbone and heart of Apache Hadoop a software tool designed for the distributed processing and storage of these massive datasets. Many researchers over the years have since tried to find ways to improve the MapReduce process performance and so many techniques have been designed. In this paper we expose all these techniques in an effort to reveal other research avenues in this domain. Introduction Big data is the collection of large and complex data sets that are difficult to process using on-hand database management tools or traditional data processing applications. The invention of online social networks, smart phones, fine tuning of ubiquitous computing and many other technological advancements have led to the generation of multiple petabytes of both structured, unstructured and semi- structured data. These massive data sets have lead to the birth of some distributed data processing and storage technologies like Apache Hadoop and MongoDB[1]. In the years building up to [1] Google embarked on a Figure 1: MapReduce Illustration project to set up their very own proprietary database which they named [3]"Big Table". Big Table was an instant hit and Apache Hadoop consists of five daemons [5] which are also it solved many of the problems with relational databases.

This approach divides the parameters search space into subsearch spaces and then searches for optimum values by trying different values for parameters iteratively within the range.

However, both approaches are unable to provide a sophisticated search technique and a mathematical function that represents the correlation of the Hadoop configuration parameters.

Li et al. The model analyzes the hardware and software levels and explores the performance issues in both levels. The model mainly focuses on the impact of different configuration settings on job execution time instead of tuning the configuration parameters. Yu et al.

The Random-Forest approach is used to build performance models for the map phase and the reduce phase and a GA is employed to search optimum configuration parameter settings within the parameter space. It should be noted that a Hadoop job is executed in overlapping and nonoverlapping stages [ 24 ], which are ignored in the proposed performance model.

As a result, the performance estimation of the proposed model may be inaccurate.

Furthermore, the proposed model uses a dynamic instrumentation tool BTrace to collect the timing characteristic of tasks. As a result, the proposed model overestimates the execution time of a job. Hadoop Parameters The Hadoop framework has more than tunable configuration parameters that allow users to manage the flow of a Hadoop job in different phases during the execution process.

Some of them are core parameters and have a significant impact on the performance of a Hadoop job [ 14 , 18 , 25 ].

Consider that all of the configuration parameters for optimization purposes would be unrealistic and time consuming. In order to reduce the parameter, search space, and effectively speed up the search process, we consider only core parameters in this research. The selection of the core parameters is based on previous research studies [ 14 , 17 , 18 , 25 , 26 ].

The core parameters as listed in Table 1 in brief are as follows. Table 1: Hadoop core parameters as GEP variables. This parameter determines the number of files streams to be merged during the sorting process of map tasks.

The default value is 10, but increasing its value improves the utilization of the physical memory and reduces the overhead in IO operations.

During job execution, the output of a map task is not directly written into the hard disk but is written into an in-memory buffer which is assigned to each map task. The size of the in-memory buffer is specified through the io. The default value of this parameter is MB.

The default value of this parameter is 0. It is recommended that the value of io.

This parameter can have a significant impact on the performance of a Hadoop job [ 24 ]. The default value is 1. The optimum value of this parameter is mainly dependent on the size of an input dataset and the number of reduce slots configured in a Hadoop cluster.

Setting a small number of reduce tasks for a job decreases the overhead in setting up tasks on a small input dataset while setting a large number of reduce tasks improves the hard disk IO utilization on a large input dataset.

These parameters define the number of the map and reduce tasks that can be executed simultaneously on each cluster node. Increasing the values of these parameters increases the utilization of CPUs and physical memory of the cluster node which can improve the performance of a Hadoop job.

The default value is -Xmxm which gives at most MB physical memory to each child task. The threshold value indicates the number of files for the in-memory merge process. When the number of map output files equal to threshold value is accumulated then the system initiates the process of merging the map output files and spill to a disk. A value of zero for this parameter means there is no threshold and the spill process is controlled by the mapred.

The Optimization of Hadoop Using GEP The automated Hadoop performance tuning approach is based on a GEP technique, which automatically searches for Hadoop optimum configuration parameter settings by building a mathematical correlation among the configuration parameters. GEP uses a combined chromosome and expression tree structure [ 15 ] to represent a targeted solution of the problem being investigated.

The factors of the targeted solution are encoded into a linear chromosome format together with some potential functions, which can be used to describe the correlation of the factors. Each chromosome generates an expression tree, and the chromosomes containing these factors are evolved during the evolutionary process.

GEP Design The execution time of a Hadoop job can be expressed using 1 where represent the Hadoop configuration parameters. In this research, we consider 9 core Hadoop parameters and based on the data types of these Hadoop configuration parameters, the functions shown in Table 2 can be applied in the GEP method. A correlation of the Hadoop parameters can be represented by a combination of the functions. Figure 1 shows an example of mining a correlation of 2 parameters and which is conducted in the following steps in the proposed GEP method: i Based on the data types of and , find a function, which has the same input data type as either or and has 2 input parameters.

In this case, the Plus function is selected. Table 2: Mathematic functions used in GEP.

Figure 1: An example of parameter correlation mining. Similarly, a correlation of can be mined using the GEP method. The chromosome and expression tree structure of GEP is used to hold the parameters and functions. A combination of functions, which takes , as inputs is encoded into a linear chromosome that is maintained and developed during the evolution process. Meanwhile, the expression tree generated from the linear chromosome produces a form of based on which an estimated execution time is computed and compared with the actual execution time.

A final form of will be produced at the end of the evolution process whose estimated execution time is the closest to the actual execution time. In the GEP method, a chromosome can consist of one or more genes. For computational simplicity, each chromosome has only one gene in the proposed method. A gene is composed of a head and a tail. The elements of the head are selected randomly from the set of Hadoop parameters listed in Table 1 and the set of functions listed in Table 2.

Optimizing Hadoop For MapReduce - PDF Drive

However, the elements of the tail are selected only from the Hadoop parameter set. The length of a gene head is set to 20, which covers all the possible combinations of the functions.


MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead. A MapReduce framework or system is usually composed of three operations or steps : Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. Shuffle: worker nodes redistribute data based on the output keys produced by the map function , such that all data belonging to one key is located on the same worker node.

Reduce: worker nodes now process each group of output data, per key, in parallel. MapReduce allows for distributed processing of the map and reduction operations. Similarly, a set of 'reducers' can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. Another way to look at MapReduce is as a 5-step parallel and distributed computation: Prepare the Map input — the "MapReduce system" designates Map processors, assigns the input key K1 that each processor would work on, and provides that processor with all the input data associated with that key.

Run the user-provided Map code — Map is run exactly once for each K1 key, generating output organized by key K2. Run the user-provided Reduce code — Reduce is run exactly once for each K2 key produced by the Map step.

Optimizing Hadoop For MapReduce

Produce the final output — the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome. These five steps can be logically thought of as running in sequence — each step starts only after the previous step is completed — although in practice they can be interleaved as long as the final result is not affected.

In many situations, the input data might already be distributed "sharded" among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data. Similarly, step 3 could sometimes be sped up by assigning Reduce processors that are as close as possible to the Map-generated data they need to process.

Mathematical Problems in Engineering

Logical view[ edit ] The Map and Reduce functions of MapReduce are both defined with respect to data structured in key, value pairs. This produces a list of pairs keyed by k2 for each call. After that, the MapReduce framework collects all pairs with the same key k2 from all lists and groups them together, creating one group for each key.

Similar files:

Copyright © 2019 ruthenpress.info. All rights reserved.
DMCA |Contact Us