Data Locality in Hadoop

Data Locality in Hadoop Tutorial


Data locality is a core concept of Hadoop training. Based on several Assumptions around the use of Map Reduce, In short, keep data on disks that are close to the RAM and CPU that will use to process and store

Introduction:

 Learn Hadoop optimizer on Data Locality is moving data to compute is more than Moving compute to data. It able to schedule jobs to nodes that are local for input stream and high performance Result produced. It is out of the blog. This blog explains the couple of data locality issues that we fixed and identified.


The dataset stored in HDFS, it divided into stored and blocks across the Data Nodes in Hadoop cluster. When a Map Reduce job executed against the dataset the individual Mappers will process the blocks. When the data is not available For the Mapped in the same node, where it is being executed, the Data needs to copy over the network from the Data Node which has the data to the Data Node which is executing the Mapper task.

Imagine a Map Reduce job with over 70 Mappers and each Mapper Will try to copy the data from another Data Node in the clustert the same time, this would result in network jammed as all the Mappers would try to copy the data at the same time and it is not ideal. So it is always effective and cheap and to move the Computation closer to the data and vice versa.

How is data proximity defined? & Hadoop training institutes in Hyderabad:

When Application Master receive a request to run a job, it looks at which Nodes in the cluster has enough resources to execute the Mappersand Reducers for the job, At this point, serious consideration  made to decide On which nodes the individual Pampers will  be executed based on where the Data for the Mapper located.


When the data located on the same node as the Mapper working on the data, it referred to as Data Local. In this case, the proximity of the data is Closer to the computation, Application Master prefers the node which has the Data that needed by the Mapper to execute the Mapper



Even though the Data Local is an ideal choice, it is not possible to execute always. The Mapper on the same node as the data due to resource constraints on a busy Cluster. It preferred to run the Mapper on node but on the same rack as the node which has data.  In other case, the data will be moved between nodes. This data provides from node with the data, to Executing the Mapper within the same rack in a busy cluster sometimes Rack Local is also not possible. In that case, a node on a different track chosen to execute the Data and the Mapper will be copied from the node and it has data to the node executing the Mapper between racks.

Comments

Post a Comment

Popular posts from this blog

Introduction To Apache Spark