YARN in hadoop THE BIG DATA

YARN in hadoop

YARN – “Yet Another Resource Negotiator”.
In this article, we will discuss the YARN. With the introduction of YARN, Hadoop becomes more powerful. YARN is introduced in Hadoop 2.x YARN provides advantages over previous versions of Hadoop including better scalability, cluster utilization, and user agility. YARN provides full backward compatibility with existing MapReduce task and application. YARN project started by Apache community to give Hadoop the ability to run the non-MapReduce program on Hadoop framework.
Fig. illustrate how YARN fits into the new Hadoop ecosystem.
Resource Manager runs on a single master node and node manager runs on all other nodes

Services in Hadoop (2.x)

After installation of Hadoop 2.x, format namenode and start all services using start-all.sh, the command on the terminal and then enter jps command. If all services shown in fig.2 are running, then installation of Hadoop is successful.

JobTracker is responsible for the Resource Manager. The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker – that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster.

The NodeManager is the per-machine slave, which is responsible for launching the application’s containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager. The ResourceManager divides the resources among all the applications in the system. The ResourceManager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications. The scheduler performs its scheduling function based on the resource requirements of an application by using the abstract notion of a resource container, which incorporates resource dimensions such as memory, CPU, disk, and network. The per-application ApplicationMaster is, performs negotiating for resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.

Execution steps of YARN:

  • Container executes a specific application
  • Single NodeManager can have multiple numbers of containers
  • After the container has been assigned to NodeManager, Resource Manager starts Application Master process within the container
  • Perform computation required for the task
  • In MapReduce Application Master process is mapper/reducer process
  • If Application Master requires extra resources then Application Master running on Node Manager request to Resource Manager for additional resources, additional resources are in the form of containers
  • Resource Manager scans the cluster and finds out free Node Manager
  • Requested NodeManager don’t have info about the other free NodeManager
  • Application Master on the original node starts off the Application Master on newly assigned nodes.

Various scheduler options

  1. FIFO Scheduler: Basically a simple “first come, first served” scheduler in which the Job-Tracker pulls jobs from a work queue, oldest job first.
  2. Capacity scheduler:-The Capacity scheduler is another pluggable scheduler for YARN that allows for multiple groups to securely share a large Hadoop cluster.
    • Capacity is distributed among different queues.
    • Each queue is allocated a share of the cluster resources
    • A job can be submitted to the specific queue
    • Within a queue, FIFO scheduling is used
    • Default scheduler – Capacity Scheduler
  3. Fair scheduler:-Fair scheduling is a method of assigning resources to applications such that all applications get, on average, an equal share of resources over time

Configure scheduling policy:

$ vi yarn-site.xml
<property>
  <name> yarn.resourcemanager.schedular.class </name>
<value> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler </value>
</property>

We can define different queue for development and production

$ vi etc/hadoop/capacity-schedular.xml
<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>dev, prod</value>
  <description>The queues at the this level (root is the root queue).
  </description>
</property>

<property>
  <name>yarn.scheduler.capacity.root.dev.capacity</name>
  <value>30</value>
</property>

<property>
  <name>yarn.scheduler.capacity.root.prod.capacity</name>
  <value>70</value>
</property>

We can submit the job to the specific queue

$ hadoop jar sample.jar wordCount.Main -D mapreduce.job.queue.name=prod input output

You can check the queue at localhost:8088/cluster

Containers:

A container is a collection of physical resources such as RAM, CPU cores, and disks on a single node. There can be multiple containers on a single node (or a single large one). Every node in the system is considered to be composed of multiple containers of a minimum size of memory (e.g., 512 MB or 1 GB) and CPU. The ApplicationMaster can request any container so as to occupy a multiple of the minimum size.

NodeManager:

The NodeManager is YARN’s per-node “worker” agent, taking care of the individual compute nodes in a Hadoop cluster. Its duties include keeping up-to-date with the ResourceManager, overseeing application containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node health, log management, and auxiliary services that may be exploited by different YARN applications. On start-up, the NodeManager registers with the ResourceManager; it then sends heartbeats with its status and waits for instructions. Its primary goal is to manage application containers assigned to it by the ResourceManager.

ApplicationMaster(AM):

The AM is the process that coordinates an application’s execution in the cluster. Each application has its own unique AM, which is tasked with negotiating resources(containers) from the ResourceManager and working with the NodeManager to execute and monitor the tasks. In the YARN design, Map-Reduce is just one application framework; this design permits building and deploying distributed applications using other frameworks. Once the AM is started (as a container), it will periodically send heartbeats to the ResourceManager to affirm its health and to update the record of its resource demands.

ResourceManager(RM):

RM is primarily a pure scheduler. It is strictly limited to arbitrating requests for available resources in the system made by the competing applications. It optimizes for cluster utilization (i.e., keeps all resources in use all the time) against various constraints such as capacity guarantees, fairness, and service level agreements (SLAs). To allow for different policy constraints, the RM has a pluggable scheduler that enables different algorithms such as those focusing on capacity and fair scheduling to be used as necessary.

That’s all in this article hopes you will clear about YARN architecture and working model

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s