Apache Spark: How to decide number of Executor & Memory per Executor?

Well, to understand the internals of Spark a one should have good knowledge of its terminologies. And once clear with the terminologies it becomes easy to debug a spark job. 

So, let's first understand what is Cluster, Driver, Executor followed by Job, Stages & Task?

Cluster 

A cluster is a group of JVMs(Nodes) connected by the network, each of which runs Spark, either in Driver or Worker. 

Driver 

A driver is one of the nodes in a Cluster. It is a master Node in Spark Cluster & this is where the main() method of our Program runs. It executes the user code and creates SparkSession. And SparkSession is responsible to create RDD, DataFrame, DataSet, and execute SQL. 

Also, it coordinates with all the executors for task execution where it checks the current set of Executors and accordingly schedules the tasks. However, it also keeps the track of data(metadata) which was persisted in the Executor's memory. 

Executor 

Executors get launched at the beginning of a Spark application and reside in the Worker Node. They run the tasks and return the result to the driver. However, it can persist data in Worker Node. 

Now, when a job is executed, an execution plan is created according to the lineage graph. This execution job is split into stages, and each stage can have multiple tasks doing the same activity on a different data set. So in short it is like Job --> Stages --> Task

Job

A Job is a sequence of Stages, triggered by an Action such as .count(),  collect(), save(), etc.

Stages

Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages.

Tasks

A single unit of work or execution will be sent to a Spark executor. Each stage is comprised of Spark tasks, which are then merged across each Spark executor; each task maps to a single core and works on a single partition of data.

Now if we are clear with the basic terminologies of Spark, let's understand how should we decide the number of executors & memory per executor with the below example: 

Note: The best choice for the number of cores should be less than 5 (for good HDFS throughput).

Total Nodes - 6 
Cores per Node - 16 cores
Memory per node - 64 GB RAM

Decide Number of Executor

Available cores – 15
Available Memory – 63GB
(1 core and 1GB ~ reserved for Hadoop and OS)

No of executors per node = 15/5 = 3 (5 is best choice)

Total executors = 6 Nodes * 3 executor = 18 executors

Total available executors = 17 (Application master needs 1)

Memory Per Executor

Memory per executor = 63/3 = 21 GB (3 executor per node)

Overhead = 7% of 21GB = 3GB (approx)

So actual Executor memory = 21 - 3 = 18GB

So in total, we have 17 Executors, 18GB of Memory per Executor. 

I hope now you are very clear with this and can easily calculate the number of executors and memory require per executor. 

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.

Comments

Post a Comment