Does partitioning help you increase/decrease the Job Performance?
Spark splits data into partitions and
computation is done in parallel for each partition. It is very important
to understand how data is partitioned and when you need to manually modify the partitioning to run spark applications efficiently.
Now, diving into our main topic i.e Repartitioning v/s Coalesce
What is Coalesce?
The coalesce method reduces the number of partitions in a DataFrame. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions, this means it can only decrease the number of partitions.
What is Repartitioning?
The repartition method can be used to either increase or decrease the number of partitions in a DataFrame. Repartition is a full Shuffle operation, whole data is taken out from existing partitions and equally distributed into newly formed partitions.
Where to use what?
Let’s look at the below example for the answer.
Now, if I manually pass the number of partitions to 10, see how the data gets distributed:
Comparatively, coalesce took less time as compared with repartitioning. And the data gets partitioned as below:
If you observe above table when repartitioned, data over all the partitions are equally populated, but when we used coalesce the data is not equally distributed.
Also, if you observed above coalesce didn’t partition your data to 10 partitions instead it created 6 partitions.
That means even if you provide a large number of partitions, it
partitions your data to the default one in the above case it is 6.
Now we understand the behavior and hence back to our initial question, where to use which function?
Coalesce use case: we pass in all 10 above partitions into our RDD and perform some action, the partition which processes the file part-00000 will finish first followed by others but the executor with part-00005 will be still running meanwhile 1st executor will be idle. Hence, the load is not balanced on executors equally.
Repartition use case: All the executors finish the job at the same time, and the resources are consumed equally because all input partitions have the same size.
So, here is the answer:
- If you have loaded a dataset, includes huge data, and a lot of transformations that need an equal distribution of load on executors, you need to use Repartition.
- Once all the transformations are applied and you want to save all the data into fewer files(no. of files = no.of partitions) instead of many files, use coalesce.
So, this was all about Repartitioning & Coalesce. Hope to take the inputs from this blog you gonna better partition your data now to increase your Job performance.
In our next blog, we will be discussing Windows Operations in Spark SQL.
If you like this blog,
please do show your appreciation by hitting like button and sharing this
blog. Also, drop any comments about the post & improvements if
needed. Till then HAPPY LEARNING.
Keep go on its very useful and cleared my doubts keep add and rock it
ReplyDeleteIn coalesce, Based on data set it will create partition, it wont depend on what we provided partition number.
Deletethanks sirigiri..will keep posting more such blogs. Till then please follow my blogpost
DeleteVery helpful
ReplyDeletethanks.
DeleteRead the other article of bad records, it is good. I have one request, can you please post an article on how to debug existing spark application when any issue occurs?
ReplyDeleteThanks Vasu, for sharing your thoughts . However, this comment belongs to other blog which you have done here by mistake. It would be great if you can comment on that blog too, so that it would be easy for me to analyse.
DeleteAlso, as suggested will share an article on your suggested topic soon.
Thanks
Hey Divyansh,
ReplyDeleteIts nicely written, I would request you to handle one more use case ideally where we should do coalesce or repartiton whether whole reading the table or after applying some join before aggregation or after everything at the last line where we are writing back the output and why.