Apache Spark: Tricks to Increase Job Performance

Apache Spark is quickly adopting the Real-world and most of the companies like Uber are using it in their production.

Spark is gaining its popularity in the market as it also provides you with the feature of developing Streaming Applications and doing Machine Learning, which helps companies get better results in their production along with proper analysis using Spark.

Although companies are using Spark in their Production you should know what are the best practices to follow to run a better Spark Job and increase its performance.

1. Avoid UDFs...But why?

Because internally, Catalyst doesn't optimize and process UDFs at all, which results in losing the optimisation level. Instead, try using SparkSql API to develop your application.

//using UDF 
val addOne = udf( (num: Int) => num + 1 )
val res1 = df.withColumn("col2", addOne($"col1"))
res1.show()
//time taken by udf 0.7 seconds

There are many SparkSQL Dataframe APIs available which are optimised and easy to use and obviously decrease your length of code. There are many use cases where we write a long code but when explored we find that most of the built-in functions are present for it.

//using SparkSQL API
def addOne2(col1: Column) = col1.plus(1)
val res2 = df.withColumn("col2", addOne2($"col1"))
res2.show()

val res3 = spark.sql("select *, col1 + 1 from df")
res3.show()
//time takn by SparkSQL API- ~70 milliseconds

It is clearly seen that UDF took more time than SparkSQL API.

Well, it is not possible every time to use SQL API but whenever you’re writing your own UDF, try to write optimised code with a better approach to solve the use case.

2. Use Catalyst to optimise your Code

Does anyone know what Catalyst is & how it helps in optimizing code? If not here is your answer.

Whenever you write a Spark Application it creates an Execution Plan. However, you can check your execution plan with a simple call using the Explain method on your Dataframe. This method explains whatever you have written in your code and how Catalyst optimiser optimised the Job for better Performance.
Example:

Execution PLanning using Explain

Well, it’s a best practice not to use Shuffle while writing your Job as shuffling is very expensive and degrades the Job performance.

So try to reduce such operations and allow the Explain method to take care of it.

So, these are some the best practices to follow while writing your Spark Job which definitely helps you increase your Job Performance.

Interested in knowing the better way to partition your Job. Please read this post Apache Spark: Repartitioning v/s Coalesce

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.

DataEngineer

Search This Blog