Apache Spark: Delta Lake as a Solution

Well, we have already covered the missing features in Apache Spark & also the causes of the issue in Delta Lake Part1. However, today we will be talking about What Delta Lake is & how it provides the solution to all those problems discussed herein Delta Lake as a Solution: Part1.

As we all know that Spark is just a processing engine, it doesn't have its own Storage or metadata store. Instead, it uses S3, HDFS for its storage. Also, while creating the table and views, it uses Hive metastore. And that's the major reason Spark doesn't provide reliable data processing systems such as ACID transactions. So, for this Delta Lake came out as a solution. How? That, we will be discussing throughout this blog.

What is Delta Lake?

Databricks Delta is a unified data management system that brings reliability and performance (10-100x faster than Spark on Parquet) to cloud data lakes. Delta's core abstraction is a Spark table with built-in reliability and performance optimizations.

You can read and write data stored in Databricks Delta using the same familiar Apache Spark SQL batch and streaming APIs you use to work with Hive tables or DBFS directories. Databricks Delta provides the following functionality:

ACID transactions - Multiple writers can simultaneously modify a data set and see consistent views.
DELETES/UPDATES/UPSERTS - Writers can modify a data set without interfering with jobs reading the data set.
Automatic file management - Data access speeds up by organizing data into large files that can be read efficiently.
Statistics and data skipping - Reads are 10-100x faster when statistics are tracked about the data in each file, allowing Delta to avoid reading irrelevant information.

So, to enable the feature of the ACID property, Delta Lake is added as a Service in the Spark Ecosystem. And once added, instead of interacting directly with storage, the program talks with Delta Lake for reading and writing data.

Now let's understand How Delta Lake solves the problem of atomicity & consistency which is present in Apache Spark?

How Delta Lake solves these issues?

First, let's discuss what happened in case of Spark (without delta lake):

Example, consider a data frame rawDF as below:

//Job1

rawDF.write.mode("overwrite").csv(path)  
//write dataframe on a specified path

Now you have another dataframe, which gets saved on the same path.

//Job2

rawDF1.write.mode("overwrite").csv(path)
//throws exception

But consider, if job2(rawDF1) fails & throws an exception in between.

What will happen?

Obviously, there will be data loss, because in the case of overwrite mode the old data gets lost & new data gets stored, so that's the reason Spark is not ACID compliant.

Now, what happens if we do the same thing with Delta Lake.

Note:- Delta Lake always saves data in parquet format.

//Job1

rawDF.write.mode("overwrite").format("delta").save(path)  
//write dataframe on a specified path

Now you have another dataframe, which gets saved on the same path.

//Job2

rawDF1.write.mode("overwrite").format("delta").save(path) 
//throws exception

But consider, if job2(rawDF1) fails & throws an exception in between.

Will you still lose your data?

And the answer to this heartbreaking question is No, you won't be losing your data at all. Hurray!!

`But how is that possible?`

In case of the example we have taken above, Whenever a job is committed with the Delta Lake property enabled, so the first dataframe which was successfully run will create a parquet file on the path specified and in parallel, there will be a directory automatically created as _delta_log which contains your commit history & logs. As shown below:

And that commit log contains commit info, which contains every detail about the job, as mentioned below:

{
"commitInfo":{
"timestamp":1585913491306,
"operation":"WRITE",
"operationParameters":
{
"mode":"Overwrite"
},
"isBlindAppend":false}
}

And if you check doing count operation or with any other way, the data will still be present.

But why didn't it lose the data?

This is because the job which gets failed creates another parquet file. And the reason for not losing the data is that for the failed job delta lake didn't create a commit log. Hence the data is not lost.

That's the reason, we say delta lake comes as a solution to Spark & make it atomic & consistent.

Hope this gives you a clear & detailed overview of Delta Lake. In our next blog, we will discuss Delta Lake Batch Operations such as Create, Append, Upsert & much more.

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.

DataEngineer

Search This Blog