Apache Spark: Read Data from S3 Bucket

Well, a one working with spark is very much familiar with the ways of reading the file from local either from a Table or HDFS or from any file.
But do you know how tricky it is to read data into spark from an S3 bucket?

So, this blog makes you give a stepwise follow up to how to read data from an S3 bucket.

Before moving to our actual topic, we should know what is S3 bucket?

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.

So in short, S3 is a Bucket to which you can store any type of data.

Accessing S3 Bucket through Spark

Now, coming to the actual topic that how to read data from S3 bucket to Spark. Well, it is not very easy to read S3 bucket by just adding Spark-core dependencies to your Spark project and use spark.read to read you data from S3 Bucket.

So, to read data from an S3, below are the steps to be followed:

Edit spark-default.conf file
You need to add below 3 lines consists of your S3 access key, secret key & file system

spark.hadoop.fs.s3a.access.key <accessKey>
spark.hadoop.fs.s3a.secret.key <secretKey>
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

2. Start Spark with AWS SDK package
Add Aws-Java-SDK along with Hadoop-AWS package to your spark-shell as written in the below command.

./spark-shell --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

3. Now read data from S3.
Let’s say S3 bucket contains your parquet data so to read the data do as below:

spark.read.parquet("S3 Bucket URL")

Example:

spark.read.parquet("s3a://<accesskey>+<secretkey>@bucketname")

This is how you can access data from S3 Bucket through Spark.

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.

DataEngineer

Search This Blog