Well, a one working with spark is
very much familiar with the ways of reading the file from local either
from a Table or HDFS or from any file.
But do you know how tricky it is to read data into spark from an S3 bucket?
But do you know how tricky it is to read data into spark from an S3 bucket?
So, this blog makes you give a stepwise follow up to how to read data from an S3 bucket.
Before moving to our actual topic, we should know what is S3 bucket?
Amazon S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.
So in short, S3 is a Bucket to which you can store any type of data.
Accessing S3 Bucket through Spark
Now, coming to the actual topic that how to read data from S3 bucket to Spark.
Well, it is not very easy to read S3 bucket by just adding Spark-core
dependencies to your Spark project and use spark.read to read you data
from S3 Bucket.
So, to read data from an S3, below are the steps to be followed:
- Edit spark-default.conf file
You need to add below 3 lines consists of your S3 access key, secret key & file system
spark.hadoop.fs.s3a.access.key <accessKey>
spark.hadoop.fs.s3a.secret.key <secretKey>
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
2. Start Spark with AWS SDK package
Add Aws-Java-SDK along with Hadoop-AWS package to your spark-shell as written in the below command.
Add Aws-Java-SDK along with Hadoop-AWS package to your spark-shell as written in the below command.
./spark-shell --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
3. Now read data from S3.
Let’s say S3 bucket contains your parquet data so to read the data do as below:
Let’s say S3 bucket contains your parquet data so to read the data do as below:
spark.read.parquet("S3 Bucket URL")
Example:
spark.read.parquet("s3a://<accesskey>+<secretkey>@bucketname")
This is how you can access data from S3 Bucket through Spark.
If you like this
blog, please do show your appreciation by hitting like button and
sharing this blog. Also, drop any comments about the post &
improvements if needed. Till then HAPPY LEARNING.
Comments
Post a Comment