Big Data Evolution: Migrating on-premise Database to Hadoop

We are now generating massive volumes of data at an accelerated rate. To meet business needs, address changing market dynamics as well as improve decision-making, sophisticated analysis of this data from disparate sources is required. The challenge is how to capture, store and model these massive pools of data effectively in relational databases.

Big data is not a fad. We are just at the beginning of a revolution that will touch every business and every life on this planet.
Every second we create new data. For example, we perform 40,000 search queries every second (on Google alone), which makes it 3.5 searches per day and 1.2 trillion searches per year.

Why migrating to Hadoop?

Let’s consider a shopping website. There is a need to maintain product information, transaction data as well as product reviews. That data can easily be stored in a relational database management system (RDBMS). But as the number of comments increases, we must alter the table to accommodate the increase. These changes are near-real-time and data modeling becomes very challenging due to the time and resources required to complete these changes. Any changes in the RDBMS schema may also affect the performance of the production database. There can be many scenarios similar to this where changes in the RDBMS schema are required due to the nature and volume of information stored in the database. These challenges can be addressed using toolsets from the Hadoop ecosystem.

WorkFlow

MySQL

First of all, if we are dealing with MySQL(which most of the companies are currently using) or any other relational database, Hadoop may look different. Very different. Apparently, Hadoop is the opposite to any relational database. Unlike the database where we have a set of tables and indexes, Hadoop works with a set of text files. And… there are no indexes at all. And yes, this may be shocking, but all scans are sequential.

So, when does Hadoop makes sense..?

Hadoop is great if you need to store huge amounts of data (we are talking about Petabytes now) and those data does not require real-time (milliseconds) response time. Hadoop works as a cluster of nodes (similar to MySQL Cluster) and all data are spread across the cluster (with redundancy), so it provides both high availability (if implemented correctly) and scalability. The data retrieval process (map/reduce) is a parallel process, so the more data nodes you will add to Hadoop the faster the process will be.

Sqoop

Importing/Exporting data from Mysql to Hadoop Ecosystem.

// sqoop import –connect “jdbc:mysql://ip-172-31-20-247:3306/dbname” –table pipeline –username sqoopuser -P –target-dir /path

Apache Spark

For the analysis of big data, the industry is extensively using Apache Spark. Hadoop enables a flexible, scalable, cost-effective, and fault-tolerant computing solution. But the main concern is to maintain the speed while processing big data. The industry needs a powerful engine that can respond in less than seconds and perform in-memory processing. Also, that can perform stream processing as well as batch processing of the data. This is what made Apache Spark come into existence!

Apache Spark is a powerful open-source framework that provides interactive processing, real-time stream processing, batch processing as well as the in-memory processing at very fast speed, with standard interface and ease of use.

Why Data Visualisation?

As day by day, the data is getting increased it is a challenge to visualize these data and provide productive results within the lesser amount of time. Thus, Data visualization comes to the rescue to convey concepts in a universal manner and to experiment in different scenarios by making slight adjustments.

Helps in identifying areas that need attention or improvement.
Clarify which factors influence customer behavior
Helps to understand which fields to place where
Helps to predict scenarios and more

Data visualization just not makes data more beautiful but also provides insight into complex data sets by communicating with the key aspects more intrude on the meaningful ways.

Refer the below insights:

Demo Insights

To know more about the implementation of end to end pipeline, visit github repo .

If you like this blog, please do show your appreciation by hitting like button and sharing this blog. Also, drop any comments about the post & improvements if needed. Till then HAPPY LEARNING.

DataEngineer

Search This Blog