Big Data and PySpark with Spark 2.45– Referred from Udemy Video series from Jose (Pierian Data)

vamseemudradi
May 4, 2020
2 min read

INTRODUCTION

Big Data is Data for analyses which is too huge to fit in on one machine or something that requires memory that exceeds a single machine’s configurable capacity. In this regard, we make use of a Distributed File System to work with such data like represented in image below:

To solve the problem with working with such datasets, we initially had Hadoop come up with something called Hadoop Distributed File System(HDFC) as a way to distribute data across multiple slave / worker nodes and the main master node used to use Hadoop MapReduce in order to split the computation required against the big dataset across the computational resources of the slave nodes. This mechanism called MapReduce performs computational operations on disk and this is performance deficient. Here, in MapReduce, we have Job Tracker in the master node which allocates tasks to the slave nodes and each of the slave nodes has a Task Tracker to monitor the tasks and return back the result of the slave node's computation back to the master node. The operation is depicted using the below image:

Image: The Job Tracker sends code to run on the Task Trackers. The Task trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes

HDFS splits up each data block on which analyses is to be performed into bocks of 128MB each and has 3 replicas of its data partitioned and distributed to another slave machine for fault tolerance.

Spark can work with not just HDFS DataSource but can also connect to SQL using JDBC and fetch files, also, it can get files from local systems and work with Cassandra DB too. Additionally, it can fetch datasets/blob files from AWS S3 Storage.

Image: RDDs are immutable, lazily evaluated, and cacheable

The reason Spark is 100X faster than MapReduce is due to it performing computations (on the Master Node and in Slave nodes) is due to it performing computations in memory and spilling over to the disk only when the computation required memory exceeds the in memory capacity of the local resource.

Spark also is 10X more faster compared to Hadoop MapReduce on disk based computations.

Resilient Distributed Dataset (the distributed data architecture that Spark makes use of) has the below benefits:

1. Distributed data from various sources

2. Fault Tolerance

3. Parallelism given the chunk of data replica available (as a partitionable ability)

As of Spark 2.0 + , Spark has renamed RDD syntactically as DataFrames -> A table format in which a dataset is stored in Spark Terminology.]

Under the hood, physically, the way Data is distributed is still referred to as being in RDD Format.

This marks the end of the introduction for Spark RDD, its use and benefit over analysing Big Data using Hadoop’s MapReduce mechanism.

Big Data and PySpark with Spark 2.45– Referred from Udemy Video series from Jose (Pierian Data)

Recent Posts

Comments

Subscribe Form