top of page

L3 MapReduce, Spark Project Help

Table of content:

Word count

  • output the number of occurrences for each word in the dataset.

Naïve solution:

word_count (D):
   H = new dict
   For each w in D:
     H[w]+= 1
   For each w in H:
     print(w,  H[w])

How to speed up?

Make use of multiple workers

There are some problems…

Data reliability

  • Equal split of data

  • Delay of worker

  • Failure of worker

  • Aggregation the result

We need to handle them all! In the traditional way of parallel and distributed processing.


MapReduce is a programming framework that:

  • allows us to perform distributed and parallel processing on large data sets in a distributed environment

  • no need to bother about the issues like reliability, fault tolerance etc

  • offers the flexibility to write code logic without caring about the design issues of the system

  • MapReduce consists of Map and Reduce

  • Map

  • Reads a block of data

  • Produces key-value pairs as intermediate outputs

  • Reduce

  • Receive key-value pairs from multiple map jobs

  • aggregates the intermediate data tuples to the final output

A Simple MapReduce Example

Pseudo Code of Word Count

    for each w in D:
        emit(w, 1)

Reduce(t, counts): # e.g., bear, [1, 1]
  sum = 0
  for c in counts:
     sum = sum + c
  emit (t, sum)

Advantages of MapReduce

Parallel processing

  • Jobs are divided to multiple nodes

  • Nodes work simultaneously

  • Processing time reduced

Data locality

  • Moving processing to the data

  • Opposite from the traditional way

Motivation of Spark

  • MapReduce greatly simplified big data analysis on large, unreliable clusters. It is great at one pass computation.

But as soon as it got popular, users wanted more:

  • more complex, multi-pass analytics (e.g. ML, graph)

  • more interactive ad hoc queries

  • more real-time stream processing

Limitations of MapReduce

As a general programming model:

  • more suitable for one-pass computation on a large dataset

  • hard to compose and nest multiple operations

  • no means of expressing iterative operations

As implemented in Hadoop:

  • all datasets are read from disk, then stored back on to disk

  • all data is (usually) triple replicated for reliability

Data Sharing in Hadoop MapReduce

  • Slow due to replication, serialization, and disk IO

  • Complex apps, streaming, and interactive queries all need one thing that MapReduce lacks:

  • Efficient primitives for data sharing

What is Spark?

  • Apache Spark is an open-source cluster computing framework for real-time processing.

  • Spark provides an interface for programming entire clusters with

  • implicit data parallelism

  • fault tolerance

  • Built on top of Hadoop MapReduce

  • extends the MapReduce model to efficiently use more types of computations

Spark Features

  • Polyglot

  • Speed

  • Multiple formats

  • Lazy evaluation

  • Real-time computation

  • Hadoop integration

  • Machine learning

Spark Eco-System:

Spark Architecture

  • Master Node

  • takes care of the job execution within the cluster

  • Cluster Manager

  • allocates resources across applications

  • Worker Node

  • executes the tasks

Spark Resilient Distributed Dataset(RDD)

  • RDD is where the data stays

  • RDD is the fundamental data structure of Apache Spark

  • is a collection of elements

  • Dataset

  • can be operated on in parallel

  • Distributed

  • fault-tolerant

  • Resilient

Features of Spark RDD

In-memory computation

  • Partitioning

  • Fault tolerance

  • Immutability

  • Persistence

  • Coarse-grained operations

  • Location stickiness

Create RDDs

Parallelizing an existing collection in your driver program

  • Normally, Spark tries to set the number of partitions automatically based on your cluster

Referencing a dataset in an external storage system

  • HDFS, HBase, or any data source offering a Hadoop InputFormat

  • By default, Spark creates one partition for each block of the file

RDD Operations


  • functions that take an RDD as the input and produce one or many RDDs as the output

  • Narrow Transformation

  • Wide Transformation


  • RDD operations that produce non RDD values.

  • returns the final result of RDD computations

Narrow and Wide Transformations

Narrow transformation involves no data shuffling

  • map

  • flatMap

  • filter

  • sample

Wide transformation involves data shuffling

  • sortByKey

  • reduceByKey

  • groupByKey

  • join


Actions are the operations which are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver

  • collect

  • take

  • reduce

  • forEach

  • sample

  • count

  • save


RDD lineage is the graph of all the ancestor RDDs of an RDD

  • Also called RDD operator graph or RDD dependency graph

Nodes: RDDs

Edges: dependencies between RDDs

Fault tolerance of RDD

All the RDDs generated from fault-tolerant data are fault-tolerant.

If a worker falls, and any partition of an RDD is lost

  • the partition can be recomputed from the original fault-tolerant dataset using the lineage

  • the task will be assigned to another worker

DAG in Spark

  • DAG is a direct graph with no cycle

  • Node: RDDs, results

  • Edge: Operations to be applied on RDD

  • On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task

  • DAG operations can do better global optimization than other systems like MapReduce

DAG, Stages, and Tasks

  • DAG Scheduler splits the graph into multiple stages

  • Stages are created based on transformations

  • The narrow transformations will be grouped together into a single-stage

  • Wide transformation defines the boundary of 2 stages

  • DAG scheduler will then submit the stages into the task scheduler

  • The number of tasks depends on the number of partitions

  • The stages that are not interdependent may be submitted to the cluster for execution in parallel

Lineage vs. DAG in Spark

  • They are both DAG (data structure)

  • Different end nodes

  • Different roles in Spark

Contact us to get help by our Big data experts at:

14 views0 comments

Recent Posts

See All
bottom of page