top of page

Spark, Data Structure, Shuffle In Map Reduce

Data Structure in MapReduce

Key-value pairs are the basic data structure in MapReduce:

  • Keys and values can be: integers, float, strings, raw bytes

  • They can also be arbitrary data structures

The design of MapReduce algorithms involves:

  • Imposing the key-value structure on arbitrary datasets

    • E.g., for a collection of Web pages, input keys may be URLs and values may be the HTML content

  • In some algorithms, input keys are not used (e.g., wordcount), in others the uniquely identify a record

  • Keys can be combined in complex ways to design various algorithms

Recall of Map and Reduce


  • Reads data (split in Hadoop, RDD in Spark)

  • Produces key-value pairs as intermediate outputs


  • Receive key-value pairs from multiple map jobs

  • aggregates the intermediate data tuples to the final output

MapReduce in Hadoop

  • Data stored in HDFS (organized as blocks)

  • Hadoop MapReduce Divides input into fixed-size pieces, input splits

    • Hadoop creates one map task for each split

    • Map task runs the user defined map function for each record in the split

    • Size of a split is normally the size of a HDFS block

Data locality optimization

  • Run the map task on a node where the input data resides in HDFS

  • This is the reason why the split size is the same as the block size

    • The largest size of the input that can be guaranteed to be stored on a single node

    • If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks

MapReduce in Hadoop

Map tasks write their output to local disk (not to HDFS)

  • Map output is intermediate output

  • Once the job is complete the map output can be thrown away

  • Storing it in HDFS with replication would be overkill

  • If the node of the map task fails, Hadoop will automatically rerun the map task on another node

Reduce tasks don’t have the advantage of data locality

  • Input to a single reduce task is normally the output from all mappers

  • Output of the reduce is stored in HDFS for reliability

  • The number of reduce tasks is not governed by the size of the input, but is specified independently

More Detailed MapReduce Dataflow

When there are multiple reducers, the map tasks partition their output:

  • One partition for each reduce task

  • The records for every key are all in a single partition

  • Partitioning can be controlled by a user-defined partitioning function


  • Shuffling is the process of data redistribution

    • To make sure each reducer obtains all values associated with the same key.

  • It is needed for all of the operations which require grouping

    • E.g., word count, compute avg. the score for each department,

  • Spark and Hadoop have different approaches implemented for handling the shuffles.

Shuffle In Hadoop

  • Happens between each Map and Reduce phase

  • Use the Shuffle and Sort mechanism

    • Results of each Mapper are sorted by the key

    • Starts as soon as each mapper finishes

  • Use combiner to reduce the amount of data shuffled

    • Combiner combines key-value pairs with the same key in each par

    • This is not handled by the framework!

Shuffle in Spark

Hash Shuffle

  • Data are hash partitioned on the map side

    • Hashing is much faster than sorting

  • Files created to store the partitioned data portion

    • # of mappers X # of reducers

  • Use consolidateFiles to reduce the # of files

    • From M * R => E*C/T *R

  • Pros:

    • Fast

    • No memory overhead of sorting

  • Cons:

    • Large amount of output files (when # partition is big)

Sort Shuffle

  • For each mapper 2 files are created

    • Ordered (by key) data

    • Index of beginning and ending of each '

  • Merged on the fly while being read by reducers

  • Default way

    • Fallback to hash shuffle if # partitions is small

  • Pros

    • Smaller amount of files created

  • Cons

    • Sorting is slower than hashing

Map Reduce in Spark

  • Transformation

    • Narrow transformation

    • Wide transformation

  • Action

    • The job is a list of Transformations followed by one Action

    • Only action will trigger the 'real' execution

      • I.e., lazy evaluation

Transformation = Map? Action = Reduce?


  • RDD([K, V]) to RDD([K, C])

    • K: key, V: value, C: combined type

  • Three parameters (functions)

    • createCombiner

    • What is done to a single row when it is FIRST met?

    • V => C

  • mergeValue

    • What is done to a single row when it meets a previously reduced row?

    • C, V => C

    • In a partition

  • mergeCombiners

    • What is done to two previously reduced rows?

    • C, C => C

    • Across partitions

The Efficiency of MapReduce in Spark

  • Number of transformations

    • Each transformation involves a linearly scan of the dataset (RDD)

  • Size of transformations

    • Smaller input size => less cost on linearly scan

  • Shuffles

    • data transferring between partitions is costly

      • especially in a cluster!

        • Disk I/O

        • Data serialization and deserialization

        • Network I/O

Number of Transformations (and Shuffles)

rdd = sc.parallelize(data)

  • data: (id, score) pairs

Bad design

  • maxByKey= rdd.combineByKey(…)

  • sumByKey = rdd.combineByKey(…)

  • sumMaxRdd maxByKey.join(sumByKey)

Good design

  • sumMaxRdd=rdd.combineByKey(…)

Size of Transformation


  • data: (word, pairs)

Bad design

  • countRdd= rdd.reduceByKey(…)

  • fileteredRdd countRdd.filter(…)

Good design

  • fileteredRdd = countRdd.filter(…)

  • countRdd = fileteredRdd.reduceByKey(…)



  • data: (word, pairs)

Bad design

  • countRdd= rdd.reduceByKey(…)


Good design

  • paritionedRdd

  • data.partitionBy(…)


How to Merge Two RDDs?

  • Union

    • Concatenate two RDDs

  • Zip

    • Pair two RDDs

  • Join

    • Merge based on the keys from 2 RDDs

    • Just like join in DB


  • How do A and B union together?

    • What is the number of partitions for the union of A and B?

  • Case 1: Different partitioner:

    • Note: default partitioner is None

  • Case 2: Same partitioner:


  • Key-Value pairs after

    • Key: tuples in A

    • Value: tuples in B

  • Assumes that the two RDDs have

    • The same number of partitions

    • The same number of elements in each partition

    • E.g., 1 to 1 map


  • E.g., A.*Join(B)

  • join

    • All pairs with matching Keys from A and B

  • leftOuterJoin

    • Case 1: in both A and B

    • Case 2: in A but not B

    • Case 3: in B but not A

  • rightOuterJoin

    • Opposite to leftOuterJoin

  • fullOuterJoin

    • Union of leftOuterJoin and rightOuterJoin

MapReduce of “Strips"

  • Map a sentence into stripes

    • ForAll term u in sent s do: H u = new dictionary

    • ForAll term v in Neighbors(u) do: H u (v) H u (v)+1

  • Reduce by key and merge the dictionaries

    • element wise sum of dictionaries

“Stripes” Analysis


  • Far less sorting and shuffling of key value pairs


  • More difficult to implement

  • Underlying object more heavyweight

  • Fundamental limitation in terms of size of event space

Pairs vs. Stripes

  • The pair's approach

    • Keeps track of each pair of co-occur terms separately

    • Generates a large number of key-value pairs (also intermediate)

    • The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a pair of words

  • The stripe approach

    • Keeps track of all terms that co-occur with the same term

    • Generates fewer and shorted intermediate keys

      • The framework has less sorting to do

    • Greatly benefits from combiners, as the keyspace is the vocabulary

    • More efficient, but may suffer from a memory problem

MapReduce in Real World: Search Engine

  • Information retrieval (IR)

    • Focus on textual information (= text/document retrieval)

    • Other possibilities include image, video, music,......

  • Boolean Text retrieval

    • Each document or query is treated as a bag ” of words or terms. Word sequence is not considered

    • Query terms are combined logically using the Boolean operators AND, OR, and NOT.

      • E.g., ((data AND mining) AND (NOT text))

    • Retrieval

      • Given a Boolean query, the system retrieves every document that makes the query logically true

      • Exact match

Contact us to get help related to "map-reduce assignment help", "map-reduce project help", "map-reduce homework help" or other project topics related to MapReduce at:
66 views0 comments

Recent Posts

See All


bottom of page