Data Structure in MapReduce
Key-value pairs are the basic data structure in MapReduce:
Keys and values can be: integers, float, strings, raw bytes
They can also be arbitrary data structures
The design of MapReduce algorithms involves:
Imposing the key-value structure on arbitrary datasets
E.g., for a collection of Web pages, input keys may be URLs and values may be the HTML content
In some algorithms, input keys are not used (e.g., wordcount), in others the uniquely identify a record
Keys can be combined in complex ways to design various algorithms
Recall of Map and Reduce
Map
Reads data (split in Hadoop, RDD in Spark)
Produces key-value pairs as intermediate outputs
Reduce
Receive key-value pairs from multiple map jobs
aggregates the intermediate data tuples to the final output
MapReduce in Hadoop
Data stored in HDFS (organized as blocks)
Hadoop MapReduce Divides input into fixed-size pieces, input splits
Hadoop creates one map task for each split
Map task runs the user defined map function for each record in the split
Size of a split is normally the size of a HDFS block
Data locality optimization
Run the map task on a node where the input data resides in HDFS
This is the reason why the split size is the same as the block size
The largest size of the input that can be guaranteed to be stored on a single node
If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks
MapReduce in Hadoop
Map tasks write their output to local disk (not to HDFS)
Map output is intermediate output
Once the job is complete the map output can be thrown away
Storing it in HDFS with replication would be overkill
If the node of the map task fails, Hadoop will automatically rerun the map task on another node
Reduce tasks don’t have the advantage of data locality
Input to a single reduce task is normally the output from all mappers
Output of the reduce is stored in HDFS for reliability
The number of reduce tasks is not governed by the size of the input, but is specified independently
More Detailed MapReduce Dataflow
When there are multiple reducers, the map tasks partition their output:
One partition for each reduce task
The records for every key are all in a single partition
Partitioning can be controlled by a user-defined partitioning function
Shuffle
Shuffling is the process of data redistribution
To make sure each reducer obtains all values associated with the same key.
It is needed for all of the operations which require grouping
E.g., word count, compute avg. the score for each department,
Spark and Hadoop have different approaches implemented for handling the shuffles.
Shuffle In Hadoop
Happens between each Map and Reduce phase
Use the Shuffle and Sort mechanism
Results of each Mapper are sorted by the key
Starts as soon as each mapper finishes
Use combiner to reduce the amount of data shuffled
Combiner combines key-value pairs with the same key in each par
This is not handled by the framework!
Shuffle in Spark
Triggered by some operations
Distinct, join, repartition, all * By,*ByKey
I.e., Happens between stages
Hash shuffle
Sort shuffle
Tungsten shuffle sort
More on https://issues.apache.org/jira/browse/SPARK-7081
Hash Shuffle
Data are hash partitioned on the map side
Hashing is much faster than sorting
Files created to store the partitioned data portion
# of mappers X # of reducers
Use consolidateFiles to reduce the # of files
From M * R => E*C/T *R
Pros:
Fast
No memory overhead of sorting
Cons:
Large amount of output files (when # partition is big)
Sort Shuffle
For each mapper 2 files are created
Ordered (by key) data
Index of beginning and ending of each '
Merged on the fly while being read by reducers
Default way
Fallback to hash shuffle if # partitions is small
Pros
Smaller amount of files created
Cons
Sorting is slower than hashing
Map Reduce in Spark
Transformation
Narrow transformation
Wide transformation
Action
The job is a list of Transformations followed by one Action
Only action will trigger the 'real' execution
I.e., lazy evaluation
Transformation = Map? Action = Reduce?
combineByKey
RDD([K, V]) to RDD([K, C])
K: key, V: value, C: combined type
Three parameters (functions)
createCombiner
What is done to a single row when it is FIRST met?
V => C
mergeValue
What is done to a single row when it meets a previously reduced row?
C, V => C
In a partition
mergeCombiners
What is done to two previously reduced rows?
C, C => C
Across partitions
The Efficiency of MapReduce in Spark
Number of transformations
Each transformation involves a linearly scan of the dataset (RDD)
Size of transformations
Smaller input size => less cost on linearly scan
Shuffles
data transferring between partitions is costly
especially in a cluster!
Disk I/O
Data serialization and deserialization
Network I/O
Number of Transformations (and Shuffles)
rdd = sc.parallelize(data)
data: (id, score) pairs
Bad design
maxByKey= rdd.combineByKey(…)
sumByKey = rdd.combineByKey(…)
sumMaxRdd maxByKey.join(sumByKey)
Good design
sumMaxRdd=rdd.combineByKey(…)
Size of Transformation
rdd=sc.parallelize(data)
data: (word, pairs)
Bad design
countRdd= rdd.reduceByKey(…)
fileteredRdd countRdd.filter(…)
Good design
fileteredRdd = countRdd.filter(…)
countRdd = fileteredRdd.reduceByKey(…)
Partition
rdd=sc.parallelize(data)
data: (word, pairs)
Bad design
countRdd= rdd.reduceByKey(…)
countBy2ndCharRdd=countRdd.map(…).reduceByKey(…)
Good design
paritionedRdd
data.partitionBy(…)
countBy2ndCharRdd=paritionedRdd.map(…).reduceByKey(…)
How to Merge Two RDDs?
Union
Concatenate two RDDs
Zip
Pair two RDDs
Join
Merge based on the keys from 2 RDDs
Just like join in DB
Union
How do A and B union together?
What is the number of partitions for the union of A and B?
Case 1: Different partitioner:
Note: default partitioner is None
Case 2: Same partitioner:
Zip
Key-Value pairs after A.zip(B)
Key: tuples in A
Value: tuples in B
Assumes that the two RDDs have
The same number of partitions
The same number of elements in each partition
E.g., 1 to 1 map
Join
E.g., A.*Join(B)
join
All pairs with matching Keys from A and B
leftOuterJoin
Case 1: in both A and B
Case 2: in A but not B
Case 3: in B but not A
rightOuterJoin
Opposite to leftOuterJoin
fullOuterJoin
Union of leftOuterJoin and rightOuterJoin
MapReduce of “Strips"
Map a sentence into stripes
ForAll term u in sent s do: H u = new dictionary
ForAll term v in Neighbors(u) do: H u (v) H u (v)+1
Reduce by key and merge the dictionaries
element wise sum of dictionaries
“Stripes” Analysis
Advantages
Far less sorting and shuffling of key value pairs
Disadvantages
More difficult to implement
Underlying object more heavyweight
Fundamental limitation in terms of size of event space
Pairs vs. Stripes
The pair's approach
Keeps track of each pair of co-occur terms separately
Generates a large number of key-value pairs (also intermediate)
The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a pair of words
The stripe approach
Keeps track of all terms that co-occur with the same term
Generates fewer and shorted intermediate keys
The framework has less sorting to do
Greatly benefits from combiners, as the keyspace is the vocabulary
More efficient, but may suffer from a memory problem
MapReduce in Real World: Search Engine
Information retrieval (IR)
Focus on textual information (= text/document retrieval)
Other possibilities include image, video, music,......
Boolean Text retrieval
Each document or query is treated as a bag ” of words or terms. Word sequence is not considered
Query terms are combined logically using the Boolean operators AND, OR, and NOT.
E.g., ((data AND mining) AND (NOT text))
Retrieval
Given a Boolean query, the system retrieves every document that makes the query logically true
Exact match
Contact us to get help related to "map-reduce assignment help", "map-reduce project help", "map-reduce homework help" or other project topics related to MapReduce at: contact@codersarts.com
Comentarios