Let us understand the concept of shuffling and combiner in map reduce in layman terms.

- Take the example of deck of cards – a random collection of playing cards
- Each card have 3 attributes – color, suit and pip
- There are 2 colors, 4 suits and 13 pips distributed in 52 cards of each deck
- Let us say there are 5 people to process the data
- 3 of them are mappers
- 2 of them are aggregators

- Processing examples – sort all cards within pip using suit and color, get count of all cards per pip
- Let us divide the processing into stages
- First we will divide the cards into 3 parts, so that mappers can process their own set of cards
- The output of these mappers will be processed by the aggregators
- Take the example of sorting of data
- Stage 1:
- Get the card, group the cards by pip and partition into 2 buckets (probably even and odd)
- Each of the mapper will process their subset
- Each of the mapper will get all kinds of cards, those cards will group data by pip
- Also these cards will be divided into 2 buckets, so that each of the aggregator get the bucket which they are supposed to aggregate
- Let us say one partition contain – 2, 4, 6, 8, 10, Q, A and other partition contain 3, 5, 7, 9, J, K

- Stage 2:
- One aggregator will get all the cards of one partition from all the mappers while other aggregator will get all the cards of another partition
- Each aggregator will sort the data after grouping by pip

- Stage 1:
- Let us see some numbers as part of processing
- Let us say there are 750,000 records
- Each mapper will process 250,000 records
- After reading each of the card, mapper will group and partition the data by pip
- After mappers are done, the output will be passed to aggregator
- One aggregator will get ~400K records with pips 2, 4, 6, 8, 10, Q, A and other will get ~350K records with pips 3, 5, 7, 9, J, K
- Output will be sorted 750K records

- When it comes to aggregations such as count by pip, instead of passing 750K records between mappers and aggregators – mappers can perform additional step called combine
- Instead of passing the cards to aggregator, mapper can actually compute number of pips he got
- Aggregator can get those intermediate results and perform final count

- Let us see some numbers as part of processing
- Let us say there are 750,000 records
- Each mapper will process 250,000 records
- After reading each of the card, mapper will group and partition the data by pip
- While grouping and partitioning, once every handful of records (let’s say 1000) he can perform the count
- Keep the count and discard the original cards
- Finally each mapper can perform count of each pip he got
- In this case, one aggregator will get 7 counts from each mapper, while other aggregator will get 6 counts from each mapper
- Aggregator need to just group by pip and compute the final result

**Process of grouping and partitioning is called shuffling and process of computing intermediate results by mapper is called combining. In case of Spark – reduceByKey and aggregateByKey are used for aggregations. Both of them use combiner.**

Previous TopicNext Topic

## Join the discussion at discuss.itversity.com