Following are the functions which can be used to perform by key transformations. reduce can be used for total aggregations.
Using these APIs we can perform aggregations such as total revenue per day, minimum salaried employee per department, average salary per department etc. Standard aggregations are count, sum, average, standard deviation, mode, min, max etc.
|Uses Combiner||Uses Combiner||Does not use combiner|
|Take one parameter as function – for seqOp and combOp||Take 2 parameters as functions – one for seqOp and other for combOp||No parameters as functions. Generally followed by map or flatMap|
|Implicit Combiner||Explicit Combiner||No combiner|
|seqOp or combiner logic are same as combOp or final reduce logic||seqOp or combiner logic are different from combOp or final reduce logic||No combiner|
|Input and output value type need to be same||Input and output value type can be different||No parameters|
|Performance is high for aggregations||Performance is high for aggregations||Relatively slow for aggregations|
|Only aggregations||Only aggregations||Any by key transformation – aggregation, sorting, ranking etc.|
|eg: sum, min, max etc||eg: average||eg: any aggregation is possible, but not preferred for performance reasons|
Let us see them in action
- We will use order_items. It has six fields
- Each order item have total associated with it
- Let us compute revenue for each order. We have to group by order_item_order_id and get the revenue for each order. We should use reduceByKey as it is simple to implement.
- Apply map function to get order_id and revenue in a tuple
- Apply appropriate aggregate function (reduceByKey) to get revenue for each order.
- Let us also get number of order_items for each order. This can be implemented using aggregateByKey or reduceByKey (either of the approach is good)
- Both of them can be implemented using groupByKey but not preferred as groupByKey does not use combiner
- We can run these examples on local spark installation or virtual machines or big data labs
We will also look at the usage of reduceByKey to get the minimum priced product per product category.