Set operations let us get common elements between two data sets or all the elements from the two data sets.
- union will get all the elements from both the data sets
- intersect will get all the elements common in both the data sets
- distinct will get all the distinct elements in a data set
- In case of union, it will not get distinct elements. Apply distinct, if you only want to get distinct elements after union.
- When we use set operations such as union and intersect, data should have similar structure
- Diff and complement are not available on top of RDDs
Let us see with example
- We have order_items data set with six fields including order_item_product_id.
- It is 3rd field
- Let us see what are all the distinct order_item_product_id sold in a given month (December 2013))
- But month is not available in order_items. Hence we need to join with orders to get month
- Find common products sold in the month of December 2013 and January 2014
- Also find all the products sold in the month of December 2013 and January 2014
- We can run these examples on local spark installation or virtual machines or big data labs