This blog covers step by step instructions to setup single node lab for Hadoop and Spark eco system using Cloudera Distribution.
Almost all the leading vendors provide virtual machine image to learn all the tools in Big Data eco system including but not limited to Hadoop and Spark. But virtual machine requires higher configuration such as 16 GB RAM, i7 Quad Core, SSD etc. Even if there is 16 GB RAM setting up all the tools on it need not be feasible. There are many cloud providers who are providing infrastructure using pay-as-you go model and also credits to explore their environment. It might be feasible for some folks to set up cluster on cloud. Also this exercise will make people understand the basics behind setting up the clusters.
This lesson covers how to set up single node lab on cloud (for eg: AWS). Apart from provisioning the instances rest of the steps to setup single node lab are same as provided here to install Cloudera Distribution of Hadoop.
- Signup to the cloud
- Provision ec2 instance from AWS
- Setup MySQL Database
- Setup Pre-requisites (OS level) for Hadoop
- Install Cloudera Manager
- Install Cloudera Distribution of Hadoop
- Validate HDFS, YARN+MR2
- Validate Hive, Pig, Sqoop etc
- Setup retail_db database (for sqoop)
- Setup gen_logs (for streaming)