Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
Typically Hadoop cluster can have few hundreds to few thousand nodes/physical servers. Setting up plain vanilla or Apache Hadoop and monitoring it can be tedious task and hence there are several distributions which provide tools for setting up and managing the clusters.
Major Hadoop Distributions
There are several distributions of Hadoop supported by respective vendors.
- Amazon EMR
These vendors provide training, support and services for the clients. As part of their distribution these vendors provide tools to simplify setup process of the cluster as well as to simplify the operations.
To do: Video will be provided to explain differences in detail
Hadoop eco system
Hadoop eco system can be divided into core components and other tools. HDFS – Hadoop Distributed File System is the foundation in Hadoop eco system, Map Reduce is Distributed Computing framework developed in tandem with HDFS and other tools can be categorized into Map Reduce based tools and non Map Reduce based tools
- Hadoop core components
- Map Reduce
- Map Reduce based tools
- Hive – Logical database on top of HDFS with SQL based interface on top of Map Reduce to process the data
- Pig – Data flow language based interface on top of Map Reduce to process the data in HDFS
- Sqoop – Generic data movement tool to copy data between relational databases and HDFS using Map Reduce leveraging its distributed processing capabilities
- Mahout – Machine learning library which uses Map Reduce framework to process the data
- Oozie – Map Reduce based work flow tool
- Non Map Reduce based tools
- Flume – Data integration tool using Flume agents which can collect streaming data from sources such as weblogs and load into target such as HDFS
- Spark – In memory data processing tool which can accelerate data processing
- Impala – Alternative to hive to process lower volumes of data in quicker and interactive fashion
- HBase – A NoSQL database to build applications which are operational in nature at scale
Vendors such as Cloudera, Hortonworks etc package all these tools as part of distributions and provide wizards and tools to set up and maintain on larger clusters
Job roles in Hadoop
There are several several specialized roles in Hadoop.
|Job Role||Experience required||Desired Skills|
|Hadoop Developer||0-7 Years||Hadoop, Programming using java, spark, hive, pig, sqoop etc|
|Hadoop Administrator||0-10 Years||Linux, Hadoop Administration using distributions|
|Big Data Engineer||3-15 Years||Data Warehousing, ETL, Hadoop, hive, pig, sqoop, spark etc|
|Big Data Solutions Architect||12-15 Years||Deep understanding of Big Data eco system such as Hadoop, NoSQL etc|
|Infrastructure Architect||12-15 Years||Deep understanding of infrastructure as well as Big Data eco system|
For most of the above job roles 2 years of hands on experience on Hadoop as part of overall years of experience suffice.
Vendors such as Cloudera, Hortonworks not only provide training and support, they also issue certifications which are highly recognized in the industry. Most of the certifications are practical oriented which tests level of understanding of test takers.
Why should one get certified?
- Tests level of understanding of several Hadoop eco system tools
- Instill confidence in individuals while delivering projects
- Certifications can give some traction in job search process
- Instills confidence in taking the interviews
- Separate certifications for separate roles
- Tests breadth and depth of eco system tools
- Most of the certifications are no more objective, they are scenario based simulating real world problems
Where should one get certified?
- Certifications issued by major Big Data vendors such as Cloudera, Hortonworks, Databricks are well recognized
- Most of these certifications are online and proctored
- Certifications can be taken from any where with Computer with Webcam
- No need to visit proctor centers
What are the certifications that are available?
- CCAH – Cloudera Certified Administrator of Apache Hadoop
- HDPCA – Hortonworks Data Platform Certified Administrator
- CCA – Cloudera Certified Associate Spark and Hadoop Developer (HDFS, Sqoop, Flume, Spark with Python, Spark with Scala, Hive, Impala and Avro tools)
- HDPCD – Hortonworks Data Platform Certified Developer (Flume, Hive, Pig and Sqoop)
- HDPCD:Java – Hortonworks Data Platform Certified Developer (Java Map Reduce APIs)
- HDPCD:Spark – Hortonworks Data Platform Certified Developer (Spark)
- There is considerable amount of overlapping between CCA, HDPCD and HDPCD:Spark
- Data Engineers
- CCP DE – Cloudera Certified Professional Data Engineer (Sqoop, Flume, Hive and Oozie)
There are other certifications provided by other vendors as well, but these are most popular.
How can one prepare for certifications?
- itversity, llc is a startup which runs YouTube channel called itversity. Video content is published on the channel
- Content is developed based on published curriculum of respective certification
- Follow up videos are added based on feedback from the test takers
- Already around 40 people have acknowledged that they are certified following the content on the channel
How can one access the content to prepare for the certification?
Here is the table for certification to playlist mapping. Few of the certifications are still in progress. Please click on Certification Name to redirect to respective certification. There will be more blogs for each of this Hadoop Certification.
|Administration||Cloudera Certified Administrator of Apache Hadoop (CCAH)||In progress|
|Administration||Hortonworks Data Platform Certified Administrator (HDPCA)||Done|
|Developer||Cloudera Certified Associate Spark and Hadoop Developer (CCA)
HDFS, Sqoop, Flume, Spark with Python, Spark with Scala, Avro tools, Hive, Impala etc
|Developer||Hortonworks Data Platform Certified Developer (HDPCD)
Hadoop, Sqoop, Flume, Hive, Pig etc
|Developer||Hortonworks Data Platform Certified Developer – Java (HDPCD:Java)||Almost Done|
|Developer||Hortonworks Data Platform Certified Developer – Spark (HDPCD:Spark)||Not Started|
|Data Engineer||Cloudera Certified Professional Data Engineer (CCPDE)||Just started|
Cloudera Certified Associate Spark and Hadoop Developer (CCA)
Hortonworks Data Platform Certified Administrator (HDPCA)
Cloudera Certified Associate Spark and Hadoop Developer (CCA)
Hortonworks Data Platform Certified Developer (HDPCD)
Hortonworks Data Platform Certified Developer – Java (HDPCD:Java)
How can one discuss further about Big Data or Certifications?
itversity, llc manages several linkedin groups and here is the list with URLs for Big Data or Certifications.
Stay connected with us!!!
Here are the details to stay connected with itversity, llc. Please click on these hyperlinks to stay connected using platform of your choice.