Big Data Engineer immersion – classes

Introduction

For RSVP and reminders, please join my meetup group. There will be events for all the sessions.

This post will have details about live classes going on starting from September 12th covering

  • Setting up environment
  • Hadoop Distributed File System
  • Java Map Reduce Overview
  • Hive
  • Sqoop
  • Spark with Scala and Python
  • Data ingestion using Kafka/Flume
  • Oozie
  • Visualization using Tableau
  • Setting up of cluster

As of now, classes are started on 09/12 and will go for 8 to 9 weeks (3 classes every week). Post will be updated if there are any changes. Hangouts will be created for all the session and people can watch how the course is going.

For upcoming batch in January - fill this form

Class 01 - 09/12/2016 - Big Data Introduction

Following are the topics covered as part of the session

  • Big Data Introduction
  • Hadoop Introduction
  • Setting up the environment
  • How a typical production cluster will be?
  • Accessing the lab

Make sure you set up the environment or access the lab to learn from future classes.

Here is the presentation used for the class - hadoopandsparkdeveloper

For upcoming batch in January - fill this form

Follow this link for the instructions to set up the environment

Class 02 - 09/14/2016 - HDFS

Following are the topics covered as part of the session

  • Copying files to HDFS
    • hadoop fs -mkdir
    • hadoop fs -ls
    • hadoop fs -put or -copyFromLocal
    • hadoop fs -get or -copyToLocal
    • hadoop fs -cat
  • Understanding concepts behind HDFS
    • How files are stored in HDFS?
    • What is block size?
    • What is metadata?
    • What is namenode and datanode?
  • To be continued...

Class 03 - 09/15/2016 - HDFS Continued

Here are the topics covered in this session

  • What are different parameter files? Explain importance of the parameters. What is “final” for a parameter?
  • What is Gateway node? What is the role of gateway node with respect to HDFS?
  • What is block, block size and how data is distributed?
  • What is fault tolerance? What is the role of replication factor?
  • What is default block size and what is replication factor?
  • Given a scenario, explain how files are stored in HDFS? Understand size of each of the block and replication factor.
  • How to override parameters such as block size and replication factor while copying files?

Class 04 - 09/20/2016 - Java Map Reduce overview

As part of this class we have seen overview of

  • Map Reduce APIs
  • Map function
  • Reduce function
  • Shuffle and Sort

Class 05 - 09/22/2016 - Map Reduce Framework overview

As part of this session we have seen

  • Compiling map reduce programs
  • Build jar files
  • Copy jar files to Cluster
  • Execute programs
  • Validate the output

To learn Map reduce in detail

Class 06 - 09/23/2016 - Introduction to Spark

As part of this class we have seen

  • Setting up environment - click here
  • Important concepts and architecture - click here
  • We will get into details about following in future classes
    • Writing wordcount program
    • Understanding transformations and actions
    • Developing programs using Scala IDE for Eclipse
    • Compile and run the programs on the cluster

Class 07 - 09/27/2016 - Spark - Word count program

As part of this class we have seen

  • Makes sure Scala for IDE is setup - click here
  • Writing wordcount program
  • Understand Scala basics
    • val vs. var
    • object vs. class
    • create spark context and spark conf  - click here
    • Launch scala interpreter
  • Understand Spark APIs - Click here for code reference
    • sc.textFile - to read the data
    • flatMap, map and filter
    • reduceByKey

Class 08 - 09/29/2016 - Understanding transformations and actions

As part of this class, we continue to work on understanding

  • Transformations and actions
  • Define problem statement
  • Design the application
  • Start developing the code for problem statement

As part of the next class you will see implementation using several transformations using scala as programming language

  • filter
  • reduceByKey
  • join
  • aggregateByKey

Class 09 - 09/30/2016 - Develop average revenue per day

As part of this class we have completed the implementation using scala

  • Use Scala interpreter to validate the code
  • Understand SparkContext and SparkConf
  • Develop program
  • Understand how data is represented in RDD
  • Run the program using Scala IDE for eclipse
  • Validate the results

As part of next class, we will see how to externalize parameters and run it on the cluster 

Class 10 - 10/03/2016 - Broadcast variables and Accumulators

Following Spark topics are covered as part of this class

  • Accumulators
  • Broadcast variables
  • Implementation using Scala

As part of the next class we will start with

  • Spark APIs using pyspark

Class 11 - 10/06/2016 - Pyspark - Overview Transformations and Actions

Following are the topics covered as part of this class

  • Overview of transformations and actions in pyspark
  • Launching pyspark
  • Writing simple programs using pyspark

As part of the next class, we will see

  • Sorting and ranking
  • Using groupByKey for ranking
  • Demonstrate groupByKey using scala as well as python

Class 12 - 10/07/2016 - groupByKey (python as well as scala)

Following are the topics covered as part of this class

  • Using groupByKey for ranking
  • Define problem statement (get top n priced products)
  • Use pyspark to implement by key ranking
  • Use Scala to implement by key ranking

With this we are done with transformations and actions using both Scala and Python

Let us start looking into Hive, we will come back to data frames after covering Hive, Sqoop etc. in detail.

Please follow this link to prepare for CCA Spark and Hadoop Developer. It has all the relevant Spark material to clear the certification.

Class 13 - 10/10/2016 - Hive Architecture and DDL

Following topics of hive are covered as part of this class

  • Architecture of Hive
  • Launching hive using different modes
  • Overview of create databases and create tables

As part of the next class, we will cover

  • Create databases and tables
  • Managed tables and external tables
  • Load command
  • Partitioning and bucketing

Class 14 - 10/12/2016 - Hive DDL and DML

Please join the meetup and RSVP

Class 15 - 10/13/2016 - HiveQL

Please join the meetup and RSVP

Class 16 - 10/17/2016 - Hive functions

As part of this session, we have seen

  • Pre defined functions
  • User defined functions
    • Develop java class
    • Compile into jar file
    • Ship it to gateway node
    • Add jar and create function
    • Update .hiverc

Session also covered some important linux concepts.

 

Class 17 - 10/19/2016

Please join the meetup and RSVP

Class 18 - 10/20/2016

Please join the meetup and RSVP

 

Add Comment

Leave a Reply

shares

Big Data Introduction - YouTube live video

Please click here

Subscribe!