Big Data Introduction


As part of this vlog we will see overview of Big Data

  • About speaker and itversity
  • Categorization of enterprise applications
  • Typical data integration architecture
  • Challenges with conventional technologies
  • Big Data eco system
  • Data Storage - Distributed file system
  • Data Processing - Distributed computing frameworks
  • Data Ingestion - Open source tools
  • Data Visualization - BI tools as well as custom
  • Role of Apache and other companies
  • Different Certifications in Big Data
  • Resources for deep dive into Big Data
  • Job roles in Big Data

Following questions will be answered after going through this blog or video or both

  • What is Big Data?
  • What is Hadoop eco system?
  • How Hadoop Map Reduce and Spark are different?
  • Why there are so many technologies such as Hadoop, Hive, Pig, Sqoop, Spark, Flume, Kafka, Storm, Flink, Oozie, ZooKeeper?
  • How these technologies can be categorized?
  • What is NoSQL?
  • How Big Data and Data Science are related?
  • What are the use cases that are being considered using Big Data eco system?
  • What is the role of Apache?
  • How Big Data clusters are typically built and supported? Role of Hortonworks, Cloudera etc
  • What are the job roles and how one can plan for transition based on their back ground?

Big Data Introduction - YouTube video

Here is the YouTube live video

Big Data Introduction - Powerpoint

Here is the powerpoint presentation of the same.


About speaker and itversity

About me

  • Seasoned IT professional with 13 years of experience
  • Deep expertise in databases and data warehousing
  • Erudite in Big Data and Open Source technologies
  • Launched the company in US and in India
  • On a mission to build a one stop shop free IT university

About itversity

  • Dallas based startup
  • Primary focus is on providing content in many technologies
  • Planning to get into staffing in coming years
  • Focused on Cloud and Open Source technologies
  • Free certification oriented courses
  • and many more

Categorization of enterprise applications

Enterprise applications can be broadly categorized into

  • Operational
    • Transactional
    • Non-transactional
  • Decision Support
  • Customer analytics

Let us understand each of the category from the perspective of eCommerce platform

  • Operational
    • Transactional (check out of market basket)
    • Non-transactional (recommendation engine)
  • Decision Support (trends of sales)
  • Customer analytics (reports for online customers such as categories spent)

Typical data integration architecture

Here are the details about data integration in the organization

  • Data integration can be categorized into
    • Real time
    • Batch
  • Traditionally when there are no customer analytics and recommendation engines, we used to have
    • ODS (compliance and single source of truth)
    • EDW (Facts and dimensions to support reports)
    • ETL (to perform transformations and load data into EDW)
    • BI (to visualize and publish reports)

Data Integration - Technologies

Here are the details about conventional technologies in data integration and visualization

  • Batch ETL – Informatica, Data Stage etc
  • Real time data integration – Goldengate, Shareplex etc
  • ODS – Oracle, MySQL etc
  • EDW/MPP – Teradata, Greenplum etc
  • BI Tools – Cognos, Business Objects etc

Challenges with conventional technologies

Here are the challenges with respect to data integration in conventional technologies

  • Almost all operational systems are using relational databases (RDBMS like Oracle).
    • RDBMS are originally designed for Operational and transactional
  • Not linearly scalable.
    • Transactions
    • Data integrity
  • Expensive
  • Predefined Schema
  • Data processing do not happen where data is stored (storage layer) - no data locality
    • Some processing happens at database server level (SQL)
    • Some processing happens at application server level (Java/.net)
    • Some processing happens at client/browser level (Java Script)
  • Almost all Data Warehouse appliances are expensive and not very flexible for customer analytics and recommendation engines

Big Data eco system - History

Here is the brief history about Big Data

  • Started with Google search engine
  • Google’s use case is different from enterprises
    • Crawl web pages
    • Index based on key words
  • Return search results
  • As conventional database technologies does not scale, them implemented
    • GFS (Distributed file system)
    • Google Map Reduce (Distributed processing engine)
    • Google Big Table (Distributed indexed table)

Big Data eco system - Myths

Here are few Big Data myths

  • Big Data is Hadoop
  • Big Data eco system can only solve problems with very large data sets
  • Big Data is cheap
  • Big Data provide variety of tools and can solve problems quickly
  • Big Data is a technology
  • Big Data is Data Science
    • Data Scientist need to have specialized mathematical skills
    • Domain knowledge
    • Minimal technology orientation
    • Data Science it self is separate domain - if required Big Data technologies can be used

Often people have unrealistic expectations on Big Data technologies

Big Data eco system - Characteristics

Here are few characteristics of Big Data technologies

  • Distributed storage
    • Fault tolerance (RAID is replaced by replication)
  • Distributed computing/processing
    • Data locality (code goes to data)
  • Scalability (almost linear)
  • Low cost hardware (commodity)
  • Low licensing costs

* Low cost hardware and software does not mean that Big Data is cheap for enterprises

Big Data eco system - categorization

Big Data eco system of tools can be categorized into

  • Distributed file systems
    • HDFS
    • Cloud storage (s3/Azure blob)
  • Distributed processing engines
    • Map Reduce
    • Spark
  • Distributed databases (operational)
    • NoSQL databases (HBase, Cassandra)
    • Search databases (Elastic Search)

Big Data eco system - Evolution

After successfully building search engine in new technologies, Google have published white papers

  • Distributed file system – GFS
  • Distributed processing Engine – Map Reduce
  • Distributed database – Big Table

* Development of Big Data technologies such as Hadoop is started with these white papers

Data Storage - Distributed file system or Distributed Databases

Data storage options in Big Data eco systems

  • Distributed file systems (streaming and batch access)
    • HDFS
    • Cloud storage
  • Distributed Databases (random access - distributed indexed tables)
    • Cassandra
    • HBase
    • MongoDB
    • Solr

Data Ingestion

Data ingestion strategies are defined by sources from which data is pulled and sinks where data is stored

  • Sources
    • Relational Databases
    • Non relational Databases
    • Streaming web logs
    • Flat files
  • Sinks
    • HDFS
    • Relational or Non relational Databases
    • Data processing frameworks
  • Sqoop is used to get data from relational databases
  • Flume and/or Kafka is used to read data from web logs
  • Spark streaming, Storm, Flink etc are used to process data from Flume and/or Kafka before loading data into sinks

Data Processing - Batch

In Big Data eco system, there are 2 major distributed processing frameworks

  • I/O based
    • Map Reduce
    • Hive, Pig are wrappers on top of map reduce
  • In memory
    • Spark
    • Spark Data Frames is wrapper on top of core spark
  • As part of data processing typically we focus on transformations such as
    • Aggregations
    • Joins
    • Sorting
    • Ranking

Data Processing - Operational

There are many options for distributed databases

  • Data is typically stored in distributed databases
  • Supports CRUD operations
  • Data is typically distributed
  • Data is typically sorted by key
  • Fast and scalable random reads
  • NoSQL
    • HBase
    • Cassandra
    • MongoDB
  • Search databases
    • Elastic Search

Data Visualization - BI tools as well as custom

Processed data is analyzed or visualized using

  • BI Tools
  • Custom visualization frameworks (d3js)
  • Ad hoc query tools

Role of Apache and other distributions

  • Each of these are separate projects incubated under Apache
    • HDFS and MapReduce/YARN
    • Hive
    • Pig
    • Sqoop
    • HBase
    • Etc
  • Setup Process
    • Get jar files from Apache
    • Deploy on all the nodes in the cluster
    • Configure the cluster
    • Manual setup is not practical
  • Tools/Distributions used to setup process
    • DevOps tools (puppet/chef)
    • Cloudera
    • Hortonworks
    • MapR

Installation (plain vanilla)

  • In plain vanilla mode, depending up on the architecture each tool/technology needs to be manually downloaded, installed and configured.
  • Typically people use Puppet or Chef to set up clusters using plain vanilla tools
  • Advantages
    • You can set up your cluster with latest versions from Apache directly
  • Disadvantages
    • Installation is tedious and error prone
    • Need to integrate with monitoring tools

Hadoop Distributions

  • Different vendors pre-package apache suite of big data tools into their distribution to facilitate
    • Easier installation/upgrade using wizards
    • Better monitoring
    • Easier maintenance
    • and many more
  • Leading distributions include, but not limited to
    • Cloudera
    • Hortonworks
    • MapR
    • AWS EMR
    • IBM Big Insights
    • and many more

Different Certifications in Big Data

  • Why to certify?
    • To promote skills
    • Demonstrate industry recognized validation for your expertise.
    • Meet global standards required to ensure compatibility between Spark and Hadoop
    • Stay up to date with the latest advances in Big Data technologies such as Spark and Hadoop

Resources for deep dive into Big Data

Resources to learn Big Data with hands on practice

  • YouTube Channel: (please subscribe)
  • 900+ videos
  • 100+ playlists
  • 6 Certification courses
  • - launched recently
  • Few courses added
  • Other courses will be added overtime
  • Courses will be either role based or certification based
  • Will be working on blogging platform for IT content


Job roles in Big Data

Job Role Experience required Desired Skills
Hadoop Developer 0-7 Years Hadoop, Programming using java, spark, hive, pig, sqoop etc
Hadoop Administrator 0-10 Years Linux, Hadoop Administration using distributions
Big Data Engineer 3-15 Years Data Warehousing, ETL, Hadoop, hive, pig, sqoop, spark etc
Big Data Solutions Architect 12-18 Years Deep understanding of Big Data eco system such as Hadoop, NoSQL etc
Infrastructure Architect 12-18 Years Deep understanding of infrastructure as well as Big Data eco system

Add Comment

Leave a Reply


Big Data Introduction - YouTube live video

Please click here