Big Data|Hadoop|Spark|Scala|Kafka

  • Batch Timings :
  • Starting Date :

Course Overview

Hadoop is a Big Data mechanism, which helps to store and process & analysis of unstructured data by using any commodity hardware. Hadoop is an open source software framework written in Java, which supports distributed application. It was introduced by Dough Cutting & Michael J. Cafarella in mid of 2006. Yahoo is the first commercial user of Hadoop(2008).

Hadoop works on two different generation Hadoop 1.0 & Hadoop 2.0 which, is based on YARN (yet another resource negotiator) architecture. Enterprises are now looking to leverage the big data environment require Big Data Architect who can design and build large-scale development and deployment of Hadoop applications.

Big Data is a collection of the huge or massive amount of data. We live in the data age. And it’s not easy to measure the total volume of data or to manage & process this enormous data. The flood of this Big Data is coming from different resources such as the New York stock exchange, Facebook, Twitter, AirCraft, Wallmart etc.

Apache Spark is a big data processing framework and its popularity lies in the fact that it is fast, easy to use and offers sophisticated solutions to data analysis. Its built-in modules for streaming, machine learning, SQL, and graph processing make it useful in diverse Industries like Banking, Insurance, Retail, Healthcare, and Manufacturing.


  • Resume & Interviews Preparation Support
  • Hands on Experience on Project.
  • 100 % Placement Assistance
  • Multiple Flexible Batches
  • Missed Sessions Covered
  • Practice Course Material

At the end of this Training Course, Participants will be able to:

  • Completely understand Apache Hadoop Framework.
  • Learn to work with HDFS.
  • Discover how MapReduce works with data and processes it.
  • Design and develop big data applications using Hadoop Ecosystem.
  • Learn how YARN helps in managing resources into clusters.
  • Write as well as execute programs in YARN.
  • Fundamentals of Apache Spark and Scala
  • Difference between Spark and Hadoop
  • Implementing Spark on a cluster
  • Learning Scala programming language and its concepts
  • Scala Java interoperability

Course Duration

  • WeekEnd: 12-14 Weekends
  • Weekdays: 3-3.5 months

Prerequisites :

  • Basics of Core Java and OOPs concept.
  • Basic knowledge of database, SQL and query language can help.

Who Should Attend?

  • Software Engineers looking to upgrade Big Data skills
  • Data Engineers and ETL Developers
  • Data Scientists and Analytics Professionals
  • Graduates looking to make a career in Big Data
  • IT Professionals
  • Software Testing Professionals
  • Graduates who are willing to build a career in Big Data
  • Anyone interested in Big data Analytics


1.1 Big Data Introduction

  • What is data, Types of data, what is big data?
  • Evolution of big data, Need for Big data Analytics
  • Sources of data, how to define big data using three V’s

1.2 Apache Hadoop and the Hadoop Ecosystem

  • History of Hadoop
  • Problems with traditional large-scale systems
  • Requirements for a new approach
  • Why Hadoop is in demand in market nowadays?
  • Why we need Hadoop?
  • What is Hadoop?
  • Apache Hadoop Overview
  • Data Ingestion and Storage
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Activity: Querying Hadoop Data

1.3  Basic Java Overview for Hadoop

  • Object oriented concepts
  • Variables and Data types
  • Static data type
  • Primitive data types
  • Objects & Classes
  • Types: Wrapper classes
  • Java Operators Method and its types
  • Constructors
  • Conditional statements
  • Looping in Java Access
  • Modifiers
  • Inheritance
  • Polymorphism
  • Method overloading &overriding Interfaces

1.4 Getting started with Cloudera QuickStart VM

  • Getting started with Bigdata Hadoop Cluster with Cloudera CDH
  • Creating Virtual environment demo
  • QuickStartVM CDH Navigation
  • Introduction to Cloudera Manager

1.5 Apache Hadoop File Storage

  • Apache Hadoop Cluster Components
  • HDFS Architecture
  • Using HDFS
  • Activity: Accessing HDFS with the Command Line and Hue

1.6 Distributed Processing on an Apache Hadoop Cluster

  • YARN Architecture
  • Working With YARN
  • Activity: Running and Monitoring a YARN

1.7 MapReduce

  • Introduction to MapReduce
  • Concepts of MapReduce
  • Map Reduce architecture
  • Advance Concept of Map Reduce
  • Understanding how the distributed processing solves the big data challenge and how MapReduce helps to solve that problem
  • Understanding the concept of Mappers and Reducers
  • Phases of a MapReduce program
  • Anatomy of a Map Reduce Job Run
  • Data-types in Hadoop MapReduce
  • Role of InputSplit and RecordReader
  • Input format and Output format in Hadoop
  • Concepts of Combiner and Partitioner
  • Running and Monitoring MapReduce jobs
  • Writing your own MapReduce job using MapReduce API
  • Difference between Hadoop 1 & Hadoop 2
  • The Hadoop Java API for MapReduce
  • Mapper Class
  • Reducer Class
  • Driver Class
  • Basic Configuration of MapReduce
  • Writing and Executing the Basic MapReduce Program using Java
  • Submission & Initialization of MapReduce Job.
  • Explain the Driver, Mapper and Reducer code
  • Word count problem and solution
  • Configuring development environment – Eclipse
  • Testing, debugging project through eclipse and then finally packaging, deploying the code on Hadoop Cluster

2.1 Data Analysis using Pig

  • Introduction to Apache Pig
  • Why PIG if MapReduce is there?
  • Pig Architecture and components
  • Pig Installation
  • Accessing Pig Grunt Shell
  • Pig Data types
  • Pig Commands – Load, Store, Describe , Dump
  • Pig Rotational Operators
  • Pig User Defined Functions
  • Tight coupling between Pig and MapReduce
  • Pig Latin scripting
  • PIG running modes,
  • PIG in local mode
  • PIG in Map Reduce mode

2.2 Data Analysis using Hive

  • Introduction to hive
  • Hive Architecture
  • Components of Hive
  • Installation & configuration
  • Working with Tables.
  • Primitive data types and complex data type
  • Hive Bucketed Tables and Sampling.
  • Dynamic Partition
  • Differences between ORDER BY, DISTRIBUTE BY and SORT BY.
  • Bucketing and Sorted Bucketing with Dynamic partition.
  • RC File.
  • Compression on hive tables and Migrating Hive tables.
  • Dynamic substation of Hive and Different ways of running Hive
  • How to enable Update in HIVE.
  • Log Analysis on Hive.
  • Access HBASE tables using Hive.
  • Hive Services, Hive Shell, Hive Server and Hive Web Interface (HWI)
  • Meta store
  • Creating the table, Loading the datasets & performing analysis on that Datasets.
  • How to Capture Inserts, Updates and Deletes in Hive
  • Hue Interface for Hive
  • How to analyse data using Hive script

2.3 Data Integration using Sqoop

  • Introduction to Apache Sqoop
  • Sqoop Architecture and installation
  • Import Data using Sqoop in HDFS
  • Import all tables in Sqoop
  • Export data from HDFS
  • Setting up RDBMS Server and creating & loading datasets into RDBMS Mysql.
  • Writing the Sqoop Import Commands to transfer data from RDBMS to HDFS/Hive/Hbase
  • Writing the Sqoop Export commands to transfer data from HDFS/Hive to RDBMS ·

2.4 Real time data streaming Flume

  • Installation
  • Introduction to Flume
  • Flume Agents: Sources, Channels and Sinks
  • Flume Commands
  • Flume Use Cases
  • How to load data in Hadoop that is coming from web server or other storage
  • How to load streaming data from Twitter data in HDFS using Hadoop

2.5 Hadoop No Sql database HBase

  • Introduction to Hbase
  • Hbase Architecture
  • HBase Installation and configurations
  • HBase concepts
  • HBase Data Model and Comparison between RDBMS and NOSQL.
  • Master & Region Servers.
  • HBase Operations (DDL and DML) through Shell and Programming and HBase Architecture.
  • Catalog Tables

2.6 Zookeeper

  • Introduction to Zookeeper
  • How Zookeeper helps in Hadoop Ecosystem
  • How to load data from Relational storage in Hadoop
  • Data Model of ZooKeeper
  • Operations of ZooKeeper
  • ZooKeeper Implementation
  • Sessions, States and Consistency.

3.1 Scala Basics

  • Functional programming basics
  • Scala Programming constructs walkthrough
  • Apache Spark Basics
  • What is Apache Spark?
  • Starting the Spark Shell
  • Using the Spark Shell
  • Introduction to Spark framework
  • Spark Architecture

3.2 Scala Programming

  • Why Scala?
  • Data Types,Variables
  • Operators
  • While Loop
  • Do While Loop
  • For Loop
  • Loop Control Statement
  • Control Structures
  • Functions,Comments
  • ArrayBuffer
  • Tuples
  • Map
  • Sets

3.3 Programming with RDDs

  • RDD Overview
  • RDD Data Sources
  • Creating and Saving RDDs
  • RDD Operations
  • Activity: Working with RDDs

3.4 Transforming Data with RDDs

  • Writing and Passing Transformation Functions
  • Transformation Execution
  • Use Cases with RDDs
  • Activity: Transforming Data using RDDs

3.5 Introduction to Pair RDDs

  • Key-Value Pair RDDs
  • Map-Reduce with Pair RDDs
  • Other Pair RDD Operations
  • Activity: Joining Data Using Pair RDDs

3.6 Spark SQL:DataFrames and Schemas

  • Getting Started with Datasets and DataFrames
  • DataFrames Operations
  • Activity: Exploring DataFrames Using the Apache Spark Shell
  • Creating DataFrames from Data Sources
  • Saving DataFrames to Data Sources
  • Data Frame Schemas
  • Eager and Lazy Execution
  • Querying DataFrames Using Column Expressions
  • Grouping and Aggregation Queries
  • Joining DataFrames
  • Activity: Analyzing Data with DataFrame Queries
  • Activity: Working with DataFrames and Schemas

3.7 Querying tables using Spark SQL

  • Querying Files and Views
  • The Catalog API
  • Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
  • Activity: Querying Tables and Views with SQL

3.8 Working with Datasets in Scala

  • Datasets and DataFrames
  • Creating Datasets
  • Loading and Saving Datasets
  • Dataset Operations
  • Activity: Using Datasets in Scala

3.9 Writing, Configuring, and Running Apache Spark Applications

  • Writing a Spark Application
  • Building and Running an Application
  • Application Deployment Mode
  • The Spark Application Web UI
  • Configuring Application Properties
  • Activity: Writing, Configuring, and Running a Spark Application

3.10 Apache Spark Streaming: Introduction to D Streams

  • Apache Spark Streaming Overview
  • Example: Streaming Request Count
  • DStreams
  • Developing Streaming Applications
  • Activity: Writing a Streaming Application

3.11 Apache Spark Streaming: Processing Multiple Batches

  • Multi-Batch Operations
  • Time Slicing
  • State Operations
  • Sliding Window Operations
  • Activity: Processing Multiple Batches of Streaming Data

3.12  Apache Spark Streaming: Data Sources

  • Streaming Data Source Overview
  • Apache Flume and Apache Kafka Data Sources
  • Example: Using a Kafka Direct Data Source
  • Activity: Processing Streaming Apache Kafka Messages

3.13 Message Processing with Apache Kafka

  • What Is Apache Kafka?
  • Apache Kafka Overview
  • Apache Kafka Cluster Architecture
  • Apache Kafka Command Line Tools
  • Activity: Producing and Consuming Apache Kafka Messages

3.14 Capturing Data with Apache Flume

  • What Is Apache Flume?
  • Basic Architecture
  • Sources, Sinks, Channels and Configuration
  • Activity: Collecting Web Server Logs with Apache Flume

3.15 Integrating Apache Flume and Apache Kafka

  • Overview
  • Use Cases
  • Configuration
  • Activity: Sending Messages from Flume to Kafka

3.16 Hadoop on Google Cloud

3.17Hadoop on AWS


  • Java 1.6.x or higher.
  • Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work.

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

The training on “Spark with Scala” course is a hands-on training. All the code and exercises will be done in the clasroom sessions. Our batch sizes are generally small so that personalized attention can be given to each and every learner.

Classes are held on weekdays and weekends. You can check available schedules and choose the batch timings which are convenient for you.

Quick Enquiry