Big Data|Hadoop|Spark|Scala|Kafka


  • Batch Timings :
  • Starting Date :

Course Overview

Hadoop is a Big Data mechanism, which helps to store and process & analysis of unstructured data by using any commodity hardware. Hadoop is an open source software framework written in Java, which supports distributed application. It was introduced by Dough Cutting & Michael J. Cafarella in mid of 2006. Yahoo is the first commercial user of Hadoop(2008).

Hadoop works on two different generation Hadoop 1.0 & Hadoop 2.0 which, is based on YARN (yet another resource negotiator) architecture. Enterprises are now looking to leverage the big data environment require Big Data Architect who can design and build large-scale development and deployment of Hadoop applications.

Big Data is a collection of the huge or massive amount of data. We live in the data age. And it’s not easy to measure the total volume of data or to manage & process this enormous data. The flood of this Big Data is coming from different resources such as the New York stock exchange, Facebook, Twitter, AirCraft, Wallmart etc.

Apache Spark is a big data processing framework and its popularity lies in the fact that it is fast, easy to use and offers sophisticated solutions to data analysis. Its built-in modules for streaming, machine learning, SQL, and graph processing make it useful in diverse Industries like Banking, Insurance, Retail, Healthcare, and Manufacturing.

COURSE FEATURES

  • Resume & Interviews Preparation Support
  • Hands on Experience on Project.
  • 100 % Placement Assistance
  • Multiple Flexible Batches
  • Missed Sessions Covered
  • Practice Course Material

At the end of this Training Course, Participants will be able to:

  • Completely understand Apache Hadoop Framework.
  • Learn to work with HDFS.
  • Discover how MapReduce works with data and processes it.
  • Design and develop big data applications using Hadoop Ecosystem.
  • Learn how YARN helps in managing resources into clusters.
  • Write as well as execute programs in YARN.
  • Fundamentals of Apache Spark and Scala
  • Difference between Spark and Hadoop
  • Implementing Spark on a cluster
  • Learning Scala programming language and its concepts
  • Scala Java interoperability

Course Duration

  • WeekEnd: 60 hours

Prerequisites :

  • Basics of Core Java and OOPs concept.
  • Basic knowledge of database, SQL and query language can help.

Who Should Attend?

  • Software Engineers looking to upgrade Big Data skills
  • Data Engineers and ETL Developers
  • Data Scientists and Analytics Professionals
  • Graduates looking to make a career in Big Data
  • IT Professionals
  • Software Testing Professionals
  • Graduates who are willing to build a career in Big Data
  • Anyone interested in Big data Analytics

Course

1.1 Big Data Introduction

  • What is Big Data
  • Evolution of Big Data
  • Why Big Data?
  • Role Played by Big Data
  • Data management – Industry Challenges
  • Types of data
  • Sources of Big Data
  • Big Data examples
  • What is streaming data?
  • Batch vs Streaming data processing

1.2 Hadoop Introduction

  • History of Hadoop
  • Problems with traditional large-scale systems
  • Requirements for a new approach
  • Why Hadoop is in demand in market nowadays?
  • Why we need Hadoop?
  • What is Hadoop?
  • How Hadoop solves the Big Data problem
  • Hadoop Architecture.
  • Hadoop ecosystem Components
  • HDFS Overview
  • Hadoop 1.x vs Hadoop 2.x
  • Hadoop 1.X Architecture
  • Hadoop 1.X Core Components
  • Hadoop 1.X Job Process
  • Overview of Hadoop Daemons
  • Hadoop Daemons in Hadoop Release-1.x
  • Hadoop Daemons in Hadoop Release-2.x
  • Hadoop Release-3.x
  • Comparing Hadoop & SQL.

1.3  Basic Java Overview for Hadoop

  • Object oriented concepts
  • Variables and Data types
  • Static data type
  • Primitive data types
  • Objects & Classes
  • Types: Wrapper classes
  • Java Operators Method and its types
  • Constructors
  • Conditional statements
  • Looping in Java Access
  • Modifiers
  • Inheritance
  • Polymorphism
  • Method overloading &overriding Interfaces

1.4 Building Blocks

  • Quick tour of Java (As Hadoop is Written in Java , so it will help us to understand it better)
  • Quick tour of Linux commands ( Basic Commands to traverse the Linux OS)
  • Quick Tour of RDBMS Concepts (to use HIVE and Impala)
  • Quick hands on experience of SQL.
  • Introduction to Cloudera VM and usage instructions

1.5 Setting up the Development Environment (Cloudera quickstart vm)

  • Overview of Big Data Tools
  • Different vendors providing hadoop and where it fits in the industry ·
  • Setting up Development environment & performing Hadoop Installation on User’s laptop Hadoop daemons
  • Starting stopping daemons using command line and cloudera manager

1.6 Hadoop cluster(Introduction and Installation)

  • Nodes in hadoop cluster (Slave & Master)
  • Setting up a Hadoop cluster
  • Preparing nodes for Hadoop and VM settings
  • Install Java and configure password less SSH across nodes
  • Basic Linux commands
  • Hadoop 1.x single node deployment
  • Hadoop Daemons – NameNode, JobTracker, DataNode, TaskTracker, Secondary NameNode
  • Hadoop Configuration files and running
  • Important web URLs and Logs for Hadoop
  • Nodes in hadoop cluster (Slave & Master)
  • Run HDFS and Linux commands
  • Hadoop 1.x multi-mode deployment
  • Run sample jobs in Hadoop single and multi-node clusters

1.7  HDFS (Hadoop Distributed File System)

  • HDFS Design Goals
  • Understanding the HDFS architecture.
  • Understand Blocks and how to configure block size
  • Block replication and replication factor
  • Understand Hadoop Rack Awareness and configure racks in Hadoop
  • File read and write anatomy in in HDFS
  • Configure HDFS Name and Space Quota
  • Configure and use WebHDFS (Rest API for HDFS)
  • Understand NameNode Safemode, file system image and edits
  • Configure Secondary NameNode and use checkpointing process to provide NameNode failover
  • HDFS DFSAdmin and FIle system shell commands
  • Hadoop NameNode/DataNode directory structure
  • Metadata, FS image, Edit log, Secondary Name Node and Safe Mode.
  • How to add New Data Node dynamically.
  • How to decommission a Data Node dynamically (Without stopping cluster).
  • Data Processing and Replication Pipeline

1.8 MapReduce

  • Introduction to MapReduce
  • Concepts of MapReduce
  • Map Reduce architecture
  • Advance Concept of Map Reduce
  • Understanding how the distributed processing solves the big data challenge and how MapReduce helps to solve that problem
  • Understanding the concept of Mappers and Reducers
  • Phases of a MapReduce program
  • Anatomy of a Map Reduce Job Run
  • Data-types in Hadoop MapReduce
  • Role of InputSplit and RecordReader
  • Input format and Output format in Hadoop
  • Concepts of Combiner and Partitioner
  • Running and Monitoring MapReduce jobs
  • Writing your own MapReduce job using MapReduce API
  • Difference between Hadoop 1 & Hadoop 2
  • The Hadoop Java API for MapReduce
  • Mapper Class
  • Reducer Class
  • Driver Class

1.9 Configuration

  • Basic Configuration of MapReduce
  • Writing and Executing the Basic MapReduce Program using Java
  • Submission & Initialization of MapReduce Job.
  • Explain the Driver, Mapper and Reducer code
  • Word count problem and solution
  • Configuring development environment – Eclipse

2.1 Pig

  • What is Apache Pig
  • Why Apache Pig
  • Pig features
  • Where should Pig be used
  • Where not to use Pig
  • Why PIG if MapReduce is there?
  • Pig Architecture and components
  • Pig Installation
  • Accessing Pig Grunt Shell
  • Pig Data types
  • Pig Commands – Load, Store, Describe , Dump
  • Pig Rotational Operators
  • Pig User Defined Functions
  • Configure PIG to use HCatalog
  • Tight coupling between Pig and MapReduce
  • Pig Latin scripting
  • PIG running modes,
  • Map Reduce vs. PIG
  • PIG in local mode
  • PIG in Map Reduce mode
  • Execution mechanism and data processing
  • Writing UDFs
  • Macros in Pig

2.2 Hive

  • Overview of Hive
  • Background of Hive
  • Hive vs PIG
  • Hive Architecture
  • Components of Hive
  • Installation & configuration
  • Working with Tables.
  • Primitive data types and complex data type
  • Hive Bucketed Tables and Sampling.
  • Dynamic Partition
  • Differences between ORDER BY, DISTRIBUTE BY and SORT BY.
  • Bucketing and Sorted Bucketing with Dynamic partition.
  • RC File.
  • INDEXES and VIEWS.
  • MAPSIDE JOINS.
  • Compression on hive tables and Migrating Hive tables.
  • Dynamic substation of Hive and Different ways of running Hive
  • How to enable Update in HIVE.
  • Log Analysis on Hive.
  • Access HBASE tables using Hive.
  • Hive Services, Hive Shell, Hive Server and Hive Web Interface (HWI)
  • Meta store
  • Creating the table, Loading the datasets & performing analysis on that Datasets.
  • How to Capture Inserts, Updates and Deletes in Hive
  • Hue Interface for Hive
  • How to analyse data using Hive script
  • Differentiation between Hive and Impala

2.3 Sqoop

  • Introduction to Apache Sqoop
  • Sqoop Architecture and installation
  • Import Data using Sqoop in HDFS
  • Import all tables in Sqoop
  • Export data from HDFS
  • Setting up RDBMS Server and creating & loading datasets into RDBMS Mysql.
  • Writing the Sqoop Import Commands to transfer data from RDBMS to HDFS/Hive/Hbase
  • Writing the Sqoop Export commands to transfer data from HDFS/Hive to RDBMS ·

2.4 Flume

  • Installation
  • Introduction to Flume
  • Flume Agents: Sources, Channels and Sinks
  • Flume Commands
  • Flume Use Cases
  • How to load data in Hadoop that is coming from web server or other storage
  • How to load streaming data from Twitter data in HDFS using Hadoop

2.5 NoSQL Database – Hbase

  • Introduction to NoSQL Databases and Hbase
  • HBase v/s RDBMS, HBase Components, HBase Architecture
  • HBase Cluster Deployment

2.6 HBase

  • Introduction to Hbase
  • Hbase Architecture
  • HBase Installation and configurations
  • HBase concepts
  • HBase Data Model and Comparison between RDBMS and NOSQL.
  • Master & Region Servers.
  • HBase Operations (DDL and DML) through Shell and Programming and HBase Architecture.
  • Catalog Tables.
  • Block Cache and sharding.
  • SPLITS
  • DATA Modeling (Sequential, Salted, Promoted and Random Keys).
  • JAVA API’s and Rest Interface.
  • Client Side Buffering and Process 1 million records using Client side Buffering.
  • HBASE Counters.
  • Enabling Replication and HBASE RAW Scans.
  • HBASE Filters.
  • Bulk Loading and Coprocessors (Endpoints and Observers with programs).
  • Real world use case consisting of HDFS,MR and HBASE.

2.7 Oozie, HUE and Yarn (Hadoop Processing Framework)

  • Oozie Fundamentals
  • Oozie: Components
  • Oozie workflow creations
  • Scheduling with Oozie
  • Concepts on Coordinators and BundlesClient Nodes
  • Hands-on Training on Oozie Workflow
  • Oozie Coordinator
  • Oozie Commands
  • Oozie Web Console
  • Oozie for MapReduce
  • Hive in Oozie
  • An Overview of Hue
  • Hue in Real-time Scenarios
  • Introduction to YARN
  • Significance of YARN
  • YARN Daemons – Resource Manager, NodeManager etc.
  • Job assignment & Execution flow

2.8 Zookeeper

  • Introduction to Zookeeper
  • How Zookeeper helps in Hadoop Ecosystem
  • How to load data from Relational storage in Hadoop
  • Data Model of ZooKeeper
  • Operations of ZooKeeper
  • ZooKeeper Implementation
  • Sessions, States and Consistency.
  • Preparing nodes for Hadoop and VM settings
  • Install Java and configure password less SSH across nodes
  • Basic Linux commands
  • Hadoop 1.x single node deployment
  • Hadoop Daemons – NameNode, JobTracker, DataNode, TaskTracker, Secondary NameNode
  • Hadoop Configuration files and running
  • Important web URLs and Logs for Hadoop
  • Run HDFS and Linux commands
  • Hadoop 1.x multi-mode deployment
  • Run sample jobs in Hadoop single and multi-node clusters

2.9 Introduction to YARN

  • What is YARN? YARN Architecture
  • YARN daemons
  • Active and Standby Namenodes
  • Resource Manager and Application Master
  • Node Manager
  • Container Objects and Container
  • Namenode Federation

2.10 Hadoop on Google Cloud

  • Introduction to google Cloud Infrastructure
  • Creating VM instances on google cloud
  • Deploying Data on to the google Cloud
  • Choosing size of our instance
  • Configuration of EMR Instance
  • Creating a virtual cluster on google cloud
  • Creating project on google cloudDeploying project

3.1 Introduction to Data Analysis with Spark

  • What Is Apache Spark?
  • A Unified Stack
  • Spark Core
  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX
  • Cluster Managers
  • Who Uses Spark, and for What?
  • Data Science Tasks
  • Data Processing Applications
  • A Brief History of Spark
  • Spark Versions and Releases
  • Storage Layers for Spark

3.2 Downloading Spark and Getting Started

  • Downloading Spark
  • Introduction to Spark’s Python and Scala Shells
  • Introduction to Core Spark Concepts
  • Standalone Applications
  • Initializing a SparkContext
  • Building Standalone Applications

3.3 Scala Programming

  • Why Scala?
  • OOP
  • Environment Setup
  • Data Types,Variables
  • Functions,Comments
  • Inheritance
  • Annotations
  • if – else statement
  • String
  • Singleton & Companion Object
  • Case Class
  • Implement Abstract Class
  • String Method
  • Method Overloading
  • Method & Field Overriding
  • String Interpolation
  • Arrays
  • Operators
    • While Loop
    • Do While Loop
    • For Loop
    • Loop Control Statement
    • Control Structures
    • Tuples
    • Map
    • Sets
    • Constructor
    • Extractors
    • Iterators
    • Pattern Matching
    • List
    • Closures
    • Option
    • Final & This
    • Trait
    • Regular Expressions
    • Partial Functions
    • Currying Functions
    • Access Modifiers
    • File I/O

3.4 Programming with RDDs

  • RDD Basics
  • Creating RDDs
  • RDD Operations
  • Transformations
  • Actions
  • Lazy Evaluation
  • Passing Functions to Spark
  • Common Transformations and Actions
  • Basic RDDs
  • Converting Between RDD Types
  • Persistence (Caching)

3.5 Working with Key/Value Pairs (Cloudera quickstart vm)

  • Motivation
  • Creating Pair RDDs
  • Transformations on Pair RDDs
  • Aggregations
  • Grouping Data
  • Joins
  • Sorting Data
  • Actions Available on Pair RDDs
  • Data Partitioning (Advanced)
  • Determining an RDD’s Partitioner
  • Operations That Benefit from Partitioning
  • Operations That Affect Partitioning
  • Example: PageRank
  • Custom Partitioners

3.6 Loading and Saving Your Data

  • Hadoop Input and Output Formats
  • Text Files
  • JSON
  • Comma-Separated Values and Tab-Separated Values
  • SequenceFiles
  • Object Files
  • File Compression
  • Filesystems
  • Local/“Regular” FS
  • HDFS
  • Structured Data with Spark SQL
  • Apache Hive
  • Databases
  • Java Database Connectivity
  • HBase

3.7 Advanced Spark Programming

  • Introduction
  • Accumulators
  • Accumulators and Fault Tolerance
  • Custom Accumulators
  • Broadcast Variables
  • Optimizing Broadcasts
  • Working on a Per-Partition Basis
  • Piping to External Programs
  • Numeric RDD Operations

3.8 Running on a Cluster

  • Introduction
  • Spark Runtime Architecture
  • The Driver
  • Executors
  • Cluster Manager
  • Launching a Program
  • Driver Class

3.9 Deploying Applications with spark-submit

  • Packaging Your Code and Dependencies
  • A Java Spark Application Built with Maven
  • A Scala Spark Application Built with sbt
  • Dependency Conflicts
  • Scheduling Within and Between Spark Applications
  • Cluster Managers
  • Standalone Cluster Manager
  • Hadoop YARN
  • Apache Mesos
  • Amazon EC
  • Which Cluster Manager to Use?

3.10 Tuning and Debugging Spark

  • Configuring Spark with SparkConf
  • Components of Execution: Jobs, Tasks, and Stages
  • Finding Information
  • Spark Web UI
  • Driver and Executor Logs
  • Key Performance Considerations
  • Level of Parallelism
  • Serialization Format
  • Memory Management
  • Hardware Provisioning

3.11 Spark SQL

  • Linking with Spark SQL
  • Using Spark SQL in Applications
  • Initializing Spark SQL
  • Basic Query Example
  • SchemaRDDs
  • Caching
  • Loading and Saving Data
  • Apache Hive
  • Parquet
  • JSON
  • From RDDs
  • JDBC/ODBC Server
  • Working with Beeline
  • Long-Lived Tables and Queries
  • User-Defined Functions
  • Spark SQL UDFs
  • Hive UDFs
  • Spark SQL Performance
  • Performance Tuning Options

3.12  Spark Streaming

  • Spark Streaming?
  • Need for Streaming in Apache Spark
  • Why Streaming in Spark?
  • Spark Streaming Architecture and Advantages
  • Goals of Spark Streaming
  • How does Spark Streaming works
  • Streaming Sources
  • Streaming Operations
  • Transformation Operations in Spark
  • Output Operations in Apache Spark

3.13 Spark Streaming Tool – Kafka

  • Kafka?
  • use of Apache Kafka Cluster?
  • Kafka Architecture
  • Components
    • Topic
    • Producer
    • Consumer
    •  Broker
    •  Zookeeper
  • Partition in Kafka
  • Kafka Use Cases
  • Apache Kafka vs Apache Flume
  • RabbitMQ vs Apache Kafka
  • Traditional queuing systems vs Apache Kafka

FAQ

  • Java 1.6.x or higher.
  • Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work.

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

The training on “Spark with Scala” course is a hands-on training. All the code and exercises will be done in the clasroom sessions. Our batch sizes are generally small so that personalized attention can be given to each and every learner.

Classes are held on weekdays and weekends. You can check available schedules and choose the batch timings which are convenient for you.

Quick Enquiry