Big Data with Spark and Scala


  • Batch Timings :
  • Starting Date :

Course Overview

Apache Spark is a big data processing framework and its popularity lies in the fact that it is fast, easy to use and offers sophisticated solutions to data analysis. Its built-in modules for streaming, machine learning, SQL, and graph processing make it useful in diverse Industries like Banking, Insurance, Retail, Healthcare, and Manufacturing.

COURSE FEATURES

  • Resume & Interviews Preparation Support
  • Hands on Experience on Project.
  • 100 % Placement Assistance
  • Multiple Flexible Batches
  • Missed Sessions Covered
  • Practice Course Material

At the end of Big Data with Spark and Scala Training Course, Participants will be able to:

  • Fundamentals of Apache Spark and Scala
  • Difference between Spark and Hadoop
  • Implementing Spark on a cluster
  • Learning Scala programming language and its concepts
  • Writing Spark applications in Scala, Java and Python
  • Scala Java interoperability
  • Learn Storm Architecture and basic distributed concepts.
  • Learn Big Data features
  • Understand Legacy architecture of real time systems
  • Understand Logic Dynamics and Components in Storm
  • Learn the difference between Hadoop and Apache Spark
  • Learn Scala and the programming implementation in Scala
  • Implement Spark on cluster
  • Gain insight into functioning of Scala
  • Develop Real-life Storm Projects

Course Duration

  • Weekend: 40-50 hours

Prerequisites :

  • Basic knowledge of database, SQL and query language can help.
  • Basic knowledge of object-oriented programming is enough.

Who Should Attend?

  • Software Engineers looking to upgrade Big Data skills
  • Data Engineers and ETL Developers
  • Data Scientists and Analytics Professionals
  • Graduates looking to make a career in Big Data

Course

1.1 Big Data Introduction

  • What is Big Data
  • Evolution of Big Data
  • Why Big Data?
  • Role Played by Big Data
  • Data management – Industry Challenges
  • Types of data
  • Sources of Big Data
  • Big Data examples
  • What is streaming data?
  • Batch vs Streaming data processing

1.2 Introduction to Data Analysis with Spark

  • What Is Apache Spark?
  • A Unified Stack
  • Spark Core
  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX
  • Cluster Managers
  • Who Uses Spark, and for What?
  • Data Science Tasks
  • Data Processing Applications
  • A Brief History of Spark
  • Spark Versions and Releases
  • Storage Layers for Spark

1.3 Downloading Spark and Getting Started

  • Downloading Spark
  • Introduction to Spark’s Python and Scala Shells
  • Introduction to Core Spark Concepts
  • Standalone Applications
  • Initializing a SparkContext
  • Building Standalone Applications

1.4 Scala Programming

  • Why Scala?
  • OOP
  • Environment Setup
  • Data Types,Variables
  • Functions,Comments
  • Inheritance
  • Annotations
  • if – else statement
  • String
  • Singleton & Companion Object
  • Case Class
  • Implement Abstract Class
  • String Method
  • Method Overloading
  • Method & Field Overriding
  • String Interpolation
  • Arrays
  • Operators
    • While Loop
    • Do While Loop
    • For Loop
    • Loop Control Statement
    • Control Structures
    • Tuples
    • Map
    • Sets
    • Constructor
    • Extractors
    • Iterators
    • Pattern Matching
    • List
    • Closures
    • Option
    • Final & This
    • Trait
    • Regular Expressions
    • Partial Functions
    • Currying Functions
    • Access Modifiers
    • File I/O

1.5 Programming with RDDs

  • RDD Basics
  • Creating RDDs
  • RDD Operations
  • Transformations
  • Actions
  • Lazy Evaluation
  • Passing Functions to Spark
  • Common Transformations and Actions
  • Basic RDDs
  • Converting Between RDD Types
  • Persistence (Caching)

1.6 Working with Key/Value Pairs (Cloudera quickstart vm)

  • Motivation
  • Creating Pair RDDs
  • Transformations on Pair RDDs
  • Aggregations
  • Grouping Data
  • Joins
  • Sorting Data
  • Actions Available on Pair RDDs
  • Data Partitioning (Advanced)
  • Determining an RDD’s Partitioner
  • Operations That Benefit from Partitioning
  • Operations That Affect Partitioning
  • Example: PageRank
  • Custom Partitioners

1.7 Loading and Saving Your Data

  • Hadoop Input and Output Formats
  • Text Files
  • JSON
  • Comma-Separated Values and Tab-Separated Values
  • SequenceFiles
  • Object Files
  • File Compression
  • Filesystems
  • Local/“Regular” FS
  • HDFS
  • Structured Data with Spark SQL
  • Apache Hive
  • Databases
  • Java Database Connectivity
  • HBase

1.8 Advanced Spark Programming

  • Introduction
  • Accumulators
  • Accumulators and Fault Tolerance
  • Custom Accumulators
  • Broadcast Variables
  • Optimizing Broadcasts
  • Working on a Per-Partition Basis
  • Piping to External Programs
  • Numeric RDD Operations

1.9 Running on a Cluster

  • Introduction
  • Spark Runtime Architecture
  • The Driver
  • Executors
  • Cluster Manager
  • Launching a Program
  • Driver Class

1.10 Deploying Applications with spark-submit

  • Packaging Your Code and Dependencies
  • A Java Spark Application Built with Maven
  • A Scala Spark Application Built with sbt
  • Dependency Conflicts
  • Scheduling Within and Between Spark Applications
  • Cluster Managers
  • Standalone Cluster Manager
  • Hadoop YARN
  • Apache Mesos
  • Amazon EC
  • Which Cluster Manager to Use?

1.11 Tuning and Debugging Spark

  • Configuring Spark with SparkConf
  • Components of Execution: Jobs, Tasks, and Stages
  • Finding Information
  • Spark Web UI
  • Driver and Executor Logs
  • Key Performance Considerations
  • Level of Parallelism
  • Serialization Format
  • Memory Management
  • Hardware Provisioning

1.12 Spark SQL

  • Linking with Spark SQL
  • Using Spark SQL in Applications
  • Initializing Spark SQL
  • Basic Query Example
  • SchemaRDDs
  • Caching
  • Loading and Saving Data
  • Apache Hive
  • Parquet
  • JSON
  • From RDDs
  • JDBC/ODBC Server
  • Working with Beeline
  • Long-Lived Tables and Queries
  • User-Defined Functions
  • Spark SQL UDFs
  • Hive UDFs
  • Spark SQL Performance
  • Performance Tuning Options

1.13  Spark Streaming

  • Spark Streaming?
  • Need for Streaming in Apache Spark
  • Why Streaming in Spark?
  • Spark Streaming Architecture and Advantages
  • Goals of Spark Streaming
  • How does Spark Streaming works
  • Streaming Sources
  • Streaming Operations
  • Transformation Operations in Spark
  • Output Operations in Apache Spark

1.14 Spark Streaming Tool – Kafka

  • Kafka?
  • use of Apache Kafka Cluster?
  • Kafka Architecture
  • Components
    • Topic
    • Producer
    • Consumer
    •  Broker
    •  Zookeeper
  • Partition in Kafka
  • Kafka Use Cases
  • Apache Kafka vs Apache Flume
  • RabbitMQ vs Apache Kafka
  • Traditional queuing systems vs Apache Kafka

FAQ

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

The training on “Spark with Scala” course is a hands-on training. All the code and exercises will be done in the classroom sessions. Our batch sizes are generally small so that personalized attention can be given to each and every learner.

Feel free to drop a mail to us at info@stackodes.com and we will get back to you at the earliest for your queries on “Spark with Scala” course.

Quick Enquiry