Spark Developemnt
Course Outline
Introduction to Apache Hadoop and the Hadoop Ecosystem
Introduction to Apache Hadoop and the Hadoop Ecosystem
Apache Hadoop Overview
Data Ingestion and Storage
Data Processing
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises
Apache Hadoop File Storage
Apache Hadoop Cluster Components
HDFS Architecture
Using HDFS
Distributed Processing on an Apache Hadoop Cluster
YARN Architecture
Working With YARN
Apache Spark Basics
What is Apache Spark?
Starting the Spark Shell
Using the Spark Shell
Getting Started with Datasets and DataFrames
DataFrame Operations
Working with DataFrames and Schemas
Creating DataFrames from Data Sources
Saving DataFrames to Data Sources
DataFrame Schemas
Eager and Lazy Execution
Analyzing Data with DataFrame Queries
Querying DataFrames Using Column Expressions
Grouping and Aggregation Queries
Joining DataFrames
RDD Overview
RDD Overview
RDD Data Sources
Creating and Saving RDDs
RDD Operations
Transforming Data with RDDs
Writing and Passing Transformation Functions
Transformation Execution
Converting Between RDDs and DataFrames
Aggregating Data with Pair RDDs
Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations
Querying Tables and Views with Apache Spark SQL
Querying Tables in Spark Using SQL
Querying Files and Views
The Catalog API
Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
Working with Datasets in Scala
Datasets and DataFrames
Creating Datasets
Loading and Saving Datasets
Dataset Operations
Writing, Configuring, and Running Apache Spark Applications
Writing a Spark Application
Building and Running an Application
Application Deployment Mode
The Spark Application Web UI
Configuring Application Properties
Distributed Processing
Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan
Distributed Data Persistence
DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
Common Patterns in Apache Spark Data Processing
Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means
Apache Spark Streaming: Introduction to DStreams
Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications
Apache Spark Streaming: Processing Multiple Batches
Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka Data Sources
Example: Using a Kafka Direct Data Source
Comments
Post a Comment