IL - Apache Spark Programming with Azure Databricks
In this course, you will explore the Spark Internals and Architecture of Azure Databricks. The course will start with a brief introduction to Scala. Using the Scala programming language, you will be introduced to the core functionalities and use cases of Azure Databricks including Spark SQL, Spark Streaming, MLlib, and GraphFrames.
- Duration: 3 Days
- Level: 300
Who this course is designed for
- Data Engineers
- Data Administrators
- Data Scientists
- Data Architects
- Software Developers
- Understand the Azure Databricks architecture
- Understand the Apache Spark internals
- Manipulating data using Spark APIs
- Working with large data sets and querying data with Spark SQL
- Building structured streaming jobs
- Implementing machine learning pipelines with the MLlib API
- Processing data using the GraphFrames API
- Familiarity with cloud computing concepts
- Familiarity with Azure
- Familiarity with SQL
- Background in programming
Introduction to Azure Databricks
In this module, you will learn an overview of Azure Databricks and Spark and where Azure Databricks fits in the big data landscape in Azure. Key features of Azure Databricks such as Workspaces and Notebooks will be covered. You will also learn the basic architecture of Spark and cover basic Spark internals including core APIs, job scheduling and execution. This module will prepare developers and administrators for more advanced work in Azure Databricks such as Python or Scala development.
Introduction to Scala
This session will introduce students to the Scala programming language. We will look at basic Scala syntax including variables, types, control flow, functions, scoping, inference, imports, and object-oriented programming.
Introduction to Spark Programming
In this session, students will learn the basics of Spark and Spark programming. We will cover the DataFrames and Datasets API, processing data with Spark SQL, and working with the Functions API. Students will also look at basic Spark programming concepts and techniques such as aggregation, column operations, joins and broadcasting, user defined functions, caching and performance analysis.
Spark Architecture and Internals
In this session, students will learn about Spark internals. We will look at the Spark cluster architecture covering topics such as job and task executing and scheduling, shuffling and the Catalyst optimizer.
Building Structured Streaming Jobs in Spark
In this session, students will be introduced to Spark Structured Streaming. Students will learn about data sources and data sinks and working with the Structured Streaming APIs. Students will look at stream processing techniques such as windowing and aggregation functions, checkpointing and watermarking and their use in stream processing jobs. Finally, students will investigate fault tolerance in stream processing jobs.
Implementing Machine Learning Pipelines in Spark using MLlib
In this session, students will learn how to use the Spark Machine Learning Libraries to build machine learning pipelines. Students will be introduced to machine learning pipeline concepts such as Transformer and Estimator. They will learn to perform feature processing and how to evaluate and apply machine learning models.
Graph Processing with the GraphFrames API
In this session, attendees will learn to leverage the GraphFrames API for graph processing. Topics will include transforming data frames into a graph and performing graph analysis including page rank, shortest path, connected components, and label propagation.