Performing Data Engineering on Microsoft HD Insight(20775)

Description

The main purpose of the course is to give students the ability to plan and implement big data workflows on HDInsight.

The primary audience for this course is data engineers, data architects, data scientists, and data developers who plan to implement big data engineering workflows on HDInsight.

After completing this course, students will be able to:

Deploy HDInsight Clusters.
Authorizing Users to Access Resources.
Loading Data into HDInsight.
Troubleshooting HDInsight.
Implement Batch Solutions.
Design Batch ETL Solutions for Big Data with Spark
Analyze Data with Spark SQL.
Analyze Data with Hive and Phoenix.
Describe Stream Analytics.
Implement Spark Streaming Using the DStream API.
Develop Big Data Real-Time Processing Solutions with Apache Storm.
Build Solutions that use Kafka and HBase.

Prerequisites

In addition to their professional experience, students who attend this course should have:

Programming experience using R, and familiarity with common R packages
Knowledge of common statistical methods and data analysis best practices.
Basic knowledge of the Microsoft Windows operating system and its core functionality.
Working knowledge of relational databases.

What’s included?

  • Authorized Courseware
  • Intensive Hands on Skills Development with an Experienced Subject Matter Expert
  • Hands-on practice on real Servers and extended lab support 1.800.482.3172
  • Examination Vouchers & Onsite Certification Testing- (excluding Adobe and PMP Boot Camps)
  • Academy Code of Honor: Test Pass Guarantee
  • Optional: Package for Hotel Accommodations, Lunch and Transportation

With several convenient training delivery methods offered, The Academy makes getting the training you need easy. Whether you prefer to learn in a classroom or an online live learning virtual environment, training videos hosted online, and private group classes hosted at your site. We offer expert instruction to individuals, government agencies, non-profits, and corporations. Our live classes, on-sites, and online training videos all feature certified instructors who teach a detailed curriculum and share their expertise and insights with trainees. No matter how you prefer to receive the training, you can count on The Academy for an engaging and effective learning experience.

Methods

  • Instructor Led (the best training format we offer)
  • Live Online Classroom – Online Instructor Led
  • Self-Paced Video

Speak to an Admissions Representative for complete details

StartFinishPublic PricePublic Enroll Private PricePrivate Enroll
12/25/202312/29/2023
1/15/20241/19/2024
2/5/20242/9/2024
2/26/20243/1/2024
3/18/20243/22/2024
4/8/20244/12/2024
4/29/20245/3/2024
5/20/20245/24/2024
6/10/20246/14/2024
7/1/20247/5/2024
7/22/20247/26/2024
8/12/20248/16/2024
9/2/20249/6/2024
9/23/20249/27/2024
10/14/202410/18/2024
11/4/202411/8/2024
11/25/202411/29/2024
12/16/202412/20/2024
1/6/20251/10/2025

Curriculum

Module 1: Getting Started with HDInsight

This module introduces Hadoop, the MapReduce paradigm, and HDInsight.

Lessons

What is Big Data?
Introduction to Hadoop
Working with MapReduce Function
Introducing HDInsight
Lab: Working with HDInsight

Provision an HDInsight cluster and run MapReduce jobs
After completing this module, students will be able to:

Describe Hadoop, MapReduce, and HDInsight.
Use scripts to provision an HDInsight Cluster.
Run a word-counting MapReduce program using PowerShell.

Module 2: Deploying HDInsight Clusters

This module provides an overview of the Microsoft Azure HDInsight cluster types, in addition to the creation and maintenance of the HDInsight clusters. The module also demonstrates how to customize clusters by using script actions through the Azure Portal, Azure PowerShell, and the Azure command-line interface (CLI). This module includes labs that provide the steps to deploy and manage the clusters.

Lessons

Identifying HDInsight cluster types
Managing HDInsight clusters by using the Azure portal
Managing HDInsight Clusters by using Azure PowerShell
Lab: Managing HDInsight clusters with the Azure Portal

Create an HDInsight cluster that uses Data Lake Store storage
Customize HDInsight by using script actions
Delete an HDInsight cluster
After completing this module, students will be able to:

Identify HDInsight cluster types
Manage HDInsight clusters by using the Azure Portal.
Manage HDInsight clusters by using Azure PowerShell.

Module 3: Authorizing Users to Access Resources

This module provides an overview of non-domain and domain-joined Microsoft HDInsight clusters, in addition to the creation and configuration of domain-joined HDInsight clusters. The module also demonstrates how to manage domain-joined clusters using the Ambari management UI and the Ranger Admin UI. This module includes the labs that will provide the steps to create and manage domain-joined clusters.

Lessons

Non-domain Joined clusters
Configuring domain-joined HDInsight clusters
Manage domain-joined HDInsight clusters
Lab: Authorizing Users to Access Resources

Prepare the Lab Environment
Manage a non-domain joined cluster
After completing this module, students will be able to:

Identify the characteristics of non-domain and domain-joined HDInsight clusters.
Create and configure domain-joined HDInsight clusters through the Azure PowerShell.
Manage the domain-joined cluster using the Ambari management UI and the Ranger Admin UI.
Create Hive policies and manage user permissions.

Module 4: Loading data into HDInsight

This module provides an introduction to loading data into Microsoft Azure Blob storage and Microsoft Azure Data Lake storage. At the end of this lesson, you will know how to use multiple tools to transfer data to an HDInsight cluster. You will also learn how to load and transform data to decrease your query run time.

Lessons

Storing data for HDInsight processing
Using data loading tools
Maximizing value from stored data
Lab: Loading Data into your Azure account

Load data for use with HDInsight
After completing this module, students will be able to:

Discuss the architecture of key HDInsight storage solutions.
Use tools to upload data to HDInsight clusters.
Compress and serialize uploaded data for decreased processing time.

Module 5: Troubleshooting HDInsightIn this module, you will learn how to interpret logs associated with the various services of Microsoft Azure HDInsight cluster to troubleshoot any issues you might have with these services. You will also learn about Operations Management Suite (OMS) and its capabilities.

Lessons

Analyze HDInsight logs
YARN logs
Heap dumps
Operations management suite
Lab: Troubleshooting HDInsight

Analyze HDInsight logs
Analyze YARN logs
Monitor resources with Operations Management Suite
After completing this module, students will be able to:

Locate and analyze HDInsight logs.
Use YARN logs for application troubleshooting.
Understand and enable heap dumps.
Describe how the OMS can be used with Azure resources.

Module 6: Implementing Batch Solutions

In this module, you will look at implementing batch solutions in Microsoft Azure HDInsight by using Hive and Pig. You will also discuss the approaches for data pipeline operationalization that are available for big data workloads on an HDInsight stack.

Lessons

Apache Hive storage
HDInsight data queries using Hive and Pig
Operationalize HDInsight
Lab: Implement Batch Solutions

Deploy HDInsight cluster and data storage
Use data transfers with HDInsight clusters
Query HDInsight cluster data
After completing this module, students will be able to:

Understand Apache Hive and the scenarios where it can be used.
Run batch jobs using Apache Hive and Apache Pig.
Explain the capabilities of the Microsoft Azure Data Factory and Apache Oozie—and how they can orchestrate and automate big data workflows.

Module 7: Design Batch ETL solutions for big data with Spark

This module provides an overview of Apache Spark, describing its main characteristics and key features. Before you start, it’s helpful to understand the basic architecture of Apache Spark and the different components that are available. The module also explains how to design batch Extract, Transform, Load (ETL) solutions for big data with Spark on HDInsight. The final lesson includes some guidelines to improve Spark performance.

Lessons

What is Spark?
ETL with Spark
Spark performance
Lab: Design Batch ETL solutions for big data with Spark.

Create an HDInsight Cluster with access to Data Lake Store
Use the HDInsight Spark cluster to analyze data in the Data Lake Store
Analyzing website logs using a custom library with Apache Spark cluster on HDInsight
Managing resources for Apache Spark cluster on Azure HDInsight
After completing this module, students will be able to:

Describe the architecture of Spark on HDInsight.
Describe the different components required for a Spark application on HDInsight.
Identify the benefits of using Spark for ETL processes.
Create Python and Scala code in a Spark program to ingest or process data.
Identify cluster settings for optimal performance.
Track and debug jobs running on an Apache Spark cluster in HDInsight.

Module 8: Analyze Data with Spark SQL

This module describes how to analyze data by using Spark SQL. In it, you will be able to explain the differences between RDD, Datasets, and Dataframes, identify the uses cases between Iterative and Interactive queries and describe best practices for Caching, Partitioning, and Persistence. You will also look at how to use Apache Zeppelin and Jupyter notebooks, carry out exploratory data analysis, then submit Spark jobs remotely to a Spark cluster.

Lessons

Implementing iterative and interactive queries
Perform exploratory data analysis
Lab: Performing exploratory data analysis by using iterative and interactive queries

Build a machine learning application
Use zeppelin for interactive data analysis
View and manage Spark sessions by using Livy
After completing this module, students will be able to:

Implement interactive queries.
Perform exploratory data analysis.

Module 9: Analyze Data with Hive and Phoenix

In this module, you will learn about running interactive queries using the Interactive Hive (also known as Hive LLAP or Live Long and Process) and Apache Phoenix. You will also learn about the various aspects of running interactive queries using Apache Phoenix with HBase as the underlying query engine.

Lessons

Implement interactive queries for big data with an interactive hive.
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix
Lab: Analyze Data with Hive and Phoenix

Implement interactive queries for big data with an interactive Hive
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix
After completing this module, students will be able to:

Implement interactive queries with an interactive Hive.
Perform exploratory data analysis using Hive.
Perform interactive processing by using Apache Phoenix.

Module 10: Stream AnalyticsThe Microsoft Azure Stream Analytics service has some built-in features and capabilities that make it as easy to use as a flexible stream processing service in the cloud. You will see that there are a number of advantages to using Stream Analytics for your streaming solutions, which you will discuss in more detail. You will also compare features of Stream Analytics to other services available within the Microsoft Azure HDInsight stack, such as Apache Storm. You will learn how to deploy a Stream Analytics job, connect it to the Microsoft Azure Event Hub to ingest real-time data, and execute a Stream Analytics query to gain low-latency insights. After that, you will learn how Stream Analytics jobs can be monitored when deployed and used in production settings.

Lessons

Stream analytics
Process streaming data from stream analytics
Managing stream analytics jobs
Lab: Implement Stream Analytics

Process streaming data with stream analytics
Managing stream analytics jobs
After completing this module, students will be able to:

Describe stream analytics and its capabilities.
Process streaming data with stream analytics.
Manage stream analytics jobs.

Module 11: Implementing Streaming Solutions with Kafka and HBase

In this module, you will learn how to use Kafka to build streaming solutions. You will also see how to use Kafka to persist data to HDFS by using Apache HBase and then query this data.

Lessons

Building and Deploying a Kafka Cluster
Publishing, Consuming, and Processing data using the Kafka Cluster
Using HBase to store and Query Data
Lab: Implementing Streaming Solutions with Kafka and HBase

Create a virtual network and gateway
Create a storm cluster for Kafka
Create a Kafka producer
Create a streaming processor client topology
Create a Power BI dashboard and streaming dataset
Create an HBase cluster
Create a streaming processor to write to HBase
After completing this module, students will be able to:

Build and deploy a Kafka Cluster.
Publish data to a Kafka Cluster, consume data from a Kafka Cluster, and perform stream processing using the Kafka Cluster.
Save streamed data to HBase, and perform queries using the HBase API.

Module 12: Develop big data real-time processing solutions with Apache Storm

This module explains how to develop big data real-time processing solutions with Apache Storm.

Lessons

Persist long term data
Stream data with Storm
Create Storm topologies
Configure Apache Storm
Lab: Developing big data real-time processing solutions with Apache Storm

Stream data with Storm
Create Storm topologies
After completing this module, students will be able to:

Persist long term data.
Stream data with Storm.
Create Storm topologies.
Configure Apache Storm.

Module 13: Create Spark Streaming Applications

This module describes Spark Streaming; explains how to use discretized streams (DStreams); and explains how to apply the concepts to develop Spark Streaming applications.

Lessons

Working with Spark Streaming
Creating Spark Structured Streaming Applications
Persistence and Visualization
Lab: Building a Spark Streaming Application

Installing Required Software
Building the Azure Infrastructure
Building a Spark Streaming Pipeline
After completing this module, students will be able to:

Describe Spark Streaming and how it works.
Use discretized streams (DStreams).
Work with sliding window operations.
Apply the concepts to develop Spark Streaming applications.
Describe Structured Streaming.