Spark

Introduction

Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.

Educational goals

Discover all the functionalities of Apache Spark and why it is everywhere.
Understand the internals of Spark.
Learn to use Spark for batch and streaming data analytics.

Prerequisites

Python programming knowledge, Linux/Unix shell basic knowledge.

Modules

Module 1 - Introduction to Spark & RDDs

Presentation
Spark in Hadoop ecosystem
Use cases
Spark ecosystem
Internals
Data structures
Operations
Resilient Distributed Datasets (RDDs)

Module 2 - Spark SQL and DataFrames

RDDs: Pros and Cons
DataFrames
RDDs vs DataFrames
Working with DataFrames
Why SQL?

Module 3 - Spark Structured Streaming

Streaming introduction
Difference between batch and stream processing
Stream processing models
Different processing semantics
Programming model
Event-time vs. processing time
Windows: tumbling, overlapping
Handling late data and how long to wait
Vocabulary

Resource

You can freely download a book, used for this course:

Learning Spark, 2nd Edition

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
feature		feature
modules		modules
README.md		README.md
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark

Introduction

Educational goals

Prerequisites

Modules

Module 1 - Introduction to Spark & RDDs

Module 2 - Spark SQL and DataFrames

Module 3 - Spark Structured Streaming

Resource

About

Releases

Packages

Languages

adaltas/ece-spark-2024-fall-gr02

Folders and files

Latest commit

History

Repository files navigation

Spark

Introduction

Educational goals

Prerequisites

Modules

Module 1 - Introduction to Spark & RDDs

Module 2 - Spark SQL and DataFrames

Module 3 - Spark Structured Streaming

Resource

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages