Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.
- Discover all the functionalities of Apache Spark and why it is everywhere.
- Understand the internals of Spark.
- Learn to use Spark for batch and streaming data analytics.
Python programming knowledge, Linux/Unix shell basic knowledge.
- Presentation
- Spark in Hadoop ecosystem
- Use cases
- Spark ecosystem
- Internals
- Data structures
- Operations
- Resilient Distributed Datasets (RDDs)
- RDDs: Pros and Cons
- DataFrames
- RDDs vs DataFrames
- Working with DataFrames
- Why SQL?
- Streaming introduction
- Difference between batch and stream processing
- Stream processing models
- Different processing semantics
- Programming model
- Event-time vs. processing time
- Windows: tumbling, overlapping
- Handling late data and how long to wait
- Vocabulary
You can freely download a book, used for this course: