Big Data Processing with Apache Spark

Apache Spark is growing as the standard platform for Big Data processing on Cloud infrastructure, thanks to its speed, flexibility and ease of use.

In this class we will start from the general concepts of computing on the Cloud, then we will introduce the MapReduce paradigm and its implementation in Distributed Computing frameworks like Hadoop and Spark. We will then learn about how to use the Apache Spark Python interface to load and process data with different levels of abstraction and we'll finally show an example of Machine Learning with Spark.

The concepts acquired learning Spark are general and can then be applied to other distributed computing frameworks.

The class will have a strong emphasis on hands-on exercises in Python and requires a basic knowledge of the Python programming language.

Course Goals/ Learning Objectives

The goal of this course is to teach the basis of distributed computing and introduce the Spark Python interface. By the end of this course, the student will be able to:

  • Understand how a distributed computing framework handles resources.
  • Access a running Spark cluster.
  • Use the Spark Python interface to load data from text files or Hadoop Distributed File System and perform distributed operations.
  • Use Spark Dataframes to efficiently perform joining and grouping operations.
  • Apply basic Machine Learning algorithms to datasets loaded in Spark.
  • Have an insight of possible computational bottlenecks in a Spark computation.

Topical Outline

  • Introduction to cloud computing
  • MapReduce paradigm
  • Hadoop and Spark
  • Resilient Distributed Dataset
  • Read from Hadoop Distributed Filesystem
  • Spark distributed primitives
  • Join two datasets
  • Spark actions
  • Spark DataFrames
  • Machine Learning with Spark