Nov 19 – 20, 2018
CSC - IT Center for Science
Europe/Helsinki timezone


With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. One typically runs into into several problems - low memory/cpu, waiting forever for a job to complete or starting all over again if a job fails. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. The two day course addresses the technical architecture and use cases of Spark, writing Spark code using Python, using Spark's machine learning library to perform ML based tasks. Then, we would be looking at the methods for running a spark cluster on a cloud based infrastructure, along with ways to manage and fine tune your cluster. The course will also demonstrate how to work with real-time data streams.

The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and Spark's Machine Learning library. The second day focuses on the analysis of data by running SQL queries in Spark, working with real-time data streams and how to setup and manage a spark cluster.

Learning outcome

After the course the participants should be able to write simple to intermediate programmes in Spark using RDD and dataframes.

Intended Audience and Prerequisites

The course is intended for researchers, students, and professionals with programming skills, preferably in Python, as the exercises are in Python. Some knowledge of SQL is also recommended.

Please NOTE: This is not a regular programming course, participants would be expected to learn emerging concepts in the field of big data / distributed processing, which might be completely different from the concepts of a general programming language.


Day 1, Monday 19.11

  •    09.00 – 09.45    Overview and architechture of Spark
  •    09:45 – 10.30    Basics of RDDs and Demo
  •    10.30 – 10.45    Coffee break
  •    10.45 – 11.30    RDD: Transformations and Actions
  •    11.30 – 12.00    Exercises
  •    12.00 – 13.00    Lunch
  •    13.00 – 13.30    Word Count Example
  •    13.30 – 14.00    Exercises
  •    14.00 – 14.30    Short overviewof Machine learning library of Spark
  •    14.30 – 14.45    Coffee break
  •    14.45 – 15.30    Exercises
  •    15.30 – 15.45    Wrap-up and further topics
  •    15.45 – 16.00    Summary of the first day & exercises walk-through

Day 2, Tuesday, 20.11

  •    09.00 – 09.30    Spark Dataframes and SQL Overview
  •    09:30 – 10.15    Exercises
  •    10.15 – 10.30    Coffee break
  •    10.30 – 10.45    Dataframes and SQL (contd.)
  •    10.45 – 12.00    Exercises
  •    12.00 – 13.00    Lunch
  •    13.00 – 14.00    Setting up a Spark cluster
  •    14.00 – 14.30    Exercises
  •    14.00 – 14.30    Best practices and other useful stuff
  •    14.30 – 14.45    Coffee break
  •    14.45 – 15.00    Brief overview of Spark Streaming
  •    15.00 – 15.15    Demo: Processing live twitter stream data
  •    15.15 – 16.00    Summary of the course & exercises walk-through


Apurva Nandan (CSC, lecturer), Juha Hulkkonen (CSC, teaching assistant)

Language:   English
Price:          Free of charge

CSC - IT Center for Science
CSC Training room Dogmi, 1st floor
LIfe Science Center, Keilaranta 14, Espoo, Finland

How to reach us

CSC is located in Keilaniemi, Espoo, 10 km west of the Helsinki City Center. Detailed information is available here.


We recommend choosing a few hotels that are most close to our premises. The nearest hotel is Radisson Blu Espoo, which is in a walking distance (only 500 m) from CSC. Another hotel close to the venue (1,8 km) is Sokos Hotel Tapiola Garden. Other hotels are located in Helsinki' downtown with a frequent and fast metro connection to Keilaniemi. Please note, that there is no special rates for the participants at any hotels.

If you have any questions, please click on the support link on the left to send an e-mail to the local organizers.