16-17 November 2017
CSC - IT Center for Science
Europe/Helsinki timezone


Thank you to those of you who have already registered for our training course!

The course is now FULLY BOOKED! So if you have registered to this course and you are not able to attend, please CANCEL your registration in advance by sending an email to patc at csc.fi.

We are now opening a waiting list. If you wish to be added to the waiting list please contact to patc at csc.fi. You will be notified via e-mail of your position on the waiting list. We will keep you further informed of whether you are enrolled into the course or not.

Please note that the waiting list is handled on a first come/first serve basis.

Welcome to the course!


With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines.

This two day course addresses the technical architechture and use cases of Spark, setting it up for your work, best practices and programming aspects. The first day includes the overview, architechtural concepts and programming with Spark's fundamental data structure (RDD). The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries.

Learning outcome

After this course you should be able to write simple to intermediate programmes in Spark using RDD and dataframes/SQL.


Basic knowledge on programming in general is recommended (ideally, Python).

Please NOTE: This is not a regular programming course, the participants would be expected to learn emerging concepts in the field of big data / distributed processing, which might be completely different from the concepts of a general progamming language.


Day 1, Thursday 16.11

  •    09.00 – 09.30 Overview and architechture of Spark
  •    09.30 – 10.15 Basics of RDDs + Demo
  •    10.15 – 10.30 Coffee break
  •    10.30 – 11.00 RDD: Transformations and Actions
  •    11.00 – 12.00 Exercises
  •    12.00 – 13.00 Lunch
  •    13.00 – 13.30 Word Count Example
  •    13.30 – 14.00 Exercises
  •    14.00 – 14.15 Short overview of Machine learning library of Spark
  •    14.15 – 14.30 Coffee break
  •    14.30 – 15.30 Exercises
  •    15.30 – 16.00 Summary of the first day & exercises walk-trough

Day 2, Friday 17.11

  •    09.00 – 09.30 Spark Dataframes and SQL overview
  •    09.30 – 10.15 Exercises
  •    10.15 – 10.30 Coffee break
  •    10.30 – 10.45 Dataframes and SQL contd.
  •    10.45 – 12.00 Exercises
  •    12.00 – 13.00 Lunch
  •    13.00 – 13.30 Best practices and other useful stuff
  •    13.30 – 14.30 Exercises
  •    14.30 – 14.45 Coffee break
  •    14.45 – 15.00 Brief overview of Spark Streaming
  •    15.00 – 15.15 Demo: Processing live twitter stream data
  •    15.15 – 16.00 Summary of the course & exercises walk-trough


Apurva Nandan (CSC), Teaching Assistant: Tommi Jalkanen (CSC)


Language:  English
Price:          Free of charge

Starts 16 Nov 2017 09:00
Ends 17 Nov 2017 16:00
CSC - IT Center for Science
Training room Dogmi, 1st floor
LIfe Science Center, Keilaranta 14, Espoo, Finland

How to reach us

CSC is located in Keilaniemi, Espoo, 10 km west of the Helsinki City Center. Detailed information is available here.


We recommend choosing a few hotels that are most close to our premises. The nearest hotel is Radisson Blu Espoo, which is in a walking distance (only 500 m) from CSC. Another hotel close to the venue (1,8 km) is Sokos Hotel Tapiola Garden. Other hotels are located in Helsinki' downtown with a frequent and fast bus connection to Keilaniemi. Please note, that there are no special rates for participants at any hotels.

If you have any questions, please click on the support link on the left to send an e-mail to the local organizers.