Apache Spark is an open-source framework for cluster computing, ideal for large-scale parallel data processing, that is designed for performance and ease-of-use. It is faster and simpler to use than Hadoop MapReduce, providing a rich set of APIs in Python, Java and Scala.
This hands-on course will cover the following topics:
- Introduction to Spark
- Map, Filter and Reduce
- Running on a Spark Cluster
- Key-value pairs
- Correlations, logistic regression
- Decision trees, K-means
10:00 - 17:30 (Thu)
10:00 - 15:30 (Fri)
Attendees will be provided with access to EPCC's Tier2 Cirrus system for all practical exercises.
The practicals will be done using Jupyter notebooks so a basic knowledge of Python would be extremely useful.
Registration: Registration has been closed as the course is full with a long waiting list.
Full timetable and course materials to follow