Apache Spark is an open-source framework for cluster computing, ideal for large-scale parallel data processing, that is designed for performance and ease-of-use. It is faster and simpler to use than Hadoop MapReduce, providing a rich set of APIs in Python, Java and Scala.
This hands-on course will cover the following topics:
- Introduction to Spark
- Map, Filter and Reduce
- Running on a Spark Cluster
- Key-value pairs
- Correlations, logistic regression
- Decision trees, K-means
Sessions
09:30 - 17:30 (Thu)
09:30 - 15:30 (Fri)
Attendees will be provided with access to EPCC's Tier2 Cirrus system for all practical exercises.
The practicals will be done using Jupyter notebooks so a basic knowledge of Python would be extremely useful.