[ONLINE] Introduction to Scalable Deep Learning @ JSC

CET
Online

Online

Description

In this course, we will cover machine learning and deep learning and how to achieve scaling to high performance computing systems. The course aims at covering all levels, from fundamental software design to specific compute environments and toolkits. We want to enable the participants to unlock the resource of machines like the JUWELS booster for their machine learning workflows. Different from previous years we assume that the participants have a background from a university level introductory course to machine learning. Suggested options for self-teaching are given below.

We will start the course with a presentation of high performance computing system architectures and the design paradigms for HPC software. In the tutorial, we familiarize the users with the environment. Furthermore, we give a recap of important machine learning concepts and algorithms and the participants will train and test a reference model. Afterwards, we introduce how deep learning algorithms can be parallelized for supercomputer usage with Horovod. Furthermore, we discuss best practicies and pitfalls in adopting deep learning algorithms on supercomputers and learn to test their function and performance. Finally we apply the gained expertise to large scale unsupervised learning, with a particular focus on Generative Adversarial Networks (GANs).

Prerequisites:

We assume that the participants are familiar with general concepts of machine learning and/or deep learning, such as widely used models, losses, regularization and basic model training / testing. Many excellent self-training resources are available such as:

Hands-on experience with ML/DL framework is required, first experience with HPC systems is helpful.

Learning outcome:

After this course, participants will be able to parallelize Tensorflow and Pytorch ML workflows on HPC machines, taking into account the HPC system architecture and circumventing typical pitfalls and bottlenecks.

Date:

14-18 March 2022, 9:00 - 13:00

Instructors:

Dr. Stefan Kesselheim, Dr. Jenia Jitsev, Roshni Kamath, Dr. Mehdi Cherti, Dr. Alexandre Strube, Jan Ebert, Jülich Supercomputing Centre

    • 9:00 AM 11:00 AM
      Day 1: Introduction to Supercomputers
      • 9:00 AM
        Introduction 30m
      • 9:30 AM
        Lecture 1.1: Intro to supercomputers 30m
      • 10:00 AM
        Tutorial 1.1: First Steps on the Supercomputer 1h
    • 11:30 AM 1:00 PM
      Day 1: Introduction to parallel programming
      • 11:30 AM
        Lecture 1.2: Supercomputer architecture and MPI primer 30m
      • 12:00 PM
        Tutorial 1.2: Hello MPI World 1h
    • 9:30 AM 11:00 AM
      Day 2: Deep Learning Basics Recap
      • 9:30 AM
        Lecture 2.1: Motivation for Large Scale Deep Learning and Deep Learning Basics Recap 30m
      • 10:00 AM
        Tutorial 2.1: Deep Learning Basics Recap 1h
    • 11:30 AM 1:00 PM
      Day 2: Distributed Training
      • 11:30 AM
        Lecture 2.2: Distributed Training Schemes and Data Parallelism with Horovod 30m
      • 12:00 PM
        Tutorial 2.2: A first parallelization with Horovod 1h
    • 9:30 AM 11:00 AM
      Day 3: Large Datasets and Scaling Distributed Training
      • 9:30 AM
        Lecture 3.1: Large Datasets and Scaling Distributed Training 30m
      • 10:00 AM
        Tutorial 3.1: Distributed Training with ImageNet and Scaling Basics 1h
    • 11:30 AM 1:00 PM
      Day 3: Performance analysis
      • 11:30 AM
        Lecture 3.2: Is my code fast? Performance analysis 30m
      • 12:00 PM
        Tutorial 3.2: Tools for training performance analysis 1h
    • 9:30 AM 11:00 AM
      Day 4: Combating Accuracy Loss
      • 9:30 AM
        Lecture 4.1: Combating Accuracy Loss 30m
      • 10:00 AM
        Tutorial 4.1 Part I : Combating Accuracy Loss, Basics 1h
    • 11:30 AM 1:00 PM
      Day 4: Outllook
      • 11:30 AM
        Tutorial 4.1 Part II : Combating Accuracy Loss, Advanced 1h
      • 12:30 PM
        Lecture 4.2 : Outlook beyond the basics 30m
    • 9:30 AM 11:00 AM
      Day 5: GANs I
      • 9:30 AM
        Lecture 5.1: Generative models, GAN basics 30m
      • 10:00 AM
        Tutorial 5.1: Parallelizing a basic GAN: DCGAN architecture 1h
    • 11:30 AM 1:15 PM
      Day 5: GANs II
      • 11:30 AM
        Lecture 5.2: Advanced GANs 30m
      • 12:00 PM
        Tutorial 5.2: Parallelizing an advanced GAN: StyleGAN2 architecture 1h
      • 1:00 PM
        Concluding Remarks 15m