With the increasing performance gap between compute and storage, even the best use of IO bandwidth might not be enough. This is especially critical for checkpointing in the context of fault tolerance. This course proposes an introduction to FTI, a library that aims to give computational scientists the means to perform fast and efficient multilevel checkpointing in large scale supercomputers.
To handle the multiplication of libraries, this course will also introduce the PDI Data Interface, a solution to couple simulation codes with libraries for I/O and post-processing (including HDF5, NetCDF, FTI, ...) based on simple annotations. PDI improves code quality with a) annotations independent of the library used, b) a choice of IO strategy at runtime in YAML and c) negligible overhead to access the full power of the underlying libraries.
- 9h30 - 9h45: Welcome & introduction
- 9h45 - 11h15: Storage@TGCC & Lustre filesystems (Thomas Leibovici - TGCC, CEA)
- 11h30 - 12h30: NetCDF (Olga Abramkina - MdlS/IDRIS, CNRS)
- 14h00 - 17h00: NetCDF (Olga Abramkina - MdlS/IDRIS, CNRS)
- 9h30 - 12h30: Sequential HDF5 (Matthieu Haefele - LMAP, CNRS)
- 14h00 - 17h00: Parallel HDF5 (Matthieu Haefele - LMAP, CNRS)
- 9h30 - 12h30: The FTI fault-tolerance library (Leonardo Bautista Gomez - BSC)
- 14h00 - 17h00: The PDI Data Interface (Julien Bigot - MdlS, CEA)
Instructors: Olga Abramkina (MdlS/IDRIS, CNRS), Leonardo Bautista Gomez (BSC), Julien Bigot (MdlS, CEA), Matthieu Haefele (LMAP, CNRS), Kai Keller(BSC), Thomas Leibovici (TGCC, CEA)
Learning outcomes: After this course, participants should understand the trade-offs implied by using a parallel file-system, and know how to efficiently use parallel IO libraries. Participants will also have a basic understanding and practise of FTI and PDI.
Prerequisites: Knowledge of C or Fortran programming languages, parallel programming with MPI