PRACE Winter School 2015 includes the following training sessions:
The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework.
Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
The session will be conducted by Jeremy Logan and Norbert Podhorszki.
Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.
VisIt is an open source, turnkey application for large scale simulated and experimental data sets. Its charter goes beyond pretty pictures; the application is an infrastructure for parallelized, general post-processing of extremely massive data sets. Target use cases include data exploration, comparative analysis, visual debugging, quantitative analysis, and presentation graphics.
The first hands-on portion will include how to use the VisIt GUI to perform a variety of visualization tasks, including scalar, and vector field data, using techniques such as iso contouring, volume rendering, and particle advection. It will also include an introduction to VisIt's powerful analysis and expression engine for creating derived quantities. The advanced section will include an introduction to VisIt's python API, how to use the client-server architecture to run VisIt on remote resources, and advanced rendering techniques
The session will be conducted by Dr. David Pugmire. David is a Research Scientist in the Scientific Data Group, in the Computer Science and Mathematics Division at ORNL. His research interests are in visualization of large scientific data.
Hadoop session [CANCELLED]
[9/1/2015: The tutor Shadi Ibrahim cancelled his Hadoop session due do an emergency (sudden serious illness) within his closest family.]
Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will then be discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.