PRACE Winter School 2015 - HPC Tools for Data Intensive Processing

Name: PRACE Winter School 2015 - HPC Tools for Data Intensive Processing
Start: 2015-01-12T08:00:00+01:00
End: 2015-01-15T15:00:00+01:00
Location: VŠB-Technical University of Ostrava, Czech Republic

12 Jan 2015, 08:00 → 15 Jan 2015, 15:00 CET

VŠB-Technical University of Ostrava, Czech Republic

Address: 17. listopadu 15/2172, 708 33 Ostrava-Poruba, Czech Republic, http://www.vsb.cz

Ondřej JAKL

Description

Introduction

Historically, High Performance Computing (HPC) was primarily about computing-intensive processing, opposed to data-intensive jobs. Nowadays, complex HPC simulations combine these two attributes: Large-scale modelling as a rule performs demanding computations on very large data sets. Moreover, a new subset of HPC jobs occurs, searching for useful information and patterns in the data itself – high-performance data analysis. Data becomes the focus of HPC. PRACE Winter School 2015, presenting HPC Tools for Data Intensive Processing, reflects this development.

The PRACE Winter School 2015 will take place on 12 – 15 January 2015 at the campus of VSB - Technical University of Ostrava, Czech Republic, being organized jointly by the National Supercomputing Center IT4Innovations, VSB - Technical University of Ostrava and PRACE.

About the programme

The School’s programme offers a portfolio of tutorials (sessions), in which students will learn the best practices and techniques that are crucial to work with exponentially growing data sizes.

The first and largest session, High Performance I/O, will start with parallel I/O in general, teach serial and parallel HDF5 and then focus on ADIOS, a middleware that already helped many applications achieve high performance I/O.

In the Analytics session, we shall get acquainted with the R language, called the "lingua franca" of data analysis and statistical computing, and its recent high performance extensions enabled by the "Programming with Big Data in R" (pbdR) project.

The optional Visualization session introduces VisIt, a turnkey tool for large scale simulated and experimental data sets, going far beyond pretty pictures.

The Visualization session (as well as some advanced add-ons of High Performance I/O and Analytics sessions) will run in parallel with another optional session, Hadoop. If you make this choice, you will enter the world of distributed processing and unstructured data storage and follow a paradigm shift in the way how data in large-scale data-centers and clouds is processed through the MapReduce model.

The tutors invited to the PRACE Winter School 2015 are distinguished scientists and experts in the focus of the School. Many of them give tutorials at the Supercomputing Conference (SC) series in the US or at the International Supercomputing Conference (ISC), its European conterpart.

Two social events will be offered to the participants. On Monday evening, a Welcome reception will take place in the patio of the new IT4Innovations building. On Wednesday, we shall visit the former Anselm Mine, nowadays hosing the largest mining museum in the Czech Republic. The dinner will be served in the nearby Harenda Miner’s Pub.

See Programme in the menu on the left for further details.

Prerequisites and Registration

Applicants are expected to be active developers/programmers of HPC applications dealing with large data sets. Thus, the prerequisites include working skills in (parallel) programming and some experience with big data processing. The attendees are encouraged to bring a poster on their work related to the topics of the school.

The school participants will be selected by an addmission committee based on information they provide through a registration form (accessible via menu on the left). The registration deadline is December 8, 2014. The selection procedure will be performed on the fly, and applicants will be informed about the result usually within three weeks after the registration, one week after the registration deadline at the latest. The number of the School's participants is limited to forty.

The registration is closed.

The participants are expected to bring their own laptops for the hands-on exercises, with Virtual Box >= 4.3.4 installed. Please install this software in advance, there will be no time to install it on spot before the School starts on Monday. Virtual Box can be downloaded from https://www.virtualbox.org/wiki/Downloads.

Be sure that you have at least 10 GB free disk space for the virtual machine itself.

Remarks

The school is offered free of charge to students, researchers and academics residing in PRACE member states and eligible countries. Lunches, coffee breaks and social events are included. It is the responsibility of the attendees to arrange and cover travel and accommodation. The school's official language is English.

E-mail contact

training@it4i.cz

Monday, 12 January
- 08:30 → 09:00
  
  Registration 30m auditorium NA4 (VŠB-Technical University of Ostrava)
  
  auditorium NA4
  
  VŠB-Technical University of Ostrava
- 09:00 → 09:30
  
  Opening 30m auditorium NA4 (VŠB-Technical University of Ostrava)
  
  auditorium NA4
  
  VŠB-Technical University of Ostrava
  
  Welcome and introductory information
- 09:30 → 11:00
  High Performance I/O auditorium NA4 (VŠB-Technical University of Ostrava)
  
  auditorium NA4
  
  VŠB-Technical University of Ostrava
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 09:30
    
    High performance I/O discussion 45m auditorium NA4
    
    auditorium NA4
    
    VŠB-Technical University of Ostrava
    
    Speaker: Jeremy Logan (University of Tennessee, ORNL)
  - 10:15
    
    Using HDF5 45m auditorium NA4
    
    auditorium NA4
    
    VŠB-Technical University of Ostrava
    
    Speaker: Norbert Podhorszki
- 11:00 → 11:30
  
  Coffee 30m
- 11:30 → 12:30
  High Performance I/O auditorium NA4 (VŠB-Technical University of Ostrava)
  
  auditorium NA4
  
  VŠB-Technical University of Ostrava
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 11:30
    
    Using parallel HDF5 1h
    
    Speaker: Norbert Podhorszki
- 12:30 → 13:30
  
  Lunch 1h
- 13:30 → 15:00
  High Performance I/O auditorium NA4 (VŠB-Technical University of Ostrava)
  
  auditorium NA4
  
  VŠB-Technical University of Ostrava
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 13:30
    
    ADIOS overview 30m auditorium NA4
    
    auditorium NA4
    
    VŠB-Technical University of Ostrava
    
    Speaker: Jeremy Logan (University of Tennessee, ORNL)
  - 14:00
    
    ADIOS – write files 1h auditorium NA4
    
    auditorium NA4
    
    VŠB-Technical University of Ostrava
    
    Speaker: Jeremy Logan (University of Tennessee, ORNL)
- 15:00 → 15:30
  
  Coffee 30m
- 15:30 → 17:30
  High Performance I/O auditorium NA4 (VŠB-Technical University of Ostrava)
  
  auditorium NA4
  
  VŠB-Technical University of Ostrava
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 15:30
    
    ADIOS tools 30m
    
    Speaker: Jeremy Logan (University of Tennessee, ORNL)
  - 16:00
    
    ADIOS no-xml 1h
    
    Speaker: Norbert Podhorszki
  - 17:00
    
    HW assignments 30m
    
    Speakers: Jeremy Logan (University of Tennessee, ORNL), Norbert Podhorszki (ORNL)
- 18:00 → 21:00
  Social event
  - 18:30
    
    Welcome reception 2h 30m patio (IT4Innovations building)
    
    patio
    
    IT4Innovations building
    
    Welcome reception will take place in the patio of the new IT4Innovations building, in walking distance from the School's venue (and the Garni hotel). Transport to the Mercure hotel after the event provided.
Tuesday, 13 January
- 09:00 → 10:30
  High Performance I/O
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 09:00
    
    Go over HW assignment 30m
    
    Speaker: Norbert Podhorszki (ORNL)
  - 09:30
    
    ADIOS reading 1h
    
    Speaker: Jeremy Logan (University of Tennessee, ORNL)
- 10:30 → 11:00
  
  Coffee 30m
- 11:00 → 12:00
  High Performance I/O
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 11:00
    
    ADIOS Staging 1h
    
    Speaker: Norbert Podhorszki (ORNL)
- 12:00 → 13:30
  
  Lunch 1h 30m
- 12:00 → 13:30
  
  Poster session
- 13:30 → 14:30
  Analytics session
  
  The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
  
  The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
  
  The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's
  native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
  
  The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.
  - 13:30
    
    pdbR overview 1h
    
    Speaker: George Ostrouchov (ORNL)
- 14:30 → 15:00
  
  Coffee 30m
- 15:00 → 17:30
  Analytics session
  
  The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
  
  The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
  
  The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's
  native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
  
  The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.
  - 15:00
    
    pdbR hands-on 2h
    
    Speaker: George Ostrouchov (ORNL)
  - 17:00
    
    HW assignments 30m
    
    Speaker: George Ostrouchov (ORNL)
Wednesday, 14 January
- 09:00 → 10:30
  Hadoop session [CANCELLED]
  
  Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
  
  The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will be then discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
  
  Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
  
  The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.
  - 09:00
    
    An introduction to Big Data 1h
    
    Speaker: Shadi Ibrahim (INRIA)
  - 10:00
    
    Big Data processing in the Cloud: The MapReduce programming model 30m
    
    Speaker: Shadi Ibrahim (INRIA)
- 09:00 → 10:30
  Visualization session
  
  VisIt is an open source, turnkey application for large scale simulated and experimental data sets. Its charter goes beyond pretty pictures; the application is an infrastructure for parallelized, general post-processing of extremely massive data sets. Target use cases include data exploration, comparative analysis, visual debugging, quantitative analysis, and presentation graphics.
  
  The first hands-on portion will include how to use the VisIt GUI to perform a variety of visualization tasks, including scalar, and vector field data, using techniques such as iso contouring, volume rendering, and particle advection. It will also included an introduction to VisIt's powerful analysis and expression engine for creating derived quantities. The advanced section will include an introduction to VisIt's python API, how to use the client-server architecture to run VisIt on remote resources, and advanced rendering techniques.
  
  The session will be conducted by Dr. David Pugmire. David is a Research Scientist in the Scientific Data Group, in the Computer Science and Mathematics Division at ORNL. His research interests are in visualization of large scientific data.
  - 09:00
    
    Introduction to Visit 1h 30m
    
    Speaker: Dave Pugmire (ORNL)
- 10:30 → 11:00
  
  Coffee 30m
- 11:00 → 12:15
  Hadoop session [CANCELLED]
  
  Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
  
  The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will be then discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
  
  Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
  
  The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.
  - 11:00
    
    Hadoop ecosystem: An overview 1h 15m
    
    Speaker: Shadi Ibrahim (INRIA)
- 11:00 → 12:15
  Visualization session
  
  VisIt is an open source, turnkey application for large scale simulated and experimental data sets. Its charter goes beyond pretty pictures; the application is an infrastructure for parallelized, general post-processing of extremely massive data sets. Target use cases include data exploration, comparative analysis, visual debugging, quantitative analysis, and presentation graphics.
  
  The first hands-on portion will include how to use the VisIt GUI to perform a variety of visualization tasks, including scalar, and vector field data, using techniques such as iso contouring, volume rendering, and particle advection. It will also included an introduction to VisIt's powerful analysis and expression engine for creating derived quantities. The advanced section will include an introduction to VisIt's python API, how to use the client-server architecture to run VisIt on remote resources, and advanced rendering techniques.
  
  The session will be conducted by Dr. David Pugmire. David is a Research Scientist in the Scientific Data Group, in the Computer Science and Mathematics Division at ORNL. His research interests are in visualization of large scientific data.
  - 11:00
    
    Hands on with Visit 1h 15m
    
    Speaker: Dave Pugmire (ORNL)
- 12:15 → 13:30
  
  Lunch 1h 15m
- 13:30 → 15:00
  Analytics session
  
  The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
  
  The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
  
  The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's
  native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
  
  The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.
  - 13:30
    
    pdbR advanced topics 1h 30m
    
    Speaker: George Ostrouchov (ORNL)
- 13:30 → 15:00
  Hadoop session [CANCELLED]
  
  Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
  
  The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will be then discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
  
  Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
  
  The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.
  - 13:30
    
    Hands on deploying/using Hadoop 1h 30m
    
    Speaker: Shadi Ibrahim (INRIA)
- 15:00 → 15:30
  
  Coffee 30m
- 15:30 → 17:00
  Analytics session
  
  The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
  
  The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
  
  The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's
  native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
  
  The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.
  - 15:30
    
    pdbR advanced topics 1h 30m
    
    Speaker: George Ostrouchov (ORNL)
- 15:30 → 17:00
  Hadoop session [CANCELLED]
  
  Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
  
  The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will be then discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
  
  Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
  
  The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.
  - 15:30
    
    Configuring Hadoop hands on 1h 30m
    
    Speaker: Shadi Ibrahim (INRIA)
- 17:30 → 21:30
  Social event Landek Park
  
  Landek Park
  - 17:30
    
    Siteseeing tour to the Coal Minining Museum & Dinner 4h
    
    A bus will bring us to Landek Park and the former Anselm Mine, nowadays the largest mining museum in the Czech Republic. Its unique exhibition highlights the evolution of coal mining in the Ostrava-Karvina region, as well as mining technology, and rescue services. In fact, it is the largest exhibition of its kind in the world. See https://www.ostrava.cz/en/turista/co-navstivit/technicke-pamatky/landek-park-en?set_language=en for more details. Btw, IT4Innovation's first supercomputer got its name (Anselm) to honour the beginning of coal mining in our region, which took place here at the end of the 18th century. The dinner will be served in the Harenda Miner’s Pub, also located in the Landek Park. A great place to sit down for a cool beer and period food. Its interior is decorated with all kinds of mining memorabilia. Transport to the Mercure and Garni hotels after the event provided.
Thursday, 15 January
- 09:00 → 09:30
  Analytics session
  
  The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
  
  The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
  
  The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's
  native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
  
  The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.
  - 09:00
    
    Go over HW assignment 30m
    
    Speaker: George Ostrouchov (ORNL)
- 09:00 → 10:30
  Hadoop session [CANCELLED]
  
  Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
  
  The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will be then discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
  
  Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
  
  The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.
  - 09:00
    
    Go over HW assignment 30m
    
    Speaker: Shadi Ibrahim (INRIA)
  - 09:30
    
    Hadoop: Optimizations and open issues 1h
    
    Speaker: Shadi Ibrahim (INRIA)
- 09:30 → 10:30
  High Performance I/O
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 09:30
    
    Staging Plugins 1h
    
    Speakers: Jeremy Logan (University of Tennessee, ORNL), Norbert Podhorszki (ORNL)
- 10:30 → 11:00
  
  Coffee 30m
- 11:00 → 12:30
  Hadoop session [CANCELLED]
  
  Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.
  
  The goal of this tutorial is to serve as a first step towards exploring the Hadoop platform and also to provide a short introduction into working with big data in Hadoop. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will be then discussed. Finally, we will discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud.
  
  Several hand-ons will be provided to study the operation of Hadoop platform along with the implementation of MapReduce applications.
  
  The Hadoop session will be conducted by Dr. Shadi Ibrahim. Shadi is a permanent Inria research scientist within the KerData research team. He obtained his Ph.D. in Computer Science from Huazhong University of Science and Technology in Wuhan of China in 2011. His research interests are in cloud computing, big data management, data-intensive computing, high performance computing, virtualization technology, and file and storage systems. He has published several research papers in recognized big data and cloud computing research journals and conferences, among which, several papers on optimizing and improving Hadoop MapReduce performance in the cloud and one book chapter on MapReduce framework.
  - 11:00
    
    Hands on writing MapReduce applications 1h 30m
    
    Speaker: Shadi Ibrahim (INRIA)
- 11:00 → 12:30
  High Performance I/O
  
  The age of "Big Data" has researched high performance computers and new tools and technologies must be integrated into the software ecosystem of scientist in order to extract knowledge from data. New challenges emerge from the complexities of simulations integrating more physics, more fidelity as simultaneously the memory and storage hierarchies have dramatically increased the difficulty to cope with the large volumes and fast velocity of data. In this tutorial students will learn the best practices and techniques that are crucial to allow them to work with exponentially growing data sizes. Our tutorial will teach students the basics of high performance I/O, analytics, and visualization.
  
  Part I of this tutorial will introduce parallel I/O in general, and middleware libraries that were created to work with "Big Scientific Data". First, we will teach serial and parallel HDF5, and learn how to incorporate this into serial and parallel simulations. Next, we will summarize the lessons and the key techniques that our team gained through years of collaboration with domain scientists working in areas such as fusion, combustion, astrophysics, materials science, and seismology. This experience and knowledge resulted in the creation of ADIOS, a tool that makes scaling I/O easy, portable, and efficient.
  
  In the second part of the tutorial we will discuss ADIOS, and how this has helped many applications move from I/O to compute dominated simulations. We will show the API's which allow ADIOS to utilize different methods to write self-describing files, and achieve high performance I/O. This will be followed by a hands-on session on how to write/read data, and how to use different I/O componentizations inside the ADIOS framework. Part III will teach students how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using advanced methods.
  
  The session will be conducted by Jeremy Logan and Norbert Podhorszki.
  
  Jeremy Logan is a Computational Scientist at the University of Tennessee and works closely with the Scientific Data Group at Oak Ridge National Laboratory. Jeremy’s research interests include I/O performance, data and workflow management, and the application of domain specific, generative techniques to High Performance Computing.
  
  Norbert Podhorszki is a Research Scientist in the Scientific Data Group at Oak Ridge National Laboratory. He is the lead developer of ADIOS. He works with application users of the Oak Ridge Leadership Facility to improve their I/O performance using ADIOS. His research interest is in how to enable data processing on-the-fly using memory-to-memory data movements, e.g. for speeding up I/O, coupling simulation codes, and building in-situ workflows.
  - 11:00
    
    Hands on creating/using plugins 1h 30m
    
    Speakers: Jeremy Logan (University of Tennessee, ORNL), Norbert Podhorszki (ORNL)
- 12:30 → 13:00
  
  Closing 30m
- 13:00 → 14:00
  
  Lunch 1h

PRACE Winter School 2015 - HPC Tools for Data Intensive Processing

VŠB-Technical University of Ostrava, Czech Republic

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

auditorium NA4

VŠB-Technical University of Ostrava

patio

IT4Innovations building

Landek Park

Share this page

Direct link

Social networks

Calendaring