Description
The R language has been called the "lingua franca" of data analysis and statistical computing. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project, which is aimed at medium to large HPC platforms. The overview part of the tutorial will include a brief introduction to R and an introduction to the pbdR project and its use of scalable HPC libraries.
The first hands-on portion will include pbdR's profiling and its simplified interface to MPI. We will cover a range of examples beginning with simple MPI use through simple examples with data to more advanced uses in statistics and data analysis. We will also introduce students to concepts of partitioning data across nodes of a large platform.
The advanced topics portion will cover pbdR's connections to scalable libraries and their partitioned data structure requirements. We will cover ways of reading in and redistributing data. This will include R's
native data input, input from HDFS, as well as scalable input via ADIOS. We also cover more advanced examples that make use of the distributed data structures and object oriented components of R for identical parallel and serial syntax capability in many functions.
The session will be conducted by George Ostrouchov. George is a Senior Research Scientist in the Scientific Data Group at the Oak Ridge National Laboratory and Joint Faculty Professor of Statistics at the University of Tennessee. His doctoral work was on large sparse least squares computations in data analysis. His research interests have been for many years at the intersection of high performance computing and statistics. He initiated and continues to lead the pbdR project. George is a Fellow of the American Statistical Association.