Speaker
Dr
Apurva Nandan
(Centre fo Scientific Computing - CSC, Finland)
Description
With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines.
This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects.
The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark.
The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data.
Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.