Apr 23 – 26, 2018
SAS campus
Europe/Bratislava timezone

Analysing large datasets with Apache Spark (6/8)

Apr 25, 2018, 11:00 AM
1h 30m
SAS campus

SAS campus

Plant Science and Biodiversity Center Dúbravská cesta 9 845 23 Bratislava 4 Slovak Republic GPS: 48.17289° N, 17.0665° E

Speaker

Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)

Description

With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.

Presentation materials

There are no materials yet.