PRACE Seasonal School on Bioinformatics

Europe/Bratislava
SAS campus

SAS campus

Plant Science and Biodiversity Center Dúbravská cesta 9 845 23 Bratislava 4 Slovak Republic GPS: 48.17289° N, 17.0665° E
Description

Bioinformatics is growing in popularity rapidly over the past years due to lots of reasons, such as availability of high-throughput sequencing techniques, high-quality (open access) databases, as well as the availability of ready-to-use machine learning frameworks and libraries. The extent of applicability of the bioinformatics algorithms and programs is, however, often limited by computer resources that researchers have at hand. The goal of the Seasonal School is to evangelize HPC (High-Performance Computing) as the tool they can use to overcome this barrier limiting their research

There are a lot of tools available, that researcher working in this field can use, but most of them were developed for desktop applications. In this workshop, we will present tools / frameworks that are designed towards parallel computer architectures, thus suitable for running them on computer clusters or supercomputers. Participants will not only learn about dedicated bioinformatics software, such as BLAST (Basic Local Alignment Search Tool), but also on generally applicable tools, such as the R programming language or the Apache Spark framework, suitable for parallel (pre)processing of Big Data.

The workshop is composed of introductory sessions on bioinformatics and NGS (Next-Generation Sequencing), lectures and hands-on sessions on using R in parallel, a short course on analysis of large data sets with Apache Spark (primarily using Python) and finally the lecture on BLAST and how to convert BLAST tasks in parallel jobs.

The school is organized by the Computing Centre of Slovak Academy of Sciences (CC SAS). The venue is located directly in the campus of SAS close to Bratislava city centre, at the Plant Science and Biodiversity Centre. Bratislava is a beautiful historical city with international airport, not far from Vienna, Budapest and Prague, with a lots of opportunities for sightseeing and social life.

Event speakers:

  • Prof. Erik Bongcam Rudloff (Swedish University of Agricultural Science)
  • Dr. Seija Sirkiä (Centre fo Scientific Computing - CSC, Finland)
  • Mr. Apurva Nandan (CSC)
  • Dr. Kimmo Mattila (CSC)
    • 09:00 09:30
      Opening 30m
      Speaker: Dr Lukáš Demovič (Computing Center of the SAS)
    • 09:30 10:30
      Introduction to bioinformatics 1/2 1h
      Speaker: Prof. Erik Bongcam Rudloff (Swedish University of Agricultural Sciences)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 12:00
      Introduction to bioinformatics 2/2 1h
      Speaker: Prof. Erik Bongcam Rudloff (Swedish University of Agricultural Sciences)
    • 12:00 13:30
      Lunch 1h 30m
    • 13:30 15:15
      Parallel programming with R (1/2) 1h 45m
      This lecture is aimed at R users with very limited or no experience in parallel computing. You will learn how and when taking advantage of parallel computing can help you run your R scripts in less time, when not, and how to tell the difference. More importantly, you will get an idea of how to approach parallelising your task in practice. We will consider Intel Math kernel library (MKL) together with Microsoft R Open, and R packages snow and foreach, both used as backend by various Bioconductor and CRAN packages. Lecture will include live coding demos. Prerequisites: experience in using R for data analysis in research
      Speaker: Dr Sirkiä Seija (Centre fo Scientific Computing - CSC, Finland)
    • 15:15 15:45
      Coffee break 30m
    • 15:45 17:30
      Parallel programming with R (2/2) 1h 45m
      This lecture is aimed at R users with very limited or no experience in parallel computing. You will learn how and when taking advantage of parallel computing can help you run your R scripts in less time, when not, and how to tell the difference. More importantly, you will get an idea of how to approach parallelising your task in practice. We will consider Intel Math kernel library (MKL) together with Microsoft R Open, and R packages snow and foreach, both used as backend by various Bioconductor and CRAN packages. Lecture will include live coding demos. Prerequisites: experience in using R for data analysis in research
      Speaker: Sirkiä Seija (Centre fo Scientific Computing - CSC, Finland)
    • 09:00 10:30
      Analysing large datasets with Apache Spark (1/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Mr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 12:30
      Analysing large datasets with Apache Spark (2/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 12:30 14:00
      Lunch 1h 30m
    • 14:00 15:30
      Analysing large datasets with Apache Spark (3/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:30
      Analysing large datasets with Apache Spark (4/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 19:00 21:30
      Socia event (dinner)
    • 09:00 10:30
      Analysing large datasets with Apache Spark (5/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 12:30
      Analysing large datasets with Apache Spark (6/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 12:30 14:00
      Lunch 1h 30m
    • 14:00 15:30
      Analysing large datasets with Apache Spark (7/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:30
      Analysing large datasets with Apache Spark (8/8) 1h 30m
      With the rapid growth in data volume that is being used in data analysis tasks, it gets more and more challenging for the user to process it using standard methods. Enter Spark, a high-performance distributed computing framework, which allows us to tackle big-data problems by distributing the workload across a cluster of machines. This two day course discusses the advantage of cloud computing for big data based computing, why should you use Spark for big data analysis and why should you care about running Spark on cloud. Next, the technical architechture and use cases of Spark, some ways to set it up, best practices and programming aspects. The first day includes the overview, architectural concepts, programming with Spark's fundamental data structure (RDD) and basics of Machine Learning with Spark. The second day focuses on the SQL module of Spark, which allows the user to analyse data using Spark's distributed collection (Dataframes) by using the traditional SQL queries, best practices when using spark, demo of a working Spark cluster, using Spark Streaming over a live twitter data. Spark can be an ideal platform for bioinformatics when it comes to building analysis pipelines and workflows. Spark supports languages such as R, Python, and SQL which eases the learning for practicing bioinformaticians. Spark is constantly growing with new libraries for bioinformatics analysis, although widespread usage will take sometime because the traditional methods need some rewriting in Spark. But, with the community constantly evolving, it is good chance to learn Spark and implement your own methods in it, for doing large scale data analysis.
      Speaker: Dr Apurva Nandan (Centre fo Scientific Computing - CSC, Finland)
    • 09:00 10:30
      BLAST (1/2) 1h 30m
      Running BLAST in Clusters NCBI BLAST is one of the most of the most frequently used bioinformatics tools. BLAST answers to the question: “What known sequences are significantly similar to my sample sequence”. Answers to this question is needed in numerous bioinformatics analyses and work-flows. As the sequence databases keep growing as well as the sizes of the data sets to be analyzed, a HPC cluster environment is often needed for BLAST analyses. In this half a day session we briefly go though the basic features of BLAST and issues related to maintaining and using BLAST in HPC cluster environments.
      Speaker: Dr Kimmo Mattila (Centre fo Scientific Computing - CSC, Finland)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 12:40
      BLAST (2/2) 1h 40m
      Running BLAST in Clusters NCBI BLAST is one of the most of the most frequently used bioinformatics tools. BLAST answers to the question: “What known sequences are significantly similar to my sample sequence”. Answers to this question is needed in numerous bioinformatics analyses and work-flows. As the sequence databases keep growing as well as the sizes of the data sets to be analyzed, a HPC cluster environment is often needed for BLAST analyses. In this half a day session we briefly go though the basic features of BLAST and issues related to maintaining and using BLAST in HPC cluster environments.
      Speaker: Dr Kimmo Mattila (Centre fo Scientific Computing - CSC, Finland)
    • 12:40 13:00
      Closing remarks 20m
      Speaker: Dr Lukáš Demovič (Computing Center of the SAS)
    • 13:00 14:30
      Lunch 1h 30m
Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×