HPC Pipes
Graduate School of Health and Medical Sciences at University of Copenhagen
This is a generic course. This means that the course is reserved for PhD students at the Graduate School of Health and Medical Sciences at UCPH.
Anyone can apply for the course, but if you are not a PhD student at the Graduate School, you will be placed on the waiting list until enrollment deadline. After the enrollment deadline, available seats will be allocated to the waiting list.
The course is free of charge for PhD students at Danish universities (except Copenhagen Business School), and for PhD students at NorDoc member faculties. All other participants must pay the course fee.
Learning objectives
A student who has met the objectives of the course will be able to:
1. Explain the purpose and structure of a bioinformatics pipeline
2. Develop and manage reproducible data analysis pipelines using Snakemake or Nextflow
3. Control the software environment of a workflow/pipeline using workspace management tools like conda and docker
4. Manage data and computing using best practices (RDM) and appropriate compute provisioning (HPC)
Content
The course HPC-Pipes introduces best practices for setting up, running, and sharing reproducible bioinformatics pipelines and workflows. Rather than instruct on the whys and wherefores of using particular tools for a bioinformatics analysis, we will cover the general process of building a robust pipeline (regardless of data type) using workflow languages, environment/package managers, optimized HPC resources, and FAIRly managed data and tools. On course completion, participants will be able to use this knowledge to design their own custom pipelines with tools appropriate for their individual analysis needs.
The course will provide guidance on how to automate data analysis using common workflow languages such as Snakemake or Nextflow. Subsequently, we will delve into ensuring the reproducibility of pipelines and explore available options. Participants will learn how to share their data analysis and software with the research community. We will also delve into different strategies for managing the produced research data. This includes addressing the challenges posed by large volumes of data and exploring computational approaches that aid in data organization, documentation, processing, analysis, storing, sharing, and preservation. These discussions will encompass the reasons behind the increasing popularity of Docker and other containers, along with demonstrations on how to effectively utilize package and environment managers like Conda to control the software environment within a workflow. Finally, participants will learn how to manage and optimize their pipeline projects on HPC platforms, using compute resources efficiently.
Exercises will be run on the UCloud HPC platform, and participants will be expected to build on existing familiarity with bioinformatics tools and the scripting languages bash and R/Python.
Participants
The course is intended for PhD students, postdocs, and junior faculty at SUND who are interested in learning how to construct and manage bioinformatics pipelines and projects on high-performance computing resources.
Requirements
The workshop is for PhD students at SUND who seek to acquire skills in effectively managing data and analyses in bioinformatics. Knowledge of R/Python and bash is required, as well as basic understanding of an omics analysis pipeline.
We strongly recommend taking this course after completing the course HPC-Launch, a single day course which covers theoretical concepts for HPC and RDM in health data science.
Relevance to graduate programs
The course is relevant to PhD students from the following graduate programs at the Graduate School of Health and Medical Sciences, UCPH:
- All graduate programmes
Language
English
Form
Lectures with active discussion sessions, interactive demos using the UCloud platform, and group work and exercises navigating UCloud and practicing with workflow languages and tools for RDM-compliant project set-up.
Course director
Anders Krogh,
Professor, Head of Center for Health Data Science, Head of Health Data Science Sandbox
Center for Health Data Science,
anders.krogh@sund.ku.dk
Teachers
The workshop is provided by project members of the Health Data Science Sandbox, a national training and research infrastructure project.
The Sandbox team is building training resources and guides for learning bioinformatics, predictive modeling in precision medicine, high performance computing and data carpentry.
These resources are accessible to all Danish university employees (PhD students and up) via academic supercomputing infrastructure.
Jennifer Bartell
PhD, Senior consultant and Sandbox project manager
Center for Health Data Science, KU
bartell@sund.ku.dk
Alba Refoyo Martinez
PhD, Data Scientist, Sandbox Team
Center for Health Data Science, KU
alba.martinez@sund.ku.dk
Adrija Kalvisa
PhD, Special Research Consultant
ReNEW Genomics Platform, KU
adrija.kalvisa@sund.ku.dk
Stefano Pupe
PhD, Senior Consultant
Center for Health Data Science,KU
stefano.pupe@sund.ku.dk
Dates
4 - 5 November 2024
Course location
Faculty of Health and Medical Sciences, Panum,
Blegdamsvej 3B, 2200 København.
4-Nov Panum 13.1.41/61
5-Nov Holst Auditorium
Registration
Please register by 10 October 2024
Expected frequency
This course will be repeated in Spring 2025.
Seats to PhD students from other Danish universities will be allocated on a first-come, first-served basis and according to the applicable rules.
Applications from other participants will be considered after the last day of enrollment.
Note: All applicants are asked to submit invoice details in case of no-show, late cancellation or obligation to pay the course fee (typically non-PhD students). If you are a PhD student, your participation in the course must be in agreement with your principal supervisor.