Reproducible Quantitative Data Science
PhD School at the Faculty of SCIENCE at University of Copenhagen
The course structure is over 5 days plus personal work: 2 days, course work, 2 days, course work, and 1 day with presentations.
The 5 physical days can be structured in the following fashion: 2 days in the lecture free week of block 4, 2 days in fall holiday and one day in the start of December.
Day 1 - Data Collection and data storage:
Date: 19th of June 2025, 09.00-16.00
Venue: TBA
- Introduction to reproducibility: Definitions, issues and origins - lecture (2 h)
- Data provenance: keeping track of where data are coming from - lecture and exercises (1h)
- How do you store data on your computer? Data structures and data naming - lecture and exercises (1h)
- Ethic and GDPR - lecture and practical case reviews (2h)
Day 2 - Reproducible designs, protocols and pre-registration:
Date: 20th of June 2025, 09.00-16.00
Venue: TBA
- Concepts and tools for protocol documentation, and study pre-registration - lecture (1.5h)
- Case studies - exercise (1h)
- Using markdown for documentation - practical (1/2h)
- Version control and social coding with Git and GitHub - practical (3h)
Course work (10 hours):
- Using your PhD research data, protocol, code, etc, write a report explaining from where you start, which measures are already in place to increase reproducibility as per concepts presented during days 1 and 2. What measures can be taken to increase reproducibility and if any, why some cannot be implemented? (min page count 3)
Day 3 - Better coding:
Date: 21st of August 2025, 09.00-16.00
Venue: TBA
- Literate programming - lecture and exercises (1h)
- Good coding practices - lecture and exercises (2h)
- Time to update your code - practical from student's own analysis scripts (3h)
Day 4 - Better analyses:
Date: 22nd of August 2025, 09.00-16.00
Venue: TBA
- P-hacking your data - lecture (1h)
- Encapsulate code for reproducibility using containers (2h)
- An introduction to computational analysis methods: permutation, bootstrap, cross-validation,
- out-of-sample generalization - lecture and exercises (3hours)
Course work (18 hours):
- Make a copy of an existing code you have used and/o used in the lab and improve it’s reproducibilty using any of the tools reviewed during the course: from better inline documentation and variable coding to updated analyses.
- Make a 10 minutes presentation summarizing all of your course works and what measures you have taken to improve reproducibility in your PhD.
Day 5 - Data sharing (9-16):
Date: 10th of November 2025
Venue: TBA
- The ‘data’ cycle, sharing from raw data to figures - lecture (1h)
- Reproducible publishing - a case study (1h)
- Presentations and discussions/social event with drinks and pizza (4h)
Aim and content
The Reproducible Quantitative Data Science course introduces key concepts, tools and analysis methods for reproducible data analysis in any type of quantitative research study. It is meant as a hands-on crash course in reproducible data analysis for PhD students.
In the course, we will cover the area of research data management and best practices for data before introducing the concepts of reproducible designs, protocols and pre-registration of research studies. Next, we will focus on literate programming and good coding practices and focus on how to improve the student’s code to make it more reproducible. Part of this is include using version control and also how to encapsulate code using containers. We will then go into issues in the actual data analysis and address computational analysis methods such as permutation, bootstrap, cross-validation and out-of-sample generalization. We are finishing the course by introducing the topic of reproducible publishing.
Formal requirements
We expect students to join the course several months after starting their PhD allowing them to already have data and some code. This will allow applying the concepts developed to their own data and code.
We assume that the students have some experience with programming as one cannot reproduce analyses using a graphical interface but only using code. We’ll try to be as agnostic as possible language wise, but prior exposure of bash/git, Matlab, Python are a plus.
During the course, active participation is expected including sharing an example of code written by the students for code review.
Learning outcome
Knowledge:
> Understand the concepts of reproducible designs, protocols and pre-registration of research studies
> Understand good coding practices
> Understand computational analysis methods such as permutation, bootstrap, cross-validation and out-of-sample generalization
Skills:
> Version control and social coding
> Develop literate programming and good coding practices
> Encapsulate code for reproducibility using containers
Competences:
> Propose measures to increase reproducibility in their own PhD research data analysis
> Prepare a manuscript in a reproducible fashion
Literature
We already have a Zotero group with all the course literature that can be made available, e-mail course reponsible Melanie Ganz-Benjaminsen ganz@di.ku.dk to be added to the Zotero group
Target group
The number of participants is limited at 30, and priority will be given to PhD students from UCPH-SCIENCE and UCPH-SUND.
.
Teaching and learning methods
The students need to prepare with background information before the course by going through the provided reading material.
During the physical meeting days, we intersperse lectures with exercises. A full overview over our teaching materials is publically available on Github:
https://github.com/CPernet/ReproducibleQuantitativeDataScience
Between the physical meetings the students will individually work on exercises.
Lecturers
UCPH lecturers
> Senior Scientist Cyril Pernet https://di.ku.dk/CP
> Associate Prof. Melanie Ganz-Benjaminsen https://research.ku.dk/MG-B
Guest Lectures
Physical visitors:
> Russ Poldrack, Stanford University, poldrack@stanford.edu
> Robert Oostenveld, Radboud University, r.oostenveld@donders.ru.nl
> Michael Hanke, Forschungszentrum Jülich GmbH, m.hanke@fz-juelich.de
Zoom lecturers from the US/Canada:
> Jean Baptiste Poline, McGill
> Ariel Rokem, University of Washington
Remarks
No participation fee for PhD Students enrolled at a Danish institution/ Danish University
All other students are required to pay the participation fee of 3000 DKK.
***