Reproducible Research
Astro 497, Week 13, Monday
What should we expect of science?
Reproducible
Replicable
Valid
Historically, different fields of science have used these terms in different ways. As their importance became more widely recognized, the National Academies produced a report that attempts to standardize language.
Reproduciblity
"obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis."
Focuses on the reliability of the computations and their implementation
If a study isn't reproducible, then there are likely errors that should be corrected.
(Some subtleties in the context of stochastic algorithms)
Minimal requirement for a study to be trusted.
Replicablility
"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”
Robustness of a scientific conclusion... given the researcher's choices (e.g., definition of sample, analysis method), but allowing for natural variations in data.
Even if a study isn't replicable, it still might be high-quality science.
Validity
"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”
Robustness of a scientific conclusion
Making research replicabile & valid is very hard!
"when a researcher transparently reports a study and makes available the underlying digital artifacts, such as data and code, the results should be computationally reproducible. In contrast, even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated."
Common Barriers to Reproducibility
Inadequate recordkeeping (e.g., failing to archive data & metadata)
Availability of data & metadata (e.g., not sharing data)
Obsolescence of data (e.g., glass plates, digital media, file formats,...)
Obsolescence of code (e.g., programs, libraries, compilers, OS,...)
Flaws in attempt to replicate (e.g., lack of expertise, failure to follow protocols)
Barriers in the culture of research: resources & incentives
How is astronomy doing?
Good
Federally funded observatories (and many larger private ones) have archives for their data.
Institutional & discipline-specific services for archiving data products:
ScholarSphere & Data Comomons (Penn State)
Zenodo (CERN)
Dataverse (Harvard)
SciServer (JHU)
FITS format has been standardized since 1981.
Programming languages used for Data Science (e.g., Julia, Python, R) incorporate package managers
Funding agencies & AAS journals increasingly encourage archiving data, providing and "data behind the figures".
Not-so-good
Smaller, private observatories less likely to have funding to archive data
Much less likely to archive higher-level data products, metadata and documentation necessary to make use of them
Large datasets often need performant file formats that have yet to prove their longevity (e.g., HDF5)
Most computational R and Python rely on C/C++ or Fortran code underneath that rely on Makefiles that are customized for different architectures by hand.
Making research reproducible takes serious time and funding. When there are finite resources, difficult choices have to be made.
Common Barriers to Replicability
Human error (typically unintentional)
Misuse of statistical methods
Publication bias
Inadequate experimental design
Inadequate reporting of study protocols
Incentive system that encourages "significant" results
Failure to Replicate can lead to Scientific Progress!
Different research groups can make different, reasonable choices
One (or more) choices affect results
Subsequent investigation identifies which choice(s) were responsible for the different outcomes
Only works if both groups precisely document their choices.
Who is reproducing & replicating research?
Original investigator(s) reproducing their own results to convince themselves (most common)
Original investigator(s) reproducing their own results to convince others (e.g., collaborators, other scientists in the field, or industry/government), particularly if a result is highly surprising or has significant ramifications
Different investigators (potentially from same or different lab) may attempt to replicate a study using a data set they are collecting as a stepping stone in their research process.
Different investigators may try to build on a previous study, not succeed, and then decide to try to replicate the previous study to identify why they didn't succeed.
Maybe no one.
Strategies to make your work reproducible
Make input data publically available (when allowed & ethical)
Use open-source software for analysis
Use package manager to completely specify languages, libraries & packages used.
Version control source code and scripts
Only use results, tables & figures generated by scripts
For complex calculations, use workflow management software
Make code used to generate results public
Archive code & data
Provide sufficient documentation for others to reproduce calculations.
Encourage a team to replicate your results from the documentation you've provided.
Tools to automate workflows
Build tools: make, cmake
Scientific workflows: Snakemake, Galaxy, Nextflow, BigDataScript, ...
Example scripts for code/data behind figures in AAS journals: Showyourwork
Dangers of Big Data
Multiple testing: Perform many possible tests (explicitly or "by eye") and then report one that appears to be significant in isolation
$p$-hacking: "the practice of collecting, selecting, or analyzing data until a result of statistical significance is found" (RRiS 2019)
Overfitting: Over confidence in model performance, especially when applied to out-of-sample data
Machine learning models: Overreliance on optimizing predictive performance using complex models, rather than prioritizing, interpretability and explainability
Other relevant terms:
Rigor
"the strict application of the scientific method to ensure robust and unbiased experimental design"
NIH 2018 via Reproducibility & Replicabilitiy in Science (2019)
Reliability
"A predominant focus on the replicability of individual studies is an inefficient way to assure the reliability of scientific knowledge. Rather, reviews of cumulative evidence on a subject, to assess both the overall effect size and generalizability, is often a more useful way to gain confidence in the state of scientific knowledge."
Generalizability
"Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one."
Transparency
Setup/Helper Code
using PlutoUI, PlutoTeachingTools
Built with Julia 1.8.3 and
PlutoTeachingTools 0.2.5PlutoUI 0.7.48
To run this tutorial locally, download this file and open it with Pluto.jl.