Reproducible Research

Astro 497, Week 13, Monday

What should we expect of science?

  • Reproducible

  • Replicable

  • Valid

Historically, different fields of science have used these terms in different ways. As their importance became more widely recognized, the National Academies produced a report that attempts to standardize language.

Reproduciblity

"obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis."

–- Reproducibility & Replicability in Science (2019)

  • Focuses on the reliability of the computations and their implementation

  • If a study isn't reproducible, then there are likely errors that should be corrected.

  • (Some subtleties in the context of stochastic algorithms)

  • Minimal requirement for a study to be trusted.

Replicablility

"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”

–- Reproducibility & Replicability in Science (2019)

  • Robustness of a scientific conclusion... given the researcher's choices (e.g., definition of sample, analysis method), but allowing for natural variations in data.

  • Even if a study isn't replicable, it still might be high-quality science.

Validity

"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”

–- Reproducibility & Replicability in Science (2019)

  • Robustness of a scientific conclusion

Making research replicabile & valid is very hard!

"when a researcher transparently reports a study and makes available the underlying digital artifacts, such as data and code, the results should be computationally reproducible. In contrast, even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated."

–- Reproducibility & Replicabilitiy in Science (2019)

Common Barriers to Reproducibility

  • Inadequate recordkeeping (e.g., failing to archive data & metadata)

  • Availability of data & metadata (e.g., not sharing data)

  • Obsolescence of data (e.g., glass plates, digital media, file formats,...)

  • Obsolescence of code (e.g., programs, libraries, compilers, OS,...)

  • Flaws in attempt to replicate (e.g., lack of expertise, failure to follow protocols)

  • Barriers in the culture of research: resources & incentives

How is astronomy doing?

Good

  • Federally funded observatories (and many larger private ones) have archives for their data.

  • Institutional & discipline-specific services for archiving data products:

    • ScholarSphere & Data Comomons (Penn State)

    • Zenodo (CERN)

    • Dataverse (Harvard)

    • SciServer (JHU)

  • FITS format has been standardized since 1981.

  • Programming languages used for Data Science (e.g., Julia, Python, R) incorporate package managers

  • Funding agencies & AAS journals increasingly encourage archiving data, providing and "data behind the figures".

Not-so-good

  • Smaller, private observatories less likely to have funding to archive data

  • Much less likely to archive higher-level data products, metadata and documentation necessary to make use of them

  • Large datasets often need performant file formats that have yet to prove their longevity (e.g., HDF5)

  • Most computational R and Python rely on C/C++ or Fortran code underneath that rely on Makefiles that are customized for different architectures by hand.

  • Making research reproducible takes serious time and funding. When there are finite resources, difficult choices have to be made.

Common Barriers to Replicability

  • Human error (typically unintentional)

  • Misuse of statistical methods

  • Publication bias

  • Inadequate experimental design

  • Inadequate reporting of study protocols

  • Incentive system that encourages "significant" results

Failure to Replicate can lead to Scientific Progress!

  • Different research groups can make different, reasonable choices

  • One (or more) choices affect results

  • Subsequent investigation identifies which choice(s) were responsible for the different outcomes

  • Only works if both groups precisely document their choices.

Who is reproducing & replicating research?

  • Original investigator(s) reproducing their own results to convince themselves (most common)

  • Original investigator(s) reproducing their own results to convince others (e.g., collaborators, other scientists in the field, or industry/government), particularly if a result is highly surprising or has significant ramifications

  • Different investigators (potentially from same or different lab) may attempt to replicate a study using a data set they are collecting as a stepping stone in their research process.

  • Different investigators may try to build on a previous study, not succeed, and then decide to try to replicate the previous study to identify why they didn't succeed.

  • Maybe no one.

Strategies to make your work reproducible

  • Make input data publically available (when allowed & ethical)

  • Use open-source software for analysis

  • Use package manager to completely specify languages, libraries & packages used.

  • Version control source code and scripts

  • Only use results, tables & figures generated by scripts

  • For complex calculations, use workflow management software

  • Make code used to generate results public

  • Archive code & data

  • Provide sufficient documentation for others to reproduce calculations.

  • Encourage a team to replicate your results from the documentation you've provided.

Tools to automate workflows

  • Build tools: make, cmake

  • Scientific workflows: Snakemake, Galaxy, Nextflow, BigDataScript, ...

  • Example scripts for code/data behind figures in AAS journals: Showyourwork

Dangers of Big Data

  • Multiple testing: Perform many possible tests (explicitly or "by eye") and then report one that appears to be significant in isolation

  • $p$-hacking: "the practice of collecting, selecting, or analyzing data until a result of statistical significance is found" (RRiS 2019)

  • Overfitting: Over confidence in model performance, especially when applied to out-of-sample data

  • Machine learning models: Overreliance on optimizing predictive performance using complex models, rather than prioritizing, interpretability and explainability

Other relevant terms:

Rigor

"the strict application of the scientific method to ensure robust and unbiased experimental design"

NIH 2018 via Reproducibility & Replicabilitiy in Science (2019)

Reliability

"A predominant focus on the replicability of individual studies is an inefficient way to assure the reliability of scientific knowledge. Rather, reviews of cumulative evidence on a subject, to assess both the overall effect size and generalizability, is often a more useful way to gain confidence in the state of scientific knowledge."

–- Reproducibility & Replicabilitiy in Science (2019)

Generalizability

"Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one."

–- Reproducibility & Replicabilitiy in Science (2019)

Transparency

Setup/Helper Code

     
using PlutoUI, PlutoTeachingTools

Built with Julia 1.8.3 and

PlutoTeachingTools 0.2.5
PlutoUI 0.7.48

To run this tutorial locally, download this file and open it with Pluto.jl.