Reproducible Research

Astro 497, Week 13, Monday

What should we expect of science?

  • Reproducible

  • Replicable

  • Valid

Historically, different fields of science have used these terms in different ways. As their importance became more widely recognized, the National Academies produced a report that attempts to standardize language.


"obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis."

–- Reproducibility & Replicability in Science (2019)

  • Focuses on the reliability of the computations and their implementation

  • If a study isn't reproducible, then there are likely errors that should be corrected.

  • (Some subtleties in the context of stochastic algorithms)

  • Minimal requirement for a study to be trusted.


"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”

–- Reproducibility & Replicability in Science (2019)

  • Robustness of a scientific conclusion... given the researcher's choices (e.g., definition of sample, analysis method), but allowing for natural variations in data.

  • Even if a study isn't replicable, it still might be high-quality science.


"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”

–- Reproducibility & Replicability in Science (2019)

  • Robustness of a scientific conclusion

Making research replicabile & valid is very hard!

"when a researcher transparently reports a study and makes available the underlying digital artifacts, such as data and code, the results should be computationally reproducible. In contrast, even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated."

–- Reproducibility & Replicabilitiy in Science (2019)

Common Barriers to Reproducibility

  • Inadequate recordkeeping (e.g., failing to archive data & metadata)

  • Availability of data & metadata (e.g., not sharing data)

  • Obsolescence of data (e.g., glass plates, digital media, file formats,...)

  • Obsolescence of code (e.g., programs, libraries, compilers, OS,...)

  • Flaws in attempt to replicate (e.g., lack of expertise, failure to follow protocols)

  • Barriers in the culture of research: resources & incentives

How is astronomy doing?


  • Federally funded observatories (and many larger private ones) have archives for their data.

  • Institutional & discipline-specific services for archiving data products:

    • ScholarSphere & Data Comomons (Penn State)

    • Zenodo (CERN)

    • Dataverse (Harvard)

    • SciServer (JHU)

  • FITS format has been standardized since 1981.

  • Programming languages used for Data Science (e.g., Julia, Python, R) incorporate package managers

  • Funding agencies & AAS journals increasingly encourage archiving data, providing and "data behind the figures".


  • Smaller, private observatories less likely to have funding to archive data

  • Much less likely to archive higher-level data products, metadata and documentation necessary to make use of them

  • Large datasets often need performant file formats that have yet to prove their longevity (e.g., HDF5)

  • Most computational R and Python rely on C/C++ or Fortran code underneath that rely on Makefiles that are customized for different architectures by hand.

  • Making research reproducible takes serious time and funding. When there are finite resources, difficult choices have to be made.

Common Barriers to Replicability

  • Human error (typically unintentional)

  • Misuse of statistical methods

  • Publication bias

  • Inadequate experimental design

  • Inadequate reporting of study protocols

  • Incentive system that encourages "significant" results

Failure to Replicate can lead to Scientific Progress!

  • Different research groups can make different, reasonable choices

  • One (or more) choices affect results

  • Subsequent investigation identifies which choice(s) were responsible for the different outcomes

  • Only works if both groups precisely document their choices.

Who is reproducing & replicating research?

  • Original investigator(s) reproducing their own results to convince themselves (most common)

  • Original investigator(s) reproducing their own results to convince others (e.g., collaborators, other scientists in the field, or industry/government), particularly if a result is highly surprising or has significant ramifications

  • Different investigators (potentially from same or different lab) may attempt to replicate a study using a data set they are collecting as a stepping stone in their research process.

  • Different investigators may try to build on a previous study, not succeed, and then decide to try to replicate the previous study to identify why they didn't succeed.

  • Maybe no one.

Strategies to make your work reproducible

  • Make input data publically available (when allowed & ethical)

  • Use open-source software for analysis

  • Use package manager to completely specify languages, libraries & packages used.

  • Version control source code and scripts

  • Only use results, tables & figures generated by scripts

  • For complex calculations, use workflow management software

  • Make code used to generate results public

  • Archive code & data

  • Provide sufficient documentation for others to reproduce calculations.

  • Encourage a team to replicate your results from the documentation you've provided.

Tools to automate workflows

  • Build tools: make, cmake

  • Scientific workflows: Snakemake, Galaxy, Nextflow, BigDataScript, ...

  • Example scripts for code/data behind figures in AAS journals: Showyourwork

Dangers of Big Data

  • Multiple testing: Perform many possible tests (explicitly or "by eye") and then report one that appears to be significant in isolation

  • $p$-hacking: "the practice of collecting, selecting, or analyzing data until a result of statistical significance is found" (RRiS 2019)

  • Overfitting: Over confidence in model performance, especially when applied to out-of-sample data

  • Machine learning models: Overreliance on optimizing predictive performance using complex models, rather than prioritizing, interpretability and explainability

Other relevant terms:


"the strict application of the scientific method to ensure robust and unbiased experimental design"

NIH 2018 via Reproducibility & Replicabilitiy in Science (2019)


"A predominant focus on the replicability of individual studies is an inefficient way to assure the reliability of scientific knowledge. Rather, reviews of cumulative evidence on a subject, to assess both the overall effect size and generalizability, is often a more useful way to gain confidence in the state of scientific knowledge."

–- Reproducibility & Replicabilitiy in Science (2019)


"Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one."

–- Reproducibility & Replicabilitiy in Science (2019)


