Reproducible Research in R (and friends)

Cheatsheet

Author

M. Rolland, MSc

Published

June 8, 2024

Modified

November 21, 2024

This cheatsheet provides essential guidelines and best practices for conducting reproducible research using R and related tools. It covers project organization, version control, data management, documentation, environment management, workflow automation, and more. For any suggestions or feedback, please feel free to email me.

1 Project Organization

  • Use a consistent folder structure:
    • data/ - Analysis data files
    • scripts/ - Analysis scripts
    • outputs/ - Results (figures, tables)
    • docs/ - Documentation and reports
  • Use RStudio Projects to facilitate project management and environment isolation
  • Maintain a project log to document progress, changes, and key decisions throughout the analysis
  • Reference:

2 Version Control

  • Use Git to track changes in scripts and documents
  • Commit regularly with meaningful messages
  • One repository per analysis
  • Make sure your data/ folder is in the .gitignore file
  • Make sure there is no sensitive information in your code
  • Reference:

3 Data Management

  • Store raw data in data/raw/ and never modify it directly
  • Produce a README describing the source data
  • Use scripts to clean and process data, save the cleaned data in data/processed/
  • Document each step of data cleaning
  • Keep data cleaning separate from analysis
  • Organize your data in a tidy format: each variable is a column, each observation is a row
  • Reference:

4 Documentation

  • Comment code extensively to explain steps and logic
  • Create README files to explain project structure and instructions for running the analysis
  • Document all functions clearly, including input parameters, output, and purpose
  • Reference:

5 Environment Management

6 Workflow Automation

  • Organize your analysis into a series of numbered and ordered scripts to create a clear and reproducible workflow (e.g., 01-data-cleaning.R, 02-data-analysis.R, 03-visualization.R)
  • Create a master script (e.g., run_all.R) that sequentially runs each numbered script

OR

7 Analysis Scripts

8 Computational reproducibility

  • Set seeds to ensure reproducibility when using randomness in your analysis
  • Document all warnings
  • Reference:

9 Reporting

  • Use RMarkdown (.Rmd) or Quarto (.Qmd) files to combine code, results, and narrative for creating dynamic reports
  • Reference:

10 Validation

11 Sharing Code And Data

12 Advanced Analysis Practices

References

Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using R (and Friends).” Preprint. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3192v2.