Reproducible Research in R (and friends)
Cheatsheet
This cheatsheet provides essential guidelines and best practices for conducting reproducible research using R and related tools. It covers project organization, version control, data management, documentation, environment management, workflow automation, and more. For any suggestions or feedback, please feel free to email me.
1 Project Organization
- Use a consistent folder structure:
data/
- Analysis data filesscripts/
- Analysis scriptsoutputs/
- Results (figures, tables)docs/
- Documentation and reports
- Use RStudio Projects to facilitate project management and environment isolation
- Maintain a project log to document progress, changes, and key decisions throughout the analysis
- Reference:
2 Version Control
- Use Git to track changes in scripts and documents
- Commit regularly with meaningful messages
- One repository per analysis
- Make sure your
data/
folder is in the.gitignore
file - Make sure there is no sensitive information in your code
- Reference:
3 Data Management
- Store raw data in
data/raw/
and never modify it directly - Produce a README describing the source data
- Use scripts to clean and process data, save the cleaned data in
data/processed/
- Document each step of data cleaning
- Keep data cleaning separate from analysis
- Organize your data in a tidy format: each variable is a column, each observation is a row
- Reference:
4 Documentation
- Comment code extensively to explain steps and logic
- Create README files to explain project structure and instructions for running the analysis
- Document all functions clearly, including input parameters, output, and purpose
- Reference:
5 Environment Management
- Use
sessionInfo()
ordevtools::session_info()
to capture the R session information - Use
{renv}
to manage package versions - Reference:
6 Workflow Automation
- Organize your analysis into a series of numbered and ordered scripts to create a clear and reproducible workflow (e.g., 01-data-cleaning.R, 02-data-analysis.R, 03-visualization.R)
- Create a master script (e.g., run_all.R) that sequentially runs each numbered script
OR
Use Makefile or
{targets}
package to automate and document the workflowReference:
7 Analysis Scripts
- Break analysis into small, reusable functions
- Use meaningful and consistent naming conventions such as provided by the Tidyverse Naming Conventions for variables and functions and by data carpentry for folders and files
- Style your code according to standardized recommendations from the Tidyverse Style Guide
- Reference:
8 Computational reproducibility
- Set seeds to ensure reproducibility when using randomness in your analysis
- Document all warnings
- Reference:
9 Reporting
- Use RMarkdown (.Rmd) or Quarto (.Qmd) files to combine code, results, and narrative for creating dynamic reports
- Reference:
10 Validation
- Get your code reviewed prior to publication
- Reference:
12 Advanced Analysis Practices
- Use the “many models” approach to fit and compare models across many subsets of data (e.g. EWAS). Storing models as list-columns in tibbles simplifies storage, manipulation and visualization while promoting modularity and reusability.
References
Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using R (and Friends).” Preprint. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3192v2.