Practical reproducibility
ImportantGolden rule
git tracks code history; bulky data does not belong in public repos. Document external access in the README.
Principles
gitfor history · Large data off GitHub.- Relative paths · Avoid machine-specific absolute paths in shared repos.
README· Enough story that someone understands what runs and why without running line 1.
Per org guidelines: repositories host code; analysis tables are linked externally (e.g. Dropbox, Google Drive) with explicit access notes in the README.
What “reproducible” means here
| Level | Meaning | When required |
|---|---|---|
| A — Internal | Script + documented environment + package versions | Thesis / internal preprints in the group. |
| B — Public / paper | A + public data download steps + Output_Analysis |
Manuscripts, reports with DOI or active preprint. |
Quick validation
- Clean-machine rerun (container,
renv, or virtualenv) by someone who did not write the pipeline. - Row parity between inputs and aggregated outputs after joins.
- Comparable figures between final exports and regenerated runs.
Privacy
Never push direct identifiers or re-identifiable raw tables to public repos. Keep secrets in .Renviron / .env (gitignored); list required variables in the README without exposing values.