Project Types
Privacy Sensitive Project
Enhanced privacy controls with separate private/public flows
Overview
The Privacy Sensitive Project is designed for research involving sensitive or protected data. It provides enhanced privacy controls with separate private and public directories, and stricter .gitignore rules that prevent private data from being committed.
Directory Structure
my-study/
├── functions/ # Custom R functions (auto-loaded)
│ └── helpers.R
├── inputs/
│ ├── private/ # Sensitive data (gitignored)
│ │ ├── raw/
│ │ ├── intermediate/
│ │ └── final/
│ └── public/ # Non-sensitive data
│ ├── raw/
│ ├── intermediate/
│ └── final/
├── notebooks/ # Analysis notebooks
├── outputs/
│ └── private/
│ └── notebooks/ # Rendered notebooks (gitignored)
├── scripts/ # Automation scripts
├── framework.db # Metadata database
└── settings.yml # Project configuration
Additional directories are created on first use:
outputs/private/cache- Cached computationsoutputs/private/figures- Private figuresoutputs/private/tables- Private tablesoutputs/public/- Shareable results
Privacy Controls
What Gets Gitignored
The privacy sensitive .gitignore template excludes:
inputs/private/- All private input dataoutputs/private/- All private outputs.env- Environment variables with credentialsframework.db- May contain data hashes
Public vs Private Outputs
Use the public parameter when saving outputs:
# Save to private (default)
save_table(results, "analysis_results")
# Save to public for sharing
save_table(summary_stats, "summary", public = TRUE)
Workflow
1. Set Up Environment
library(framework)
scaffold()
2. Organize Data by Sensitivity
Place sensitive data in private directories:
data:
# Private: contains PHI
inputs.private.raw.patients:
path: inputs/private/raw/patients.csv
type: csv
locked: true
# Public: aggregated, de-identified
inputs.public.final.demographics:
path: inputs/public/final/demographics.csv
type: csv
3. Work with Private Data
# Load private data
patients <- data_read("inputs.private.raw.patients")
# Process and de-identify
demographics <- patients |>
group_by(age_group, region) |>
summarize(n = n())
# Save de-identified data to public
data_save(demographics, "inputs.public.final.demographics")
4. Generate Public Outputs
# Private analysis output
save_figure(detailed_plot, "patient_outcomes")
# Public summary for publication
save_figure(summary_plot, "figure_1", public = TRUE)
Configuration
Example settings.yml for a privacy sensitive project:
default:
project_type: project_sensitive
directories:
notebooks: notebooks
scripts: scripts
functions: functions
inputs_private_raw: inputs/private/raw
inputs_private_intermediate: inputs/private/intermediate
inputs_private_final: inputs/private/final
inputs_public_raw: inputs/public/raw
inputs_public_intermediate: inputs/public/intermediate
inputs_public_final: inputs/public/final
cache: outputs/private/cache
packages:
- name: dplyr
auto_attach: true
- name: ggplot2
auto_attach: true
data:
inputs.private.raw.patients:
path: inputs/private/raw/patients.csv
type: csv
locked: true
Best Practices
Lock Sensitive Source Data
Use locked: true to ensure source data hasn't changed:
data:
inputs.private.raw.patients:
path: inputs/private/raw/patients.csv
type: csv
locked: true # Error if file changes
De-identify Before Publishing
Always aggregate or de-identify data before moving to public directories:
# Never do this
save_table(raw_patient_data, "patients", public = TRUE)
# Do this instead
public_summary <- raw_patient_data |>
group_by(category) |>
summarize(mean_value = mean(value), n = n())
save_table(public_summary, "summary", public = TRUE)
Review Before Committing
Check git status before committing to ensure no private data is staged:
git_status()