| Title: | Privacy-Preserving Synthetic Data for 'LLM' Workflows |
|---|---|
| Description: | Generate privacy-preserving synthetic datasets that mirror structure, types, factor levels, and missingness; export bundles for 'LLM' workflows (data plus 'JSON' schema and guidance); and build fake data directly from 'SQL' database tables without reading real rows. Methods are related to approaches in Nowok, Raab and Dibben (2016) <doi:10.32614/RJ-2016-019> and the foundation-model overview by Bommasani et al. (2021) <doi:10.48550/arXiv.2108.07258>. |
| Authors: | Zobaer Ahmed [aut, cre] |
| Maintainer: | Zobaer Ahmed <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.2 |
| Built: | 2026-06-05 10:20:55 UTC |
| Source: | https://github.com/zobaer09/fakedatar |
Uses a broad, configurable regex library to match likely PII columns.
You can extend it with extra_patterns (they get ORed in) or replace
everything with a single override_regex.
detect_sensitive_columns(x_names, extra_patterns = NULL, override_regex = NULL)detect_sensitive_columns(x_names, extra_patterns = NULL, override_regex = NULL)
x_names |
Character vector of column names to check. |
extra_patterns |
Character vector of additional regexes to OR in. Examples: c("MRN", "NHS", "Aadhaar", "passport") |
override_regex |
Optional single regex string that fully replaces the
defaults (case-insensitive). When supplied, |
Character vector of names from x_names that matched.
detect_sensitive_columns(c("id","email","home_phone","zip","notes")) detect_sensitive_columns(names(mtcars), extra_patterns = c("^vin$", "passport"))detect_sensitive_columns(c("id","email","home_phone","zip","notes")) detect_sensitive_columns(names(mtcars), extra_patterns = c("^vin$", "passport"))
Save a data.frame to CSV, RDS, or Parquet based on the file extension.
export_fake(x, path)export_fake(x, path)
x |
A data.frame (e.g., output of |
path |
File path. Supported extensions: |
(Invisibly) the path written.
Generate Fake Data from Real Dataset Structure
generate_fake_data( data, n = 30, category_mode = c("preserve", "generic", "custom"), numeric_mode = c("range", "distribution"), column_mode = c("keep", "generic", "custom"), custom_levels = NULL, custom_names = NULL, seed = NULL, verbose = FALSE, sensitive = NULL, sensitive_detect = TRUE, sensitive_strategy = c("fake", "drop"), normalize = TRUE )generate_fake_data( data, n = 30, category_mode = c("preserve", "generic", "custom"), numeric_mode = c("range", "distribution"), column_mode = c("keep", "generic", "custom"), custom_levels = NULL, custom_names = NULL, seed = NULL, verbose = FALSE, sensitive = NULL, sensitive_detect = TRUE, sensitive_strategy = c("fake", "drop"), normalize = TRUE )
data |
A tabular object; will be coerced via |
n |
Rows to generate (default 30). |
category_mode |
One of "preserve","generic","custom".
|
numeric_mode |
One of "range","distribution".
|
column_mode |
One of "keep","generic","custom".
|
custom_levels |
optional named list of allowed levels per column (for |
custom_names |
optional named character vector old->new (for
|
seed |
Optional RNG seed. |
verbose |
Logical; print progress. |
sensitive |
Optional character vector of original column names to treat as sensitive. |
sensitive_detect |
Logical; auto-detect common sensitive columns by name. |
sensitive_strategy |
One of "fake","drop". Only applied if any sensitive columns exist. |
normalize |
Logical; lightly normalize inputs (trim, %→numeric, short date-times→POSIXct). |
A data.frame of n rows with attributes:
name_map (named chr: original -> output)
column_mode (chr)
sensitive_columns (chr; original names)
dropped_columns (chr; original names that were dropped)
Generate fake data from a DB schema data.frame
generate_fake_from_schema(sch_df, n = 30, seed = NULL)generate_fake_from_schema(sch_df, n = 30, seed = NULL)
sch_df |
A data.frame returned by |
n |
Number of rows to generate. |
seed |
Optional integer seed for reproducibility. |
A base data.frame with n rows and one column per schema
entry. Column classes follow the schema type values
(integer, numeric, character, logical, Date, POSIXct);
missingness is injected when nullable is TRUE.
Create synthetic timestamps either by mimicking an existing POSIXct vector
(using its range and NA rate) or by sampling uniformly between start and end.
generate_fake_posixct_column( like = NULL, n = NULL, start = NULL, end = NULL, tz = "UTC", na_prop = NULL )generate_fake_posixct_column( like = NULL, n = NULL, start = NULL, end = NULL, tz = "UTC", na_prop = NULL )
like |
Optional POSIXct vector to mimic. If supplied, |
n |
Number of rows to generate. Required when |
start, end
|
Optional POSIXct bounds to sample between when |
tz |
Timezone to use if |
na_prop |
Optional NA proportion to enforce in the output (0–1). If |
A POSIXct vector of length n.
Generates a synthetic copy of data, then optionally detects/handles
sensitive columns by name. Detection uses the ORIGINAL column names and
maps to output via attr(fake, "name_map") if present.
generate_fake_with_privacy( data, n = 30, level = c("low", "medium", "high"), seed = NULL, sensitive = NULL, sensitive_detect = TRUE, sensitive_strategy = c("fake", "drop"), normalize = TRUE, sensitive_patterns = NULL, sensitive_regex = NULL )generate_fake_with_privacy( data, n = 30, level = c("low", "medium", "high"), seed = NULL, sensitive = NULL, sensitive_detect = TRUE, sensitive_strategy = c("fake", "drop"), normalize = TRUE, sensitive_patterns = NULL, sensitive_regex = NULL )
data |
A data.frame (or coercible) to mirror. |
n |
Rows to generate (default same as input if NULL). |
level |
One of "low","medium","high". |
seed |
Optional RNG seed. |
sensitive |
Character vector of original column names to treat as sensitive. |
sensitive_detect |
Logical; auto-detect common sensitive columns by name. |
sensitive_strategy |
One of "fake" or "drop". |
normalize |
Logical; lightly normalize inputs. |
sensitive_patterns |
Optional named list of patterns to treat as sensitive (e.g., list(id = "...", email = "...", phone = "...")). Overrides defaults. |
sensitive_regex |
Optional fully-combined regex (single string) to detect sensitive columns by name. If supplied, it is used instead of defaults. |
Generate fake data with privacy controls
data.frame with attributes: sensitive_columns, dropped_columns, name_map
Create a copy-paste prompt for LLMs
generate_llm_prompt( fake_path, schema_path = NULL, notes = NULL, write_file = TRUE, path = dirname(fake_path), filename = "README_FOR_LLM.txt" )generate_llm_prompt( fake_path, schema_path = NULL, notes = NULL, write_file = TRUE, path = dirname(fake_path), filename = "README_FOR_LLM.txt" )
fake_path |
Path to the fake data file (CSV/RDS/Parquet). |
schema_path |
Optional path to the JSON schema. |
notes |
Optional extra notes to append for the analyst/LLM. |
write_file |
Write a README txt next to the files? Default TRUE. |
path |
Output directory for the README if write_file = TRUE. |
filename |
README file name. Default "README_FOR_LLM.txt". |
The prompt string (invisibly returns the file path if written).
Generates fake data, writes files (CSV/RDS/Parquet), writes a scrubbed JSON schema, and optionally writes a README prompt and a single ZIP file containing everything.
llm_bundle( data, n = 30, level = c("medium", "low", "high"), formats = c("csv", "rds"), path = tempdir(), filename = "fake_bundle", seed = NULL, write_prompt = TRUE, zip = FALSE, prompt_filename = "README_FOR_LLM.txt", zip_filename = NULL, sensitive = NULL, sensitive_detect = TRUE, sensitive_strategy = c("fake", "drop"), normalize = FALSE )llm_bundle( data, n = 30, level = c("medium", "low", "high"), formats = c("csv", "rds"), path = tempdir(), filename = "fake_bundle", seed = NULL, write_prompt = TRUE, zip = FALSE, prompt_filename = "README_FOR_LLM.txt", zip_filename = NULL, sensitive = NULL, sensitive_detect = TRUE, sensitive_strategy = c("fake", "drop"), normalize = FALSE )
data |
A data.frame (or coercible) to mirror. |
n |
Number of rows in the fake dataset (default 30). |
level |
Privacy level: "low", "medium", or "high". Controls stricter defaults. |
formats |
Which data files to write: any of "csv","rds","parquet". |
path |
Folder to write outputs. Default: |
filename |
Base file name (no extension). Example: "demo_bundle". This becomes files like "demo_bundle.csv", "demo_bundle.rds", etc. |
seed |
Optional RNG seed for reproducibility. |
write_prompt |
Write a README_FOR_LLM.txt next to the data? Default TRUE. |
zip |
Create a single zip archive containing data + schema + README? Default FALSE. |
prompt_filename |
Name for the README file. Default "README_FOR_LLM.txt". |
zip_filename |
Optional custom name for the ZIP file (no path).
If |
sensitive |
Character vector of column names to treat as sensitive (optional). |
sensitive_detect |
Logical, auto-detect common sensitive columns (id/email/phone). Default TRUE. |
sensitive_strategy |
"fake" (replace with realistic fakes) or "drop". Default "fake". |
normalize |
Logical; if TRUE, attempt light auto-normalization before faking. |
Tips
Avoid using angle brackets in examples; prefer plain tokens like NAME
or FILE_NAME. If you truly want bracket glyphs, use Unicode ⟨name⟩ ⟩name⟨.
List with paths: $data_paths (named), $schema_path, $readme_path (optional), $zip_path (optional), and $fake (data.frame).
Reads just the schema from table on conn, synthesizes n fake rows,
writes a schema JSON, fake dataset(s), and a README prompt, and optionally
zips them into a single archive.
llm_bundle_from_db( conn, table, n = 30, level = c("medium", "low", "high"), formats = c("csv", "rds"), path = tempdir(), filename = "fake_from_db", seed = NULL, write_prompt = TRUE, zip = FALSE, zip_filename = NULL, sensitive_strategy = c("fake", "drop") )llm_bundle_from_db( conn, table, n = 30, level = c("medium", "low", "high"), formats = c("csv", "rds"), path = tempdir(), filename = "fake_from_db", seed = NULL, write_prompt = TRUE, zip = FALSE, zip_filename = NULL, sensitive_strategy = c("fake", "drop") )
conn |
A DBI connection. |
table |
Character scalar: table name to read. |
n |
Number of rows in the fake dataset (default 30). |
level |
Privacy level: "low", "medium", or "high". Controls stricter defaults. |
formats |
Which data files to write: any of "csv","rds","parquet". |
path |
Folder to write outputs. Default: |
filename |
Base file name (no extension). Example: "demo_bundle". This becomes files like "demo_bundle.csv", "demo_bundle.rds", etc. |
seed |
Optional RNG seed for reproducibility. |
write_prompt |
Write a README_FOR_LLM.txt next to the data? Default TRUE. |
zip |
Create a single zip archive containing data + schema + README? Default FALSE. |
zip_filename |
Optional custom name for the ZIP file (no path).
If |
sensitive_strategy |
"fake" (replace with realistic fakes) or "drop". Default "fake". |
Invisibly, a list with useful paths:
schema_path – schema JSON
files – vector of written fake-data files
zip_path – zip archive path (if zip = TRUE)
if (requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") on.exit(DBI::dbDisconnect(con), add = TRUE) DBI::dbWriteTable(con, "cars", head(cars, 20), overwrite = TRUE) out <- llm_bundle_from_db( con, "cars", n = 100, level = "medium", formats = c("csv","rds"), path = tempdir(), filename = "db_bundle", seed = 1, write_prompt = TRUE, zip = TRUE ) }if (requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") on.exit(DBI::dbDisconnect(con), add = TRUE) DBI::dbWriteTable(con, "cars", head(cars, 20), overwrite = TRUE) out <- llm_bundle_from_db( con, "cars", n = 100, level = "medium", formats = c("csv","rds"), path = tempdir(), filename = "db_bundle", seed = 1, write_prompt = TRUE, zip = TRUE ) }
Converts common tabular objects to a base data.frame, and if normalize = TRUE
it applies light, conservative value normalization:
Converts common date/time strings to POSIXct (best-effort across several formats)
Converts percent-like character columns (e.g. "85%") to numeric (85)
Maps a configurable set of "NA-like" strings to NA, while keeping common survey
responses like "not applicable" or "prefer not to answer" as real levels
Normalizes yes/no character columns to an ordered factor c("no","yes")
prepare_input_data( data, normalize = TRUE, na_strings = c("", "NA", "N/A", "na", "No data", "no data"), keep_as_levels = c("not applicable", "prefer not to answer", "unsure"), percent_detect_threshold = 0.6, datetime_formats = c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %H:%M", "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%dT%H:%M", "%m/%d/%Y", "%Y-%m-%d") )prepare_input_data( data, normalize = TRUE, na_strings = c("", "NA", "N/A", "na", "No data", "no data"), keep_as_levels = c("not applicable", "prefer not to answer", "unsure"), percent_detect_threshold = 0.6, datetime_formats = c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %H:%M", "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%dT%H:%M", "%m/%d/%Y", "%Y-%m-%d") )
data |
An object coercible to |
normalize |
Logical, run value normalization step (default |
na_strings |
Character vector that should become |
keep_as_levels |
Character vector that should be kept as values (not |
percent_detect_threshold |
Proportion of non-missing values that must contain |
datetime_formats |
Candidate formats tried (in order) when parsing date-times strings.
The best-fitting format (most successful parses) is used. Defaults cover
|
A base data.frame.
Returns a data frame describing the columns of a database table.
schema_from_db(conn, table, level = c("medium", "low", "high"))schema_from_db(conn, table, level = c("medium", "low", "high"))
conn |
A DBI connection. |
table |
Character scalar: table name to introspect. |
level |
Privacy preset to annotate in schema metadata: one of "low", "medium", "high". Default "medium". |
A data.frame with column metadata (e.g., name, type).
if (requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") on.exit(DBI::dbDisconnect(con), add = TRUE) DBI::dbWriteTable(con, "mtcars", mtcars[1:3, ]) sc <- schema_from_db(con, "mtcars") head(sc) }if (requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") on.exit(DBI::dbDisconnect(con), add = TRUE) DBI::dbWriteTable(con, "mtcars", mtcars[1:3, ]) sc <- schema_from_db(con, "mtcars") head(sc) }
Compares classes, NA/blank proportions, and simple numeric ranges.
validate_fake(original, fake, tol = 0.15)validate_fake(original, fake, tol = 0.15)
original |
data.frame |
fake |
data.frame (same columns) |
tol |
numeric tolerance for proportion differences (default 0.15) |
data.frame summary by column
Zip a set of files for easy sharing
zip_llm_bundle(files, zipfile)zip_llm_bundle(files, zipfile)
files |
Character vector of file paths. |
zipfile |
Path to the zip file to create. |
The path to the created zip file.