What Is Synthetic Data Generation (SDG)?

Synthetic Data Generation (SDG) - Definition

Synthetic data generation (SDG) is a controlled process of creating artificial data that preserves the key statistical or structural properties of source data without being a direct copy of it. In normative terms, synthetic data is data generated artificially rather than collected directly from observations of phenomena or individuals (ISO/IEC 22989:2022). SDG can apply to images, video, audio, and tabular data.

In the context of image and video anonymization, SDG is primarily used for two purposes. First, to create training and validation datasets for face detection, face blurring, and license plate detection and redaction models. Second, to replace parts of an image with synthetic textures or faces that have low (ideally near-zero) biometric similarity, thereby reducing the risk of re-identification. The mere creation of synthetic data does not automatically constitute anonymization under the GDPR. For data to be considered anonymous, identifying an individual must be practically impossible using reasonable means, as stated in GDPR Recital 26 and Article 29 Working Party Opinion 05/2014.

The Role of SDG in Image and Video Anonymization

In practice, SDG is one link in a processing chain that includes detection, segmentation, and masking of elements requiring protection. By generating synthetic faces and license plates, teams can train and test detection models in line with the data minimization principle, without widely distributing real-world data. This is particularly important for on-premise deployments and environments with elevated data security requirements.

SDG also increases the diversity of imaging conditions, such as lighting, camera angles, occlusions, license plate types, and visual artifacts. As a result, face and license plate blurring models achieve higher sensitivity in crowded scenes, under motion blur, and at low resolution. From a Data Protection Officer (DPO) perspective, SDG is a compliance-supporting tool: it improves the effectiveness of anonymization techniques but does not replace risk assessment or re-identification resilience testing.

SDG Technologies Used in Anonymization

Specialized generative models are used to create synthetic images and video sequences. In anonymization workflows, identity detection and verification models are also critical, as they help assess the risk of information disclosure in synthetic outputs.

  • Generative models: diffusion-based image models, GANs, and VAEs for generating faces, license plates, and background textures (Heusel et al., 2017; diffusion research from 2020 onward).
  • Detection models: YOLO, RetinaFace, EfficientDet for locating faces and license plates in source material and in synthetic training data.
  • Biometric verification models: e.g., ArcFace, used to measure similarity between synthetic and real faces and to monitor excessive biometric resemblance.
  • Privacy-preserving training: DP-SGD and memorization-reduction techniques to lower the risk of training data reconstruction by generative models (Abadi et al., 2016; Carlini et al., 2023).

Key Parameters and Metrics for SDG in Anonymization

Evaluating the effectiveness of synthetic data generation requires balancing utility for redaction models with privacy risk. The table below summarizes commonly used metrics in imaging and anonymization, along with their interpretation and references.

Category

Metric

Description

Interpretation

 

Detection utility

mAP@IoU

Mean Average Precision at a given IoU threshold, measured on a detector trained on synthetic data

Higher is better - indicates whether SDG improves face and license plate detection

Generative quality

FID

Fréchet Inception Distance - similarity between feature distributions of real and synthetic datasets

Lower is better - lower FID means higher fidelity (Heusel et al., 2017)

Diversity

Precision-Recall for generative models

Metric balancing sample precision and coverage of data modes

High precision and recall - no spurious modes and no mode collapse (Kynkäänniemi et al., 2019)

Memorization risk

Membership inference AUC

Ability of an attack to determine whether a sample was part of the generator’s training set

AUC close to 0.5 - lower leakage risk (MIA literature; NIST synthetic data tools)

Biometric risk

Match rate

Percentage of matches between synthetic and real faces according to a biometric classifier

Low match rate - synthetic faces do not resemble real individuals

Redaction quality

SSIM / PSNR within mask area

Structural consistency and noise relative to the intended redaction effect

Aligned with policy - no artifacts that could facilitate identification

Performance

Generation time, number of steps

Latency and computational complexity, e.g., number of diffusion steps

Optimized for batch, on-premise processing - no real-time requirement

Challenges and Limitations of SDG

Deploying synthetic data generation for privacy protection requires addressing both technical and legal risks. Below are key considerations for DPOs and technical teams.

  • No automatic anonymization: synthetic data may still disclose information if a model memorizes training samples or reproduces rare feature combinations. Research documents extraction of training data fragments from generative models without adequate safeguards (Carlini et al., 2023).
  • Domain shift: overly “clean” synthetic data can reduce detector performance in real-world conditions. Domain randomization and validation on real data are necessary, while respecting data minimization and GDPR principles.
  • Risk management: AI risk management practices in line with ISO/IEC 23894:2023 are required, including documentation of decisions and reference datasets.
  • Compliance and transparency: public materials should avoid synthetic content that could mislead as to authenticity. For internal anonymization processes, re-identification and re-profiling resistance testing is essential.

Examples of SDG Applications in Face and License Plate Blurring

In solutions such as Gallio PRO, deployed on-premise and performing automated batch anonymization of faces and license plates, SDG supports multiple stages of the model lifecycle. The examples below relate to image and video processing and do not apply to text documents.

  • Training data augmentation for face and license plate detectors - synthetic crowded scenes, multiple countries and plate formats, varied lighting conditions.
  • Redaction effectiveness validation - generating challenging test cases with partial occlusions and motion blur.
  • Synthetic identity replacement - creating faces with low (ideally near-zero) biometric similarity and filling masks instead of applying simple blur to reduce reversibility risk.
  • Compliance support - in some jurisdictions, license plate blurring is mandatory or recommended; SDG improves detection of rare plate formats. In Poland, whether license plates constitute personal data depends on context, so a precautionary policy and risk-based testing aligned with EDPB and UODO guidance is recommended.
  • Manual operations - for logos, tattoos, nameplates, or screens not automatically detected, SDG can provide training patterns for operators and test scenarios for built-in manual editors.

Normative References and Sources

Below is a list of standards and technical sources used for SDG definitions and metrics. Edition numbers and dates allow for verification.

  • ISO/IEC 22989:2022 - Artificial intelligence - Concepts and terminology. Definition of synthetic data.
  • ISO/IEC 23894:2023 - Artificial intelligence - Risk management. AI risk management framework.
  • ISO/IEC 27559:2022 - Privacy-enhancing data de-identification framework. De-identification and privacy risk assessment.
  • GDPR - Recital 26 and Article 4. Definitions of personal data and anonymization criteria.
  • EDPB, Guidelines 3/2019 on processing of personal data through video devices, final version 2020.
  • Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques.
  • NIST AI RMF 1.0, January 2023. AI risk management framework, including data and testing.
  • NIST SDNist Toolkit, 2023-2024. Tools for assessing privacy and utility of synthetic data.
  • Heusel et al., 2017, GANs Trained by a Two Time-Scale Update Rule - FID metric.
  • Kynkäänniemi et al., 2019, Improved Precision and Recall Metric for Assessing Generative Models.
  • Abadi et al., 2016, Deep Learning with Differential Privacy - DP-SGD.
  • Carlini et al., 2023, Extracting Training Data from Diffusion Models.