AI Model Training with Photo & Video Datasets: Anonymization and Face Blurring Workflow

Mateusz Zimoch

Published: 12/2/2025

Updated: 3/10/2026

Regulatory context for training models with photos and videos
When anonymization and consent may be unnecessary
Common risk points in visual data anonymization
A practical workflow for face blurring and license plate blurring
GDPR vs UK GDPR for publishing photos and videos
Quality assurance for anonymized datasets
FAQ: AI Model Training with Photo & Video Datasets

Visual data anonymization means transforming photos and videos so that natural persons are no longer identifiable. In practice this often involves face blurring and license plate blurring, combined with removal of metadata and safeguards against re-identification. For AI model training, anonymization can enable the use of rich datasets while reducing personal data risk and supporting data protection by design and by default requirements.

A black-and-white photo featuring a phone with an AI chat app open against the screen background

Regulatory context for training models with photos and videos

Under the GDPR and the UK GDPR, a photo or video is personal data if a person can be identified directly or indirectly, including by combining elements such as setting, clothing or unique objects [1][2]. If individuals are identifiable, model training requires a lawful basis and must follow the principles of purpose limitation, data minimization and storage limitation [1]. Anonymized data falls outside the scope of the GDPR only if identification of a person is no longer possible by any means reasonably likely to be used, taking account of available technology and costs (Recital 26) [1].

The EU AI Act introduces governance across the AI lifecycle. It includes requirements on risk management, data governance and technical documentation for certain AI systems, and it interacts with existing EU data protection law rather than replacing it. Anonymization and robust redaction can support data minimization and reduce risks such as unintended memorization and model inversion, but they do not automatically make a use-case compliant if individuals remain identifiable [5].

Supervisory authorities highlight special considerations for images from CCTV or public spaces, especially when used beyond security purposes, for analytics or publication [2][3]. Organisations often carry out a Data Protection Impact Assessment (DPIA) before large-scale or systematic monitoring of publicly accessible areas, or when new technology could heighten risks [1][3].

A black-and-white photo showing a phone with an AI chat app running, a finger touching it, against the background of the screen with the same app

While many publishing and training scenarios require a lawful basis or anonymization, there are three well-known exceptions often cited in image rights practice. These are context-dependent and vary by jurisdiction. The three exceptions are:

The person is widely known (a public figure), and the image was taken in connection with their public role.
The person appears only as a part of a larger scene, such as a meeting, landscape, or public event.
The person was paid to pose, unless they explicitly stated that they do not consent to the distribution of their image.

These exceptions do not switch off data protection duties when individuals remain identifiable. They are often considered in parallel with legitimate interests tests, freedom of expression exemptions and local image rights. For AI training, reliance on such exceptions is less predictable than anonymization, because model training is often a repurposing beyond the original context of capture.

A colorless graphic showing a laptop on a desk, with a 3D graphic of connected points forming lines on the screen, resembling a brain

Common risk points in visual data anonymization

Re-identification risk. Even when faces are blurred, a combination of distinctive clothing, tattoos, location landmarks or timestamps may make a person identifiable. Organisations often treat blurring as one layer within a broader strategy that can include cropping, masking or background redaction for high-risk scenes, guided by Recital 26’s standard of reasonable means [1].

Background identifiers. Whiteboards, screens, documents in the frame and building signage can expose names, emails or addresses. License plates in the background are easy to miss without multi-scale detection.

Metadata. EXIF data can include GPS coordinates, device identifiers and capture dates. Removing or minimizing metadata before sharing or publishing can significantly reduce linkage risk [2].

Detection errors. Face and plate detectors generate false negatives and false positives. Missed detections expose identities. Over-blurring can degrade dataset utility. Accuracy is highly context-dependent and varies by lighting, angle, occlusion and camera type. Human-in-the-loop review remains a common practice for sensitive releases.

A practical workflow for face blurring and license plate blurring

Define the purpose. Describe whether images will be published, used for internal analytics, or included in AI model training. The use determines anonymization strength and retention periods.
Select lawful basis and risk controls. Where individuals are identifiable, organisations assess an appropriate lawful basis (for example legitimate interests where applicable, or consent in some contexts) and decide whether a DPIA is required [1][3]. When in doubt, push toward anonymization that meets Recital 26’s standard.
Ingest and classify assets. Separate photos and videos by scenario, camera type and location sensitivity. Track provenance and rights, including model releases for paid posing when available.
Choose on-premise software (where appropriate). On-premise software can keep datasets within the organisation’s network and reduce external transfer risk. It can support encryption at rest, identity-based access, and audit logging aligned with accountability and data protection by design [1].
Configure detectors and thresholds. Use models for faces and license plates. Calibrate minimum face size, confidence thresholds, and motion-based pre-detection for video. For crowded scenes, enable multi-scale detection and overlapping-mask resolution.
Automate redaction. Apply face blurring and license plate blurring. For high-risk contexts, add full-body or background masking. Use consistent kernels, pixelation levels or Gaussian blur that prevent practical reversal under reasonably likely means (rather than assuming blur is irreversible in all cases).
Human-in-the-loop review. Sample frames, seek missed detections, and correct using annotation tools. Create playbooks for recurring edge cases such as reflections, posters with faces, screens showing people in video calls, and mirrored helmets.
Strip metadata and prepare outputs. Remove EXIF and device identifiers. Export publishing copies at necessary resolution only. For training datasets, keep a mapping of originals to anonymized versions only if needed, store it separately, and restrict access (for example via role-based access controls). If possible, avoid retaining direct linkability.
Test re-identification risk. Attempt linkage using context clues and reverse image search where applicable. Record residual risk and improvement actions. Re-run on diverse scenes and devices.
Log, retain and delete. Keep processing logs and redaction manifests to the minimum necessary for accountability. Define retention by purpose. Delete non-essential originals or move them to a sealed archive with strict access policies.

On-premise software considerations

On-premise software can reduce transfers of personal data to external processors and can help manage exposure to third-country access depending on the organisation’s architecture and vendors. It also can facilitate auditability, supporting GDPR accountability and aligning with the EU AI Act’s lifecycle governance expectations for in-scope systems [1][5]. Check out Gallio PRO for on-premise processing options that fit this workflow.

A graphic with desaturated colors depicting a modern computer screen with connected graphics, text: 'TEXT TO IMAGE, input image prompt, Generate'

The following table highlights common practice points. It does not replace legal analysis and should be read as high-level, context-dependent guidance derived from publicly available materials.

Topic	GDPR (EU)	UK GDPR + Data Protection Act 2018
Images as personal data	Photos and videos are personal data if a person is identifiable, directly or indirectly [1].	Same approach. ICO guidance provides practical examples for photos and CCTV [2][3].
Lawful basis for publishing	Often legitimate interests for some operational publishing, subject to a balancing test and context. Consent is commonly used in some scenarios such as close-up marketing portraits. Context-dependent.	Same. ICO stresses transparency, reasonable expectations and the right to object where appropriate [2].
DPIA signals	Systematic monitoring of publicly accessible areas on a large scale, or new tech raising risk, is a common trigger for a DPIA [1].	ICO guidance indicates that systematic monitoring and use of new technology are likely to require a DPIA depending on scale and risk [3].
Anonymization standard	Anonymized if identification is no longer reasonably likely given means and costs (Recital 26) [1].	Same standard in UK GDPR. ICO guidance discusses robust anonymisation and managing residual risk [2].
Freedom of expression carve-outs	Member State rules apply for journalistic and academic/artistic/literary expression purposes. Highly contextual.	DPA 2018 provides exemptions, including for journalism and for research/statistics in specific conditions. Highly contextual [4].

Teams planning regular publishing or dataset sharing can operationalize these points in template DPIA checklists, redaction profiles and release procedures. Download a demo to test how this looks in an on-premise environment.

Quality assurance for anonymized datasets

Quality assurance should focus on measurable coverage and error rates. Create ground-truth samples with manual annotations. Compare automated face blurring and license plate blurring against ground truth to estimate false negatives and false positives. Track performance by scenario, such as night footage, helmets, masks, and fisheye cameras. Report results as context-dependent metrics rather than universal accuracy claims. For publishing, apply stricter thresholds and manual checks. For model training, balance anonymization strength with utility by suppressing high-risk attributes while retaining non-identifying features relevant to the model task.

Organisations aiming to operationalize this workflow can align it with internal policy and supplier due diligence. Contact us to discuss on-premise processing controls, role-based access and audit logging.

A white question mark spray-painted on the asphalt road

FAQ: AI Model Training with Photo & Video Datasets

Does face blurring alone make a dataset anonymous under the GDPR?

Not always. If a person remains identifiable by reasonably likely means, such as distinctive clothing or location cues, the dataset still contains personal data. A combination of face blurring, license plate blurring, background redaction and metadata removal may be required depending on context and risk [1][2].

When should license plate blurring be applied?

Apply it whenever vehicles appear in a way that could identify a driver, owner, or be linkable to an individual (for example when plates are readable and can be connected to a person in context). This is common in street scenes, parking lots and building entrances. In model training, enable plate detection at multiple scales to handle distant vehicles.

Is cloud processing acceptable for blurring?

It depends on risk, architecture and contracts. On-premise software can reduce external transfers and support stronger control over access and retention. If cloud is used, implement appropriate security measures and ensure a compliant controller-processor arrangement, including any requirements for international transfers under GDPR/UK GDPR.

How should organisations handle metadata?

Remove GPS coordinates and device identifiers from publishing copies. For internal compliance, keep only the minimum technical logs needed for accountability and troubleshooting, and avoid storing unnecessary metadata that would enable re-identification. ICO guidance discusses careful handling of images and associated information [2].

What level of blur is sufficient?

There is no universal level. Choose pixelation or Gaussian blur that prevents practical identification and is robust against reasonably likely enhancement. Test across lighting, angles and motion. Strength should typically be higher for public release than for internal analytics.

How does the EU AI Act affect visual datasets?

It strengthens lifecycle governance expectations (for in-scope systems), including risk management and data governance requirements, and it operates alongside existing data protection law. Anonymization and minimization can help reduce personal data risks, but they do not remove GDPR obligations if people remain identifiable [5].

Are the three exceptions safe to rely on for AI training?

They are context-dependent and typically relate to image publication/image rights, not broad repurposing for training. For training datasets, anonymization (or another clearly applicable lawful basis with appropriate safeguards) usually offers more predictable compliance outcomes.

Download free demo