T-Closeness Definition
T-closeness is a privacy model introduced by Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian in 2007 as an extension of the earlier k-anonymity and l-diversity models. Its purpose is to limit so-called attribute disclosure, meaning a situation in which, after assigning a record to an anonymity group, it becomes highly probable to infer a sensitive attribute from the data distribution within that group. Under the t-closeness model, the distance between the distribution of a sensitive attribute in each equivalence class and the distribution of that attribute across the entire dataset must not exceed a threshold t.
In the original literature, this distance is defined using Earth Mover’s Distance (EMD). Formally, for every equivalence class E, the condition is: distance(D(E), D(T)) <= t, where D(E) denotes the distribution of the sensitive attribute in class E, and D(T) denotes the distribution of that attribute in the entire dataset. Original paper: Li, Li, Venkatasubramanian, "t-Closeness: Privacy Beyond k-Anonymity and l-Diversity", ICDE 2007, IEEE.
In the context of photo and video anonymization, t-closeness is not a face blurring or license plate blurring mechanism. It is a model for assessing the risk of disclosing information derived from metadata, labels, detection results, or scene descriptions that remain after processing the material. It therefore matters when an organization builds datasets, statistical exports, or reports from photo and video anonymization processes, not when the software simply applies a mask to a face.
The Role of T-Closeness in Photo and Video Anonymization
In systems that process images and video recordings, privacy risk does not end with face blurring. Even after direct identifiers are removed, data may still remain that indirectly reveals information about individuals or events. T-closeness is useful as an analytical layer for secondary data.
In practice, this mainly applies to derived datasets such as content descriptions, detection statistics, training annotations, or operational reports. In such cases, an equivalence class may be, for example, a group of recordings from the same location, day, or event type.
- Quasi-identifiers - camera location, time of day, object type, weather conditions, shot length, place category.
- Sensitive attributes - the presence of a child, medical intervention, emergency services, a protest, a traffic incident, or another elevated-risk context.
- Risk - combining quasi-identifiers with the distribution of sensitive attributes may reveal more than face blurring alone would suggest.
Practical example: if a report for a specific camera and time window shows almost exclusively recordings labeled as "medical intervention," then even without visible identities, a sensitive event context may still be disclosed. T-closeness is designed to prevent this kind of distributional skew.
How T-Closeness Works in Practice
The model is based on equivalence classes, meaning groups of records that are indistinguishable in terms of quasi-identifiers. The distribution of the sensitive attribute within each group is then compared with the global distribution.
For ordered or numerical data, EMD is typically used because it takes into account the "distance" between categories. For nominal data, the original paper uses a distance equal to half the sum of absolute differences between distributions. The choice of metric should be explicitly documented.
Model Element | Meaning in Photo and Video Data
|
|---|---|
Quasi-identifiers | descriptive features of the material that do not identify a person on their own but can narrow the set when combined |
Sensitive attribute | a feature that reveals the event context or a category requiring special caution |
Equivalence class | a group of recordings or images with the same generalized quasi-identifiers |
Threshold t | the maximum permissible difference between the local and global distribution |
The lower the threshold t, the stronger the privacy protection, but also the greater the loss of data utility. There is no single universal threshold imposed by law or by any ISO standard. The value of t is chosen depending on the processing purpose, dataset size, and tolerated risk.
Key T-Closeness Parameters and Metrics
Evaluating t-closeness requires clearly defined, measurable parameters. In practical project work, documentation should cover not only the value of t itself, but also how equivalence classes are constructed and what information cost the anonymization introduces.
- t - the maximum allowable distance between distributions.
- EMD - the primary metric for measuring distance between distributions for ordered or numerical attributes, as indicated in the original 2007 paper.
- Equivalence class size - affects the stability of distribution estimates.
- Information loss - the loss of information after data generalization or suppression.
- Disclosure risk - the risk of attribute disclosure after anonymization.
In photo and video environments, it is also worth adding operational metrics that are not part of the formal definition of t-closeness but still affect the security of the overall process:
- Face and license plate detection precision and recall - detection errors affect the quality of the input data used for further anonymization.
- False negative rate - a missed face or license plate creates a direct privacy risk that t-closeness does not compensate for.
- Batch processing time - operationally important, but not a parameter of the t-closeness model.
T-Closeness vs. Face and License Plate Blurring
These two levels of protection should be clearly separated. Face blurring and license plate blurring operate at the image pixel level. T-closeness operates at the descriptive or analytical data level. They are not interchangeable solutions.
In systems such as Gallio PRO, automatic processing applies to faces and license plates. It does not include automatic detection of logos, tattoos, name badges, documents, or content displayed on monitors. Such elements may be masked manually in the editor. If, after anonymization, an organization stores additional labels or metadata about the material, that is where a model such as t-closeness may become relevant.
Automatic face and license plate blurring requires AI models, usually based on deep learning and trained on image data for object detection tasks. T-closeness is not used to train those models. It can, however, support safer sharing of annotation datasets, statistics, or model evaluation results.
Challenges and Limitations of T-Closeness
The model is more restrictive than k-anonymity and l-diversity, but it does not solve every problem. In photo and video applications, limitations related to high-dimensional data and image semantics are especially important.
- It does not work on raw pixels - it requires a tabular representation of attributes.
- Sensitivity to the definition of the sensitive attribute - incorrect scene categorization reduces the value of the model.
- Utility cost - heavy generalization may reduce the analytical value of the dataset.
- No normative threshold t - risk assessment and decision documentation are necessary.
- It does not replace legal compliance - satisfying t-closeness alone does not mean GDPR compliance.
Normative References and Sources
T-closeness is a scientific concept, not an ISO standard or a requirement explicitly stated in the GDPR. Even so, it fits within the broader logic of data protection by design and risk minimization.
- Li, N., Li, T., Venkatasubramanian, S., "t-Closeness: Privacy Beyond k-Anonymity and l-Diversity", IEEE 23rd International Conference on Data Engineering, 2007.
- Regulation (EU) 2016/679 of the European Parliament and of the Council - GDPR, in particular Article 5, Article 25, and Recital 26.
- Opinion 05/2014 of the Article 29 Working Party on anonymisation techniques, as well as EDPB guidance on pseudonymization and risk assessment, can be interpreted together with re-identification risk assessment, although they do not establish t-closeness as a mandatory standard.
In compliance practice, t-closeness can be treated as a supporting technique for risk assessment of derived data related to photos and video recordings. It does not replace access controls, retention rules, legal basis analysis, or the technical effectiveness of face and license plate blurring.