What is Multi-Object Tracking (MOT)?

Multi-Object Tracking (MOT) definition
The role of Multi-Object Tracking in image and video anonymization
Technologies used in Multi-Object Tracking
Key Multi-Object Tracking parameters and metrics
Challenges and limitations of Multi-Object Tracking
Regulatory references and practical context of use

Multi-Object Tracking (MOT) definition

Multi-Object Tracking, or MOT for short, is a computer vision and video analysis task that involves tracking multiple objects simultaneously across consecutive video frames. The goal is not just to detect an object in a single frame, but to preserve its consistent identity over time despite motion, partial occlusion, changes in scale, lighting, and viewing angle. In the technical literature, MOT is typically defined as the problem of estimating the trajectories of multiple objects based on a sequence of visual observations. This definition is used, among others, in the MOTChallenge benchmarks developed since 2015, as well as in IEEE and Springer publications on computer vision.

In the context of image and video anonymization, MOT has clear practical value. A face detector or license plate detector only identifies an object in a single frame. A tracking mechanism, by contrast, makes it possible to assign an identifier to that same object over time and maintain masking continuity between frames. As a result, face blurring and license plate blurring are more stable and less prone to flickering, missed detections, or incorrect mask shifts. In offline anonymization systems, MOT is therefore a supporting layer for consistent video processing rather than a standalone business objective.

In practice, an MOT model works on input data produced by object detection. For video anonymization, this usually means combining two stages: first, an AI model detects faces or license plates, and then a tracking algorithm links detections from successive frames into trajectories. Only on that basis is a mask, blur, or pixelation applied. Deep learning is used here primarily to build detection models and, increasingly, re-identification and object association models that improve tracking quality.

The role of Multi-Object Tracking in image and video anonymization

For a single image, MOT does not apply because there is no time dimension. Its importance emerges in video footage, where the same object appears in many consecutive frames. For a Data Protection Officer or a person responsible for publishing materials, what matters is not only whether a face was detected, but whether it was blurred consistently throughout the entire period it appears in the footage.

In an anonymization system, MOT primarily improves the stability and completeness of masking. This directly affects the risk of personal data being exposed through individual unblurred frames.

it maintains continuous tracking of the same face or the same license plate between frames,
it reduces mask flicker when detection quality temporarily drops,
it allows the system to predict the object’s position during brief occlusions,
it reduces the number of situations in which an object is blurred only partially or with a delay,
it makes it easier to assess anonymization quality at the level of the entire sequence rather than a single frame.

It is worth clarifying the scope. In anonymization software such as Gallio PRO, automation applies to faces and license plates. MOT can therefore support stable blurring of these two object classes. This does not mean automatic detection of logos, tattoos, name badges, documents, or content displayed on monitor screens. Such elements may require manual work in the editor unless the system includes separate detection models for them.

Technologies used in Multi-Object Tracking

Modern MOT systems combine traditional motion estimation methods with machine learning models. In practice, the tracking-by-detection architecture is used, meaning tracking based on successive detection results. This is currently the dominant approach in both industrial and research applications.

A typical pipeline includes several technical stages:

object detection - for example, detecting faces or license plates in each frame,
motion prediction - often using a Kalman filter, originally described by R.E. Kalman in 1960,
data association - matching new detections to existing tracks, often using the Hungarian algorithm,
appearance features - re-identification embeddings that help distinguish similar objects,
handling occlusions and track termination - rules for initializing, maintaining, and closing tracks.

Well-known methods include SORT from 2016 and Deep SORT from 2017. SORT relies mainly on geometry and motion, making it fast but less effective under frequent occlusions. Deep SORT extends this model with appearance descriptors, which usually improves robustness against ID switches. Between 2021 and 2023, approaches such as ByteTrack and BoT-SORT were also widely cited because they improved results on the MOTChallenge benchmarks through better use of lower-confidence detections.

Key Multi-Object Tracking parameters and metrics

MOT evaluation should not be based solely on detection performance. For video anonymization, tracking continuity and the risk of losing an object between frames also matter. The literature uses a set of standardized benchmark metrics.

Metric	Meaning	Interpretation in anonymization
MOTA	Multi-Object Tracking Accuracy - combines false positives, false negatives, and ID switches	A higher value means fewer overall tracking errors
MOTP	A measure of localization precision for matches in older MOT evaluation protocols	Affects the precision of the blur mask position
IDF1	A measure of identification consistency over time	Important for maintaining consistent blurring of the same object
HOTA	Higher Order Tracking Accuracy - a metric combining detection and association	Provides a good representation of the real tracking quality of full trajectories
FPS / latency	Processing speed and delay	Operationally relevant, although Gallio PRO does not perform real-time anonymization

For clarity, it is worth noting the simple relationship used in the literature for MOTA:

MOTA = 1 - (FN + FP + IDSW) / GT

where FN means missed objects, FP false detections, IDSW identifier switches, and GT the number of ground-truth objects. These metric definitions are used, among others, in the MOTChallenge benchmarks and in comparative publications since 2015.

Challenges and limitations of Multi-Object Tracking

MOT does not eliminate problems related to input data quality. If face detection or license plate detection is weak, tracking will also be unreliable. That is why anonymization effectiveness depends on the entire processing chain, not just the tracking module itself.

The most common limitations are as follows:

heavy occlusions and the object leaving the frame,
small object size and low footage resolution,
motion blur and lossy video compression,
high visual similarity between objects in the same scene,
sudden shot changes or editing cuts that break track continuity.

From a privacy compliance perspective, this means the final result must be validated. MOT improves masking stability, but it does not replace quality control in the anonymization process. This is particularly important for materials that are published or shared with third parties.

Regulatory references and practical context of use

MOT is not a term explicitly defined in the GDPR or in data protection standards as a standalone legal obligation. It is an image processing technique that supports the goal of effective anonymization or de-identification of video material. In practice, it should be viewed as a technical measure supporting the principles of privacy by design and privacy by default set out in Article 25 GDPR, as well as security of processing under Article 32 of Regulation (EU) 2016/679 of 27 April 2016.

In operational use, it should be remembered that Gallio PRO runs in an on-premise model and is designed for offline anonymization of images and video recordings. The software automatically blurs faces and license plates, but it does not anonymize video streams or operate in real time. In this context, MOT should be understood as a mechanism that improves processing consistency after footage has been uploaded into the system, not as a real-time surveillance tool. This matters for risk assessment, deployment architecture, and the scope of operational data. In addition, in line with the system design assumptions, logs should not contain personal data or records of face and license plate detections.