What is object tracking in video?

Object tracking definition

Object tracking, or tracking objects across consecutive video frames, is the process of assigning the same object a consistent temporal identity throughout an image sequence. In the practical context of photo and video anonymization, this means preserving the information that a detected face or license plate in frame t is the same object that appeared in frames t-1, t-2, and subsequent frames. As a result, the blur or redaction mask does not “jump” between objects or disappear temporarily during brief drops in detection quality.

In the technical literature, object tracking is usually treated separately from object detection. Detection answers the question of whether a face or license plate is present in a given frame and where it is located. Tracking answers the question of whether it is the same object as before and how to predict its position between detections. In video anonymization systems, object tracking therefore acts as a stabilization layer for the detection algorithm. It is especially important in cases of partial occlusion, camera movement, changes in object scale, and temporary image blur.

This definition is consistent with the approach used in research on multi-object tracking in video, including the MOTChallenge benchmarks developed since 2015 and IEEE survey papers on Multiple Object Tracking. In the context of Gallio PRO, the term refers to tracking faces and license plates between frames in order to maintain continuity of video anonymization. It does not refer to real-time stream anonymization, because Gallio PRO does not perform anonymization in real time.

The role of object tracking in video anonymization

In a system for blurring faces and license plates, running detection on every frame alone is not enough. A detector may briefly lose an object because of glare, motion, low resolution, or occlusion by another element in the scene. Object tracking reduces the impact of such interruptions and helps keep the anonymization mask in a logical position.

In practice, this means several critical functions for compliance and processing quality:

  • maintaining continuous blurring of the same face or the same license plate across consecutive frames,
  • reducing mask “flicker” when the detector behaves unstably,
  • predicting the object’s position between detections based on a motion model,
  • lowering the risk of temporarily exposing personal data in individual frames,
  • enabling consistent manual correction in the editor when automation needs adjustment.

For a Data Protection Officer, this has direct practical significance. An anonymization incident does not have to affect an entire recording. Just a few unblurred frames may be enough for a face or license plate number to become readable when the video is paused. For that reason, object tracking should be treated as a risk-reduction mechanism, not merely as a feature that improves the visual quality of the export.

How face and license plate tracking works between frames

A typical pipeline consists of detection, motion estimation, object association, and trajectory updates. In modern systems, detection is usually performed by deep learning models, because faces and license plates vary in scale, angle, and quality in ways that are difficult to describe with simple rules. It is the AI model that detects the object, which can then be tracked between frames.

The most commonly used technical components are:

  • an object detector, such as a CNN or transformer model that detects faces or license plates in a single frame,
  • a motion model, often a Kalman filter, classically described by R.E. Kalman in 1960 and used to predict the object’s next position,
  • an association algorithm, for example an assignment problem solved with the Hungarian algorithm,
  • similarity measures, such as IoU, visual feature distance, trajectory consistency, and bounding box size consistency,
  • track management mechanisms, including initialization, confirmation, loss, and track termination.

A simplified scheme can be described by the formula:

Track(t) = Associate(Detections(t), Predict(Track(t-1)))

Here, Predict determines the expected position of the object in the new frame, while Associate matches new detections to existing trajectories. If a detection temporarily disappears, the tracker can maintain the track for a limited time based on prediction. If the absence of detection lasts too long, the track is terminated.

Key object tracking parameters and metrics

Object tracking performance should not be assessed solely by a general statement that the system “tracks well.” In practice, you should measure identity preservation, trajectory stability, and the impact on anonymization effectiveness. Some metrics come directly from the MOTChallenge ecosystem and from the 2008 publication by Bernardin and Stiefelhagen on MOTA and MOTP.

Parameter / metric

Meaning

Relevance for anonymization

 

ID Switches

The number of incorrect identity changes for a tracked object

Affects the risk of transferring the mask to the wrong object

MOTA

An aggregate measure of tracking errors

Shows the overall stability of multi-object tracking

MOTP

A measure of localization precision in the classical benchmark definition

Affects whether the mask accurately covers the face or license plate

HOTA

A metric combining detection and association quality, published in 2020

Better reflects the quality of linking an object across frames

Latency

Computational processing delay

Important for process performance, although it does not necessarily imply real-time operation

Track fragmentation

The number of times one trajectory is split into multiple short tracks

Increases the risk of temporary gaps in anonymization

In privacy protection use cases, a low level of false negatives, meaning missed objects, is particularly important. From a compliance perspective, it is sometimes better for the mask to cover a slightly larger area than to leave part of a face or license plate visible.

Challenges and limitations of object tracking

Object tracking does not eliminate every problem. Its effectiveness depends on the quality of the input detection, frame rate, video compression, lighting conditions, and the degree of object occlusion. Partially turned faces, small license plates in the background, or strong compression artifacts reduce tracking stability.

The most common limitations include:

  • partial and full occlusion of the object by other people or vehicles,
  • sudden camera motion and motion blur,
  • too few pixels covering the face or license plate,
  • similar appearance of multiple objects within the same scene,
  • errors inherited from the detector that the tracker cannot correct on its own.

It is also important to define the scope of automation correctly. Gallio PRO automatically detects and blurs faces and license plates. It does not automatically detect logos, tattoos, name badges, documents, or images displayed on monitors. Such elements can be blurred manually in the editor. From the perspective of object tracking, this means that tracking applies only to those object classes the system actually detects automatically.

Normative references and practical importance for compliance

Object tracking is not a separate legal obligation explicitly stated in the GDPR, but it is a technique that supports the principles of integrity and confidentiality under Article 5(1)(f) and security of processing under Article 32 of Regulation (EU) 2016/679. If a controller anonymizes video material, the stability of that anonymization matters for the real effectiveness of the technical safeguard. Short gaps between frames may undermine the practical protective effect.

In the case of faces, regulations concerning a person’s likeness under civil law and copyright law may also be relevant. In the case of license plates, the legal situation in Poland remains inconsistent, while in many European countries data protection practice and interpretation may lead to masking them. From a technical standpoint, object tracking improves the consistency of that masking throughout the entire video.