TSL Sign Spotting — Real-World News Videos

Abstract

Turkish Sign Language (TSL) is a low-resource language where creating large annotated datasets is prohibitively expensive and time-consuming. In continuous sign language video, sign boundaries are never explicitly marked — the camera captures an unbroken stream of movement, and every boundary must be inferred from visual evidence alone. This work addresses the problem of detecting and temporally localising signs in real-world Turkish news broadcast footage without relying on any manually annotated frame-level labels.

We developed a weakly supervised sign spotting pipeline for TSL. Individual video frames are passed through a DINOv2 Vision Transformer to extract rich spatial embeddings, which are then processed by a temporal sign encoder to capture how signs evolve over consecutive frames. Rather than annotated glosses, supervision is derived from automatically generated pseudo-glosses extracted from sentence-level transcripts. Turkish morphological analysis is applied to lemmatize inflected word forms, compressing the pseudo-gloss vocabulary from approximately 16,000 surface-level entries to 4,802 semantically meaningful lemmas.

Vocabulary compression nearly tripled the sign detection F1-score (0.142 → 0.406). Downstream translation quality improved substantially across all n-gram levels, with BLEU-4 rising from 9.60 to 12.60 and ROUGE from 23.48 to 26.88. Embedding similarity analysis using isolated sign videos from the Turkish Sign Language Dictionary further confirms that the learned representations capture meaningful visual-semantic correspondences. The results demonstrate that useful sign-level information can be extracted from sentence transcripts alone, without any frame-level annotation burden.

Data

Dataset

This study was conducted on the TSL-News dataset, a large-scale continuous Turkish Sign Language corpus collected from TRT Hearing-Impaired News Bulletin broadcasts between 2021 and 2023. The dataset contains 13,378 sentence-level video segments — 11,507 training, 802 validation, and 1,069 test samples — comprising more than 21 hours of annotated signing footage and approximately 3.6 million video frames recorded across three different signers.

Each segment lasts approximately 6 seconds and contains an average of 10 words per transcript. Unlike laboratory-recorded datasets, TSL-News captures naturally occurring signing behaviour under broadcast conditions: variable lighting, different interpreters, and spontaneous signing speed. Only sentence-level transcripts are provided — no temporal boundaries, no gloss-level annotations, no sign-order labels.

The corpus contains more than 136,000 total words and 28,578 unique tokens. The agglutinative morphology of Turkish causes a single semantic concept to appear in dozens of inflected surface forms, producing severe vocabulary sparsity. Turkish morphological analysis collapses these variants to 4,802 lemma-level pseudo-glosses, making the weak supervision signal far more tractable for training.

Source TRT Hearing-Impaired Bulletin

Total Sentences 13,378

Train / Val / Test 11,507 / 802 / 1,069

Total Duration 21+ hrs

Video Frames ~3.6M

Avg. Segment Duration 6 sec

Signers 3

Unique Tokens (raw) 28,578

Compressed Vocabulary 4,802

Temporal Labels None

Methodology

Pipeline

Visual Feature Extraction

Individual frames are encoded by a DINOv2 Vision Transformer backbone pretrained via large-scale self-supervised learning. Each frame yields a high-dimensional feature embedding that captures rich spatial and textural information about the signer's hand shape, position, and body posture.

DINOv2 · ViT-B/14 · Self-supervised

Temporal Sign Encoding

Frame embeddings are processed by a temporal sign encoder that models how signs evolve across consecutive frames — producing temporally-aware representations that carry both spatial and motion information. This is essential since sign meaning is defined by movement trajectories, not individual still frames.

Temporal Modelling · Sequential Encoding

Pseudo-Gloss Supervision

Words are extracted automatically from sentence transcripts. Turkish morphological analysis lemmatizes inflected forms, collapsing ~16,000 surface entries to 4,802 lemmas. Each lemma is associated with a learnable prototype; the encoder learns to activate when the corresponding concept appears visually in the video.

NLP · Morphological Analysis · Prototype Learning

Figure 1 — Framework Architecture

Fig. 1 — Overview of the proposed weakly supervised sign spotting framework. Input video frames are encoded by DINOv2, processed by the temporal sign encoder, and matched against pseudo-gloss prototypes to produce temporal probability maps.

Training Strategy

The framework was trained under a weakly supervised learning setting — no frame-level sign annotations, temporal boundaries, or sign-order gloss labels were available during training. Supervision is derived exclusively from sentence-level transcripts and the automatically generated pseudo-gloss vocabulary.

Training proceeded in two stages: pseudo-gloss training, where the model learns to detect whether a concept is present in a video sequence; and representation learning, where the temporal sign encoder aligns visual features with pseudo-gloss prototypes. Cosine similarity between projected sign features and prototype embeddings is converted to temporal probability distributions via temperature-scaled normalization. Peaks in these distributions indicate candidate sign locations — discovered without any explicit temporal supervision.

Evaluation

Results

0.406 Sign Detection F1 up from 0.142 — ×2.86 improvement

12.60 BLEU-4 Score up from 9.60 baseline (+31.3%)

26.88 ROUGE Score up from 23.48 baseline (+14.5%)

~16,000 Original vocabulary F1: 0.142

→

4,802 After morphological compression F1: 0.406

→

×2.86 F1-score multiplier Language-specific preprocessing is critical

Metric	Without Embeddings	With Our Embeddings	Δ
BLEU-1	23.41	29.26	+24.9%
BLEU-2	16.44	21.06	+28.1%
BLEU-3	12.27	15.94	+29.9%
BLEU-4	9.60	12.60	+31.3%
ROUGE	23.48	26.88	+14.5%

All improvements were achieved without introducing any additional manual annotations. The model relies entirely on weak supervision from sentence transcripts, meaning the gains reflect genuine improvement in learned representation quality — not stronger ground-truth signal. The largest relative gains appear in BLEU-4 (+31.3%), suggesting the learned representations capture sentence-level semantic coherence, not merely word-level correspondence.

Embedding Similarity

NCC-Based Localization

To verify temporal localization without frame-level ground truth, isolated sign videos from the Turkish Sign Language Dictionary (TİD) were matched against continuous video embeddings using a sliding-window Normalized Cross-Correlation (NCC) search over 865 samples spanning 276 unique glosses.

Metric	Value
Samples evaluated	865
Unique glosses	276
Mean NCC score	0.3789
Peak NCC score	0.8866
Random baseline IoU	0.0301
Top-1 IoU	0.1599
Top-3 IoU	0.2786
Top-5 IoU	0.3478

Analysis

Visualizations

Because the TSL-News dataset provides no frame-level ground truth, qualitative analysis plays a central role in evaluation. Four complementary visualization tools were developed to inspect temporal sign activations, embedding quality, and localization behaviour — providing evidence that the model learns meaningful structure rather than memorizing transcript statistics.

Figure 2 — Sentence-Level Probability Heatmap

Fig. 2 — Sentence-level pseudo-gloss probability heatmap. Each row is a pseudo-gloss word; each column is a video time step. Brighter cells indicate higher predicted sign probability — revealing temporal structure of the full signing sequence.

The heatmap exhibits an approximately monotonic temporal structure: pseudo-glosses appearing earlier in the transcript tend to activate earlier temporal regions. Though perfect alignment is not expected — TSL sign order differs from spoken Turkish — the overall pattern confirms the model learned meaningful temporal correspondences without any explicit temporal supervision.

Figure 3 — Per-Word Probability Curves

Fig. 3 — Per-word confidence curves over the video timeline. Peaks mark predicted sign locations for each pseudo-gloss in the transcript.

Figure 4 — Embedding Similarity (NCC)

Fig. 4 — Normalized Cross-Correlation similarity between isolated TID dictionary signs and continuous video embeddings. Peaks indicate candidate sign locations.

Figure 5 — Web Visualization Interface

Fig. 5 — Web-based visualization interface. Displays pseudo-gloss predictions as temporal overlays synchronized with video playback, with per-gloss probability curves and the full probability heatmap matrix.

The web interface serves dual purposes: as an analysis tool for researchers to inspect whether detected sign locations correspond to meaningful signing segments, and as a demonstration platform. Probability curves allow manual inspection of how model confidence changes over time for individual pseudo-glosses — making model behaviour interpretable without requiring frame-level ground truth.

Significance

Impact & Future Work

Contributions

Scalable weak supervision — demonstrates that meaningful sign-level representations can be learned from sentence transcripts alone, eliminating the need for costly frame-by-frame annotation by expert linguists.

Morphological preprocessing as a key component — shows that language-specific NLP preprocessing is not optional but essential when applying weakly supervised methods to morphologically rich languages. A near-3× improvement in F1 from vocabulary compression alone.

Multi-modal evaluation without ground truth — establishes a practical evaluation framework combining translation quality, sign detection F1, embedding similarity (NCC), and qualitative heatmaps when temporal annotations are unavailable.

Low-resource transferability — the pipeline makes no assumptions specific to TSL and can be adapted to other morphologically rich, low-resource sign languages that lack large annotated corpora.

Future Directions

TID benchmark evaluation — a dedicated evaluation pipeline using isolated sign videos from the Turkish Sign Language Dictionary (TİD) is under development. Hundreds of sign instances with manually annotated temporal boundaries will enable direct measurement of localization accuracy.

Sign retrieval systems — the learned representations can index large sign language video archives, enabling keyword-based search without manual annotation of the target corpus.

Semi-automatic annotation tools — probability curves and heatmaps can pre-populate annotation timelines for expert review, dramatically reducing the time required to build ground-truth datasets.

Improved temporal localization — future work will explore dedicated temporal grounding modules and contrastive learning objectives to sharpen sign boundary predictions beyond the current prototype-matching approach.

TSL Sign
Spotting from
Real-World
News Videos

Dataset