3.2 Data Processing
The dataset was divided into training and validation subsets according to an 80:20 ratio, resulting in 52 classes assigned to the training set and 12 classes reserved for validation. The validation samples were strictly excluded from the training phase to prevent the model from adapting its embeddings to unseen data. A detailed description of the dataset partitioning is provided in Table
4. All parameters governing the data split, including the fixed random seed used to ensure reproducibility, are specified in the accompanying code
1.
During preprocessing, all video samples were resized to a spatial resolution of $224\times 224$ pixels and normalized to the range $[0,1]$. These steps ensured compatibility with the downstream model requirements, while retaining all three RGB channels throughout training.
Since the video sequences had different frame counts, shorter sequences were padded with zeros to match the length of the longest sequence in each batch. This zero-padding method ensured uniform input sizes across all training samples.
3.3 Training Details
Training was conducted over 64 epochs, with 100 episodes sampled per epoch. Episodes were generated using a custom sampling strategy (see source code
1) designed to ensure a balanced representation of classes within each episode. The exact episodic configuration depended on the selected model architecture and its computational complexity. For certain architectures, increasing the number of support or query samples significantly enlarged the computational graph and gradient memory footprint, particularly in models with skip connections. As a result, larger episodic configurations exceeded the available GPU memory during backpropagation. Therefore, for memory-intensive models, the (N,K,Q) setting was reduced to ensure stable training (see Table
6).
For the SlowFast (Feichtenhofer
et al.,
2019) architecture, a 5-way setting was used, meaning that each episode included 5 classes, with 3 support samples per class for prototype computation and 2 query samples per class for distance-based evaluation. In the case of the R(2+1)D (Tran
et al.,
2018) model, a 5-way configuration was likewise adopted; however, due to its greater computational complexity, only 2 support samples and 1 query sample per class were feasible. To provide a fair and informative comparison, the SlowFast model was additionally evaluated under a reduced 3-way 2-shot 1-query configuration. This enables direct comparison with architectures that were restricted to lower episodic settings due to hardware limitations. The remaining architectures, including S3D (Xie
et al.,
2018), I3D (Carreira and Zisserman,
2018), and R(3+2+1)D (Zhou
et al.,
2021b), were trained under a 3-way 2-shot 1-query episodic configuration, corresponding to the highest configuration achievable without exceeding GPU memory limits (see Table
6).
Within each episode, both the classes and their corresponding support and query samples were selected at random. To guarantee reproducibility, the seed value was fixed at 123 for both the PyTorch and Python random libraries, and applied consistently throughout training, validation, and testing. The training objective relied on the Euclidean distance metric, following the algorithm and loss formulation described in Snell
et al. (
2017), with necessary adjustments to accommodate video inputs.
3.4 Metrics
The evaluation of models trained under the few-shot learning paradigm requires adequate metrics to compare models and assess their ability to embed sign language samples. We can measure how well the classification of unseen samples matches their respective class by checking if they are closer to their class centroid. We should also evaluate the quality of the centroids, ensuring they are well separated from each other to keep samples of different signs apart, while embeddings of samples within the same class remain grouped. Although accuracy is an important factor, considering only the model’s overall accuracy may not be sufficient for a complete assessment. Even if samples are close to their centroid, centroids might be too close to one another in the embedding space, leading to potential misclassification of future samples due to variations in setting, environment, signer, or angle. This is why, beyond the accuracy metric, we use the following additional metrics: Prototype–Centroid Distance Ratio, Intra / Inter-Class Distance Ratio, and Prototype Rank Consistency.
To thoroughly validate the model’s performance, embeddings were generated for each sample within the test dataset. These embeddings were subsequently organized by their corresponding classes, and a prototype, representing the centroid of all embeddings within each class, was computed for each group. Prototypes were computed as the mean embedding vector for each class based on training samples. For each test embedding, the Euclidean distance to each class prototype was then calculated, and the sample was assigned to the class associated with the nearest prototype.
Prototypical Calculation
The prototype for each class
k, denoted as
${\mathbf{c}_{k}}$, is computed as the mean of the embedding vectors
${f_{\phi }}({\mathbf{x}_{i}})$ for all samples
$({\mathbf{x}_{i}},{y_{i}})$ belonging to the support set
${S_{k}}$ represented by samples of the same class:
For a test embedding
${f_{\phi }}(\mathbf{x})$, the classification is performed by finding the nearest prototype based on the Euclidean distance function
d:
Accuracy Calculation
The overall accuracy (
$Acc$) for the evaluation task is then computed using the indicator function
$\mathbb{I}[\cdot ]$ over all samples in the test dataset
Q:
where
$|Q|$ represents the total number of test samples.
Prototype-Centroid Distance Ratio (PCDR)
To evaluate the discriminative quality of the learned embedding space, we employ the Prototype-Centroid Distance Ratio (PCDR). This metric measures how well a sample is clustered around its corresponding class prototype relative to its separation from competing class prototypes.
For a test sample
x belonging to class
y, the PCDR is defined as the ratio between the squared Euclidean distance to its assigned class prototype
${\mathbf{c}_{y}}$ and the minimum squared Euclidean distance to any other class prototype
${\mathbf{c}_{k}}$,
$k\ne y$:
A lower average PCDR across the test set indicates that samples are more tightly clustered around their correct prototypes while remaining well separated from prototypes of other classes, reflecting higher feature discriminability.
Intra-/Inter-Class Distance Ratio (ICDR)
To further assess the structure of the learned embedding space, we employ the Intra-/Inter-Class Distance Ratio (ICDR), which evaluates the compactness of samples within the same class relative to the separation between different class prototypes.
The intra-class distance for a class
k is defined as the average squared Euclidean distance between the embeddings of test samples belonging to that class and the corresponding class prototype
${\mathbf{c}_{k}}$:
where
${Q_{k}}$ denotes the set of test samples belonging to class
k.
The inter-class distance is defined as the average squared Euclidean distance between the prototype
${\mathbf{c}_{k}}$ and all other class prototypes:
The ICDR for class
k is then computed as the ratio between the intra-class and inter-class distances:
Finally, the overall ICDR score is obtained by averaging over all classes:
A lower ICDR value indicates tighter clustering of samples within each class and greater separation between different class prototypes, reflecting a more discriminative embedding space.
Prototype Rank Consistency (PRC)
To assess the stability of class relationships in the learned embedding space, we introduce the Prototype Rank Consistency (PRC) metric. This metric evaluates whether the relative ranking of class prototypes with respect to a given sample is consistent with the ground-truth class assignment.
For a test sample x, the distances to all class prototypes ${\mathbf{c}_{k}}$ are computed and sorted in ascending order, producing an ordered list of class indices based on proximity in the embedding space. Let $r(\mathbf{x})$ denote the rank position of the correct class prototype ${\mathbf{c}_{y}}$ within this ordered list, where $r(\mathbf{x})=1$ indicates that the correct prototype is the nearest.
The PRC score for a test sample is defined as:
where
C denotes the total number of classes. This formulation yields a normalized score in the range
$[0,1]$, where a value of 1 indicates that the correct prototype is ranked first (i.e. it is the closest prototype to the sample).
The overall PRC metric is then computed as the average PRC score over all test samples in the query set
Q:
A higher PRC value indicates greater consistency in prototype ranking, reflecting a more stable and reliable embedding space for prototype-based classification.
3.5 Experimental Setup and Results
All experiments were conducted on the test split described in Section
3.2. As motivated in the Introduction, the SlowFast network (Feichtenhofer
et al.,
2019) was assumed to provide the best performance on the considered datasets; therefore, the initial detailed evaluation was performed using this model. Following the episodic evaluation protocol defined in Section
2.5, experiments were conducted using 1000 episodes per configuration. Each episode used a fixed number of
$Q=15$ query samples per class, independently sampled for each episode. Performance was measured as the average episode-level accuracy over 1000 episodes. The evaluation was performed for
$N\in 5,10$ and
$K\in 1,5,10$, and the resulting average accuracies are summarized in Table
5. It should be emphasized that the accuracies reported in Table
5 (maximum 92.0%, obtained for the 5-way 10-shot configuration) correspond to episodic few-shot evaluation, where performance is averaged over randomly sampled tasks with limited support and query sets. These results are therefore not directly comparable to the higher accuracy reported later (94.33%), which is obtained under a different evaluation protocol using the full test set and complete class prototypes. The episodic evaluation reflects performance under stochastic, low-data classification scenarios, where each task is constructed from a limited number of support and query samples. In contrast, the full test-set evaluation represents a deterministic setting in which all available samples contribute to prototype estimation, resulting in a more stable and typically higher accuracy.
Table 5
Average classification accuracy for different few-shot configurations obtained using the SlowFast model (${q_{\mathrm{query}}}=15$, ${n_{\mathrm{episodes}}}=1000$; values in [%].
|
n_way |
Accuracy (k_support) |
| 1-shot |
5-shot |
10-shot |
| 5-way |
80.0 |
90.7 |
92.0 |
| 10-way |
66.0 |
81.9 |
84.9 |
In the second experiment, a full test-set prototype-based evaluation was performed. In contrast to the episodic setting, where only
$Q=15$ query samples are evaluated per randomly sampled episode, the consecutive experiment utilizes the entire test set of 600 samples simultaneously in a single deterministic evaluation. Following the full test-set evaluation protocol (Section
2.5), a single prototype per class was computed using all 50 samples, and all 600 test instances were classified in a single deterministic pass. Under this setting, 34 samples were misclassified, resulting in an overall accuracy of 94.33%. The detailed classification outcomes are visualized in the confusion matrix shown in Fig.
4, which highlights both correct predictions (diagonal entries) and inter-class confusions (off-diagonal entries).

Fig. 4
Confusion matrix of test set predictions for the SlowFast model under the 5-way, 3-shot, 2-query (5-3-2) configuration. The matrix summarizes classification results over 600 test samples, achieving an overall accuracy of 94.33%, with 34 misclassified instances. Diagonal elements indicate correct predictions, while off-diagonal entries highlight class-level confusions.
The same evaluation protocol was applied to additional spatiotemporal architectures, including S3D (Xie
et al.,
2018), I3D (Carreira and Zisserman,
2018), R(2+1)D (Tran
et al.,
2018), and R(3+2+1)D (Zhou
et al.,
2021b). For each model, class prototypes were computed using all available samples per class, and classification was performed by nearest-prototype matching across all 600 test instances. The resulting accuracies, along with prototype-based metrics such as per-class distance rate (PCDR), inter-class distance rate (ICDR), and prototype robustness coefficient (PRC), are reported in Table
6, providing a consistent comparison across architectures under the same evaluation protocol.
Table 6
Comparison of few-shot models across accuracy and prototype-based metrics. The column (N,K,Q) specifies the N-way classification setting with K-support samples per class and Q-query samples.
| Model |
$(N,K,Q)$ |
Accuracy (% ↑) |
PCDR (↓) |
ICDR (↓) |
PRC (↑) |
| SlowFast(Feichtenhofer et al., 2019) |
5-3-2 |
94.33 |
0.4448 |
0.1433 |
0.9938 |
| SlowFast(Feichtenhofer et al., 2019) |
3-2-1 |
87.33 |
0.5881 |
0.1424 |
0.9853 |
| S3D(Xie et al., 2018) |
3-2-1 |
82.00 |
0.6051 |
0.1311 |
0.9798 |
| I3D(Carreira and Zisserman, 2018) |
3-2-1 |
82.83 |
0.6549 |
0.0841 |
0.9830 |
| R(2+1)D(Tran et al., 2018) |
5-2-1 |
90.33 |
0.5195 |
0.1503 |
0.9897 |
| R(3+2+1)D(Zhou et al., 2021b) |
3-2-1 |
82.67 |
0.7335 |
0.1281 |
0.9809 |
Finally, an additional evaluation was conducted to assess the stability of the prototype-based representation using a split-based protocol. Following the split-based evaluation protocol (Section
2.5), performance was computed as the mean and standard deviation across all five splits, as summarized in Table
7. This evaluation provides additional insight into the robustness and generalization of the learned embeddings under varying data partitions.
Table 7
Five–split evaluation of prototype-based metrics. For each model, one split is used as the query set while the remaining splits are used to compute class prototypes. The final row for each model reports the mean and standard deviation across splits.
| Model $(N,K,Q)$
|
Split |
Accuracy (↑) |
PCDR (↓) |
ICDR (↓) |
PRC (↑) |
| SlowFast (5-3-2) |
1 |
0.9250 |
0.4822 |
0.1634 |
0.9932 |
|
2 |
0.9750 |
0.4392 |
0.1604 |
0.9977 |
|
3 |
0.9583 |
0.4689 |
0.1677 |
0.9947 |
|
4 |
0.9083 |
0.4730 |
0.1587 |
0.9902 |
|
5 |
0.9500 |
0.4680 |
0.1611 |
0.9955 |
|
Mean ± Std |
0.9433 ± 0.0237 |
0.4663 ± 0.0148 |
0.1623 ± 0.0032 |
0.9943 ± 0.0026 |
| SlowFast (3-2-1) |
1 |
0.8667 |
0.5759 |
0.1454 |
0.9856 |
|
2 |
0.8333 |
0.6689 |
0.1507 |
0.9833 |
|
3 |
0.8750 |
0.5685 |
0.1448 |
0.9856 |
|
4 |
0.8250 |
0.6278 |
0.1531 |
0.9788 |
|
5 |
0.9000 |
0.5782 |
0.1365 |
0.9894 |
|
Mean ± Std |
0.8600 ± 0.0296 |
0.6039 ± 0.0408 |
0.1461 ± 0.0060 |
0.9845 ± 0.0035 |
| S3D (3-2-1) |
1 |
0.8250 |
0.6633 |
0.1350 |
0.9833 |
|
2 |
0.7917 |
0.6159 |
0.1383 |
0.9727 |
|
3 |
0.8500 |
0.5738 |
0.1286 |
0.9856 |
|
4 |
0.8250 |
0.5944 |
0.1360 |
0.9803 |
|
5 |
0.7750 |
0.6652 |
0.1330 |
0.9773 |
|
Mean ± Std |
0.8133 ± 0.0265 |
0.6225 ± 0.0359 |
0.1342 ± 0.0037 |
0.9798 ± 0.0050 |
| I3D (3-2-1) |
1 |
0.8083 |
0.7270 |
0.0864 |
0.9803 |
|
2 |
0.8250 |
0.6624 |
0.0920 |
0.9818 |
|
3 |
0.8333 |
0.7700 |
0.0886 |
0.9848 |
|
4 |
0.8250 |
0.5898 |
0.0825 |
0.9841 |
|
5 |
0.8083 |
0.6354 |
0.0820 |
0.9811 |
|
Mean ± Std |
0.82 ± 0.01 |
0.6769 ± 0.0643 |
0.0863 ± 0.0038 |
0.9824 ± 0.0017 |
| R(2+1)D (5-2-1) |
1 |
0.8917 |
0.5194 |
0.1508 |
0.9894 |
|
2 |
0.9000 |
0.5188 |
0.1581 |
0.9902 |
|
3 |
0.9250 |
0.5213 |
0.1520 |
0.9924 |
|
4 |
0.8167 |
0.6271 |
0.1623 |
0.9818 |
|
5 |
0.9167 |
0.4812 |
0.1469 |
0.9909 |
|
Mean ± Std |
0.8900 ± 0.0389 |
0.5336 ± 0.0535 |
0.1540 ± 0.0057 |
0.9889 ± 0.0041 |
| R(3+2+1)D (3-2-1) |
1 |
0.8500 |
0.6963 |
0.1343 |
0.9848 |
|
2 |
0.8500 |
0.6509 |
0.1456 |
0.9833 |
|
3 |
0.8833 |
0.6335 |
0.1424 |
0.9879 |
|
4 |
0.8000 |
0.6764 |
0.1468 |
0.9795 |
|
5 |
0.8083 |
0.6813 |
0.1465 |
0.9803 |
|
Mean ± Std |
0.8383 ± 0.0305 |
0.6677 ± 0.0225 |
0.1431 ± 0.0047 |
0.9832 ± 0.0031 |
For completeness, Table
8 presents a comparison of the evaluated architectures in terms of the number of trainable parameters used in this study, reflecting the computational complexity of the adapted backbone models within the proposed few-shot learning framework.
Table 8
Comparison of evaluated spatiotemporal architectures in terms of the number of trainable parameters used in this study. The values reflect the parameter counts of the adapted backbone models employed within the prototypical few-shot learning framework.
| Model |
Trainable parameters |
| SlowFast(Feichtenhofer et al., 2019) |
411 506 |
| I3D(Carreira and Zisserman, 2018) |
12 418 464 |
| S3D(Xie et al., 2018) |
12 418 464 |
| R(2+1)D(Tran et al., 2018) |
4 684 167 |
| R(3+2+1)D(Zhou et al., 2021b) |
3 793 300 |
3.7 Discussion and Study Limitation
The proposed SlowFast meta-based network achieved an accuracy of 94.33% on the test split dataset, effectively handling unseen data. Predictions were made by calculating the distance between a sample’s embedding and class prototypes, formed by averaging embeddings within each class. This accuracy of 94.33% (error rate 5.67%) demonstrates that the network effectively fulfills prototypical learning objectives, underscoring its reliability as a feature extractor for isolated SLR.
Class separability depends strongly on intra-class consistency. Signs that differ in starting position, motion speed, or hand orientation produce more dispersed embeddings and thus reduce prototype compactness; this variation is intrinsic to LSA64, where different signers sometimes modify the same sign’s execution. For instance, signs “027”, “056”, “063” and “064” show variation in handshape, signing height, finishing position, and temporal execution speed, occasionally accompanied by non-manual cues (e.g. facial expressions) that may mislead the model. UMAP visualization (McInnes
et al.,
2020) of test embeddings confirms this: several classes form distinct clusters, yet visually similar gestures such as “050”, “054”, “056” and “063” overlap (Fig.
5). These overlaps are echoed in the confusion matrix (e.g. 14 of 50 samples from “056” misclassified as “063”, Fig.
4), indicating that visual/temporal inconsistencies (including subtle non-manual cues) are primary confusion sources. A similar behaviour is observed for class “054”, where 8 out of 50 samples are misclassified as “050”. In addition, a single sample from class “064” is incorrectly predicted as “027”. These misclassifications are reflected in the UMAP projection of the test subset (Fig.
5), where the clusters corresponding to classes “054” and “050”, as well as “056” and “063”, are located in close proximity and partially overlap. A comparable overlap is visible between classes “027” and “064”; however, in this case only one misclassification occurs. This indicates that, for some class pairs, embeddings remain close to their class prototype, while for others (e.g. “054” vs. “050”), increased intra-class spread causes some samples to lie nearer to neighbouring clusters, resulting in higher confusion. While the 2-D UMAP projection validates qualitative cluster structure, it can distort some 128-dimensional relationships; stronger high-dimensional evaluation metrics would complement these visualizations. Finally, the few-shot episodic sampling, where only a subset of classes appears per episode, can limit prototype refinement because not all samples contribute to prototype updates. Beyond the detailed analysis of the SlowFast model, the comparative evaluation presented in Table
6 offers additional insight into the behaviour of different spatiotemporal architectures under the prototypical few-shot learning framework. Among all evaluated models, SlowFast architecture achieves the highest classification accuracy (94.33%) together with the most favorable prototype-based metrics, including the lowest Prototype–Class Distance Ratio (PCDR) and the highest Prototype Reliability Coefficient (PRC). These results suggest that the embeddings produced by the SlowFast model form both compact intra-class clusters and well-separated inter-class boundaries, which is especially beneficial for prototype-based classification.
However, an exception is observed in the Intra-Class Distance Ratio (ICDR), where the I3D model achieves the lowest value. This suggests that, although I3D produces relatively compact clusters within individual classes, the separation between different class prototypes is less pronounced compared to SlowFast architecture. Consequently, the overall classification accuracy remains lower despite favourable intra-class compactness. This shows that both intra-class consistency and inter-class separation are important when assessing embedding quality in prototype-based learning.
The comparison also reveals the influence of the episodic configuration $(N,K,Q)$ on model performance. In few-shot learning, the number of support samples per class plays a crucial role in prototype estimation, as prototypes are constructed by averaging the embeddings of the support examples. Experiments conducted with larger support sets generally lead to more stable prototype representations and improved classification performance. This trend can be observed when comparing the SlowFast model configuration (5–3–2) with settings involving fewer samples per class, where a reduction in accuracy is evident.
Despite operating under more constrained episodic settings in some experiments, the SlowFast model behaves consistently as well as or better than architectures such as S3D, I3D, and R(3+2+1)D. This indicates that the SlowFast architecture is particularly effective at encoding discriminative spatiotemporal representations that stay reliable even when prototype estimation relies on limited data. One possible explanation for this behaviour is the design of SlowFast networks, which explicitly model temporal information at two different frame rates. The slow pathway captures semantic and structural motion patterns over longer temporal contexts, while the fast pathway focuses on rapidly changing motion cues. Such a design is particularly well suited to sign language recognition, where meaningful information is conveyed through both broader arm movements and subtle hand or finger details. The combination of these complementary temporal representations can help the SlowFast backbone generate embeddings that better reflect the spatiotemporal features of signing gestures.
Another important aspect highlighted by this study is the parameter efficiency of the evaluated architectures. As summarized in Table
8, the SlowFast model used in this research contains the smallest number of trainable parameters among the evaluated networks. Despite its lower parameter count, it consistently achieves the best performance across most evaluation metrics. This indicates that the architecture is not only effective in capturing relevant motion features but also computationally efficient, allowing more samples to be processed within the available GPU memory constraints. Consequently, the SlowFast model represents a favourable balance between representational capacity and computational cost in the context of few-shot sign language recognition.
To address these overlaps, a margin-based prototypical loss could be adopted to explicitly enlarge inter-class distances and improve discrimination among visually similar signs. We used prototypical loss for its robustness, i.e. class anchors are computed as class means. However, experimenting with margin terms and alternatives (e.g. median anchors or hybrid anchor estimators) may increase prototype stability and reduce confusion from outliers.
It is important to note that all evaluated architectures were adjusted to match the available computational resources. In particular, GPU memory limits required changes to input resolution, batch sizes, and episodic setups during training. As a result, some models might not have operated under their most optimized conditions. Although each network was trained for 64 epochs with 100 episodes per epoch, giving repeated exposure to different support–query combinations, the smaller episode sizes for certain models may have limited their ability to fully utilize their representational capacity.
However, several limitations must be recognized. First, the scope of the evaluation is constrained by the dataset setup. Although the LSA64 dataset includes 64 gesture classes, only 12 were used in the experiments (Table
4), which may limit the ability to assess its generalizability to a broader range of vocabulary. Additionally, while the model performs well for most categories, it still struggles with overlapping or visually similar signs, indicating limited class separation in certain areas of the embedding space. Conducting broader experiments with multiple random seeds would better quantify variability, although they require more computational resources than currently available. The use of fluorescent gloves during recording (Ronchetti
et al.,
2023) facilitated hand segmentation and improved recognition, but limits real-world applicability since users are unlikely to wear such aids. Relying only on the RGB input may also cause the model to pick up background or contextual cues that do not generalize.
Overall, the model demonstrates strong performance in few-shot sign language recognition, with effective feature extraction, although challenges persist in separating certain classes and ensuring robustness across broader datasets and more unconstrained visual conditions.