Model Architecture Design

model architecture deep-learning

The core of the recognition system is a custom Attention-based Bidirectional LSTM (BiLSTM) network designed for processing streaming skeletal data.

Architecture Overview

The model takes a sequence of spatial keypoints and outputs a probability distribution over the sign classes.

Diagram

graph TD
    Input["Input Sequence (T,F)"] --> SGE[Spatial Group Embedding]
    SGE --> L1[ResBiLSTM Block 1]
    L1 --> L2[ResBiLSTM Block 2]
    L2 --> L3[ResBiLSTM Block 3]
    L3 --> L4[ResBiLSTM Block 4]
    L4 --> Attn[Multihead Attention]
    Attn --> Pool[Attention Pooling]
    Pool --> FC[Classifier Head]
    FC --> Softmax[Softmax Probabilities]

1. Spatial Group Embedding (SGE)

Before temporal processing, we independently project distinct body parts into a shared latent space. This allows the model to learn part-specific features.

  • Inputs: Pose, Face, Left Hand, Right Hand.
  • Projections: 4 separate Linear layers.
  • Fusion: Concatenation GELU BatchNorm Permute.
  • Output: A unified feature vector per time step.

2. Residual BiLSTM Layers

We use a stack of BiLSTM blocks to capture temporal dependencies.

  • Bidirectional: Processes the sequence forwards and backwards to capture context.
  • Residual Connection: The input to the block is added to the output to prevent vanishing gradients.
  • Layer Normalization: Applied after the residual addition for stability.

3. Self-Attention Pooling

Instead of simply taking the last hidden state (which loses early context) or averaging all states (which dilutes information), we use a Self-Attention mechanism.

  • Query: The model learns to weigh each time step based on its relevance.
  • Weighted Sum: The final representation is a weighted sum of all time steps.

4. Classification Head

  • Dropout: For regularization.
  • Linear Layer: Maps the pooled representation to num_classes logits.