Architecture Overview

architecture system-design components

This document provides a high-level overview of the Arabic Sign Language Recognition system architecture, component interactions, and data flow.

System Architecture

graph TB
    subgraph "Frontend (Browser)"
        A[User Camera] --> B[HTML5 Canvas]
        B --> C[WebSocket Client]
        C --> D[live-signs.js]
    end
    
    subgraph "Backend (FastAPI)"
        E[WebSocket Handler] --> F[Frame Buffer]
        F --> G[Motion Detector]
        G --> H[MediaPipe Processor]
        H --> I[Keypoint Extractor]
        I --> J[ONNX Inference]
        J --> K[Sign Classifier]
    end
    
    subgraph "Models & Data"
        L[ONNX Model]
        M[MediaPipe Models]
        N[Sign Labels]
    end
    
    C <-->|Binary Frames| E
    J --> L
    H --> M
    K --> N
    K -->|JSON Response| C
    
    style A fill:#e1f5ff
    style L fill:#ffe1e1
    style M fill:#ffe1e1
    style N fill:#ffe1e1

Component Overview

1. Frontend Layer

Technology: HTML5, CSS3, JavaScript (Vanilla)

Components:

  • Camera Handler: Captures video frames from webcam
  • WebSocket Client: Establishes real-time connection to backend
  • UI Controller: Displays recognized signs, confidence scores, and skeletal visualizations.
  • Frame Encoder: Converts canvas frames to JPEG for transmission, including optimization flags.
  • Visualization Engine: Renders body-region specific landmarks and connections.

Key Files:

See Web Interface Design for details.

2. API Layer

Technology: FastAPI, Uvicorn, WebSockets

Components:

  • FastAPI Application: HTTP server and routing
  • WebSocket Handler: Manages real-time frame processing
  • CORS Middleware: Handles cross-origin requests
  • Lifespan Manager: Model loading and cleanup

Key Files:

Functions:

  • lifespan() - Loads ONNX model on startup
  • ws_live_signs() - Main WebSocket handler
  • live_signs_ui() - Serves frontend HTML

See FastAPI Application for details.

3. Processing Pipeline

Technology: OpenCV, MediaPipe, NumPy

Components:

Frame Buffer

Asynchronous queue for managing incoming frames between the producer and consumer tasks.

Key Class: asyncio.Queue (integrated in live_processing.py)

Methods:

  • add_frame() - Adds frame to buffer
  • get_frame() - Retrieves frame by index
  • clear() - Resets buffer

Motion Detection

Detects movement to trigger sign recognition.

Key Class: MotionDetector in cv2_utils.py

Methods:

  • detect() - Compares consecutive frames
  • convert_small_gray() - Preprocesses frames

Keypoint Extraction

Extracts pose, face, and hand landmarks using MediaPipe.

Key Class: LandmarkerProcessor in mediapipe_utils.py

Methods:

  • extract_frame_keypoints() - Extracts all landmarks
  • init_mediapipe_landmarkers() - Initializes MediaPipe models

See MediaPipe Integration for details.

4. Model Layer

Technology: PyTorch, ONNX Runtime

Components:

Model Architecture

Spatial-Temporal Transformer (ST-Transformer) for sequence classification.

Key Classes in model.py:

  • STTransformer - Main model architecture
  • GroupTokenEmbedding - Body part tokenization layer
  • STTransformerBlock - Spatial-Temporal dual attention block
  • AttentionPooling - Attention-based temporal aggregation

Model Pipeline:

  1. Input: Keypoint sequences (batch, seq_len, features)
  2. Embedding: Group token embedding (4 tokens: Pose, Face, Hands)
  3. Positioning: Sinusoidal positional encoding
  4. Transformer: N consecutive Spatial-Temporal attention blocks
  5. Pooling: Attention-based temporal pooling
  6. Output: Class logits (502 classes)

See Model Architecture for details.

Inference Engine

ONNX Runtime for optimized CPU inference.

Key Functions in model.py:

  • load_onnx_model() - Loads ONNX model
  • onnx_inference() - Runs inference

5. Data Layer

Technology: PyTorch, NumPy, Pandas

Components:

Dataset Loaders

  • LazyDataset: On-demand loading from NPZ files
  • MmapDataset: Memory-mapped dataset for efficient access

Key Files:

Data Preparation

  • Video preprocessing
  • Keypoint extraction from videos
  • Dataset splitting (train/val/test)

Key Files:

See Data Preparation Pipeline for details.

Data Flow

Real-Time Recognition Flow

sequenceDiagram
    participant User
    participant Browser
    participant WebSocket as WS Router
    participant Producer as Producer Handler
    participant Queue as asyncio.Queue
    participant Consumer as Consumer Handler
    participant MediaPipe
    participant ONNX
    
    User->>Browser: Perform sign
    Browser->>WebSocket: Connect WS
    WebSocket->>Queue: Initialize (max_size=50)
    WebSocket->>Producer: Spawn Process
    WebSocket->>Consumer: Spawn Process
    
    loop Stream
        Browser->>Producer: Send Frame (JPEG)
        Producer->>Queue: Put (Decoded Frame)
        
        Queue->>Consumer: Get Frame
        Consumer->>Consumer: Motion Detection
        
        alt Motion Detected
            Consumer->>MediaPipe: Extract Keypoints
            Consumer->>Consumer: Buffer Keypoints
            
            alt Keypoints >= 15
                Consumer->>ONNX: Run Inference
                ONNX-->>Consumer: Logits
                Consumer->>Browser: Send Prediction (JSON)
            end
        else No Motion
            Consumer->>Browser: Send Idle Status
        end
    end

Training Flow

graph LR
    A[Raw Videos] --> B[Video Preprocessing]
    B --> C[MediaPipe Extraction]
    C --> D[NPZ Keypoints]
    D --> E[Dataset Loader]
    E --> F[DataLoader]
    F --> G[Model Training]
    G --> H[PyTorch Checkpoint]
    H --> I[ONNX Export]
    I --> J[ONNX Model]
    J --> K[Production Inference]
    
    style A fill:#e1f5ff
    style J fill:#e1ffe1
    style K fill:#ffe1e1

Configuration Management

Environment Variables

Managed through .env file:

ONNX_CHECKPOINT_FILENAME  # Model filename
DOMAIN_NAME               # CORS allowed origin
LOCAL_DEV                 # Local vs Kaggle paths
USE_CPU                   # Force CPU execution

See Environment Configuration for all options.

Constants

Defined in constants.py:

SEQ_LEN = 50              # Sequence length
FEAT_NUM = 184            # Number of features
FEAT_DIM = 4              # Feature dimensions (x, y, z, v)
DEVICE = "cpu" | "cuda"   # Execution device

Deployment Architecture

Docker Deployment

graph TB
    subgraph "Docker Container"
        A[Uvicorn Server]
        B[FastAPI App]
        C[ONNX Runtime]
        D[MediaPipe]
        E[Static Files]
    end
    
    F[Host Port 8000] --> A
    A --> B
    B --> C
    B --> D
    B --> E
    
    G[Volume: ./] --> B
    H[Volume: ./models] --> C
    I[Volume: ./landmarkers] --> D
    
    style A fill:#e1f5ff
    style G fill:#ffe1e1
    style H fill:#ffe1e1
    style I fill:#ffe1e1

Features:

  • Hot reload enabled for development
  • Volume mounts for code and models
  • Automatic dependency installation
  • Consistent environment across platforms

See Docker Setup for configuration.

Performance Considerations

Performance Considerations

  1. ONNX Runtime: Inference engine for CPU-bound environments.
  2. CPU Execution: Configured for hardware without GPU acceleration.
  3. Frame Buffering: Asynchronous queue management to prevent memory exhaustion.
  4. Motion Detection: Frame differencing to reduce processing load during idle periods.
  5. Async Processing: Non-blocking concurrency for client-server communication.
  6. Thread Pool: Parallel execution for compute-intensive keypoint extraction.

Bottlenecks

  • MediaPipe Processing: ~20-30ms per frame
  • ONNX Inference: ~10-20ms per sequence
  • Network Latency: WebSocket frame transmission

Security Considerations

  • CORS: Configured allowed origins
  • WebSocket: No authentication (add for production)
  • Input Validation: Frame size and format checks
  • Resource Limits: Frame buffer size limits

Scalability

Current Limitations

  • Single-threaded WebSocket handler
  • In-memory frame buffer
  • No load balancing

Future Improvements

  • Multi-worker deployment
  • Redis for session management
  • Load balancer for multiple instances
  • GPU acceleration for inference

Next Steps: