Live Processing Pipeline

api pipeline processing cv

The Live Processing Pipeline is the engine that converts a stream of raw video frames into meaningful sign language predictions. It orchestrates frame decoding, motion detection, buffering, keypoint extraction, and model inference.

Processing Flow

graph LR
    A[Raw Bytes] -->|Decode| B[BGR Image]
    B -->|Check| C{Motion?}
    C -- No --> D[Discard/Reset]
    C -- Yes --> E[Frame Buffer]
    E -->|Full?| F{Ready?}
    F -- No --> B
    F -- Yes --> G[MediaPipe]
    G -->|Keypoints| H[ONNX Model]
    H -->|Probabilities| I[Prediction]

Pipeline Stages

1. Frame Acquisition

Input: Raw bytes from the WebSocket.
Action: Decoded into a NumPy array (BGR image) using cv2.imdecode.
Optimization: Decoding happens in a separate thread to avoid blocking the async event loop.

2. Motion Detection

To save computational resources and reduce false positives, the system processes frames only when motion is detected.

Mechanism: A MotionDetector class compares the current frame with a running average of previous frames.
Threshold: If the pixel difference exceeds a set threshold (sensitivity), the frame is flagged as “active”.
Reset: If no motion is detected for a certain period, the system resets its state and clears the frame buffer.

3. Frame Buffering

The model requires a sequence of frames (temporal context) to recognize a sign, not just a single static image.

Buffer Size: Fixed sequence length (e.g., SEQ_LEN frames).
Logic:
- Valid frames (with motion) are appended to a ring buffer.
- When the buffer reaches the required length, it is passed to the inference stage.
- A sliding window approach allows for continuous recognition.

4. Keypoint Extraction (MediaPipe)

Raw images are processed by MediaPipe to extract skeletal landmarks.

Components:
- Face: 468 landmarks (reduced to relevant subset).
- Pose: 33 landmarks (shoulders, arms).
- Hands: 21 landmarks per hand.
Normalization: Landmarks are normalized relative to the image center or specific body points (e.g., nose) to ensuring scale invariance.

5. Inference

The extracted keypoints are formatted into a tensor and passed to the ONNX Runtime session.

Input: Shape (Batch, Time, Channels, Keypoints).
Model: Spatial-Temporal Transformer (ST-Transformer).
Output: Softmax probability distribution over the 502 sign classes.

6. Post-Processing

Decoding: The class index with the highest probability is mapped to its text label (e.g., “HELLO”).
Confidence Threshold: Predictions below a certain confidence score (e.g., 0.6) are discarded or marked as “Unknown”.

Arabic Sign Language

Explorer

Live Processing Pipeline

Live Processing Pipeline

Processing Flow

Pipeline Stages

1. Frame Acquisition

2. Motion Detection

3. Frame Buffering

4. Keypoint Extraction (MediaPipe)

5. Inference

6. Post-Processing

Table of Contents

Graph View

Backlinks

Arabic Sign Language

Explorer

Live Processing Pipeline

Live Processing Pipeline

Processing Flow

Pipeline Stages

1. Frame Acquisition

2. Motion Detection

3. Frame Buffering

4. Keypoint Extraction (MediaPipe)

5. Inference

6. Post-Processing

Related Documentation

Table of Contents

Graph View

Backlinks