Live Processing Pipeline
The Live Processing Pipeline is the engine that converts a stream of raw video frames into meaningful sign language predictions. It orchestrates frame decoding, motion detection, buffering, keypoint extraction, and model inference.
Processing Flow
graph LR A[Raw Bytes] -->|Decode| B[BGR Image] B -->|Check| C{Motion?} C -- No --> D[Discard/Reset] C -- Yes --> E[Frame Buffer] E -->|Full?| F{Ready?} F -- No --> B F -- Yes --> G[MediaPipe] G -->|Keypoints| H[ONNX Model] H -->|Probabilities| I[Prediction]
Pipeline Stages
1. Frame Acquisition
- Input: Raw bytes from the WebSocket.
- Action: Decoded into a NumPy array (BGR image) using
cv2.imdecode. - Optimization: Decoding happens in a separate thread to avoid blocking the async event loop.
2. Motion Detection
To save computational resources and reduce false positives, the system processes frames only when motion is detected.
- Mechanism: A
MotionDetectorclass compares the current frame with a running average of previous frames. - Threshold: If the pixel difference exceeds a set threshold (sensitivity), the frame is flagged as “active”.
- Reset: If no motion is detected for a certain period, the system resets its state and clears the frame buffer.
3. Frame Buffering
The model requires a sequence of frames (temporal context) to recognize a sign, not just a single static image.
- Buffer Size: Fixed sequence length (e.g.,
SEQ_LENframes). - Logic:
- Valid frames (with motion) are appended to a ring buffer.
- When the buffer reaches the required length, it is passed to the inference stage.
- A sliding window approach allows for continuous recognition.
4. Keypoint Extraction (MediaPipe)
Raw images are processed by MediaPipe to extract skeletal landmarks.
- Components:
- Face: 468 landmarks (reduced to relevant subset).
- Pose: 33 landmarks (shoulders, arms).
- Hands: 21 landmarks per hand.
- Normalization: Landmarks are normalized relative to the image center or specific body points (e.g., nose) to ensuring scale invariance.
5. Inference
The extracted keypoints are formatted into a tensor and passed to the ONNX Runtime session.
- Input: Shape
(Batch, Time, Channels, Keypoints). - Model: Attention-based BiLSTM.
- Output: Softmax probability distribution over the 502 sign classes.
6. Post-Processing
- Decoding: The class index with the highest probability is mapped to its text label (e.g., “HELLO”).
- Confidence Threshold: Predictions below a certain confidence score (e.g., 0.6) are discarded or marked as “Unknown”.