mmap_dataset.py
source data pytorch performance
File Path: src/data/mmap_dataset.py
Purpose: High-performance PyTorch Dataset backed by numpy.memmap, allowing training on datasets larger than RAM.
Overview
Instead of loading thousands of small files (Lazy) or the whole dataset into RAM, this class maps a single giant binary file (train_X.mmap) into virtual memory. The OS handles paging pages in/out of RAM as needed.
Class MmapKArSLDataset
Inherits: torch.utils.data.Dataset
__init__
Logic:
-
Loads metadata:
X_shape.npy: Total dimensions of the giant array.y.npz: Labels array.X_map_samples_lens.npy: Length of each sample within the giant array.
-
Memmap: Creates a read-only view (
mode="r") of the data.self.X = np.memmap(data_path, dtype="float32", mode="r", shape=X_shape) -
Offset Calculation: Pre-calculates the start index (
X_offsets) for every sample to allow O(1) random access.
__getitem__(index)
Logic:
- usage
indexto findstart_offsetandlength. - Slices the memmap (Zero-copy operation):
raw = self.X[start:start+len]. - Applies
TSNSamplerto get fixed size. - Applies
DataAugmentor.
Performance Note
TIP
This is the recommended dataset for training on high-performance clusters or machines with fast SSDs (NVMe). It significantly increases GPU utilization by removing CPU/IO bottlenecks.
Related Documentation
Depends On:
- mmap_dataset_preprocessing.py - Creates the mmap files
- constants.py -
MMAP_PREPROCESSED_DIR
Used By: