For example, our indices are [n_train: n_train+n_val] when using split="val". As the shuffle=False, the idx sampler is SequentialSampler(Subset(data_container, indices)). Notice that this sampler produces indices in range [0, n_val].
super().__init__(
data_container, ## here is the full set.
sampler=batch_sampler,
collate_fn=lambda x: collate(x, data_container),
pin_memory=True, # load on CPU push to GPU
**kwargs
)
as shown in this code snippet, the "dataset" passed to the DataLoader is the full dataset. Then, the iterator of data loader would take sample according to the indices provided by the sampler. As illustrated above, the sampler produces indices in range [0, n_val]. Therefore, it actually takes data from a subset of training part.
class CustomDataLoader(DataLoader):
def __init__(
self, data_container, batch_size, indices, shuffle, seed=None, **kwargs
):
if shuffle:
generator = torch.Generator()
if seed is not None:
generator.manual_seed(seed)
idx_sampler = SubsetRandomSampler(indices, generator)
else:
idx_sampler = SequentialSampler(Subset(data_container, indices))
batch_sampler = BatchSampler(
idx_sampler, batch_size=batch_size, drop_last=False
)
# Note: a bug here if we do not use subset.
# Sequential sampler on subset returns index like (0, 1, 2, 3...)
# However, the returned index is on the full data.
# If we do not take Subset here, it uses data from training subset.
dataset = data_container if shuffle else Subset(data_container, indices)
super().__init__(
dataset ,
sampler=batch_sampler,
collate_fn=data_container.collate_fn,
pin_memory=True, # load on CPU push to GPU
**kwargs
)