Proposal: reproducibility in DataLoader #105

CarlosNacher · 2022-08-10T16:50:45Z

Hi,

In DataLoader class, there are an argument seed_for_shuffle which controls reproducibilty that when using DataLoader with infinite=False and shuffle=True (between two different trains, the data feeding the network will be always the same). But why when infinite=True not doing the same? even if batch is infinite, between two different train you may want reproducibility.

So, there are two alternatives:

When doing return np.random.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities) in line 118 DataLoader do it using self.rs (i.e. return self.rs.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities)).
Instead of set self.rs at init of DataLoader instance, do np.random.seed().

Thanks for your time!

The text was updated successfully, but these errors were encountered:

CarlosNacher · 2022-08-10T17:27:23Z

UPDATE:

To ensure that even if infinite=True we have reproducibility, the second option must be done. With the first one, if other lines of code calling np.random are executed, this reproducibility is lost, because the main seed is consumed. For example if the transformation we pass to the Single/MultiThreadedAugmentor is MirrorTransform, if we pass to the latter axes=(0,) vs axes=(0, 1) we will not get the same data back from the DataLoader having done only np.random.seed(seed), because inside MirrorTransforn we have executed 1 and 2 times, respectively, methods involving np.random. I have seen it with this case.

Moreover. It must be the two options to ensure full reproducibility! This way, between different trainings where we only want to change, for example, the optimiser, we ensure that the same random transformations are always done in both trainings.

Best regards,
Nácher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: reproducibility in DataLoader #105

Proposal: reproducibility in DataLoader #105

CarlosNacher commented Aug 10, 2022

CarlosNacher commented Aug 10, 2022

Proposal: reproducibility in DataLoader #105

Proposal: reproducibility in DataLoader #105

Comments

CarlosNacher commented Aug 10, 2022

CarlosNacher commented Aug 10, 2022