Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: reproducibility in DataLoader #105

Open
CarlosNacher opened this issue Aug 10, 2022 · 1 comment
Open

Proposal: reproducibility in DataLoader #105

CarlosNacher opened this issue Aug 10, 2022 · 1 comment

Comments

@CarlosNacher
Copy link

Hi,

In DataLoader class, there are an argument seed_for_shuffle which controls reproducibilty that when using DataLoader with infinite=False and shuffle=True (between two different trains, the data feeding the network will be always the same). But why when infinite=True not doing the same? even if batch is infinite, between two different train you may want reproducibility.

So, there are two alternatives:

  1. When doing return np.random.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities) in line 118 DataLoader do it using self.rs (i.e. return self.rs.choice(self.indices, self.batch_size, replace=True, p=self.sampling_probabilities)).

  2. Instead of set self.rs at init of DataLoader instance, do np.random.seed().

Thanks for your time!

@CarlosNacher
Copy link
Author

UPDATE:

To ensure that even if infinite=True we have reproducibility, the second option must be done. With the first one, if other lines of code calling np.random are executed, this reproducibility is lost, because the main seed is consumed. For example if the transformation we pass to the Single/MultiThreadedAugmentor is MirrorTransform, if we pass to the latter axes=(0,) vs axes=(0, 1) we will not get the same data back from the DataLoader having done only np.random.seed(seed), because inside MirrorTransforn we have executed 1 and 2 times, respectively, methods involving np.random. I have seen it with this case.

Moreover. It must be the two options to ensure full reproducibility! This way, between different trainings where we only want to change, for example, the optimiser, we ensure that the same random transformations are always done in both trainings.

Best regards,
Nácher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant