Update on "enable data loading for data parallel training"

Tested that data loading now have the expected behavior: - different dp ranks get different data - different tp ranks within the same dp rank get the same data [ghstack-poisoned]
pytorch · Feb 8, 2024 · f8a6e76 · f8a6e76
1 parent b920b23
commit f8a6e76
Showing 1 changed file with 2 additions and 0 deletions.
diff --git a/torchtrain/datasets/alpaca.py b/torchtrain/datasets/alpaca.py
@@ -61,6 +61,8 @@ def __iter__(self):
 
         for idx, sample in enumerate(self.data_iterator):
             # select samples to pack in a round-robin fashion
+            # TODO: This is a temporary solution for small datasets like Alpaca.
+            #       For larger datasets we need to use a more scalable approach.
             if idx % self.world_size != self.rank:
                 continue
             sample_text = sample["text"]