Skip to content

Commit

Permalink
sigh
Browse files Browse the repository at this point in the history
  • Loading branch information
dlwh committed Oct 4, 2024
1 parent 04ce944 commit 7ad4092
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 4 deletions.
5 changes: 3 additions & 2 deletions config/data/dclm_gpt_neo.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
cache_dir: "gs://marin-us-central2/scratch/dlwh/tokenized/gpt_neox/"
cache_dir: "gs://marin-us-central2/tokenized/gpt_neox/"
tokenizer: "EleutherAI/gpt-neox-20b"
cache_options:
batch_size: 4096
batch_size: 256
num_shard_groups: 1024
stop_strategy: restart
shuffle: 100000
configs:
Expand Down
2 changes: 1 addition & 1 deletion config/data/dolma_olmo_paloma.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -135,4 +135,4 @@ train_weights:
paloma/ptb: 0.0
paloma/redpajama: 0.0
paloma/twitterAAE_HELM_fixed: 0.0
paloma/wikitext_103: 0.0
paloma/wikitext_103: 0.0
3 changes: 2 additions & 1 deletion src/levanter/store/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -1156,7 +1156,8 @@ def generator():
generator_fns = [_make_generator_fn(group) for group in groups]

readers = [
RayPrefetchQueue(fn, 128, producer_options=dict(name=name)) for name, fn in zip(group_names, generator_fns)
RayPrefetchQueue(fn, 128, producer_options=dict(name=name, scheduling_strategy="SPREAD"))
for name, fn in zip(group_names, generator_fns)
]

# then figure out the first shard to start from. This is the first unfinished shard with the minimum number of rows
Expand Down

0 comments on commit 7ad4092

Please sign in to comment.