-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding configs related to DCLM #663
Open
abhinavg4
wants to merge
101
commits into
fineweb_data
Choose a base branch
from
dclm
base: fineweb_data
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 6 commits
Commits
Show all changes
101 commits
Select commit
Hold shift + click to select a range
4aa2f2c
Adding configs related to DCLM
abhinavg4 dde9ed0
Adding configs related to DCLM
abhinavg4 b991e29
Adding Z loss
abhinavg4 bb674bb
pre commit changes
abhinavg4 6c99dfb
Adding z_loss as part of train_lm.py
abhinavg4 24469e7
Reverting changes to llama.py for z_loss
abhinavg4 76092c4
Address capacity_type and env variables (#665)
Ivan-Zhou 2e55856
fix best effort test (#662)
dlwh 2e64f14
Enable multislice in launch script (#666)
blahBlahhhJ 4950a8e
Fineweb Text + Partial revert of kiloshard (#669)
dlwh c17f653
log run_progress for a special x axis. Fixes #671 (#674)
dlwh ac0882d
refactor trainer to always need a loss function, add z_loss (#672)
dlwh cb3638e
Specify node_count as int in launch.py (#682)
Ivan-Zhou 8111f29
Bump ray[default] from 2.32.0 to 2.34.0 (#683)
dependabot[bot] 04b0904
wandb seems to be broken in latest release (#688)
dlwh 8c10a7a
switch to setup tools and forget the config thing (#691)
dlwh e8b6003
set logging level to INFO
dlwh 441af5c
update docker image, build it in ci, make the args point to the new v…
dlwh ef6349c
RE-Allow adding extrenal directory to docker image (#695)
blahBlahhhJ e12c1b6
Merge remote-tracking branch 'origin/dclm' into dclm
dlwh c9ebc88
match specs in dclm
dlwh 7727696
publish dev build
dlwh 55e4d98
wip
dlwh de51236
fix imports and such
dlwh 7863989
get default zone from gcloud config
dlwh a550bb5
factor out docker command, build
dlwh 6341252
Merge remote-tracking branch 'origin/main' into dclm
dlwh 715a04a
Update beta2=0.95 (#701)
dlwh c0ae0f9
publish full tpu image (#703)
dlwh ca7c9a6
fix incremental build on CI (#704)
dlwh d16482b
sigh
dlwh c823c75
grr (#705)
dlwh 7ec7bb5
Adding multiple configs (#685)
abhinavg4 20faff3
Expose infra as a package, publish dev builds (#696)
dlwh 5c53a19
Llama mixture (#706)
abhinavg4 277e728
Fix base again (#707)
dlwh 0c628d5
Fix tpu vm autoshutdown (#708)
dlwh 97358f9
suppress stderr in describe_tpu since it usually logs a dumb error (#…
dlwh e9ca517
Merge remote-tracking branch 'origin/main' into dclm
dlwh d674dd9
wip
dlwh 4913df2
fix pyprojec.toml and pre-commit wandb issues (#712)
dlwh 06dc304
wip
dlwh ffa8e28
fix device kind for mfu v5e (#713)
dlwh fd7888d
add haps configuration (cycle lr schedule) (#709)
blahBlahhhJ 8dd32c6
Bump ray[default] from 2.34.0 to 2.35.0 (#714)
dependabot[bot] ea4ea25
use hf config from checkpoint by default (#715)
dlwh fbe27bc
Completely rework dataset/cache system: instant resume, perfect shuff…
dlwh 944a19f
unpin ray (#718)
dlwh f13cfde
bump equinox
dlwh 8d3dfe0
wip
dlwh 8ecb7ea
768
dlwh 9ba6b20
Update gcsfs requirement from <2024.7,>=2024.2 to >=2024.2,<2024.10 (…
dependabot[bot] 78da902
Update fsspec[http] requirement (#722)
dependabot[bot] 5b685c3
Bump equinox from 0.11.4 to 0.11.5 (#721)
dependabot[bot] a91ef81
fix extra context docker build bug (#724)
blahBlahhhJ 5c18557
Fix eqx (#725)
dlwh b6f334e
get rid of eraconfig b/c draccus can't handle it
dlwh e33a905
ugh
dlwh 2645efb
missed some prints
dlwh 5fc4084
attempt at launching small fast in CI, add tqdm_loggable (#719)
dlwh d05036c
Update datasets requirement from ~=2.18 to >=2.18,<4.0 (#732)
dependabot[bot] ca16aa0
Bump tensorstore from 0.1.64 to 0.1.65 (#731)
dependabot[bot] 79fa64c
Bump equinox from 0.11.3 to 0.11.6 (#730)
dependabot[bot] 07b3f16
add bits-per-byte calculation to levanter (#729)
dlwh fe3e2f3
fix sequence parallel attention in splash attention (#738)
dlwh 9fa3aaa
fix llama 3 rotary embeddings (#740)
dlwh 2b42bfb
Support for running in a Ray cluster (#737)
dlwh 8ad3074
see if it's this file in particular (#742)
dlwh 541ff12
Update README.md (#656)
devactivity-team 91be677
bump levanter version (#743)
dlwh cd82fb3
Make new tokenization ~67% faster (#744)
dlwh 43268e0
Adding supervised data config
TheQuantumFractal d6ad71f
Fixing linter error
TheQuantumFractal 71bd696
Tweaks to Ray TPU stuff (#747)
dlwh f5b32cd
Fixing supervised training
TheQuantumFractal 6483b42
Making linter happy
TheQuantumFractal 45d41d8
Making linter happy
TheQuantumFractal b41838f
Simplify tokenization pipeline, make it work with large numbers of sh…
dlwh 3bae9d3
allow mixture components to override cache_dir (#754)
dlwh 9847728
a few final tweaks for marin runs (#755)
dlwh 36b29fd
Update Audio Data Loader to Support Mixture Dataset (#758)
Helw150 5370c72
Update src/levanter/data/text.py
ahmeda14960 1063fd8
Merge remote-tracking branch 'origin/main' into ksalahi/supervised-data
ahmeda14960 2f625d3
address david's comments
ahmeda14960 cf2c9e5
lint and minor
ahmeda14960 8bed0aa
Adding supervised data config (#746)
ahmeda14960 adf4b6d
Add an actor pool for batch processing, switch to a thread for writin…
dlwh 36459da
pre-commit
dlwh 6499656
flaky hf
dlwh 074477f
Fix actor pool in python 3.11, add better scaling down logic (#760)
dlwh 1c0e10e
Fix ray docs (#761)
blahBlahhhJ 51f9bf1
ensure everything always uses at least some CPU to avoid flooding ray…
dlwh c3b3dd8
cap the size of the core writer task rather than the number of batche…
dlwh 52bff4f
add parquet support
nikil-ravi af78281
lint, shard name fix
nikil-ravi 8d09cfd
pre-commit
nikil-ravi 50715e9
read as binary file
nikil-ravi 3fe8995
simplify test
nikil-ravi fc26c74
Support Parquet files in ShardedDataSource (#764)
nikil-ravi 02f34ac
fix crash in data loader caused by using stale array (#765)
dlwh 0ea3eb4
Merge remote-tracking branch 'origin/main' into dclm
dlwh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
cache_dir: "gs://marin-data/tokenized/dclm/gpt_neo_tokenizer" | ||
tokenizer: "EleutherAI/gpt-neox-20b" | ||
stop_strategy: restart | ||
configs: | ||
"dclm": | ||
train_urls: | ||
- gs://marin-data/datacomp/dclm-baseline-dedup-07-09/*/*/*.jsonl.zstd | ||
# these are just for eval | ||
"paloma/4chan": | ||
validation_urls: | ||
- gs://levanter-data/paloma/4chan_meta_sep/val/val*.jsonl.gz | ||
"paloma/c4_100_domains": | ||
validation_urls: | ||
- gs://levanter-data/paloma/c4_100_domains/val/val*.jsonl.gz | ||
"paloma/c4_en": | ||
validation_urls: | ||
- gs://levanter-data/paloma/c4_en/val/val*.jsonl.gz | ||
"paloma/dolma-v1_5": | ||
validation_urls: | ||
- gs://levanter-data/paloma/dolma-v1_5/val/val*.jsonl.gz | ||
"paloma/dolma_100_programing_languages": | ||
validation_urls: | ||
- gs://levanter-data/paloma/dolma_100_programing_languages/val/val*.jsonl.gz | ||
"paloma/dolma_100_subreddits": | ||
validation_urls: | ||
- gs://levanter-data/paloma/dolma_100_subreddits/val/val*.jsonl.gz | ||
"paloma/falcon-refinedweb": | ||
validation_urls: | ||
- gs://levanter-data/paloma/falcon-refinedweb/val/val*.jsonl.gz | ||
"paloma/gab": | ||
validation_urls: | ||
- gs://levanter-data/paloma/gab/val/val*.jsonl.gz | ||
"paloma/m2d2_s2orc_unsplit": | ||
validation_urls: | ||
- gs://levanter-data/paloma/m2d2_s2orc_unsplit/val/val*.jsonl.gz | ||
"paloma/m2d2_wikipedia_unsplit": | ||
validation_urls: | ||
- gs://levanter-data/paloma/m2d2_wikipedia_unsplit/val/val*.jsonl.gz | ||
"paloma/manosphere_meta_sep": | ||
validation_urls: | ||
- gs://levanter-data/paloma/manosphere_meta_sep/val/val*.jsonl.gz | ||
"paloma/mc4": | ||
validation_urls: | ||
- gs://levanter-data/paloma/mc4/val/val*.jsonl.gz | ||
"paloma/ptb": | ||
validation_urls: | ||
- gs://levanter-data/paloma/ptb/val/val*.jsonl.gz | ||
"paloma/redpajama": | ||
validation_urls: | ||
- gs://levanter-data/paloma/redpajama/val/val*.jsonl.gz | ||
"paloma/twitterAAE_HELM_fixed": | ||
validation_urls: | ||
- gs://levanter-data/paloma/twitterAAE_HELM_fixed/val/val*.jsonl.gz | ||
"paloma/wikitext_103": | ||
validation_urls: | ||
- gs://levanter-data/paloma/wikitext_103/val/val*.jsonl.gz | ||
train_weights: | ||
dclm: 1.0 | ||
paloma/4chan: 0.0 | ||
paloma/c4_100_domains: 0.0 | ||
paloma/c4_en: 0.0 | ||
paloma/dolma-v1_5: 0.0 | ||
paloma/dolma_100_programing_languages: 0.0 | ||
paloma/dolma_100_subreddits: 0.0 | ||
paloma/falcon-refinedweb: 0.0 | ||
paloma/gab: 0.0 | ||
paloma/m2d2_s2orc_unsplit: 0.0 | ||
paloma/m2d2_wikipedia_unsplit: 0.0 | ||
paloma/manosphere_meta_sep: 0.0 | ||
paloma/mc4: 0.0 | ||
paloma/ptb: 0.0 | ||
paloma/redpajama: 0.0 | ||
paloma/twitterAAE_HELM_fixed: 0.0 | ||
paloma/wikitext_103: 0.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
data: !include data/dclm_gpt_neo.yaml | ||
model: # 1B class model | ||
type: llama | ||
seq_len: 2048 | ||
hidden_dim: 2048 | ||
intermediate_dim: 8192 | ||
num_layers: 24 | ||
num_heads: 16 | ||
num_kv_heads: 16 | ||
use_flash_attention: True | ||
flash_attention_block_size: 1024 | ||
trainer: | ||
tracker: | ||
type: wandb | ||
project: "marin" | ||
tags: ["llama", "fineweb", "markdown"] | ||
|
||
mp: p=f32,c=bfloat16 | ||
train_batch_size: 256 # 2048 * 2048 = 4,194,304 | ||
num_train_steps: 71526 # 300,000,000,000 / 4,194,304 = 71,526 | ||
steps_per_eval: 1000 | ||
tensor_parallel_axes: ["mlp", "heads"] | ||
fsdp_axis: "embed" | ||
batch_axis: "batch" | ||
optimizer: | ||
learning_rate: 3E-3 | ||
weight_decay: 0.033 | ||
min_lr_ratio: 0.1 | ||
warmup: 5000 | ||
cooldown: 3E-5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
data: !include data/dclm_gpt_neo.yaml | ||
model: # 7B class model | ||
type: llama | ||
seq_len: 2048 | ||
hidden_dim: 4096 | ||
intermediate_dim: 11008 | ||
num_layers: 32 | ||
num_heads: 32 | ||
num_kv_heads: 32 | ||
use_flash_attention: True | ||
flash_attention_block_size: 1024 | ||
trainer: | ||
tracker: | ||
type: wandb | ||
project: "marin" | ||
tags: ["dclm", "7B", "llama"] | ||
|
||
mp: p=f32,c=bfloat16 | ||
train_batch_size: 2048 | ||
num_train_steps: 750000 # 3,000,000,000,000 / 4,000,000 = 750,000 | ||
steps_per_eval: 1000 | ||
tensor_parallel_axes: ["mlp", "heads"] | ||
fsdp_axis: "embed" | ||
batch_axis: "batch" | ||
optimizer: | ||
learning_rate: 4E-4 | ||
weight_decay: 0.1 | ||
min_lr_ratio: 0.1 | ||
warmup: 0.01 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still don't like this but I think I can't really articulate what I want. i'm gonna push a change to my fork