Adding configs related to DCLM #663

abhinavg4 · 2024-07-18T08:34:50Z

DCLM 7B related configs

Ivan-Zhou · 2024-07-18T17:50:21Z

LGTM. Can you run pre-commit checks to fix the pre-commit issue? I think it is just some formatting issue.

pre-commit run --all-files

dlwh · 2024-07-19T15:13:03Z

src/levanter/models/llama.py

@@ -64,6 +65,7 @@ class LlamaConfig(HFCompatConfig):
    activation_function: str = "silu"
    initializer_range: float = 0.02
    layer_norm_epsilon: float = 1e-5
+    z_loss_weight: float = 0.0


i would rather this not be a property of the model's config but an option on TrainLmConfig, and define a loss function in train_lm and pass it into trainer

dlwh · 2024-07-19T15:13:24Z

src/levanter/models/lm_model.py

-        loss = cross_entropy_loss(
-            logits, self.Vocab, target_y, reduction, reduction_axis=reduction_axis, where=example.loss_mask
-        )
+        if hasattr(self.config, "z_loss_weight") and self.config.z_loss_weight > 0:


really don't like using this here. much cleaner to just pull out the loss function

dlwh · 2024-07-25T16:24:35Z

src/levanter/main/train_lm.py

 from levanter.utils.jax_utils import parameter_count


 logger = logging.getLogger(__name__)


+class ModuleComputeZLoss(ComputeLossFunction[M, X]):


i still don't like this but I think I can't really articulate what I want. i'm gonna push a change to my fork

* refactor queued-resources * fix multislice * add auto tear down * reuse docker image * tiny fix * switch to concurrent executor for parallel subprocesses & small fix & logs

* Add llama 1b with fineweb txt * replace with 50 fineweb urls * wip * revert many of the changes, which seems to fix the crashing * revert many of the changes, which seems to fix the crashing * remove now-unused option * cleanup * cleanup * sigh * Adding changes for dclm --------- Co-authored-by: Ivan Zhou <[email protected]> Co-authored-by: Abhinav Garg <[email protected]>

Bumps [ray[default]](https://github.com/ray-project/ray) from 2.32.0 to 2.34.0. - [Release notes](https://github.com/ray-project/ray/releases) - [Commits](ray-project/ray@ray-2.32.0...ray-2.34.0) --- updated-dependencies: - dependency-name: ray[default] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* wandb seems to be broken in latest release * oops * what?

…ersion (#693)

* add mounting dir * minor fix * support abs and rel path * add docs * refactor to extra context * minor fix docs * minor fix * modify docs

1. num_tpus=1 is actually a bad idea because Ray will mask out the other tpus 2. force non-docker workloads to run in a separate process for stability

…ards again, (re)add configuration metadata to cache (#752) Co-authored-by: Ahmed Ahmed <[email protected]>

Pulls in the New Mixture Features Into Audio Space! Tested that this fixes the previous epoching errors in the whisper_tiny config.

Co-authored-by: David Hall <[email protected]>

…g batches instead of a ray actor/task (#757) About a 5x speedup. Memory usage isn't super well controlled in mixtures and that needs some work

… head node, add code to change max size of actor pool

#762) This is marginally slower, but pile now builds fine on a v4-32, which is an improvement.

This PR creates a `ParquetDataSource` class to support loading `.parquet` files. Closes #763

Adding configs related to DCLM

4aa2f2c

abhinavg4 requested review from dlwh and Ivan-Zhou July 18, 2024 08:35

abhinavg4 added 3 commits July 18, 2024 17:14

Adding configs related to DCLM

dde9ed0

Adding Z loss

b991e29

pre commit changes

bb674bb

dlwh reviewed Jul 19, 2024

View reviewed changes

Adding z_loss as part of train_lm.py

6c99dfb

abhinavg4 requested a review from dlwh July 19, 2024 19:27

abhinavg4 and others added 3 commits July 19, 2024 12:29

Reverting changes to llama.py for z_loss

24469e7

Address capacity_type and env variables (#665)

76092c4

fix best effort test (#662)

2e55856

dlwh reviewed Jul 25, 2024

View reviewed changes

blahBlahhhJ and others added 16 commits July 25, 2024 17:53

Enable multislice in launch script (#666)

2e64f14

* refactor queued-resources * fix multislice * add auto tear down * reuse docker image * tiny fix * switch to concurrent executor for parallel subprocesses & small fix & logs

log run_progress for a special x axis. Fixes #671 (#674)

c17f653

refactor trainer to always need a loss function, add z_loss (#672)

ac0882d

Specify node_count as int in launch.py (#682)

cb3638e

wandb seems to be broken in latest release (#688)

04b0904

* wandb seems to be broken in latest release * oops * what?

switch to setup tools and forget the config thing (#691)

8c10a7a

set logging level to INFO

e8b6003

update docker image, build it in ci, make the args point to the new v…

441af5c

…ersion (#693)

RE-Allow adding extrenal directory to docker image (#695)

ef6349c

* add mounting dir * minor fix * support abs and rel path * add docs * refactor to extra context * minor fix docs * minor fix * modify docs

Merge remote-tracking branch 'origin/dclm' into dclm

e12c1b6

match specs in dclm

c9ebc88

publish dev build

7727696

wip

55e4d98

fix imports and such

de51236

TheQuantumFractal and others added 30 commits September 25, 2024 16:20

Adding supervised data config

43268e0

Fixing linter error

d6ad71f

Tweaks to Ray TPU stuff (#747)

71bd696

1. num_tpus=1 is actually a bad idea because Ray will mask out the other tpus 2. force non-docker workloads to run in a separate process for stability

Fixing supervised training

f5b32cd

Making linter happy

6483b42

Making linter happy

45d41d8

Simplify tokenization pipeline, make it work with large numbers of sh…

b41838f

…ards again, (re)add configuration metadata to cache (#752) Co-authored-by: Ahmed Ahmed <[email protected]>

allow mixture components to override cache_dir (#754)

3bae9d3

a few final tweaks for marin runs (#755)

9847728

Update Audio Data Loader to Support Mixture Dataset (#758)

36b29fd

Pulls in the New Mixture Features Into Audio Space! Tested that this fixes the previous epoching errors in the whisper_tiny config.

Update src/levanter/data/text.py

5370c72

Co-authored-by: David Hall <[email protected]>

Merge remote-tracking branch 'origin/main' into ksalahi/supervised-data

1063fd8

address david's comments

2f625d3

lint and minor

cf2c9e5

Adding supervised data config (#746)

8bed0aa

Add an actor pool for batch processing, switch to a thread for writin…

adf4b6d

…g batches instead of a ray actor/task (#757) About a 5x speedup. Memory usage isn't super well controlled in mixtures and that needs some work

pre-commit

36459da

flaky hf

6499656

Fix actor pool in python 3.11, add better scaling down logic (#760)

074477f

Fix ray docs (#761)

1c0e10e

ensure everything always uses at least some CPU to avoid flooding ray…

51f9bf1

… head node, add code to change max size of actor pool

cap the size of the core writer task rather than the number of batches (

c3b3dd8

#762) This is marginally slower, but pile now builds fine on a v4-32, which is an improvement.

add parquet support

52bff4f

lint, shard name fix

af78281

pre-commit

8d09cfd

read as binary file

50715e9

simplify test

3fe8995

Support Parquet files in ShardedDataSource (#764)

fc26c74

This PR creates a `ParquetDataSource` class to support loading `.parquet` files. Closes #763

fix crash in data loader caused by using stale array (#765)

02f34ac

Merge remote-tracking branch 'origin/main' into dclm

0ea3eb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding configs related to DCLM #663

Adding configs related to DCLM #663

abhinavg4 commented Jul 18, 2024

Ivan-Zhou commented Jul 18, 2024

dlwh Jul 19, 2024

dlwh Jul 19, 2024

dlwh Jul 25, 2024

Adding configs related to DCLM #663

Are you sure you want to change the base?

Adding configs related to DCLM #663

Conversation

abhinavg4 commented Jul 18, 2024

Ivan-Zhou commented Jul 18, 2024

dlwh Jul 19, 2024

Choose a reason for hiding this comment

dlwh Jul 19, 2024

Choose a reason for hiding this comment

dlwh Jul 25, 2024

Choose a reason for hiding this comment