Enable multiple tokenization schemes #163

valedan · 2023-04-12T19:43:01Z

We will likely need other tokenization schemes in future, but right now we just use one that is hardcoded in a lot of places. We need to do some thinking here about how we can design this feature.

Refactor to be compatible with `maze-dataset` versions `0.2.1` and onwards. See PRs: - [`maze_dataset` PR #5](understanding-search/maze-dataset#5) - [`maze_dataset` PR #6](understanding-search/maze-dataset#6) See related issues: - #164 - #163 - #77 These changes also revert changes in #118, to be consistent with underscores only appearing once in the special tokens. # commit history: * test_cfg_post_init working * migrated SPECIAL_TOKENS usage * wip * wip * wip, all but 3 in tok tests passing * test_tokenizers passing * unit tests passing (but need to update maze_dataset dep) * poetry lock * format * remove deprecated kwarg to process_weights_ Upgrading transformer_lens to 1.4.0 caused `HookedTransformer.process_weights_()` to no longer accept the keyword arg `move_state_dict_to_device` However, I'm not sure if this was important in the first place. If any issues come up, move the state dict to device manually in `ZanjHookedTransformer._load_state_dict_wrapper()` where all this was happening in the first place * fixed MazeTokenizer not being passed to as_tokens() in some spots * updated changed dataset config key since we removed tokenizer stuff from the dataset * fixed eval_model nb, added ZanjHookedTransformer.config ref the `eval_model.ipynb` notebook has a function `testdata_plot_predicted_path` which was using `model.zanj_model_config` to get the tokenizer, an attribute missing from the `RandomBaseline` class since it only inherits from `HookedTransformer` to fix this: - `ZanjHookedTransformer` now has a `config` property which just accesses the `zanj_model_config` used by the parent `ConfiguredModel` - `testdata_plot_predicted_path` now uses `model.config` everywhere * lock after update maze-dataset to 0.2.1 * fixed minor import issue * update configs refs in train_model notebook * lock poetry, re-run notebook * format * update coverage

mivanit · 2023-08-06T21:36:32Z

added in #191
see the linked maze-dataset PRs in #191 for more details

valedan mentioned this issue Apr 12, 2023

Create x,y tokenization scheme #164

Closed

mivanit mentioned this issue Jun 28, 2023

Cannot load mazes_tokens from MazeDataset without getting the config #53

Closed

mivanit mentioned this issue Aug 6, 2023

Refactor tokenization #191

Merged

mivanit closed this as completed Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multiple tokenization schemes #163

Enable multiple tokenization schemes #163

valedan commented Apr 12, 2023

mivanit commented Aug 6, 2023

Enable multiple tokenization schemes #163

Enable multiple tokenization schemes #163

Comments

valedan commented Apr 12, 2023

mivanit commented Aug 6, 2023