Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multiple tokenization schemes #163

Closed
valedan opened this issue Apr 12, 2023 · 1 comment
Closed

Enable multiple tokenization schemes #163

valedan opened this issue Apr 12, 2023 · 1 comment

Comments

@valedan
Copy link
Contributor

valedan commented Apr 12, 2023

We will likely need other tokenization schemes in future, but right now we just use one that is hardcoded in a lot of places. We need to do some thinking here about how we can design this feature.

mivanit added a commit that referenced this issue Aug 6, 2023
Refactor to be compatible with `maze-dataset` versions `0.2.1` and onwards. 

See PRs:
- [`maze_dataset` PR #5](understanding-search/maze-dataset#5)
- [`maze_dataset` PR #6](understanding-search/maze-dataset#6)

See related issues:
- #164 
- #163 
- #77 

These changes also revert changes in #118, to be consistent with underscores only appearing once in the special tokens.

# commit history:

* test_cfg_post_init working

* migrated SPECIAL_TOKENS usage

* wip

* wip

* wip, all but 3 in tok tests passing

* test_tokenizers passing

* unit tests passing (but need to update maze_dataset dep)

* poetry lock

* format

* remove deprecated kwarg to process_weights_

Upgrading transformer_lens to 1.4.0 caused
`HookedTransformer.process_weights_()` to no longer accept
the keyword arg `move_state_dict_to_device`

However, I'm not sure if this was important in the first place.
If any issues come up, move the state dict to device manually in
`ZanjHookedTransformer._load_state_dict_wrapper()` where all this
was happening in the first place

* fixed MazeTokenizer not being passed to as_tokens() in some spots

* updated changed dataset config key

since we removed tokenizer stuff from the dataset

* fixed eval_model nb, added ZanjHookedTransformer.config ref

the `eval_model.ipynb` notebook has a function `testdata_plot_predicted_path`
which was using `model.zanj_model_config` to get the tokenizer, an attribute
missing from the `RandomBaseline` class since it only inherits from `HookedTransformer`

to fix this:

- `ZanjHookedTransformer` now has a `config` property which just
  accesses the `zanj_model_config` used by the parent `ConfiguredModel`
- `testdata_plot_predicted_path` now uses `model.config` everywhere

* lock after update maze-dataset to 0.2.1

* fixed minor import issue

* update configs refs in train_model notebook

* lock poetry, re-run notebook

* format

* update coverage
@mivanit
Copy link
Member

mivanit commented Aug 6, 2023

added in #191
see the linked maze-dataset PRs in #191 for more details

@mivanit mivanit closed this as completed Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants