You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to compose an evaluation suite to support various decisions during the pre-training data preparation stage, such as:
Language filtering, quality filtering, content filtering
Deduplication
Mixing proportions of different data sources
Deciding whether or not to use synthetic data, code data etc.
It would be interesting to explore whether practices followed in LLM pre-training (e.g. Dolma/OLMo) transfer to encoder models. For this evaluation suite, we will require a set of high signal benchmarks. A high signal eval dataset should satisfy the following properties (Reference: https://youtu.be/2-SPH9hIKT8?t=1893):
Monotonicity: model performance improves monotonically as training progresses (avoid early saturation)
Low variance:
when comparing two known reference datasets (e.g. Pile vs C4)
when comparing with various sub-parts of data and seeds
above random baseline
We can test various common NLP benchmarks (e.g. tasks in SuperGLUE) against these properties. A viable strategy could be to train randomly initialized deberta-v3-base models on subsets of Pile/C4 (with RTD/MLM) and track eval metrics at intermediate checkpoints. Alternative suggestions/feedback are welcome!
The text was updated successfully, but these errors were encountered:
Quick question - does the data quality validation benchmark need to be different from the final downstream benchmarks?
I think data quality validation and downstream benchmarks are expected to have a high overlap. However, I think it's better to have a different evaluation setup for the data ablation studies:
- Instead of finetuning the base model, we should measure performance in a zero shot manner (isolating the impact of pre-training). As a start, I'm thinking cloze style reformulation using prompt + verbalizer pairs -- as in the AdaPET paper Appendix A: https://arxiv.org/pdf/2103.11955
- Additionally, we can monitor training efficiency, stability/loss spikes, MLM / RTD loss
I quite like the data quality evaluation strategy in Fineweb -- but we need to adapt it for encoder training.
We need to compose an evaluation suite to support various decisions during the pre-training data preparation stage, such as:
It would be interesting to explore whether practices followed in LLM pre-training (e.g. Dolma/OLMo) transfer to encoder models. For this evaluation suite, we will require a set of high signal benchmarks. A high signal eval dataset should satisfy the following properties (Reference: https://youtu.be/2-SPH9hIKT8?t=1893):
We can test various common NLP benchmarks (e.g. tasks in SuperGLUE) against these properties. A viable strategy could be to train randomly initialized
deberta-v3-base
models on subsets of Pile/C4 (with RTD/MLM) and track eval metrics at intermediate checkpoints. Alternative suggestions/feedback are welcome!The text was updated successfully, but these errors were encountered: