Data Quality Evaluation Suite #22

rbiswasfc · 2024-05-16T10:35:33Z

We need to compose an evaluation suite to support various decisions during the pre-training data preparation stage, such as:

Language filtering, quality filtering, content filtering
Deduplication
Mixing proportions of different data sources
Deciding whether or not to use synthetic data, code data etc.

It would be interesting to explore whether practices followed in LLM pre-training (e.g. Dolma/OLMo) transfer to encoder models. For this evaluation suite, we will require a set of high signal benchmarks. A high signal eval dataset should satisfy the following properties (Reference: https://youtu.be/2-SPH9hIKT8?t=1893):

Monotonicity: model performance improves monotonically as training progresses (avoid early saturation)
Low variance:
- when comparing two known reference datasets (e.g. Pile vs C4)
- when comparing with various sub-parts of data and seeds
- above random baseline

We can test various common NLP benchmarks (e.g. tasks in SuperGLUE) against these properties. A viable strategy could be to train randomly initialized deberta-v3-base models on subsets of Pile/C4 (with RTD/MLM) and track eval metrics at intermediate checkpoints. Alternative suggestions/feedback are welcome!

The text was updated successfully, but these errors were encountered:

bclavie · 2024-05-16T12:27:02Z

@orionw probably of interest to you too!

griff4692 · 2024-05-20T13:14:21Z

Quick question - does the data quality validation benchmark need to be different from the final downstream benchmarks?

rbiswasfc · 2024-05-21T07:02:10Z

Quick question - does the data quality validation benchmark need to be different from the final downstream benchmarks?

I think data quality validation and downstream benchmarks are expected to have a high overlap. However, I think it's better to have a different evaluation setup for the data ablation studies:
- Instead of finetuning the base model, we should measure performance in a zero shot manner (isolating the impact of pre-training). As a start, I'm thinking cloze style reformulation using prompt + verbalizer pairs -- as in the AdaPET paper Appendix A: https://arxiv.org/pdf/2103.11955
- Additionally, we can monitor training efficiency, stability/loss spikes, MLM / RTD loss

I quite like the data quality evaluation strategy in Fineweb -- but we need to adapt it for encoder training.

rbiswasfc added the Evals label May 16, 2024

rbiswasfc self-assigned this May 16, 2024

orionw self-assigned this May 16, 2024

griff4692 self-assigned this May 20, 2024

rbiswasfc closed this as completed May 21, 2024

rbiswasfc reopened this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Quality Evaluation Suite #22

Data Quality Evaluation Suite #22

rbiswasfc commented May 16, 2024

bclavie commented May 16, 2024

griff4692 commented May 20, 2024

rbiswasfc commented May 21, 2024

Data Quality Evaluation Suite #22

Data Quality Evaluation Suite #22

Comments

rbiswasfc commented May 16, 2024

bclavie commented May 16, 2024

griff4692 commented May 20, 2024

rbiswasfc commented May 21, 2024