Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Quality Evaluation Suite #22

Open
rbiswasfc opened this issue May 16, 2024 · 3 comments
Open

Data Quality Evaluation Suite #22

rbiswasfc opened this issue May 16, 2024 · 3 comments
Assignees
Labels

Comments

@rbiswasfc
Copy link
Contributor

We need to compose an evaluation suite to support various decisions during the pre-training data preparation stage, such as:

  • Language filtering, quality filtering, content filtering
  • Deduplication
  • Mixing proportions of different data sources
  • Deciding whether or not to use synthetic data, code data etc.

It would be interesting to explore whether practices followed in LLM pre-training (e.g. Dolma/OLMo) transfer to encoder models. For this evaluation suite, we will require a set of high signal benchmarks. A high signal eval dataset should satisfy the following properties (Reference: https://youtu.be/2-SPH9hIKT8?t=1893):

  • Monotonicity: model performance improves monotonically as training progresses (avoid early saturation)
  • Low variance:
    • when comparing two known reference datasets (e.g. Pile vs C4)
    • when comparing with various sub-parts of data and seeds
    • above random baseline

We can test various common NLP benchmarks (e.g. tasks in SuperGLUE) against these properties. A viable strategy could be to train randomly initialized deberta-v3-base models on subsets of Pile/C4 (with RTD/MLM) and track eval metrics at intermediate checkpoints. Alternative suggestions/feedback are welcome!

@rbiswasfc rbiswasfc self-assigned this May 16, 2024
@bclavie
Copy link
Contributor

bclavie commented May 16, 2024

@orionw probably of interest to you too!

@orionw orionw self-assigned this May 16, 2024
@griff4692 griff4692 self-assigned this May 20, 2024
@griff4692
Copy link
Contributor

Quick question - does the data quality validation benchmark need to be different from the final downstream benchmarks?

@rbiswasfc
Copy link
Contributor Author

Quick question - does the data quality validation benchmark need to be different from the final downstream benchmarks?

I think data quality validation and downstream benchmarks are expected to have a high overlap. However, I think it's better to have a different evaluation setup for the data ablation studies:
- Instead of finetuning the base model, we should measure performance in a zero shot manner (isolating the impact of pre-training). As a start, I'm thinking cloze style reformulation using prompt + verbalizer pairs -- as in the AdaPET paper Appendix A: https://arxiv.org/pdf/2103.11955
- Additionally, we can monitor training efficiency, stability/loss spikes, MLM / RTD loss

I quite like the data quality evaluation strategy in Fineweb -- but we need to adapt it for encoder training.

@rbiswasfc rbiswasfc reopened this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants