This repository contains the data and code for paper Does Compressing Activations Help Model Parallel Training? (MLSys'24). Our code is based on Megatron-LM developed by NVIDIA.
Installation 🛠️ • Data 🗃️ • Checkpoint ⚙️ • Quick Start 🚀 • Contributing 🐜 •
To get started, please first setup the environment:
pip install -r requirements.txt --find-links https://download.pytorch.org/whl/torch_stable.html
We employ Python 3.9 and CUDA 11.3. If you're using different versions of Python and CUDA, please ensure compatibility during the torch installation process.
To install apex, please proceed with the following steps:
git clone https://github.com/NVIDIA/apex.git
git checkout 22.04-dev
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
We provide two examples illustrating how to prepare data for fine-tuning and pre-training, respectively.
Download GLUE dataset:
python download_glue_data.py
Download vocabulary files:
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt
Download wikipedia dataset
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Preprocess wikipedia dataset
python -m wikiextractor.WikiExtractor -o output --json enwiki-latest-pages-articles.xml.bz2
cd tools
bash preprocess_wiki.sh
Download vocabulary files:
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Download Checkpoints:
cd examples
mkdir checkpoints
cd checkpoints
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
unzip megatron_bert_345m_v0.1_cased.zip -d bert_345m
Split the checkpoints:
cd tools
bash split_single.sh
Note: we need to set pipeline parallelism degree and tensor parallelism degree to fit the fine-tuning process.
In the above section, the checkpoint is manually split with success. Here, we finetune BERT-345M (BERT-Large):
cd examples
bash finetune_mrpc_distributed_with_mp.sh
To utilize the checkpoints from Huggingface, proceed with these steps:
- Implement Transformer-based Model by using Transformer function provided by Megatron-LM.
- Download checkpoints and preprocess the Huggingface checkpoints.
- Split the checkpoints for fine-tuning.
Here, we present an example. Given that the BERT-Base model is already implemented in our repository, we will only demonstrate the final two steps.
Download and preprocess the Huggingface checkpoints
python preprocess_hf_bert_checkpoint.py
Split the checkpoints
bash split_single_hf.sh
Finetune BERT-Base:
cd examples
bash finetune_mrpc_bert_base_with_mp.sh
Expanding our repository to include additional Huggingface models requires us to independently implement these models. Here are several steps:
- Implement parallel MLP and parallel Attention (please refer to
megatron/model/transformer.py
) - Implement the language model by using parallel MLP and parallel Attention (please refer to
megatron/model/language_model.py
) - Implement the model by using the above language model with embedding and head. (please refer to
megatron/model/bert_model.py
ormegatron/model/gpt_model.py
)
Authors: Song Bian*, Dacheng Li*, Hongyi Wang, Eric P. Xing, Shivaram Venkataraman
Affiliated: University of Wisconsin-Madison, Carnegie Mellon University, MBZUAI, and Petuum Inc.
If you find the idea or code useful for your research, please consider citing our paper:
@article{bian2023does,
title={Does compressing activations help model parallel training?},
author={Bian, Song and Li, Dacheng and Wang, Hongyi and Xing, Eric P and Venkataraman, Shivaram},
journal={arXiv preprint arXiv:2301.02654},
year={2023}
}