Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31

Open
xuxiaoang opened this issue Jul 20, 2024 · 4 comments
Open

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31

xuxiaoang opened this issue Jul 20, 2024 · 4 comments

Comments

@xuxiaoang
Copy link

Hello!
Thank you for your work at MLLM.
I had a fine-tuning bug that I couldn't fix: when I ran the stage2_sft.sh script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure.
屏幕截图 2024-07-20 210750

Command in stage2_sft.sh as follows:

torchrun
    --nproc_per_node 2 \
    anygpt/src/train/stage2_sft.py \
    --model_name_or_path "${METAROOT}" \
    --run_name "mm_sft" \
    --cache_dir ${CACHEROOT} \
    --report_to "wandb" \
    --speech_conv_datasets "$speech_conv_datasets" \
    --speech_datasets "$speech_datasets"\
    --preprocessing_num_workers 100 \
    --bf16 True \
    --do_train \
    --do_eval \
    --output_dir "${OUTROOT}" \
    --model_max_length 4096 \
    --save_strategy "steps" \
    --save_steps 5 \
    --evaluation_strategy "steps" \
    --eval_steps 5 \
    --max_steps 5 \
    --concatenating False \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --val_set_size 10 \
    --num_train_epochs 3\
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --log_level debug \
    --logging_steps 1 \
    --overwrite_output_dir False\
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --use_flash_attn True \
    --ddp_timeout 7200 \
    --save_total_limit 10

I'm using the following python environment:

transformers              4.34.1
huggingface-hub           0.24.0
tokenizers                0.14.1
torch                     2.1.0
torchaudio                2.1.0
torchvision               0.16.0
flash-attn                2.5.9.post1
@JunZhan2000
Copy link
Collaborator

Hi, is your training data very small? Maybe you can use a larger training data?

@xuxiaoang
Copy link
Author

Hi, thank you for your reply.

I change the dataset to the whole metadata.jsonl in part1 of AnyInstruct dataset, but there are still issues.

When I was debugging, I found that the preprocess method in anygpt/src/train/stage2_sft.py would MASK all tokens in targets by IGNORE_TOKEN_ID and return them as labels, as shown below:
masked_targets

I noticed that the comment on line 248 of the source code: Mask targets. Only compute loss on the assistant outputs. Does this mean that anygpt_system_prompt part and user_massage part need to be masked and only the "anygpt_massage" part should remain? I personally think that there are some minor bugs at the part of masking the tokens in preprocess method.

By the way, could you explain why the user_massage part needs to be masked? Is this based on rules or experience? What happens if the user_massage part is not masked?

Looking forward to your reply.

Thanks.

@JunZhan2000
Copy link
Collaborator

Does this mean that anygpt_system_prompt part and user_massage part need to be masked and only the "anygpt_massage" part should remain?
yes, it is.

I think this code seems to work fine on my data. Ideally, except for the part of the model response, the targets corresponding to other tokens will be set to -100, which means no loss is calculated. We do this because it seems to be a common practice for fine-tuning instructions, but we actually tried not to do this and directly calculate the loss on the entire sequence, and I don’t think there is much difference

@jingfanke
Copy link

Hi, thank you for your reply.

I change the dataset to the whole metadata.jsonl in part1 of AnyInstruct dataset, but there are still issues.

When I was debugging, I found that the preprocess method in anygpt/src/train/stage2_sft.py would MASK all tokens in targets by IGNORE_TOKEN_ID and return them as labels, as shown below: masked_targets

I noticed that the comment on line 248 of the source code: Mask targets. Only compute loss on the assistant outputs. Does this mean that anygpt_system_prompt part and user_massage part need to be masked and only the "anygpt_massage" part should remain? I personally think that there are some minor bugs at the part of masking the tokens in preprocess method.

By the way, could you explain why the user_massage part needs to be masked? Is this based on rules or experience? What happens if the user_massage part is not masked?

Looking forward to your reply.

Thanks.

Your problem seems similar to this issue: lm-sys/FastChat#3266 (comment) , and I hope my solution can assist you.

You can modify

for i, turn in enumerate(turns):
if turn == "":
break
turn_len = len(tokenizer(turn).input_ids)

to the following lines:

for i, turn in enumerate(turns):
    if turn == "":
        break
    turn += conv.sep2  # append the sep2 to turn
    turn_len = len(tokenizer(turn).input_ids) - 1  # subtract the length of the sos token from the turn length

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants