Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: Current loss scale already at minimum - cannot decrease scale anymore #280

Open
Z-eloto opened this issue Nov 12, 2024 · 6 comments

Comments

@Z-eloto
Copy link

Z-eloto commented Nov 12, 2024

Thank you for sharing your codes.
When running gpt2/kd/kd_medium.sh on 2*3090, the program encountered this error. What should I do, such as adjusting the learning rate?

@shiboyu1999
Copy link

You can use fp32 to train the model or decrease the batch size to 1.

@Z-eloto
Copy link
Author

Z-eloto commented Nov 19, 2024

You can use fp32 to train the model or decrease the batch size to 1.

Thanks. I will try it. :)

@t1101675
Copy link
Contributor

You can also try using bfloat16 by replacing ds_config_zero1_fp16.json in this line with ds_config_zero1_bf16.json.

@Z-eloto
Copy link
Author

Z-eloto commented Dec 4, 2024

You can also try using bfloat16 by replacing ds_config_zero1_fp16.json in this line with ds_config_zero1_bf16.json.

OK, I have already resolved this problem. Thanks :)
But I also want to know if this modification will affect the experimental results?

@t1101675
Copy link
Contributor

t1101675 commented Dec 4, 2024

This will not affect the results much. In fact, bf16 will be more stable in training than fp16 and will not suffer from the "Current loss scale already at minimum" problem.

@Z-eloto
Copy link
Author

Z-eloto commented Dec 4, 2024

I see. Many thanks!

This will not affect the results much. In fact, bf16 will be more stable in training than fp16 and will not suffer from the "Current loss scale already at minimum" problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants