Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1 #286

Open
Harryjun opened this issue Dec 12, 2024 · 6 comments

Comments

@Harryjun
Copy link

when I try train minillm from qwen14b to qwen1.5b, I encountered the following problem

`[rank0]: Traceback (most recent call last):
[rank0]: File "/miniLLM/LMOps-main/minillm/train_minillm.py", line 103, in
[rank0]: main()
[rank0]: File "/miniLLM/LMOps-main/minillm/train_minillm.py", line 89, in main
[rank0]: train(
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/init.py", line 37, in train
[rank0]: sampler.run_sample(args.num_rollouts_per_device)
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/sampler.py", line 70, in run_sample
[rank0]: gen_out = self.trainer.generate(**batch, return_dict_in_generate=True, mode=mode, teacher_mixed_sample=(self.args.teacher_mixed_alpha is not None), output_scores=True)
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/trainer.py", line 618, in generate
[rank0]: gen = model.generate(
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/model.py", line 21, in generate
[rank0]: return self.base_model.generate(**x)
[rank0]: File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/miniLLM/LMOps-main/minillm/transformers/src/transformers/generation/utils.py", line 2229, in generate
[rank0]: result = self._sample(
[rank0]: File "/miniLLM/LMOps-main/minillm/transformers/src/transformers/generation/utils.py", line 3331, in _sample
[rank0]: probs = (1 - mix_in_alpha) * probs + mix_in_alpha * m_probs
[rank0]: RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1
E1212 22:55:56.524920 139651309172544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 739442) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/miniLLM/LMOps-main/minillm/train_minillm.py FAILED`

@t1101675
Copy link
Contributor

It seems that qwen14b and qwen1.5b use different vocabulary sizes and KD methods generally requires the teacher and student models to share the same vocabulary.

However, the vocabulary difference of different-sized qwen is just for padding. Therefore, cutting the larger vocabulary to the smaller one or padding the smaller to the larger one would be fine.

@Harryjun
Copy link
Author

@t1101675 how to padding the smaller to the larger

@Harryjun
Copy link
Author

@t1101675 I need to modify the size, re-train SFT, and then perform distillation?

@t1101675
Copy link
Contributor

t1101675 commented Dec 15, 2024

@t1101675 I need to modify the size, re-train SFT, and then perform distillation?

Yes. Besides, I think the SFT re-training does not need to be extensive to adapt the model for a few padding tokens.

@Harryjun
Copy link
Author

Harryjun commented Dec 16, 2024

@t1101675 how to padding the smaller to the larger, I change config.json, but it had no effect. and Can't the code support auto padding?

@t1101675
Copy link
Contributor

You probably need to resize the embeddings of the student model after changing config.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants