RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1 #286

Harryjun · 2024-12-12T15:00:52Z

when I try train minillm from qwen14b to qwen1.5b, I encountered the following problem

`[rank0]: Traceback (most recent call last):
[rank0]: File "/miniLLM/LMOps-main/minillm/train_minillm.py", line 103, in
[rank0]: main()
[rank0]: File "/miniLLM/LMOps-main/minillm/train_minillm.py", line 89, in main
[rank0]: train(
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/init.py", line 37, in train
[rank0]: sampler.run_sample(args.num_rollouts_per_device)
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/sampler.py", line 70, in run_sample
[rank0]: gen_out = self.trainer.generate(batch, return_dict_in_generate=True, mode=mode, teacher_mixed_sample=(self.args.teacher_mixed_alpha is not None), output_scores=True)
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/trainer.py", line 618, in generate
[rank0]: gen = model.generate(
[rank0]: File "/miniLLM/LMOps-main/minillm/minillm/model.py", line 21, in generate
[rank0]: return self.base_model.generate(x)
[rank0]: File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/miniLLM/LMOps-main/minillm/transformers/src/transformers/generation/utils.py", line 2229, in generate
[rank0]: result = self._sample(
[rank0]: File "/miniLLM/LMOps-main/minillm/transformers/src/transformers/generation/utils.py", line 3331, in _sample
[rank0]: probs = (1 - mix_in_alpha) * probs + mix_in_alpha * m_probs
[rank0]: RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1
E1212 22:55:56.524920 139651309172544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 739442) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/miniLLM/LMOps-main/minillm/train_minillm.py FAILED`

t1101675 · 2024-12-13T15:58:08Z

It seems that qwen14b and qwen1.5b use different vocabulary sizes and KD methods generally requires the teacher and student models to share the same vocabulary.

However, the vocabulary difference of different-sized qwen is just for padding. Therefore, cutting the larger vocabulary to the smaller one or padding the smaller to the larger one would be fine.

Harryjun · 2024-12-15T06:00:12Z

@t1101675 how to padding the smaller to the larger

Harryjun · 2024-12-15T06:05:22Z

@t1101675 I need to modify the size, re-train SFT, and then perform distillation?

t1101675 · 2024-12-15T15:08:20Z

@t1101675 I need to modify the size, re-train SFT, and then perform distillation?

Yes. Besides, I think the SFT re-training does not need to be extensive to adapt the model for a few padding tokens.

Harryjun · 2024-12-16T03:01:48Z

@t1101675 how to padding the smaller to the larger, I change config.json, but it had no effect. and Can't the code support auto padding?

t1101675 · 2024-12-17T15:42:48Z

You probably need to resize the embeddings of the student model after changing config.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1 #286

RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1 #286

Harryjun commented Dec 12, 2024

t1101675 commented Dec 13, 2024

Harryjun commented Dec 15, 2024

Harryjun commented Dec 15, 2024

t1101675 commented Dec 15, 2024 •

edited

Loading

Harryjun commented Dec 16, 2024 •

edited

Loading

t1101675 commented Dec 17, 2024

RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1 #286

RuntimeError: The size of tensor a (151936) must match the size of tensor b (152064) at non-singleton dimension 1 #286

Comments

Harryjun commented Dec 12, 2024

t1101675 commented Dec 13, 2024

Harryjun commented Dec 15, 2024

Harryjun commented Dec 15, 2024

t1101675 commented Dec 15, 2024 • edited Loading

Harryjun commented Dec 16, 2024 • edited Loading

t1101675 commented Dec 17, 2024

t1101675 commented Dec 15, 2024 •

edited

Loading

Harryjun commented Dec 16, 2024 •

edited

Loading