Issue with Trace Option Causing TypeError in mLoRA Training #268

EricLabile · 2024-11-04T01:04:36Z

I encountered an error when using the --trace option. The error message indicates the following:

/u/.conda/envs/mlora/lib/python3.12/site-packages/bitsandbytes/autograd/functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
File "/mLoRA/mlora_train.py", line 68, in
executor.execute()
File "/mLoRA/mlora/executor/executor.py", line 110, in execute
output = self.model.forward(data.model_data())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mLoRA/mlora/model/llm/model_llama.py", line 174, in forward
data = seq_layer.forward(data)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mLoRA/mlora/model/llm/model_llama.py", line 138, in forward
return forward_func_dictmodule_name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mLoRA/mlora/model/llm/model_llama.py", line 108, in decoder_forward
set_backward_tracepoint(output.grad_fn, "b_checkpoint")
File "/mLoRA/mlora/profiler/profiler.py", line 139, in set_backward_tracepoint
if TRACEPOINT_KEY in grad_fn.metadata():
^^^^^^^^^^^^^^^^^^
TypeError: 'dict' object is not callable
Generating '/tmp/nsys-report-4fe1.qdstrm'

I executed the command:

nsys profile -w true -t cuda,nvtx -s none -o test_report -f true -x true python mlora_train.py --base_model TinyLlama/TinyLlama-1.1B-Chat-v0.4 --device "cuda:0" --config /projects/bcrn/mLoRA/demo/lora/lora_case_1.yaml --trace
or simply added --trace after normal commands.

Could you please help me understand why this error is occurring? And could you help me with using trace? Thanks!

The text was updated successfully, but these errors were encountered:

yezhengmao1 · 2024-11-04T02:46:32Z

seem the pytorch changed the metadata function, can you check the type of grad_fn.metadata, seem it's not a function..

EricLabile · 2024-11-04T03:03:32Z

Thanks for the reply. grad_fn.metadata type is <class 'dict'> and it is empty at this point.

yezhengmao1 · 2024-11-04T07:05:41Z

Thanks for the reply. grad_fn.metadata type is <class 'dict'> and it is empty at this point.

can we just change the grad_fn.metadata() to grad_fn.metadata? i don't know when pytorch change this function.

EricLabile · 2024-11-04T14:09:53Z

Thank you! It works well. However, I’m unable to open test_report.nsys-rep with NVIDIA Nsight Compute. Could you provide a compilable version as mentioned in the pull request?

(I’ll provide a CLI version later to automatically generate key summaries.)

Thanks!

yezhengmao1 · 2024-11-05T04:32:00Z

Thank you! It works well. However, I’m unable to open test_report.nsys-rep with NVIDIA Nsight Compute. Could you provide a compilable version as mentioned in the pull request?

(I’ll provide a CLI version later to automatically generate key summaries.)

Thanks!

you can use any version, must ensure your NVIDIA Nsight Compute version is higher than the nsys cli version.
just install the latest version: https://developer.nvidia.com/nsight-systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Trace Option Causing TypeError in mLoRA Training #268

Issue with Trace Option Causing TypeError in mLoRA Training #268

EricLabile commented Nov 4, 2024 •

edited

Loading

yezhengmao1 commented Nov 4, 2024

EricLabile commented Nov 4, 2024

yezhengmao1 commented Nov 4, 2024

EricLabile commented Nov 4, 2024

yezhengmao1 commented Nov 5, 2024

Issue with Trace Option Causing TypeError in mLoRA Training #268

Issue with Trace Option Causing TypeError in mLoRA Training #268

Comments

EricLabile commented Nov 4, 2024 • edited Loading

yezhengmao1 commented Nov 4, 2024

EricLabile commented Nov 4, 2024

yezhengmao1 commented Nov 4, 2024

EricLabile commented Nov 4, 2024

yezhengmao1 commented Nov 5, 2024

EricLabile commented Nov 4, 2024 •

edited

Loading