Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Trace Option Causing TypeError in mLoRA Training #268

Open
EricLabile opened this issue Nov 4, 2024 · 5 comments
Open

Issue with Trace Option Causing TypeError in mLoRA Training #268

EricLabile opened this issue Nov 4, 2024 · 5 comments

Comments

@EricLabile
Copy link

EricLabile commented Nov 4, 2024

I encountered an error when using the --trace option. The error message indicates the following:

/u/.conda/envs/mlora/lib/python3.12/site-packages/bitsandbytes/autograd/functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
File "/mLoRA/mlora_train.py", line 68, in
executor.execute()
File "/mLoRA/mlora/executor/executor.py", line 110, in execute
output = self.model
.forward(data.model_data())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mLoRA/mlora/model/llm/model_llama.py", line 174, in forward
data = seq_layer.forward(data)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mLoRA/mlora/model/llm/model_llama.py", line 138, in forward
return forward_func_dictmodule_name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mLoRA/mlora/model/llm/model_llama.py", line 108, in decoder_forward
set_backward_tracepoint(output.grad_fn, "b_checkpoint")
File "/mLoRA/mlora/profiler/profiler.py", line 139, in set_backward_tracepoint
if TRACEPOINT_KEY in grad_fn.metadata():
^^^^^^^^^^^^^^^^^^
TypeError: 'dict' object is not callable
Generating '/tmp/nsys-report-4fe1.qdstrm'

I executed the command:

nsys profile -w true -t cuda,nvtx -s none -o test_report -f true -x true python mlora_train.py --base_model TinyLlama/TinyLlama-1.1B-Chat-v0.4 --device "cuda:0" --config /projects/bcrn/mLoRA/demo/lora/lora_case_1.yaml --trace
or simply added --trace after normal commands.

Could you please help me understand why this error is occurring? And could you help me with using trace? Thanks!

@yezhengmao1
Copy link
Collaborator

seem the pytorch changed the metadata function, can you check the type of grad_fn.metadata, seem it's not a function..

@EricLabile
Copy link
Author

Thanks for the reply. grad_fn.metadata type is <class 'dict'> and it is empty at this point.

@yezhengmao1
Copy link
Collaborator

Thanks for the reply. grad_fn.metadata type is <class 'dict'> and it is empty at this point.

can we just change the grad_fn.metadata() to grad_fn.metadata? i don't know when pytorch change this function.

@EricLabile
Copy link
Author

Thank you! It works well. However, I’m unable to open test_report.nsys-rep with NVIDIA Nsight Compute. Could you provide a compilable version as mentioned in the pull request?

(I’ll provide a CLI version later to automatically generate key summaries.)

Thanks!

@yezhengmao1
Copy link
Collaborator

Thank you! It works well. However, I’m unable to open test_report.nsys-rep with NVIDIA Nsight Compute. Could you provide a compilable version as mentioned in the pull request?

(I’ll provide a CLI version later to automatically generate key summaries.)

Thanks!

you can use any version, must ensure your NVIDIA Nsight Compute version is higher than the nsys cli version.
just install the latest version: https://developer.nvidia.com/nsight-systems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants