You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.
Thank you.
Error:
torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name
wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
[rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
Allow TensorFloat32 operations on supported devices
Train dataset length: 100
Val dataset length: 4
Training from scratch.
Initialize wandb
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in
[rank0]: main()
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
[rank0]: trainer.init_wandb(cfg,
[rank0]: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
[rank0]: wandb.watch(self.model_module)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
[rank0]: tel.feature.watch = True
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
[rank0]: self._run._telemetry_callback(self._obj)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
[rank0]: self._telemetry_obj.MergeFrom(telem_obj)
[rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in
sys.exit(main())
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
I'm trying to run Neuralangelo with the test set "lego," but I haven't been able to get past the point where I invoke the command:
torchrun --nproc_per_node=${GPUS} train.py
--logdir=logs/${GROUP}/${NAME}
--config=${CONFIG}
--show_pbar
This command throws an error for which I haven't been able to find a solution. I've tried changing many of the parameters in the project's files, but I still can't find a fix. Below is the error I'm encountering, in case anyone has a solution.
Thank you.
Error:
torchrun --nproc_per_node=${GPUS} train.py --logdir=logs/${GROUP}/${NAME} --config=${CONFIG} --show_pbar
(Setting affinity with NVML failed, skipping...)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Training with 1 GPUs.
Using random seed 0
Make folder logs/example_group/example_name
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
[rank0]:[W Utils.hpp:108] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarString)
Allow TensorFloat32 operations on supported devices
Train dataset length: 100
Val dataset length: 4
Training from scratch.
Initialize wandb
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 104, in
[rank0]: main()
[rank0]: File "/mnt/d/Documents/neuralangelo/train.py", line 85, in main
[rank0]: trainer.init_wandb(cfg,
[rank0]: File "/mnt/d/Documents/neuralangelo/imaginaire/trainers/base.py", line 269, in init_wandb
[rank0]: wandb.watch(self.model_module)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_watch.py", line 49, in watch
[rank0]: tel.feature.watch = True
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/telemetry.py", line 42, in exit
[rank0]: self._run._telemetry_callback(self._obj)
[rank0]: File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 799, in _telemetry_callback
[rank0]: self._telemetry_obj.MergeFrom(telem_obj)
[rank0]: AttributeError: 'Run' object has no attribute '_telemetry_obj'
E0822 21:49:57.518840 139941045491520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 25214) of binary: /home/miguel12/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
File "/home/miguel12/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in
sys.exit(main())
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miguel12/miniconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-08-22_21:49:57
host : DESKTOP-Q0DS9I2.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 25214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: