How can DeepSpeed be configured to prevent the merging of parameter groups #6878

CLL112 · 2024-12-16T14:30:54Z

The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the param_groups are merged into one. How can this be prevented?

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

The text was updated successfully, but these errors were encountered:

tjruwase · 2024-12-16T18:30:14Z

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

CLL112 · 2024-12-17T03:10:58Z

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

I have rewritten the optimizers and separately set the learning rate for the act_fn in the model. During training, it works well when not using DeepSpeed, but after using DeepSpeed, I found that it doesn’t work:

decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' not in n)],
        'weight_decay': args.weight_decay,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
        'weight_decay': 0.0,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' in n)],
        'weight_decay': 0.0,
        'lr': 0.5,  # Custom learning rate for act_fn
    },
]

optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

# Debugging optimizer parameter groups
for i, param_group in enumerate(optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

I printed the relevant parameters:

Param group 0: lr=5e-05, weight_decay=0.1
Param group 1: lr=5e-05, weight_decay=0.0
Param group 2: lr=0.5, weight_decay=0.0

However, in transformer.trainer, after self.optimizer.step(), I also checked it with:

self.optimizer.step()
for i, param_group in enumerate(self.optimizer.optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

The output is:

Param group 0: lr=5e-05, weight_decay=0.1

This is strange; there are no Param group 1 and 2. I am using DeepSpeed’s Zero3. Does this change the Param group?
The configuration for ZeRO-3 is as follows:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

tjruwase · 2024-12-18T11:32:03Z

@CLL112, can you please share simple but full repro code to debug?

tjruwase mentioned this issue Dec 18, 2024

[BUG] Setting different learning rates for different optimizer_grouped_parameters is ineffective for zero3 #6873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can DeepSpeed be configured to prevent the merging of parameter groups #6878

How can DeepSpeed be configured to prevent the merging of parameter groups #6878

CLL112 commented Dec 16, 2024

tjruwase commented Dec 16, 2024

CLL112 commented Dec 17, 2024

tjruwase commented Dec 18, 2024

How can DeepSpeed be configured to prevent the merging of parameter groups #6878

How can DeepSpeed be configured to prevent the merging of parameter groups #6878

Comments

CLL112 commented Dec 16, 2024

tjruwase commented Dec 16, 2024

CLL112 commented Dec 17, 2024

tjruwase commented Dec 18, 2024