Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can DeepSpeed be configured to prevent the merging of parameter groups #6878

Open
CLL112 opened this issue Dec 16, 2024 · 3 comments
Open

Comments

@CLL112
Copy link

CLL112 commented Dec 16, 2024

The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the param_groups are merged into one. How can this be prevented?

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

@tjruwase
Copy link
Contributor

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

@CLL112
Copy link
Author

CLL112 commented Dec 17, 2024

@CLL112, DeepSpeed already supports this request. For example, we don't merge weights and biases, which are typically implemented as different param groups. It would be helpful to have a full repro for us to understand what is going with your scenario.

I have rewritten the optimizers and separately set the learning rate for the act_fn in the model. During training, it works well when not using DeepSpeed, but after using DeepSpeed, I found that it doesn’t work:

decay_parameters = Trainer.get_decay_parameter_names(None, model)
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' not in n)],
        'weight_decay': args.weight_decay,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
        'weight_decay': 0.0,
        'lr': args.learning_rate,  # Default learning rate
    },
    {
        'params': [p for n, p in model.named_parameters() if (n in decay_parameters and p.requires_grad and
                                                               'act_fn' in n)],
        'weight_decay': 0.0,
        'lr': 0.5,  # Custom learning rate for act_fn
    },
]

optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args)
optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)

# Debugging optimizer parameter groups
for i, param_group in enumerate(optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

I printed the relevant parameters:

Param group 0: lr=5e-05, weight_decay=0.1
Param group 1: lr=5e-05, weight_decay=0.0
Param group 2: lr=0.5, weight_decay=0.0

However, in transformer.trainer, after self.optimizer.step(), I also checked it with:

self.optimizer.step()
for i, param_group in enumerate(self.optimizer.optimizer.param_groups):
    print(f"Param group {i}: lr={param_group.get('lr', args.learning_rate)}, "
          f"weight_decay={param_group['weight_decay']}")

The output is:

Param group 0: lr=5e-05, weight_decay=0.1

This is strange; there are no Param group 1 and 2. I am using DeepSpeed’s Zero3. Does this change the Param group?
The configuration for ZeRO-3 is as follows:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupCosineLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

@tjruwase
Copy link
Contributor

@CLL112, can you please share simple but full repro code to debug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants