replace transformerencoder with mamba #199

sunxin010205 · 2024-04-29T09:49:44Z

Hi！
After replacing an eight-layer Transformer encoder with Mamba, the training loss fails to decrease. Could it be that Mamba doesn't perform as effectively as the Transformer in the diffusion model? Looking forward to your response。
Here are my codes.

mamba.txt
mdm.txt
minimamba.txt
loss_log.txt

GuyTevet · 2024-05-07T08:21:46Z

Hi @sunxin010205 , what is Mamba?

sunxin010205 · 2024-05-14T08:24:24Z

嗨，曼巴是什么？

Mamba: Linear-Time Sequence Modeling with Selective State Spaces，Mamba is a new architecture proposed for the linear complexity of transformer。
When I use mamba instead of transformer encoder, other loss is normal, only loss_q3 cannot be reduced. Do you know what the situation is? Looking forward to your early reply!

GuyTevet · 2024-05-28T13:24:38Z

I'm not familiar with this one.
What's loss_q3?

Suzixin7894946 · 2024-06-18T08:39:32Z

Hi！ After replacing an eight-layer Transformer encoder with Mamba, the training loss fails to decrease. Could it be that Mamba doesn't perform as effectively as the Transformer in the diffusion model? Looking forward to your response。 Here are my codes.

mamba.txt mdm.txt minimamba.txt loss_log.txt

I tried to replace the transformer layer with other architectures and encountered the same situation as yours, where the loss is around 2.4 and cannot be reduced. May I ask if you have found a solution later on?

sunxin010203 · 2024-06-18T08:47:53Z

Hi！ After replacing an eight-layer Transformer encoder with Mamba, the training loss fails to decrease. Could it be that Mamba doesn't perform as effectively as the Transformer in the diffusion model? Looking forward to your response。 Here are my codes.嗨！用Mamba替换八层Transformer编码器后，训练损失未能减少。难道曼巴在扩散模型中的表现不如Transformer有效？期待您的回复。这是我的密码
mamba.txt mdm.txt minimamba.txt loss_log.txt

I tried to replace the transformer layer with other architectures and encountered the same situation as yours, where the loss is around 2.4 and cannot be reduced. May I ask if you have found a solution later on?我尝试用其他架构替换Transformer层，遇到了和你一样的情况，损耗在2.4左右，无法降低。我能问一下你后来有没有找到解决办法吗？

Sry, I have not continued this part of the work at present, but my overall loss after replacement is about 0.2. If you can find the problem, could you please tell me the solution? I have some ideas. Perhaps the original encoder layer is not suitable after replacing transformer. You can try other encoder methods.

Suzixin7894946 · 2024-06-18T08:53:45Z

Thank you for your reply. I am also trying to use Mamba or a combination of Mamba with the transformer to replace the transformer layer. May I ask if you have conducted any experiments combining Mamba with the transformer?   苏紫欣 ***@***.*** 广西师范大学  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2024年6月18日(星期二) 下午4:48 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [GuyTevet/motion-diffusion-model] replace transformerencoder with mamba (Issue #199) Hi！ After replacing an eight-layer Transformer encoder with Mamba, the training loss fails to decrease. Could it be that Mamba doesn't perform as effectively as the Transformer in the diffusion model? Looking forward to your response。 Here are my codes.嗨！用Mamba替换八层Transformer编码器后，训练损失未能减少。难道曼巴在扩散模型中的表现不如Transformer有效？期待您的回复。这是我的密码 mamba.txt mdm.txt minimamba.txt loss_log.txt I tried to replace the transformer layer with other architectures and encountered the same situation as yours, where the loss is around 2.4 and cannot be reduced. May I ask if you have found a solution later on?我尝试用其他架构替换Transformer层，遇到了和你一样的情况，损耗在2.4左右，无法降低。我能问一下你后来有没有找到解决办法吗？ Sry, I have not continued this part of the work at present, but my overall loss after replacement is about 0.2. If you can find the problem, could you please tell me the solution? I have some ideas. Perhaps the original encoder layer is not suitable after replacing transformer. You can try other encoder methods. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

sunxin010203 · 2024-06-18T09:00:52Z

Thank you for your reply. I am also trying to use Mamba or a combination of Mamba with the transformer to replace the transformer layer. May I ask if you have conducted any experiments combining Mamba with the transformer? 苏紫欣 @.*** 广西师范大学
感谢您的回复。我还尝试使用Mamba或Mamba与Transformer的组合来替换Transformer层。我能问一下你有没有做过将曼巴和Transformer结合的实验吗？

苏紫欣
@.***

广西师范大学

…
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月18日(星期二) 下午4:48 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [GuyTevet/motion-diffusion-model] replace transformerencoder with mamba (Issue #199) Hi！ After replacing an eight-layer Transformer encoder with Mamba, the training loss fails to decrease. Could it be that Mamba doesn't perform as effectively as the Transformer in the diffusion model? Looking forward to your response。 Here are my codes.嗨！用Mamba替换八层Transformer编码器后，训练损失未能减少。难道曼巴在扩散模型中的表现不如Transformer有效？期待您的回复。这是我的密码 mamba.txt mdm.txt minimamba.txt loss_log.txt I tried to replace the transformer layer with other architectures and encountered the same situation as yours, where the loss is around 2.4 and cannot be reduced. May I ask if you have found a solution later on?我尝试用其他架构替换Transformer层，遇到了和你一样的情况，损耗在2.4左右，无法降低。我能问一下你后来有没有找到解决办法吗？ Sry, I have not continued this part of the work at present, but my overall loss after replacement is about 0.2. If you can find the problem, could you please tell me the solution? I have some ideas. Perhaps the original encoder layer is not suitable after replacing transformer. You can try other encoder methods. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

我用过4层mamba＋4层transformer，loss也可以正常的降下去，而且效果还可以。我采用的一层mamba和一层transformer交替的结构，然后速度可能会比较快一点，参数量基本相同。

Suzixin7894946 · 2024-06-18T09:03:45Z

Great, thank you for your reply, it has been helpful for my subsequent experiments!   苏紫欣 ***@***.*** 广西师范大学  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2024年6月18日(星期二) 下午5:01 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [GuyTevet/motion-diffusion-model] replace transformerencoder with mamba (Issue #199) Thank you for your reply. I am also trying to use Mamba or a combination of Mamba with the transformer to replace the transformer layer. May I ask if you have conducted any experiments combining Mamba with the transformer?   苏紫欣 @.*** 广西师范大学   感谢您的回复。我还尝试使用Mamba或Mamba与Transformer的组合来替换Transformer层。我能问一下你有没有做过将曼巴和Transformer结合的实验吗？   苏紫欣 @.*** 广西师范大学   … ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月18日(星期二) 下午4:48 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [GuyTevet/motion-diffusion-model] replace transformerencoder with mamba (Issue #199) Hi！ After replacing an eight-layer Transformer encoder with Mamba, the training loss fails to decrease. Could it be that Mamba doesn't perform as effectively as the Transformer in the diffusion model? Looking forward to your response。 Here are my codes.嗨！用Mamba替换八层Transformer编码器后，训练损失未能减少。难道曼巴在扩散模型中的表现不如Transformer有效？期待您的回复。这是我的密码 mamba.txt mdm.txt minimamba.txt loss_log.txt I tried to replace the transformer layer with other architectures and encountered the same situation as yours, where the loss is around 2.4 and cannot be reduced. May I ask if you have found a solution later on?我尝试用其他架构替换Transformer层，遇到了和你一样的情况，损耗在2.4左右，无法降低。我能问一下你后来有没有找到解决办法吗？ Sry, I have not continued this part of the work at present, but my overall loss after replacement is about 0.2. If you can find the problem, could you please tell me the solution? I have some ideas. Perhaps the original encoder layer is not suitable after replacing transformer. You can try other encoder methods. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> 我用过4层mamba＋4层transformer，loss也可以正常的降下去，而且效果还可以。我采用的一层mamba和一层transformer交替的结构，然后速度可能会比较快一点，参数量基本相同。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

sunxin010203 · 2024-07-08T10:49:55Z

I'm not familiar with this one. What's loss_q3?

The logger output during training looks like this, I can't find the definition of loss_q3 in the code.

| grad_norm | 1.84 |
| loss | 1.37 |
| loss_q0 | 1.33 |
| loss_q1 | 1.21 |
| loss_q2 | 1.11 |
| loss_q3 | 1.82 |
| param_norm | 462 |
| rot_mse | 1.37 |
| rot_mse_q0 | 1.33 |
| rot_mse_q1 | 1.21 |
| rot_mse_q2 | 1.11 |
| rot_mse_q3 | 1.82 |
| samples | 64 |
| step | 0 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace transformerencoder with mamba #199

replace transformerencoder with mamba #199

sunxin010205 commented Apr 29, 2024

GuyTevet commented May 7, 2024

sunxin010205 commented May 14, 2024

GuyTevet commented May 28, 2024

Suzixin7894946 commented Jun 18, 2024

sunxin010203 commented Jun 18, 2024

Suzixin7894946 commented Jun 18, 2024 via email

sunxin010203 commented Jun 18, 2024

Suzixin7894946 commented Jun 18, 2024 via email

sunxin010203 commented Jul 8, 2024

replace transformerencoder with mamba #199

replace transformerencoder with mamba #199

Comments

sunxin010205 commented Apr 29, 2024

GuyTevet commented May 7, 2024

sunxin010205 commented May 14, 2024

GuyTevet commented May 28, 2024

Suzixin7894946 commented Jun 18, 2024

sunxin010203 commented Jun 18, 2024

Suzixin7894946 commented Jun 18, 2024 via email

sunxin010203 commented Jun 18, 2024

Suzixin7894946 commented Jun 18, 2024 via email

sunxin010203 commented Jul 8, 2024

The logger output during training looks like this, I can't find the definition of loss_q3 in the code.

| grad_norm | 1.84 | | loss | 1.37 | | loss_q0 | 1.33 | | loss_q1 | 1.21 | | loss_q2 | 1.11 | | loss_q3 | 1.82 | | param_norm | 462 | | rot_mse | 1.37 | | rot_mse_q0 | 1.33 | | rot_mse_q1 | 1.21 | | rot_mse_q2 | 1.11 | | rot_mse_q3 | 1.82 | | samples | 64 | | step | 0 |

| grad_norm | 1.84 |
| loss | 1.37 |
| loss_q0 | 1.33 |
| loss_q1 | 1.21 |
| loss_q2 | 1.11 |
| loss_q3 | 1.82 |
| param_norm | 462 |
| rot_mse | 1.37 |
| rot_mse_q0 | 1.33 |
| rot_mse_q1 | 1.21 |
| rot_mse_q2 | 1.11 |
| rot_mse_q3 | 1.82 |
| samples | 64 |
| step | 0 |