Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus VM runs out of RAM #4074

Closed
TonyWildish-BH opened this issue Aug 20, 2024 · 25 comments · Fixed by #4189 · May be fixed by Barts-Life-Science/AzureTRE#213
Closed

Nexus VM runs out of RAM #4074

TonyWildish-BH opened this issue Aug 20, 2024 · 25 comments · Fixed by #4189 · May be fixed by Barts-Life-Science/AzureTRE#213
Labels
bug Something isn't working

Comments

@TonyWildish-BH
Copy link
Contributor

Description

I'm trying to merge the current AzureTRE into my own repository to get the latest changes. The merge went smoothly, no conflicts, and now I'm testing it.

The issue I see is that the Nexus VM gets wedged after a while. I'm able to create one or two VMs, either Windows or Linux, and they work, booting to completion. However, if I deploy more VMs, they eventually get stuck, with Nexus failing to respond.

Restarting the Nexus VM clears things up for a while, but the problem recurs just a short while later, when I deploy more VMs.

I'm able to connect to the Nexus VM in the azure portal, via the bastion, but when the problem happens, that session gets wedged too. It's a whole-VM phenomenon.

I haven't changed anything relating to any shared services in my TRE, and in particular, I haven't touched Nexus at all, the configuration there is exactly as-is in this repo. So while I can't rule out that it's something I've done, I'm wondering if anyone else has seen this, or anything like it?

Any suggestions of what to look for would be greatly appreciated.

@TonyWildish-BH TonyWildish-BH added the question Further information is requested label Aug 20, 2024
@TonyWildish-BH
Copy link
Contributor Author

Update: This is reproducible in the current HEAD of this repository, so I'd like to redefine this as a bug, not a question.

@tim-p-allen
Copy link
Collaborator

Hey @TonyWildish-BH what version of Nexus do you have deployed?

@TonyWildish-BH
Copy link
Contributor Author

I pulled the HEAD a week ago, it's whatever's there, I haven't touched the nexus code. We did a separate test, pulling this repo yesterday, and that shows the same problem. That's why I'm thinking it's not anything I've done, since that second test had no modifications whatsoever w.r.t. this repo.

@tim-p-allen
Copy link
Collaborator

Sure. What version is the nexus template?

@TonyWildish-BH
Copy link
Contributor Author

3.0.0

@tim-p-allen
Copy link
Collaborator

Thanks. I'll take a look, see if I can reproduce.

@TonyWildish-BH
Copy link
Contributor Author

hi @tim-allen-ck, did you get a chance to look at this?

What I have found since is that the Windows VMs seem not to provoke the problem, though the Linux VMs definitely do. Probably because they have so much more to update than the Windows VMs.

@akolensky
Copy link

Hi @tim-allen-ck , I understand it is a busy season - and wondered if this has been looked into?

@marrobi
Copy link
Member

marrobi commented Sep 18, 2024

@akolensky what troubleshooting steps have you tried? It's not something we've seen elsewhere.

@TonyWildish-BH
Copy link
Contributor Author

All I've managed to deduce so far is that it seems to be related to the Linux VMs doing a mass update. The load average in the nexus container goes over 40, and it stops responding, completely - which isn't surprising at that load average.

Rebooting the nexus VM clears the issue, but a 'restart' in the portal doesn't work, because the VM doesn't respond to it, you have to 'stop' and 'start', which takes a very long time, usually.

The problem is repeatable, but not guaranteed. With a fresh install of nexus, it wedges about ⅔ of the time, on one of the first 2 or 3 Linux VMs - often the first. It's certainly not rare.

@marrobi
Copy link
Member

marrobi commented Sep 19, 2024

Have you added some custom repositories?

We've got instances running elsewhere and Nexus have been working without issue for long periods. So something must be different in your instance.

Have you tried using a larger VM?

Might be the container needs some resource limits as to leave the host some resources.

@TonyWildish-BH
Copy link
Contributor Author

Marcus, this is in fresh installations, predominantly. It looks like a first-time cache-filling problem where the requests are not throttled, and the server gets overloaded. After rebooting, it tends to behave itself, but still spits the dummy every now and then.

This happens in a virgin installation, with unmodified code, as stated. No custom anything. We see it in the pure MS code base, and also in our own, where we have not touched anything relating to nexus, or to any of the core resources.

This is repeatable, three different people using three different setups have seen it, including one outside Barts. It's not our environment.

I did try using a larger VM (64 GB x 8 cores), that didn't help.

Restricting the container isn't likely to help much, though it might let the host OS kill and restart it, at best. If the container is spawning > 40 threads, all bets are off, that's too many. My best guess is that the server needs throttling, which means either Nexus or Java VM configuration.

Do you know if Tim tried to reproduce it?

@tim-p-allen
Copy link
Collaborator

Hi @TonyWildish-BH I've not been able to reproduce it. Was it only 1 or 2 VMs you'd deployed when you'd found the issue?

@TonyWildish-BH
Copy link
Contributor Author

hi @tim-allen-ck, I've been able to reproduce it on the first Linux VM I boot in a new SDE. It happens about 50% of the time in that situation, more or less.

@marrobi
Copy link
Member

marrobi commented Sep 19, 2024

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

@TonyWildish-BH
Copy link
Contributor Author

This has all happened with a completely unmodified installation from the HEAD of this repository. A fresh checkout of the code, with nothing changed. Not the Nexus VM, not the Linux template I'm trying to boot from it. Nothing.

I set my config.yaml at the top level and install, from scratch, following the instructions. I create a Linux VM, and with high probability, Nexus wedges.

@marrobi
Copy link
Member

marrobi commented Sep 20, 2024

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

@akolensky are you able to help @TonyWildish-BH answer my question above? Thanks.

@TonyWildish-BH
Copy link
Contributor Author

Hi Marcus,

What's the exact SKU you are using for the VM? What additional software is installed.

SKU is 22_04-lts-gen2.
As stated, there is no additional software installed. None.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

As stated, I'm seeing this error on multiple installations. One is our own, with custom VMs that have nearly all the packages installed, the other is the unmodified Microsoft codebase, commit hash c3e4c8d. That uses a cloud-init script to update the vanilla OS which comes with the TRE.

I see the issue in both these environments, therefore, this is not an issue of customisation from our side.

@marrobi
Copy link
Member

marrobi commented Sep 20, 2024

That's the image sku rather than VM SKU. The VM SKU will be a letter followed by number(s).

My thinking is you have something different going on the VM. Antivirus maybe? That in conjunction with the VM scripts is causing all the credits to be used on the B series Nexus VM.

In addition as per https://microsoft.github.io/AzureTRE/latest/tre-templates/user-resources/guacamole-linux-vm/ I suggest you use VM images in production.

@TonyWildish-BH
Copy link
Contributor Author

Where do I find the VM SKU?

Whatever is happening on the VM is whatever happens out of the box, because we haven't modified it in any way at all. There is no customisation of the Nexus VM. We haven't changed anything there. We haven't installed anything extra. Nothing.

I'm aware of that recommendation, and we will indeed be using our own VM images, but I need this bug fixed before we can consider going into production.

@marrobi
Copy link
Member

marrobi commented Dec 11, 2024

This has just happened for the first time for me, looks like a memory issue:
Image

There are plenty of CPU credits on the B- series sku left.

@marrobi
Copy link
Member

marrobi commented Dec 11, 2024

Have resized to a Standard D2s v3 , which has more RAM, will see if it continues to run out of RAM.

@marrobi marrobi added bug Something isn't working and removed question Further information is requested labels Dec 11, 2024
@marrobi marrobi changed the title Nexus VM gets wedged? Nexus VM runs out of RAM Dec 11, 2024
@marrobi
Copy link
Member

marrobi commented Dec 11, 2024

Looking at the Nexus Dockerfile, Java memory usage is configured as follows:

Xms2703m -Xmx2703m -XX:MaxDirectMemorySize=2703m -Djava.util.prefs.userRoot=${NEXUS_DATA}/javaprefs"

So maybe 4GB RAM with OS memory consumption was not always sufficient. I hope to see with 8GB RAM we retain some available RAM on the host OS.

@TonyWildish-BH
Copy link
Contributor Author

@marrobi, I've also tried resizing the VM, but the Nexus container always takes only 2.7 GB, and still freezes. I have a patch for this that I'm testing now, calculating the memory available to Java from the host VM and setting the INSTALL4J_ADD_VM_PARAMS environment variable for the Docker container. Should know in a few days if it works, and can contribute it back.

It's a very small patch, here it is, in case you want to try it yourself:

diff --git a/templates/shared_services/sonatype-nexus-vm/scripts/deploy_nexus_container.sh b/templates/shared_services/sonatype-nexus-vm/scripts/deploy_nexus_container.sh
index 84e4d964..5246c5cc 100644
--- a/templates/shared_services/sonatype-nexus-vm/scripts/deploy_nexus_container.sh
+++ b/templates/shared_services/sonatype-nexus-vm/scripts/deploy_nexus_container.sh
@@ -20,7 +20,17 @@ while true; do
   ((docker_pull_timeout--));
 done
 
+# Deduce memory available to Java. Either 3/4 of the system RAM, or a set minimum
+mem_total_mb=$(( $(cat /proc/meminfo | head -1 | awk '{ print $2 }') / 1024 ))
+java_mem=2703
+if [ $mem_total_mb -gt 4096 ]; then
+  java_mem=$(( $mem_total_mb * 3 / 4 ))
+fi
+
+echo "System memory: ${mem_total_mb} MB. Java memory: ${java_mem} MB"
+
 docker run -d -p 80:8081 -p 443:8443 -p 8083:8083 -v /etc/nexus-data:/nexus-data \
+    -e INSTALL4J_ADD_VM_PARAMS="-Xmx${java_mem}m -Xms${java_mem}m" \
     --restart always \
     --name nexus \
     --log-driver local \
diff --git a/templates/shared_services/sonatype-nexus-vm/terraform/vm.tf b/templates/shared_services/sonatype-nexus-vm/terraform/vm.tf
index 79dfa044..1d4ee13f 100644
--- a/templates/shared_services/sonatype-nexus-vm/terraform/vm.tf
+++ b/templates/shared_services/sonatype-nexus-vm/terraform/vm.tf
@@ -100,7 +100,7 @@ resource "azurerm_linux_virtual_machine" "nexus" {
   resource_group_name             = local.core_resource_group_name
   location                        = data.azurerm_resource_group.rg.location
   network_interface_ids           = [azurerm_network_interface.nexus.id]
-  size                            = "Standard_B2s"
+  size                            = "Standard_B8ms"
   disable_password_authentication = false
   admin_username                  = "adminuser"
   admin_password                  = random_password.nexus_vm_password.result

@marrobi
Copy link
Member

marrobi commented Dec 11, 2024

Thank you for sharing. Yes, see how it goes, and we would welcome a PR. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants