Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bicep kubernetes extension random 'System.OutOfMemoryException' #15758

Open
Kravca opened this issue Dec 4, 2024 · 6 comments
Open

Bicep kubernetes extension random 'System.OutOfMemoryException' #15758

Kravca opened this issue Dec 4, 2024 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Kravca
Copy link

Kravca commented Dec 4, 2024

Bicep version
Bicep CLI version 0.31.92

Describe the bug
We are using a lot Bicep kubernetes extension to deploy apps to Azure AKS cluster. Around 28 of November our pipelines started to fail with 'System.OutOfMemoryException' exception.
Like this:

Status Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details. (Code: DeploymentFailed)        
 - {"error":{"code":"InvalidImportConfig","target":"/imports/kubernetes/config","message":"Exception of type 'System.OutOfMemoryException' was thrown."}} (Code:)
CorrelationId: 6d98b9ad-62a8-489d-870d-7f6d763c6043
At line:1 char:1
+ New-AzDeployment -Location westeurope -TemplateFile .\.devops\bicep\m ...

Or like this:

Status Message: The template deployment 'app' is not valid according to the validation procedure. The tracking id is '60acfe03-d1ca-46c5-9141-4d7f40c49fe8'. See inner errors for details. (Code:
InvalidTemplateDeployment)
 - Exception of type 'System.OutOfMemoryException' was thrown. (Code:InvalidImportConfig)
CorrelationId: 6774b0e2-de98-4d13-a2b4-b05fc0aaba16
At line:1 char:1
+ New-AzDeployment -Location westeurope -TemplateFile .\.devops\bicep\m ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [New-AzDeployment], Exception
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureSubscriptionDeploymentCmdlet

Both errors from same deployment, without changing anything, just repeatedly redeploying same thing over and over (example lower)

To Reproduce
I have created the smallest biceps files (it only creates Namespace), that are getting this error, as example is really simple it has ~90% success.
main.bicep:

targetScope = 'subscription'

//Resource Group
resource rg 'Microsoft.Resources/resourceGroups@2023-07-01' = {
  name: '****-dev-rg'
  location: 'westeurope'   
}

//AKS Cluster
resource aks 'Microsoft.ContainerService/managedClusters@2024-02-01' existing = {
  name: '****-dev-aks'
  scope: rg
}

module kubernetes './app.bicep' = {
  name: 'app'
  scope: rg
  params: {
    kubeConfig: aks.listClusterAdminCredential().kubeconfigs[0].value
  }
}

app.bicep:

@secure()
param kubeConfig string
param namespaceName string = 'temp'

extension kubernetes with {
  namespace: namespaceName
  kubeConfig: kubeConfig
}

resource namespace 'core/Namespace@v1' = {
  metadata: {
    name: namespaceName
    labels: {
      name: namespaceName
    }
  }
} 

Additional context
We tried:

  • different clusters with different authentications (v1.30.5, v1.31.1) ('Local accounts' vs ' MS Entra with Azure RBAC')
  • upgrading cluster to different version
  • simplifying our apps, reducing - reading from files , data transformations, mappings, etc...
  • deploying bicep from local PC with "New-Azdeployment", pipelines use "az"
  • reading kubeConfig from file (it was one of the main suspects, as kubeConfig is pretty big ~20KB and gets passed to another bicep as parameter (forced by Microsoft))
  • reducing the size of kubeConfig (by removing cert auth part, as its not used)

Nothing really helps, this error appears randomly, somewhere its 50% success, on some pipelines its 10% success.

@danielhagstromsogeti
Copy link

I can confirm we're seeing the same intermittent failures. It is stopping our shop almost in the tracks, to the point were we can no longer tell if our deployments are actually broken because of our changes or just randomly failing, making continuous integration/deployment of our IaC close to impossible.

I've spent quite a bit of time trying to iron out what changed on our end. To complement your list @Kravca , I've also tried downloading a bunch of different bicep CLI binaries, and build-ing the ARM-template from each, diffing them to see if a recent update has changed anything there. I've not found any changes yet in our generated template from all official releases between v0.28.1 and v0.31.92 (currrent at writing moment)

@danielhagstromsogeti
Copy link

Have you also observed that it isn't necessarily the same e.g. namespace or configmap that fails? We're deploying about

  • 4 secrets
  • 12 namespaces
  • 1 azmonitoring.coreos.com/ServiceMonitor@v1
  • 1 azmonitoring.coreos.com/PodMonitor@v1
  • 1 service account, and with it
    • 1 rbac.authorization.k8s.io/ClusterRole@v1
    • 1 rbac.authorization.k8s.io/ClusterRoleBinding@v1

And we get intermittent fails on mostly the namespaces, but sometimes the secrets too.

@Kravca
Copy link
Author

Kravca commented Dec 4, 2024

Have you also observed that it isn't necessarily the same e.g. namespace or configmap that fails? We're deploying about

  • 4 secrets

  • 12 namespaces

  • 1 azmonitoring.coreos.com/ServiceMonitor@v1

  • 1 azmonitoring.coreos.com/PodMonitor@v1

  • 1 service account, and with it

    • 1 rbac.authorization.k8s.io/ClusterRole@v1
    • 1 rbac.authorization.k8s.io/ClusterRoleBinding@v1

And we get intermittent fails on mostly the namespaces, but sometimes the secrets too.

We also suspected ConfigMaps at some point (because of 'config' word in the error), but removing those from Bicep didn't bring any results, that's why I tried to focus on kubeConfig. The error is not very meaningful , as in our case, the error came from the module which uses the kubernetes extension or the one where you supply kubeConfig to. Which resource exactly in that module is failing, is not known (Id think its any), the error doesn't reveal it, that why I used the simplest example with just namespace, I didn't mean that specifically namespace resources are failing. Also it is not known which exactly API produces this error, is it ARM API (as we can see the error in ARM deployment logs), is it Kubernetes API (deeper), or something else.

@stephaniezyen
Copy link
Contributor

@shenglol

@shenglol
Copy link
Contributor

shenglol commented Dec 4, 2024

@Kravca Thanks for bringing this to our attention. We’re aware of the issue and have already developed a fix. However, due to the Azure service deployment freeze, we won’t be able to roll it out until early January. In the meantime, we will be making manual adjustments to our service VMs to mitigate the OOM issue. This process may take about two weeks since it involves making changes across all Azure regions. I’ll keep you updated on our progress.

@shenglol shenglol self-assigned this Dec 12, 2024
@shenglol shenglol added bug Something isn't working and removed Needs: Triage 🔍 labels Dec 12, 2024
@shenglol shenglol added this to the v0.33 milestone Dec 12, 2024
@shenglol shenglol moved this from Todo to In Progress in Bicep Dec 12, 2024
@shenglol
Copy link
Contributor

shenglol commented Dec 12, 2024

We were able to apply for an exception to proceed with deploying the fix. The rollout is currently underway and is expected to take about a week to reach all regions.

The deployment succeeded in the Canary regions, but further work is needed to roll it out to the other regions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

4 participants