Puzzle ITC - Kubernetes Worker down – Incident details

Kubernetes Worker down

Resolved
Partial outage
Started almost 4 years agoLasted about 6 hours

Affected

Puzzle Services

Partial outage from 1:26 PM to 6:59 PM

Rocket.Chat

Partial outage from 1:26 PM to 6:59 PM

Puzzle SSO (Keycloak)

Partial outage from 1:26 PM to 6:59 PM

CodiMD

Partial outage from 1:26 PM to 6:59 PM

Quay (registry.puzzle.ch )

Partial outage from 1:26 PM to 6:59 PM

Updates
  • Resolved
    Resolved

    I checked the resource request/limits and optimized them a bit. Furthermore there is a new worker node running. I'm closing this incident all services are up and running.

  • Monitoring
    Monitoring

    We implemented a fix and currently monitoring the result.

  • Identified
    Identified

    The node is up and running again. But we need to identify what caused the problem, as earlier there was already a node outage. Current guess is, not enough memory, nodes are too crowed and resource management is probably not the best. We are adding a new node (maybe temporary), to mitigate the problem

  • Investigating
    Investigating

    Anonther Kubernetes Worker is unfortunatly down