Downscaling

Downscaling and Continuous consolidation #

Stackbooster actively detects underused nodes within a cluster and rearranges pods to make node usage more efficient, enabling the scaling down of nodes and thus reducing operational costs. This process increases the resource utilization effectively. Stackbooster continuously evaluates if pods on active nodes can be relocated within the cluster. Nodes identified for reallocation are then systematically drained—cordoning the node and gracefully evicting the pods—to ensure ongoing infrastructure optimization and enhanced cost efficiency.

Node and Pod Management by Stackbooster #

Stackbooster proactively identifies underutilized nodes within the cluster and reorganizes pods to enhance node usage, facilitating the scaling down of nodes and thereby reducing operational costs. This action effectively increases resource utilization. As part of this continuous optimization, Stackbooster evaluates if pods on active nodes can be relocated elsewhere within the cluster. Nodes slated for reallocation undergo a systematic process of being cordoned off and having their pods gracefully evicted, ensuring consistent infrastructure optimization and cost efficiency.

Simultaneously, the Stackbooster Autoscaler removes nodes that have remained idle for extended periods, typically following job completions or pod-count reductions from scaling operations. In parallel, the Evictor mechanism within Stackbooster actively consolidates pods onto fewer nodes by continuously reassessing and optimizing pod placement. This consolidation may lead to some nodes becoming completely devoid of active workloads. Once a node is identified as empty, the Node Deletion Policy is triggered, allowing for the automatic removal of these nodes to prevent unnecessary resource expenditure. Together, these strategies ensure Stackbooster not only optimizes resource distribution but also maintains a streamlined and cost-effective cluster environment.

Pods consolidation workflow #

  • Stackbooster constantly evaluates the cluster to identify pods on underutilized nodes that could be relocated to optimize space.
  • By cordoning off the node and draining it, the Evictor shifts workloads to other nodes, effectively reducing unnecessary resource use.
  • Following the relocation, the node, now empty, is removed.
  • Stackbooster process aims to prevent any service downtime, focusing on applications with redundant replicas, although an aggressive mode is available for applications with single replicas, which is generally not recommended for production environments.

Operational Rules to Avoid Downtime #

  • Pods suitable for eviction should be controlled by a Controller (like ReplicaSets etc) with multiple replicas.
  • Pods that are not part of a StatefulSet,
  • Pods are not marked for as not-disruptable via labels or annotations.
  • Respect for Pod Disruption Budgets is maintained.
  • DaemonSets pods are considered for eviction.

Control the node and pod disruption #

To prevent eviction-eligible pods from being disrupted during the scale-down process, you can add the stackbooster.io/scale-down-disabled: "true" label or annotation to the pod. Use the following sample YAML for a Deployment:

apiVersion: v1
kind: Deployment
spec:
  template:
    metadata:
      annotations:
        stackbooster.io/scale-down-disabled: "true"

Alternatively, apply this setting using the command:
kubectl label deployment <deployment-name> stackbooster.io/scale-down-disabled="true"

For controlling node-level disruptions during the scale-down process, set the stackbooster.io/scale-down-disabled: "true" annotation on the node with this YAML:

apiVersion: v1
kind: Node
metadata:
  annotations:
    stackbooster.io/scale-down-disabled: "true"

Or use this command to apply the label:
kubectl label nodes <node-name> stackbooster.io/scale-down-disabled="true"

NOTE: The methods described above will not prevent node disruptions if the node is under a spot lifecycle and is about to be terminated soon due to market conditions. In such cases, additional measures specific to handling spot instance interruptions need to be considered.