Kubernetes Auto Scaling

Kubernetes Auto Scaling

On this page


This tutorial shows how to autoscale runtime nodes of the Curity Identity Server based on its built-in custom metrics and the features of the Kubernetes platform. The system will run on a development computer if you clone the GitHub Repository linked above. This provides helper resources to automate the setup and teardown:

Autoscaling Resources

Built-in Auto Scaling

Kubernetes has built in autoscaling based on CPU and Memory usage, and this can be activated for runtime nodes, using the Helm Chart for the Curity Identity Server. To do so, you express one or both limits beyond which additional runtime nodes should be added:

  enabled: true
  minReplicas: 2
  maxReplicas: 4
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

You then need to install the Metrics Server and you may also need to set the corresponding resource limits in the Helm Chart, based on the amount of CPU resource you wish to allocate to runtime nodes:

    cpu: 500m
    cpu: 500m

Using Custom Requirements

This tutorial will instead use custom metrics, which enables you to define autoscaling based on your own requirements. The custom behavior will be based on OAuth request time, as an example custom metric that is important to the performance of your applications:

Desired State

When the Curity Identity Server's OAuth request time has averaged greater than 100 milliseconds for a warmup period, the system will automatically scale up. When the system returns to a level below 100 milliseconds for a cooldown period, the number of runtime nodes will automatically scale down.

Kubernetes Base System

This tutorial provides some resources that can be used with the Kubernetes Demo Installation, which uses its own base GitHub repository. Before deploying the Curity Identity Server, ensure that the idsvr/helm-values.yaml from the base installation is updated to use autoscaling, with the following settings. The value expressed as a target is the 100ms threshold beyond which the system needs to start scaling up:

  enabled: true
  minReplicas: 2
  maxReplicas: 4
  targetCPUUtilizationPercentage: 0
  targetMemoryUtilizationPercentage: 0
    - type: External
          name: idsvr_http_server_request_time_rate
          type: Value
          value: 100m

The base system initially contains 2 runtime nodes for the Curity Identity Server, both of which provide Prometheus compliant metrics:

Deployed Containers

Query Metrics

Once the system is up, run the 1-generate-metrics.sh script, which runs 100 OAuth requests to authenticate via the Client Credentials Grant and displays the access tokens returned:

Generate Metrics

The script ends by using the kubectl tool to call the Curity Identity Server metrics endpoints directly, to get the fields used in this tutorial:

  • idsvr_http_server_request_time_sum
  • idsvr_http_server_request_time_count


The monitoring system will frequently call the Identity Server metrics endpoints and store the results as a time series. The Kubernetes Horizontal Pod Autoscaler will then be configured to receive an overall result, so that it is informed when the number of nodes needs to change:

Autoscaling Overview

In this tutorial, Prometheus will be used as the monitoring system, and this requires us to install both the Prometheus system and also a Custom Metrics API, whose role is to transform values for the autoscaler:

Monitoring System

The autoscaler will receive a single aggregated rate value to represent the average OAuth request time for all runtime nodes in the cluster. The autoscaler will then compare this value to a threshold beyond which auto-scaling is needed.

Install Prometheus

A fast installation of Prometheus components can be done by running the 2-install-prometheus.sh script from the helper resources, which deploys the Kube Prometheus System with default settings:

Prometheus Install

This deploys multiple components within a Kubernetes monitoring namespace:

Prometheus Resources

The Prometheus Admin UI and the Grafana UI can both be exposed to the local computer via port forwarding commands:

kubectl -n monitoring port-forward svc/prometheus-k8s 9090
kubectl -n monitoring port-forward svc/grafana 3000

Install the Custom Metrics API

Next run the 3-install-custom-metrics-api.sh script, which will provide endpoints from which the Horizontal Pod Autoscaler gets metrics:

Custom Metrics API Install

Once complete the Custom Metrics API components are available under a Kubernetes custom-metrics namespace:

kubectl get all -n custom-metrics

Metrics Components

The following command can be run to verify that the custom metrics API is correctly communicating with Prometheus, and this should result in a large JSON response:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1

The script adds the following custom entry to the Custom Metrics API's configmap, under an externalRules section, using a PromQL Query syntax:

- seriesQuery: '{namespace!="",__name__!~"^container_.*"}'
        resource: namespace
    matches: "^(.*)$"
    as: "idsvr_http_server_request_time_rate"
  metricsQuery: 'sum(rate(idsvr_http_server_request_time_sum[5m])) / sum(rate(idsvr_http_server_request_time_count[5m]))'

Enable Metrics Collection

Next run the 4-install-service-monitor.sh script, which creates a Kubernetes ServiceMonitor resource that tells the Prometheus system to start collecting data from the Curity Identity Server's metrics endpoints:

kind: ServiceMonitor
apiVersion: monitoring.coreos.com/v1
  name: curity-idsvr-runtime
      app.kubernetes.io/name: idsvr
      role: curity-idsvr-runtime
    - port: metrics

The initial state of the autoscaler, and the desired state can then be queried. This will look healthy, since the metric is well within its limits:

kubectl describe hpa/curity-idsvr

Healthy Rate

Collect Metrics

Autoscaling must be based on an accurate calculation, and this tutorial will use the following value, which gets the total request time across all nodes for the last 5 minutes, then divides it by the total number of requests from all nodes:

sum(rate(idsvr_http_server_request_time_sum[5m])) /

Next, run the 1-generate-metrics.sh script again, then navigate to the Prometheus UI at http://localhost:9090. Metrics from the Curity Identity Server will soon become available, and these will have an idsvr prefix:

Identity Server Metrics

Paste in the calculation and click Execute to see the average request time:

Metrics Calculation

When there are no requests the query will return a not a number (NaN) or negative result. This is the standard behavior, but if preferred the calculation can be updated to return zero instead:

(sum(rate(idsvr_http_server_request_time_sum[5m])) /
 > 0 or on() vector(0)

When troubleshooting, the raw and calculated metric values can be queried directly from the Custom Metrics API:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/idsvr_http_server_request_time_sum"   | jq
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/idsvr_http_server_request_time_count" | jq
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/idsvr_http_server_request_time_rate" | jq

Testing Autoscaling

Next login to the Admin UI that the base deployment installed. Navigate to Profiles / Token Service / Endpoints / Type=oauth-token / Client Credentials and select the New Procedure option, then paste in the following code before the return statement, to simulate some slow performance:

  var sleepMilliseconds = 200;
  var startTime = Date.now();
  var currentTime = null;
  do {
    currentTime = Date.now();
  } while (currentTime - startTime < sleepMilliseconds);

Simulated Load

After committing changes, run the 1-generate-metrics.sh script again, and responses will now be slower. Describe the autoscaler again and you will see that the system has determined that it needs to scale up:

Unhealthy Rate

The algorithm for selecting the number of new nodes is explained in the Kubernetes Documentation:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Next undo and commit the above Javascript changes, and notice that after a few minutes of performance being comfortably within limits, the number of nodes is gradually scaled down again.

Tuning Autoscaling

The Kubernetes system deals with auto-scaling in a mature manner with these qualities:

Phased ScalingKubernetes supports warm up and cool down delays to prevent 'thrashing', so that a metric value change does not instantly trigger spinning up of new nodes.
ExtensibilityThis tutorial's autoscaler could be easily extended to use multiple metrics. The desired replica count algorithm then selects the highest value from all included metrics.

It is important to combine autoscaling with other checks on the system, and the below options are commonly used:

  • A custom Grafana Dashboard can be imported as a JSON document, and used by technical staff to visualize the system state
  • The Curity Alarm Subsystem can be used to raise alerts to people when a particular node performs poorly
  • If the autoscaling metric is frequently flipping above and below its threshold then the default number of instances or the use of metrics may need to be further refined


The Curity Identity Server implements the industry standard solution for providing metrics about its runtime performance and use of resources. Results can then be supplied to monitoring systems, which will aggregate the results. One of the main use cases for metrics is to auto-scale the system if it is ever struggling under load, and any modern cloud native platform will support this.