Kubernetes Auto Scaling

Kubernetes Auto Scaling


This tutorial shows how to auto scale runtime nodes of the Curity Identity Server based on its built-in custom metrics and the features of the Kubernetes platform. The system will run on a development computer and a GitHub Repository provides helper resources to automate the setup and teardown:

Autoscaling Resources

Example Autoscaling Requirements

The autoscaling logic for this tutorial is summarized below, and scaling will be based on OAuth request time, though the approach demonstrated could be adapted to use alternative or additional metrics:

Desired State

When the Curity Identity Server's OAuth request time has averaged greater than 100 milliseconds for a warmup period, the system will automatically scale up. When the system returns to a level below 100 milliseconds for a cooldown period, the number of runtime nodes will automatically scale down.

Kubernetes Base System

To follow this article, either run the Kubernetes Demo Installation using minikube, or alternatively supply your own cluster. If using minikube, first edit create-cluster.sh and increase the cpus to 4 and memory to 16384, since a number of extra monitoring components will be deployed:

minikube start --cpus=4 --memory=16384 --disk-size=50g --driver=hyperkit --profile curity

The minikube system initially contains 2 runtime nodes for the Curity Identity Server, and each of these provides Prometheus compliant metrics:

Deployed Containers

Once the system is up, run the 1-generate-metrics.sh script, which runs 100 OAuth requests to authenticate via the Client Credentials Grant and displays the access tokens returned:

Generate Metrics

The script ends by using the kubectl tool to call Identity Server metrics endpoints directly, to get the fields used in this tutorial:

  • idsvr_http_server_request_time_sum
  • idsvr_http_server_request_time_count


The monitoring system will frequently call the Identity Server metrics endpoints and store the results as a time series. The Kubernetes Horizontal Pod Autoscaler will then be configured to receive an overall result, so that it is informed when the number of nodes needs to change:

Autoscaling Overview

In this tutorial, Prometheus will be used as the monitoring system, and this requires us to install both the Prometheus system and also a Custom Metrics API, whose role is to transform values for the autoscaler:

Monitoring System

The autoscaler will receive a single aggregated rate value to represent the average OAuth request time for all runtime nodes in the cluster. The autoscaler will then compare this value to a threshold beyond which auto-scaling is needed.

Installing Prometheus

A fast installation of Prometheus components can be done by running the 2-install-prometheus.sh script from the helper resources, which deploys the Kube Prometheus System with default settings:

Prometheus Install

This deploys multiple components within a Kubernetes monitoring namespace:

Prometheus Resources

The Prometheus Admin UI and the Grafana UI can both be exposed to the local computer via port forwarding commands:

kubectl -n monitoring port-forward svc/prometheus-k8s 9090
kubectl -n monitoring port-forward svc/grafana 3000

Next run the 3-install-custom-metrics-api.sh script, which will provide endpoints from which the Horizontal Pod Autoscaler gets metrics:

Custom Metrics API Install

Once complete the Custom Metrics API components are available under a Kubernetes custom-metrics namespace:

kubectl get all -n custom-metrics

Metrics Components

The following command can be run to verify that the custom metrics API is correctly communicating with Prometheus, and this should result in a large JSON response:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1

Integrating Identity Server Metrics

Next run the 4-install-service-monitor.sh script, which creates a Kubernetes ServiceMonitor resource that tells the Prometheus system to start collecting data from the Curity Identity Server's metrics endpoints:

kind: ServiceMonitor
apiVersion: monitoring.coreos.com/v1
  name: curity-idsvr-runtime
      app.kubernetes.io/name: idsvr
      role: curity-idsvr-runtime
    - port: metrics

Run the 1-generate-metrics.sh script again and then browse to the Prometheus UI at http://localhost:9090. You will then able to view metrics, and those beginning with idsvr will have been supplied by the Curity Identity Server:

Custom Metric

The raw metric values can then be queried by calling the Custom Metrics API directly:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/idsvr_http_server_request_time_sum"   | jq
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/idsvr_http_server_request_time_count" | jq

As an alternative to using service monitors, Identity Server metrics can be reported to Prometheus via scrape annotations, for customers who prefer that option. The Curity Identity Server's Helm chart writes the following annotations to support this:

kind: Deployment
  name: curity-idsvr-runtime
        prometheus.io/scrape: "true"
        prometheus.io/path: /metrics
        prometheus.io/port: "4466"

Autoscaling Calculation

It is important to base autoscaling on an accurate calculation, and this tutorial will use the following value, which gets the total request time across all nodes for the last 5 minutes, then divides it by the total number of requests from all nodes:

sum(rate(idsvr_http_server_request_time_sum[5m])) /

The expected result can be verified by entering the formula into the Prometheus UI:

Metrics Calculation

When there are no requests the query will return a not a number (NaN) or negative result. This is the standard behavior, but if preferred the calculation can be updated to return zero instead:

(sum(rate(idsvr_http_server_request_time_sum[5m])) /
 > 0 or on() vector(0)

Creating the Autoscaler

Run the 5-install-autoscaler.sh script, which first adds an external metric for the aggregated value and gives it a name of idsvr_http_server_request_rate. This is expressed in the Custom Metrics API's configmap via a PromQL Query:

- seriesQuery: '{namespace!="",__name__!~"^container_.*"}'
        resource: namespace
    matches: "^(.*)$"
    as: "idsvr_http_server_request_rate"
  metricsQuery: 'sum(rate(idsvr_http_server_request_time_sum[5m])) / sum(rate(idsvr_http_server_request_time_count[5m]))'

The script then creates the actual autoscaler Kubernetes resource, using the following yaml. The value expressed as a target is the 100ms threshold beyond which the system needs to start scaling up:

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta2
  name: curity-idsvr-runtime-autoscaler
    apiVersion: apps/v1
    kind: Deployment
    name: curity-idsvr-runtime
  minReplicas: 2
  maxReplicas: 4
    - type: External
          name: idsvr_http_server_request_rate
          type: Value
          value: 100m

Once deployed, run the 1-generate-metrics.sh script again, then run the following external metrics query a few seconds later to get the aggregated value that is used for autoscaling:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/idsvr_http_server_request_rate" | jq

The current state of OAuth requests and the number of nodes can then be queried. This will look healthy initially, since the metric is well within its limits:

kubectl describe hpa/curity-idsvr-runtime-autoscaler

Healthy Rate

Testing Autoscaling

Next use the Admin UI for the Curity Identity Server, navigate to Profiles / Token Service / Endpoints / oauth-token / Client Credentials Grant and select the New Procedure option, then paste in the following code, to simulate some slow performance:

var sleepMilliseconds = 200;
var startTime = Date.now();
var currentTime = null;
do {
    currentTime = Date.now();
} while (currentTime - startTime < sleepMilliseconds);

Simulated Load

After committing changes, run the 1-generate-metrics.sh script again, and responses will now be slower. Describe the autoscaler again and you will see that the system has determined that it needs to scale up:

Unhealthy Rate

The algorithm for selecting the number of new nodes is explained in the Kubernetes Documentation:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Next undo and commit the above Javascript changes, and notice that after a few minutes of performance being comfortably within limits, the number of nodes is gradually scaled down again.

Tuning Autoscaling

The Kubernetes system deals with auto-scaling in a mature manner with these qualities:

Phased ScalingKubernetes supports warm up and cool down delays to prevent 'thrashing', so that a metric value change does not instantly trigger spinning up of new nodes.
ExtensibilityThis tutorial's autoscaler could be easily extended to use multiple metrics. The desired replica count algorithm then selects the highest value from all included metrics.

Metrics can be visualized over time using the Grafana system that is deployed with Prometheus, by browsing to http://localhost:3000 and signing in as admin / admin. This enables a dashboard to be created with Prometheus as the data source, to show the value over time of the metric used for autoscaling.

Grafana Dashboard

It is important to combine autoscaling with other checks on the system, and the below options are commonly used:

  • The Curity Grafana Dashboard can be imported to visualize the system state
  • The Curity Alarm Subsystem can be used to raise alerts to people when a particular node performs poorly
  • If the autoscaling metric is frequently flipping above and below its threshold then the default number of instances or the use of metrics may need to be further refined


The Curity Identity Server implements the industry standard solution for providing metrics about its runtime performance and use of resources. Results can then be supplied to monitoring systems, which will aggregate the results. One of the main use cases for metrics is to auto-scale the system if it is ever struggling under load, and any modern cloud native platform will support this.