Health and Auto Healing

Health and Auto Healing

Status Endpoint

The Curity Identity Server provides a status endpoint, which by default is enabled and runs on port 4465. It is used by monitoring systems to check the health of the Curity Identity Server. The simplest way to check the health of runtime nodes is to make a GET or HEAD request to the base URL, then check the HTTP status returned:

curl -X GET http://runtimenode-instance1:4465

An HTTP status of 200 means the instance is healthy and serving requests, whereas a 503 indicates that the instance is unhealthy or not yet available. You need to avoid false readings though, such as classifying an instance as unhealthy when it is booting, or when there is an intermittent infrastructure issue for a few seconds.

Kubernetes Health Checks

To understand end-to-end usage of the status endpoint, see how this is handled in a Kubernetes deployment. The Curity Identity Server is usually deployed to Kubernetes using the Helm Chart, and a values file can be used to customize behavior. The following default values are used, and port 4465 only needs to be contactable inside the cluster:

curity:
  healthCheckPort: 4465
  adminUiPort: 6749
  adminUiHttp: false

  runtime:
    role: default
    service:
      type: ClusterIP
      port: 8443
    livenessProbe:
      timeoutSeconds: 1
      failureThreshold: 3
      periodSeconds: 10
      initialDelaySeconds: 30
    readinessProbe:
      timeoutSeconds: 1
      failureThreshold: 3
      successThreshold: 3
      periodSeconds: 10
      initialDelaySeconds: 30

The Helm chart then adds a standard httpGet command. Every 10 seconds, the Kubernetes platform calls this endpoint, on each of the instances of the Curity Identity Server in the cluster. The same HTTP request is used for both liveness probes, which check whether the component is available, and readiness probes, which check whether it should continue to receive requests. An initialDelaySeconds is also provided, to give the containers for the Curity Identity Server time to start:

spec:
  containers:
    - name: idsvr-runtime
      image: "custom_idsvr:7.3.1"
      ports:
        - name: http-port
          containerPort: 8443
          protocol: TCP
        - name: health-check
          containerPort: 4465
          protocol: TCP
        - name: metrics
          containerPort: 4466
          protocol: TCP
      livenessProbe:
        httpGet:
          path: /
          port: health-check
        timeoutSeconds:  1
        failureThreshold: 3
        periodSeconds: 10
        initialDelaySeconds: 120
      readinessProbe:
        httpGet:
          path: /
          port: health-check
        timeoutSeconds:  1
        failureThreshold: 3
        successThreshold: 3
        periodSeconds: 10
        initialDelaySeconds: 120

In the event of a probe failing three or more times consecutively for a particular container, that instance will be marked down. The platform will then delete the container and spin up a new one, to maintain the desired state of the cluster. Various settings can be adjusted in order to refine behavior, and these are explained in the Kubernetes documentation on Configuring Probes.

Other Platforms

The same concepts exist in other cloud native platforms. As an example, the Curity Identity Server could be deployed to AWS using EC2 instances. The status endpoints would then be used by Amazon EC2 Auto Scaling to implement equivalent auto-healing behavior.

Status Responses

The full response from the status endpoint includes a JSON payload, containing more detailed information. For each runtime node this returns the following information:

{
  "isReady": true,
  "nodeState": "RUNNING",
  "clusterState": "CONNECTED",
  "configurationState": "CONFIGURED",
  "transactionId": "67D-56A4D-85FBA",
  "isServing": true
}

For the admin node, similar information is returned:

{
  "isReady": true,
  "nodeState": "RUNNING",
  "clusterState": "ADMIN",
  "configurationState": "CONFIGURED",
  "transactionId": "67D-56A4D-85FBA",
  "isServing": false
}

In most cases, the default health checks will be sufficient, but the extra detail allows warning level information to be conveyed. An example might be that a runtime node has a nodeState of WAITING, and is in a working state, but is in the process of connecting to the admin node to get a configuration update. In the event of this persisting, you might decide that health checks should replace the instance. Full details about the response fields and their possible states is provided in the System Admin Guide.

Alarms

Although health checks may be passing for the Curity Identity Server, this does not guarantee that OAuth requests are working correctly for applications. A different type of failure occurs if a dependency of the Curity Identity Server, such as a data source, becomes unavailable, or its connection details are configured incorrectly.

In this case the Curity Identity Server itself is not failing, so its health checks will continue to indicate success. There is no point in the platform replacing instances of the Curity Identity Server, since that will not solve the problem. Instead this scenario is handled by a different monitoring use case of Alarms.

When a dependency fails, the Curity Identity Server raises events that other systems can subscribe to. This enables you to use one of the built-in notifiers, or implement a custom notifier, to enable your preferred behavior. The Integrate Alarms with Monitoring Systems tutorial provides further details on how alarms can be managed.

Conclusion

The Curity Identity Server implements health checks in a standard way, and can be integrated with any monitoring system. This enables you to use modern cloud native approaches to maintain the desired state of the cluster, or raise alerts to people. In most cases this will require very little work, and will simply use the platform's built-in features.