Monitoring

This section of the admin guide describes information related to monitoring the Curity Identity Server.

Tip

🔥🔥🔥 If you just want to know how to determine if your instance of Curity is unhealthy and on fire, refer to the information below. 🔥🔥🔥

JMX

Java Management Extensions (JMX) is a commonly used interface for monitoring the internals of a Java-based application like the Curity Identity Server. This ability to peer inside the application, however, can be dangerous. It is for this reason that JMX is disabled by default. To enable it, the ENABLE_JMX can be set before starting the Curity Identity Server; the value is ignored and can can be any non-empty value (e.g., true, 1, etc.). This can be done on the command line like this, for instance:

Listing 80 Example of how to enable JMX from the command line by setting the ENABLE_JMX environment variable
$ ENABLE_JMX=1 idsvr

With JMX enabled, the following can be monitored and, in some cases, changed:

  • LDAP connection pools
  • JDBC (database) connection pools
  • Web server information (thread pools, object pools, ciphers, etc.)
  • JVM settings (e.g., memory, CPU usage, etc.)
  • Logging settings (including log levels per logger)

Note that serialization must be enabled for the javax.management.* classes in order for JMX to function properly. This should be handled automatically in typical scenarios.

Tracing

To make tracing easier, Curity has support for adding a HTTP response header containing the node’s service-id, which is a unique string based on the service-name (see Setup Nodes for more information), to every HTTP response.

To enable this header, set the system property called se.curity:identity-server:http:service-id-header when starting up Curity. The value of this property should be the name of the header you want to contain the service-id.

Listing 81 Example of how to enable the service-id response header in a Curity node.
$ JAVA_OPTS="-Dse.curity:identity-server:http:service-id-header=X-Service-Id" idsvr

When that’s done, every Curity response will contain a header like X-Service-Id: the-service-id, making it easier to trace requests to particular Curity nodes.

Zulu Flight Recorder

The Curity Identity Server ships with support for Zulu Flight Recorder and Zulu Mission Control. These are branded, fully-tested builds of the open-source JDK Flight Recorder and JDK Mission Control which Oracle released in 2018. Coupled with the instrumentation and data collection performed by Flight Recorder inside of Curity, Zulu Mission Control provides monitoring and management possibilities with very little impact on Curity’s performance.

Zulu Mission Control can be downloaded from Azul’s Web site.[1] Once it is installed and started, you can use it to attach to an instance of Curity if JMX is enabled (as described above). This will allow you to monitor, in real-time, such things as CPU, memory, threads, garbage collections, and much, much more.

You can also record the performance for later analysis. This can be very helpful in difficult support cases, for instance. This can be done in Zulu Mission Control or by using the jcmd command that is shipped with Curity. Using either is a two-step process:

  1. Start the recording
  2. Stop the recording and save the results to a file

Starting a Recording Manually

To start a recording or to connect to a remote instance of Curity using Zulu Mission Control requires additional parameters to be provided when starting that instance. Refer to the Monitoring and Management section of the Java SE documentation for details of what these parameters are. As an example, running Curity in a local Docker container (which effectively makes it remote), it is possible to connect to it from Zulu Mission Control if it is started with additional parameters that can be passed using the JAVA_OPTS environment variable like this:

Listing 82 Starting Curity in a Docker container with remote monitoring enabled
$ docker run -it \
    -e ENABLE_JMX=1 \
    -e JAVA_OPTS="-Dcom.sun.management.jmxremote.ssl=false
        -Dcom.sun.management.jmxremote.authenticate=false
        -Dcom.sun.management.jmxremote.port=7091
        -Dcom.sun.management.jmxremote.rmi.port=7091
        -Djava.rmi.server.hostname=localhost
        -Dcom.sun.management.jmxremote.local.only=false" \
    -e PASSWORD=$PASSWORD \
    -p 6749:6749 \
    -p 7091:7091 \
    -p 8443:8443 \
    curity/idsvr:latest

Warning

The example in listing Listing 82 is provided only for demonstration, and more secure options should be used in production.

With remote access enabled, a connection can be made using Zulu Mission Control by selecting the Connect… menu option in the File menu, or by selecting New Connection from the context menu of the JVM Browser, or by clicking the New Connect button from the toolbar shown in figure Fig. 28:

../../_images/jmc-remote-connect.png

Fig. 28 Creating a new connection to a remote instance of Curity

Whichever method is used, the following dialogue page will be shown:

../../_images/connect-modal.png

Fig. 29 JMX connection modal in Zulu Mission Control

Clicking the Test connection button should briefly flicker a status dialogue box and then the Status field should be changed to OK. If so, click Finish. Otherwise, tweak the settings used when starting the instance of Curity and refer to the Java Monitoring and Management documentation for additional guidance and troubleshooting tips. Baring any connectivity issues, a new JVM should be shown in JVM Browser treeview:

../../_images/jmc-treeview.png

Fig. 30 JVM Browser showing new connection to Curity instance

If the MBean Server child node of this new connection is clicked, a dashboard is shown, like that of figure Fig. 31:

../../_images/jmc-dashboard.png

Fig. 31 Dashboard showing memory, JVM CPU, remaining heap memory, and other aspects of the running instance of Curity

To start a recording in Zulu Mission Control, right-click the Flight Recorder item in the JVM Browser treeview under the established connection, as shown in Fig. 30, and then select Start Flight Recording…. The following dialogue will be shown:

../../_images/jmc-start-recording.png

Fig. 32 Starting a recording of the performance of an instance of Curity using Zulu Mission Control

Select Finish to accept the defaults or make adjustments on the current or subsequent pages as required.

Starting a Recording from the Command Line

When shell access to the machine running Curity is available, an alternative to start a recording is to use jcmd. With this tool, you need the process ID of the instance of the Curity Identity Server to be profiled or you can use the package name of its bootstrapper. Both alternatives work but only the latter is shown in the following listings:

$ jcmd se.curity.identityserver.app.Bootstrapper JFR.start

Note

The jcmd command is included in various versions of the JDK, but is not directly provided by Curity. The various options and subcommands supported by this tool can be found in the jcmd documentation.

After the JFR.start subcommand is run, it will print the command necessary to dump a snapshot of analysis data. It will be something like this:

$ jcmd 59896 JFR.dump name=1 filename=FILEPATH # where 59896 is the process ID of Curity

The state of recording can be checked using the JFR.check subcommand with either the process ID or the bootstrapper package name like this:

$ jcmd se.curity.identityserver.app.Bootstrapper JFR.check

If recording is not currently underway, this command will provide the instructions on how to start one.

When the dump subcommand is run, the recording will be captured in the specified file. At that point, the file can be opened in Zulu Mission Control or other tools that support the JDK Flight Recorder format.

Dumping the recording to a file does not stop the recording. To do that, use the JFR.stop subcommand. This also accepts a filename and name parameter like JFR.dump does except that it stops the recording and the filename parameter is optional. If the filename is not provided, then a dump will not be simultaneously made. An example of stopping a recording named 1 is shown in the following listing:

$ jcmd se.curity.identityserver.app.Bootstrapper name=1 filename=/tmp/recording_1.jfr

Starting a Recording on Startup

It is also possible to start a recording by providing certain command line options to the Curity Identity Server. This can be done in various ways. For example, this can be achieved by configuring the JVM options. A better way typically though is to set the JAVA_OPTS environment variable to include the parameters necessary to start the recording when the Curity Identity Server starts. Either way, the parameters will be something like these:

-XX:StartFlightRecording=filename=my-good-file.jfr,duration=10m

For information about available flags that can be passed when starting a recording, refer to the flight recorder command reference.

Using either Zulu Mission Control, command line options provided to the Curity Identity Server, or jcmd, the resulting file can be analyzed and potentially shared with support. This will give a lot of insight into the source of potential issues. For more information on Flight Controller and Mission Control, refer to the following sources:

Status Endpoint

Curity Identity Server contains an HTTP endpoint providing node status information. Its operation is configured by the following environment variables.

Environment variable Description Default value
STATUS_CMD_ENABLED Endpoint enable state true
STATUS_CMD_PORT HTTP port to bind to 4465
STATUS_CMD_HOST Network host or address to listen on 0.0.0.0
STATUS_CMD_MAX_THREADS Maximum thread number 16

By default, this status endpoint is enabled, however it can be disabled by setting the STATUS_CMD_ENABLED environment variable to false or by starting idsvr with the --no-status parameter.

The status endpoint supports HTTP GET and HEAD requests to the / path. The response will have status code:

  • 200 if the node is started and configured. Note that a 200 status code means the node is configured but doesn’t ensure the server is listening for functional requests. For instance, the node may have a service role that is disabled or the internal HTTP server may still be starting/reconfiguring. To know if a node is ready to serve functional requests use the isServing JSON field or the /serving request path (see below).
  • 503 if the node is not started or not yet configured.

In both cases, the response body will contain a JSON representation of the node status, containing the following fields:

  • isReady
    • false - the node is not ready to process requests (e.g. it is still booting or is shutting down).
    • true - the node is ready to receive and process requests (but may be disconnected from the admin).
  • nodeState
    • BOOTING - the node is starting up and not ready to process requests.
    • WAITING - the node is ready to process requests with the latest configuration that is has; however, it is still waiting to connect to the admin, so configuration may be stale.
    • RUNNING - the node is ready to process requests and is connected to the admin node.
    • ERROR - the node is in an unrecoverable error state.
    • STOPPING - the node is shutting down and not able to process requests.
../../_images/node-state.svg
  • clusterState
    • STANDALONE - the node has clustering disabled.
    • CONNECTING_TO_CLUSTER - the node is not an admin node and is trying to connect (for the first time) or reconnect to the admin node.
    • CONNECTED - the node is not admin and is connected to the admin node.
    • ADMIN - the node is an admin node.
    • ERROR - an unexpected error occurred when checking the cluster state.
../../_images/cluster-state.svg
  • configurationState
    • UNINITIALIZED - the node is not configured and therefore unable to correctly process requests.
    • CONFIGURED - the node is fully configured.
    • RECONFIGURING - the node is currently consuming a new configuration. A previous configuration is still valid and will be used in the meanwhile for any request processing.
../../_images/configuration-state.svg
  • transactionId - an opaque string identifier for the last committed transaction seen by the current node.
  • isServing
    • false - the node’s HTTP server that serves the runtime endpoints is not running.
    • true - the node’s HTTP server that serves the runtime endpoints is running.
  • isAdminServing (this field is returned only on admin nodes)
    • false - the node’s HTTP server that serves the admin endpoints (i.e. an endpoint serving Admin UI requests) is not running.
    • true - the node’s HTTP server that serves the admin endpoints is running.
  • pluginsInitialized
    • true - all the plugins installed in the node are done initializing.
    • false - some plugins installed in the node are still initializing.

The status endpoint also contains the /serving and /admin-serving paths to expose the isServing and isAdminServing information respectively via the status code, which is useful if the probing system is unable to process JSON representations. These paths accept both GET and HEAD requests. The response for the /serving path will have an empty body and status code:

  • 200 - the node’s HTTP server that serves the runtime endpoints is running.
  • 503 - the node’s HTTP server that serves the runtime endpoints is not running.

The response for the /admin-serving path will have an empty body and status code:

  • 200 - the node’s HTTP server that serves the admin endpoints is running.
  • 503 - the node’s HTTP server that serves the admin endpoints is not running.
  • 404 - the node is not admin.

Note that the /admin-serving path is only served on admin nodes. On runtime nodes, a request to this path will return 404.

Command line tool

The Curity Identity Server installation also contains the bin/status command line tool that can be used to probe the HTTP status endpoint. It uses the same environment variables the server uses and has two invocation parameters:

  • -j or --json - if present, the response written to the standard output is in the JSON format; otherwise it is written in plain text.
  • -h or --help - prints the synopsis of the tool
  • -v - not used but maintained for backward compatibility reasons

The status tool performs a request to the local node status endpoint and writes the response body to the standard output. The tool exit code is described in the following table.

Exit code Description
0 The probed node is ready.
1 The status endpoint is disabled and was not probed.
4 There was an IO error while communicating with the status endpoint.
103 A response with a 3xx status was received from the status endpoint.
104 A response with a 4xx status was received from the status endpoint.
105 A response with a 5xx status was received from the status endpoint.

Prometheus-compliant Metrics

Each run-time and admin node exposes an endpoint where certain information is published in a Prometheus-compliant format (i.e., Prometheus’ OpenMetrics format). This allows the Prometheus monitoring tool (or others that can process data in this format) to monitor certain metrics about the behavior of the node. This endpoint is exposed over HTTP and listening on the same interface as the status endpoint described above. The port used is one greater than the status endpoint (4466 by default). The URI is /metrics, so, for example, the URL of the data would be https://localhost:4466/metrics.

The metrics exposed and their meanings is described in the following table:

Metric Name Type Labels Meaning
idsvr_authentication_login_total Counter acr, profile_id The number of authentication events that have occurred
idsvr_authentication_sso_total Counter acr, profile_id The number of Single Sign-on events that have occurred
idsvr_cpu_usage Gauge   The amount of CPU used (0 <= x <= 1) by the Java process that the node started
idsvr_datasource_account_sum Counter ds_id, ds_type The sum of total time (in seconds) that all account data sources are taking
idsvr_datasource_account_count Counter ds_id, ds_type The number of occurrences that all account data sources are taking
idsvr_datasource_attribute_sum Counter ds_id, ds_type The sum of total time (in seconds) that all attribute data sources are taking
idsvr_datasource_attribute_count Counter ds_id, ds_type The number of occurrences that all attribute data sources are taking
idsvr_datasource_credential_sum Counter ds_id, ds_type The sum of total time (in seconds) that all credential data sources are taking
idsvr_datasource_credential_count Counter ds_id, ds_type The number of occurrences that all credential data sources are taking
idsvr_datasource_database_client_sum Counter ds_id, ds_type The sum of total time (in seconds) that all database client data sources are taking
idsvr_datasource_database_client_count Counter ds_id, ds_type The number of occurrences that all database client data sources are taking
idsvr_datasource_dcr_sum Counter ds_id, ds_type The sum of total time (in seconds) that all dynamic client registration data sources are taking
idsvr_datasource_dcr_count Counter ds_id, ds_type The number of occurrences that all dynamic client registration data sources are taking
idsvr_datasource_delegation_sum Counter ds_id, ds_type The sum of total time (in seconds) that all delegation data sources are taking
idsvr_datasource_delegation_count Counter ds_id, ds_type The number of occurrences that all delegation data sources are taking
idsvr_datasource_device_sum Counter ds_id, ds_type The sum of total time (in seconds) that all device data sources are taking
idsvr_datasource_device_count Counter ds_id, ds_type The number of occurrences that all device data sources are taking
idsvr_datasource_nonce_sum Counter ds_id, ds_type The sum of total time (in seconds) that all nonce data sources are taking
idsvr_datasource_nonce_count Counter ds_id, ds_type The number of occurrences that all nonce data sources are taking
idsvr_datasource_session_sum Counter ds_id, ds_type The sum of total time (in seconds) that all session data sources are taking
idsvr_datasource_session_count Counter ds_id, ds_type The number of occurrences that all session data sources are taking
idsvr_datasource_token_sum Counter ds_id, ds_type The sum of total time (in seconds) that all token data sources are taking
idsvr_datasource_token_count Counter ds_id, ds_type The number of occurrences that all token data sources are taking
idsvr_datasource_bucket_sum Counter ds_id, ds_type The sum of total time (in seconds) that all bucket data sources are taking
idsvr_datasource_bucket_count Counter ds_id, ds_type The number of occurrences that all bucket data sources are taking
idsvr_http_server_request_time_sum Counter   The number of and amount of time (in seconds) that all HTTP requests are taking
idsvr_http_server_request_time_count Counter   The number of HTTP requests that have been made
idsvr_jvm_memory_used Gauge memory_id, memory_area The amount of memory used (in bytes) by the Java process that the node started
log4j2_appender_total Counter level The number and severity of log messages which have been written since start up
idsvr_oauth_delegation_issued_total Counter client_id, profile_id The number of delegations issued
idsvr_oauth_delegation_revoked_total Counter client_id, profile_id The number of delegations revoked
idsvr_oauth_token_issued_total Counter client_id, token_type, profile_id The number of OAuth tokens (access, ID, refresh) issued
idsvr_oauth_token_revoked_total Counter client_id, token_type, profile_id The number of OAuth tokens revoked event counter
idsvr_oauth_introspection_denied_total Counter client_id, profile_id The number of Introspections denied because of tokens not being active
idsvr_http_client_request_successful_sum Counter http_client_id, authority The total duration of requests issued by an HTTP client resulting in a response with a successful status code.
idsvr_http_client_request_successful_count Counter http_client_id, authority The number of requests issued by an HTTP client resulting in a response with a successful status code.
idsvr_http_client_request_client_error_sum Counter http_client_id, authority The total duration of requests issued by an HTTP client resulting in a response with a client error status code (4xx).
idsvr_http_client_request_client_error_count Counter http_client_id, authority The number of requests issued by an HTTP client resulting in a response with a client error status code (4xx).
idsvr_http_client_request_server_error_sum Counter http_client_id, authority The total duration of requests issued by an HTTP client resulting in a response with a server error status code (5xx).
idsvr_http_client_request_server_error_count Counter http_client_id, authority The number of requests issued by an HTTP client resulting in a response with a server error status code (5xx).
idsvr_http_client_pool_connections Gauge http_client_id The total number of connections in the connection pool of an HTTP client.
idsvr_http_client_pool_active_connections Gauge http_client_id The number of active (in-use) connections in the connection pool of an HTTP client.
idsvr_credential_verification_successful_total Counter credential_manager_id The number of successful credential verifications done by a credential manager.
idsvr_credential_verification_failed_total Counter credential_manager_id The number of failed credential verifications done by a credential manager.
idsvr_request_pool_threads Gauge   The total number of threads in request thread pool.
idsvr_request_pool_available_threads Gauge   The number of threads in request thread pool that are available to handle requests.
idsvr_request_pool_active_threads Gauge   The number of threads in request thread pool that are in use.
idsvr_request_pool_queue_size Gauge   Request thread pool’s queue size (i.e. number of requests waiting for a thread)

Note

The client_id of the idsvr_oauth_token_issued metric for ID tokens will be that of the requesting client (i.e., the authorized party). For all other tokens, the client_id is the ID of the client to whom the token was issued.

Note

Request thread pool metrics (whose name starts with idsvr_request_pool_) are only available when JMX is enabled.

In addition to the global metrics described above, some plugins may contribute with metrics related to their particular usage. Namely, the JDBC data source plugin exposes metrics of the underlying connection pool.

The labels in the previous table have the meanings described in the following table:

Label Name Meaning
acr The authentication class context reference (ACR) of the authenticator used for login or SSO (as applicable)
client_id The identifier of the OAuth client to which the metric is related. This is disabled by default.
ds_id The identifier of the data source to which the metric is related
ds_type The type of data source to which the metric is related (e.g., ldap, jdbc, etc.)
level The level of the log message (e.g., error, warn, etc.)
memory_id The identifier representing the pool of memory being measured (e.g., G1 Old Gen, etc.)
memory_area The type of memory being measured (heap, non-heap, etc.)
token_type The type of token to which the measurement is related (e.g., access_token, etc.)
profile_id The identifier of the profile to which the metric is related. This is disabled by default.
http_client_id The identifier of the HTTP client to which the metric is related
authority The authority (hostname and port) used to contact the target system to which the metric is related.
credential_manager_id The identifier of the credential manager to which the metric is related.

Note

By default no unique values are reported for client_id to prevent value explosion which Prometheus has a hard time handling. If the system only contains a small number of clients then this can be enabled by setting the system property se.curity:identity-server:reporting:include-client-id-label=true when starting the Curity Identity Server.

Gathering of data can be disabled. If this is set when the node starts, no data will be published. To disable gathering of data, in the admin UI, go to System ‣ General. There, toggle off Enable Reporting. Once that change is committed, all nodes will stop gathering data.

Common Alerts

If you want to setup certain alerts when things go wrong in the Curity Identity Server, you can simply setup the following:

  • If datasource_*_sum / datasource_*_count >= 800 since the last poll to the metrics endpoint, your database is having issues. The result of this arithmetic is the average response time from the Curity Identity Server to the database (for the given period).
  • If log4j2_appender_total with a label of error is > 0, call support!
  • If log4j2_appender_total with a label of warn is greater than the last poll, look into the issue immediately, and raise a support case if you can’t figure out the problem.
  • If cpu_usage is >= 95% at an unexpected time or for a prolonged period of time, you should take action.
  • If http_server_request_time_sum / http_server_request_time_count >= 1000 since the last poll to the metrics endpoint. The result of this arithmetic is the average HTTP response time to the the Curity Identity Server Web server (for the given period).
[1]Azul is the provider of Zulu, a branded, supported version of Java which Curity delivers.

Configuration

Configuring Prometheus metrics is done under /environments/environment/reporting section.

Parameter Description
enable Flag to enable/disable gathering of Prometheus metrics
include-profile-id Flag indicating whether profile_id label should be enabled
Listing 83 A configured reporting shown in the CLI
% show environments environment reporting
enable             true;
include-profile-id true;