OAuth Troubleshooting for DevOps

OAuth Troubleshooting for DevOps

Intro

These days it is common for developers and DevOps to collaborate when companies build digital solutions. Troubleshooting and failure scenarios are prepared for during engineering stages of the deployment pipeline, so that there is nothing extra to do for production rollouts.

When using OAuth security, this process includes managing an Authorization Server provided by a third party, which becomes a critical component. Therefore it requires certain operational features so that you can manage it using modern operational design patterns.

DevOps staff should also have an understanding of OAuth Troubleshooting for Developers, and feed into behavior for error code paths. Essentially though, the Authorization Server should be managed in a similar manner to your own APIs.

Hosting Infrastructure

The Identity System will be clustered and will run on multiple containers, to ensure no single point of failure. In more sophisticated setups this may involve Multi Region Deployment or Dynamic User Routing of OAuth and API requests. DevOps will often be involved in managing the following types of event:

  • An update to the OAuth Production Configuration
  • An upgrade of the Authorization Server software
  • Investigating an OAuth error in an application
  • Troubleshooting an availability or load problem

Configuration Updates

The Authorization Server will connect to various data sources. User accounts, credentials, account linking information and audit data are stored permanently, while session related data is stored temporarily.

Meanwhile the OAuth configuration contains client settings such as redirect URIs and allowed scopes. This information is replicated to all runtime nodes of the Authorization Server and it should be possible to scale configuration management to a global setup.

Global Cluster

Parameterized Configuration

When the Authorization Server is deployed down a pipeline, it is recommended to keep the main configuration the same for at least some stages, such as Staging and Production, to avoid duplication. The Curity Identity Server enables this via parameterized configuration, where variables are used to express environment specific values.

Incorrect Configurations

In the event of an OAuth client being configured with incorrect settings, there will be an application error. If this ever occurs it may not be possible for the app itself to recover from the problem. Instead the app will present an error screen and the configuration will need correcting.

Configuration Backups

The Authorization Server should provide hooks that enable you to backup the configuration whenever it changes. It is common to save a configuration history in source control. In the event of problems you will then be able to quickly rollback to the last good configuration.

Safe Upgrades

Upgrading the Authorization Server is something that is best done frequently, such as once per quarter. As well as providing apps with the latest security features, this also ensures that DevOps staff are comfortable with the technical procedure. The end-to-end process should be similar to how you upgrade your own UIs and APIs.

These days, hosting platforms help to enable zero downtime upgrades for end users. While the Authorization Server is being upgraded, it must be possible to run some nodes on the old version and some on the new version. The Deployment Concepts article summarizes how upgrades work for the Curity Identity Server. Platforms such as Kubernetes also enable the use of advanced patterns such as canary or blue green if required.

After upgrading the Authorization Server it is also recommended to run some automated tests, to provide immediate feedback that basic operations are still in a working state. This might include an automated UI test that signs in a test user account, then calls an API to perform a read only operation.

Incidents

In the event of the Authorization Server experiencing a problem, it will either return an error to the calling application, or will present an error. Incidents are always in the past, such as an important American user experiencing a problem yesterday at 9pm European time. DevOps will often be informed of the incident via a screenshot of an error display:

Error Display

This particular example is quite self explanatory, that an invalid client ID was sent in an OAuth request from an application. It is not expected to happen in production systems, since this type of issue should be resolved during development. In other cases the cause may be more subtle and it may be necessary to look up the displayed Error Identifier in logs.

For OAuth back channel errors, such as invalid requests to the token endpoint, the app will receive an error code from the Authorization Server. In these cases, developers should aim for a similar UI display, to provide hints to DevOps staff on what to do next. One option can be to display the current time and error code, which will then be useful during log lookup.

OAuth Technical Support Logs

The Authorization Server should provide modern logging features, since it may occasionally need to provide causes of intricate security problems, such as cryptographic failures. The Logging Best Practices article explains the logging architecture of the Curity Identity Server.

Production systems are configured with an INFO log level, since this is both secure and also performs well. At this log level, the Authorization Server must log problem causes so that DevOps have the information they need to resolve OAuth related incidents in a timely manner.

Log Aggregation

In order to be able to quickly look up issues in your own APIs it is common to use log aggregation, so that all logs for the cluster are available in a central place. Modern open source logging stacks make this much easier than it used to be, and companies can get started without a great deal of development work.

It should then be possible to feed logs from the Authorization Server into the log management system. The Log to Elasticsearch tutorial provides an end-to-end log aggregation example, after which the built-in UI and dashboard capabilties can be used by your DevOps team.

Kibana Live Analysis

Alarms

Production incidents typically fall into two main categories. Client errors occur when an invalid request is sent, in which case error responses include an error code and an optional error_description, and further details may be included in logs.

Server errors usually occur if the Authorization Server cannot connect to downstream data sources, in which case it is important that DevOps staff receive early warning. The Curity Identity Server provides an Alarms feature to provide early warning of this type of failure. Custom Event Handlers can then be used to notify other systems when alarms occur.

JDBC Alarm

The Authorization Server logs must then explain the underlying cause of server errors. The following log entry was produced by intentionally misconfiguring a JDBC connection, so it is straightforward to rehearse the use of alarms to enable your preferrred behavior:

org.postgresql.util.PSQLException: FATAL: password authentication failed for user "postgres"
	at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:525) ~[?:?]
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:146) ~[?:?]
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197) ~[?:?]
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) ~[?:?]
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217) ~[?:?]
	at org.postgresql.Driver.makeConnection(Driver.java:458) ~[?:?]
	at org.postgresql.Driver.connect(Driver.java:260) ~[?:?]
	at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138) ~[?:?]
	at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:364) ~[?:?]
	at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:206) ~[?:?]
	at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:476) ~[?:?]

Availability

The Authorization Server should have a Health Status Endpoint, to notify the platform or a monitoring system if an instance of a component ever becomes non-contactable. Typically then the instance will be automatically marked down and a new instance spun up, to restore the desired state of the overall cluster.

You will also want early warning if the Authorization Server is struggling to serve the required load. The Curity Identity Server provides Prometheus compliant time-series metrics, where each node reports values such as CPU / memory usage and the average HTTP request time. This enables DevOps to use techniques such as Horizontal Autoscaling, to automatically increase the number of instances at busy times.

DevOps Toolset

DevOps will need to manage the Identity and Access Management system, to perform tasks such as granting access to other users within the organization, or to troubleshoot configuration. The Curity Identity Server provides a number of tools focused on providing visibility and control to DevOps staff. You can read more about them in the following links:

Lower Level Troubleshooting

The Authorization Server is a component that you should be able to extend via plugins or custom scripts. It is possible that a poorly written plugin could cause problems such as thread deadlocks or memory leaks. To diagnose this type of problem, the Curity Identity Server runs on an industry leading Java Azul Virtual Machine, on an up to date Java version. This enables JVM Monitoring to be used for worst case scenarios.

Conclusion

In a connected systems world, technical issues are always possible. Therefore it is recommended that DevOps and developers collaborate to rehearse points of failure, to ensure the most resilient apps. Your Identity and Access Management system should also be a part of this people process.

At Curity we know that managing an identity system in production is not trivial, so we provide the best modern cloud native troubleshooting features. This ensures that you will receive early warning of any problems, along with useful technical logs to enable you to diagnose causes quickly.