Skip to content

Error Handling and Recovery Comparison

serac edited this page Jan 16, 2013 · 3 revisions

LDAP Authentication Pool Recovery Testing

We compared the behavior of CAS under ldaptive and Spring LDAP components by simulating directory outages of various kinds and measured the high-level behavior of CAS during the outage and after directory recovery.

System Architecture

The system architecture for our tests is conceptually similar to the following diagram.

CAS LDAP System Architecture

CAS maintains a pool of persistent connections to all 3 directory pools in the ldaptive configuration, whereas in the Spring LDAP configuration there are persistent pools to all but the authentication pool. A connection to the authorization pool is required for successful authentication since that pool is used for CAS principal resolution during the authentication process.

Test Plan

We measured CAS availability with a simple JMeter test plan that attempted to log into CAS and obtain a service ticket for a fictional service. The test sequence was configured to run indefinitely under a consistent low-level load (30 authentications/min). The directory administrator shut down the directory in the manner described in the test case and the result was observed and recorded, i.e. whether authentication resumed upon directory recovery. The LDAP pool configuration was equivalent in all test configurations except where noted.

Results

ldaptive Components

  1. Graceful directory restart: recovered immediately
  2. Restart http health check (directory not restarted): recovered immediately
  3. kill -9 OpenLDAP, restart: recovered immediately

Spring LDAP Components

  1. Graceful directory restart: did not recover; required CAS restart
  2. Restart http health check (directory not restarted): recovered immediately
  3. kill -9 OpenLDAP, restart: did not recover; required CAS restart

The second test case merits some discussion. An Apache instance provides a health check endpoint for the load balancer to monitor whether an instance is available for routing connections. If the health check endpoint is disabled (e.g. via httpd restart), the directory remains up but is removed from the pool by the load balancer, which effectively takes it offline.

Discussion

It's worth discussing the software design that accounts for the results above. Spring LDAP uses commons-pool to pool JNDI InitialContext objects that represent LDAP connections to higher-level components. The commons-pool library uses a LIFO queue (by default) to store and retrieve items in the pool, and provides for pool member validation at check out time, check in time, and idle periods. In the tests above, the pools were configured for idle validation exclusively. When the directory becomes unavailable, the Socket underlying the InitialContext object becomes invalid, but since the pool members are not idle there is no facility to notify the pool that its members are defunct and need to be culled and reprovisioned. Since the pool contains invalid members and has no way of identifying them, the CAS authentication process that requires these connections cannot proceed and CAS must be restarted.

It's worth noting that Spring LDAP components can recover gracefully if testOnBorrow is set to true on the pool. With this option enabled, the pool resource is validated before use; on failure it is culled and a new resource is provisioned to replace it. Enabling this facility provides resilience at the expense of throughput since an additional LDAP operation is performed every time a pool member is requested.

The ldaptive library provides a facility to recover without testing connections unilaterally before handing them out. Instead, it provides a retry mechanism to automatically reprovision pool members on certain error conditions, one of which is a JNDI CommunicationException that arises from a closed socket. Thus ldaptive gracefully recovers using this facility without its equivalent to testOnBorrow enabled.