Skip to content

Refactoring: Access Check and Exclusion

Kenji Nagahashi edited this page Feb 3, 2016 · 3 revisions

Access Check and Exclusion

(this is still a pile of random thoughts - I'd welcome comments/clean ups - Kenji) There are two frameworks for filtering out captures: one for ResourceIndex/CaptureSearchResult and another for CDXServer/CDXLine. As we plan to consolidate index implementation into CDXServer, I focus on CDXServer version of capture filtering here. In order to reuse filtering components originally written for ResourceIndex framework, there is some dirty bridge work in CDXServer/CDXLine. That's one area needing clean-ups.

Two kinds of filtering are recognized to date:

  • scope filtering (for hosting multiple collections on top of single Wayback index)
  • access control (for prohibiting playback of certain captures, often controlled by external filtering rule database)

Scope filtering is currently tightly coupled with CompositeAccessPoint, and specific to use case at Internet Archive. Access control can also change visibility of CDX fields as well as filtering out CDX lines altogether.

Key difference between these two is:

  • scope filtering is silent; there's no need to communicate to the user that captures are being filtered out. whereas,
  • access control needs to communicate what filtering took effect and how (ex. "excluded by robots.txt" etc.) This communication is not well implemented in my opinion (more later).

Another (minor) difference is:

  • scope filtering is usually statically configured, whereas,
  • access control often vary by client (username, IP address etc.)

Current Implementation

Description of classes involved.

AuthChecker

This interface is a factory of CDXAccessFilter. CDXServer is configured with an implementation of this interface at startup. CDXAccessFilter is a per-session filtering object.

AuthChecker also grants permissions to AuthToken. Primary implementation PrivTokenAuthChecker is configured with a list of pre-defined tokens, and determines whether a user (subject; represented by AuthToken) has certain permissions.

Mix-up of these two functionality is a legacy of stand-alone CDXServer implementation. More common approach is to have separate authentication/authorization component, and let other parts of application consult with subject object for user permissions. With this architecture, we can remove isAllUrlAccessAllowed and isAllCdxFieldAccessAllowed methods from AuthChecker. It is harder to reuse authentication/authorization functionality in Wayback because of this mix-up.

getPublicCdxFields method appears to be unused. Don't know why it must be part of AuthChecker interface. PrivTokenAuthChecker has setPublicCdxFields(String), which updates publicCdxFormat property with FieldSplitFormat object. There's no setPublicCdxFormat(FieldSplitFormat) method.

getPublicCdxFormat is used by CDXServer.writeCdxResponse method:

if (!authChecker.isAllCdxFieldAccessAllowed(authToken)) {
    outputFields = this.authChecker.getPublicCdxFormat();
}

This property could be moved to CDXServer.

WaybackAPAuthChecker has been superseded by AccessPointAuthChecker and there is no known user of WaybackAPAuthChecker currently. Its base class WaybackAuthChecker has no other sub-classes. These two classes can be removed.

AuthToken

Name of this class indicates a close tie to PrivTokenAuthChecker. This is more like a Subject class defined by JAAS. authToken field is the name of a subject (JAAS allows for multiple identities, represented by sub-object Principal). cachedAllUrlAllow, cachedAllCdxAllow and ignoreRobots are permissions.

AuthToken is abused to pass AccessPoint to AuthChecker. Its sole sub-class APContextAuthToken saves AccessPoint object passed to its constructor for later use by AuthChecker implementation (ex. AccessPointAuthChecker) to build collection-specific CDXAccessFilter. By introduction of AccessPoint.createExclusionFilter method, this method has become a standard way of instantiating ExclusionFilter. We should add CollectionContext (or CDX Server equivalent of it) parameter to CDXServer.getCdx to make this confusing trick unnecessary.

setAllCdxFieldsAllow() method and setIgnoreRobots(boolean) methods are worth a special note. These methods are used by EmbeddedCDXServerIndex for configuring APContextAuthToken for internal use of CDXServer.

CDXAccessFilter

An interface for per-session filtering object. It defines two methods:

  • boolean includeUrl(String urlKey, String originalUrl)
  • boolean includeCapture(CDXLine line)

includeUrl is called (by CDXServer.getCdx(CDXQuery, AuthToken)) just once for URL, before any calls to includeCapture, to check for per-URL filtering. This method exists so as to quickly detect per-URL exclusion, even before loading the first line of CDX.

Another purpose of this method is to communicate the act of filtering. If this method returns false, CDXServer will silently return empty result; There's no way to tell if the URL has never been captured, or excluded per-URL basis. To communicate the act of filtering, AccessCheckFilter (primary implementation of CDXAccessFilter) throws RuntimeIOException wrapping an instance of AccessControlException carrying more information on the type of filtering applied. Wayback defines four sub-classes of AccessControlException:

  • AdministrativeAccessControlException - excluded by other (possibly manually set up) policy rules
  • RobotControlAccessControlException - excluded by robots.txt rules
  • RobotNotAvailableException - unused in CDXServer
  • RobotTimeOutAccessControlException - unused in CDXServer

EmbeddedCDXServerIndex.doQuery catches RuntimeIOException and re-throws inner AccessControlException. We should be able to make AuthChecker and CDXAccessFilter throw AccessControlException directly.

Semantics of the first two exceptions are loose. AccessCheckFilter throws AdministrativeAccessControlException when whatever ExclusionFilter given to its adminFilter parameter returns values other than FILTER_INCLUDE. Similarly it throws RobotAccessControlException whenever its robotsFilter returns non-FILTER_INCLUDE value. As such, AccessCheckFilter is not extensible to allow for other types of exclusions. For this reason, recent changes (at IA) are moving away from this approach; New AccessPointAuthChecker creates AccessCheckFilter with just one ExclusionFilter object, returned by AccessPoint.createExclusionFilter(), and new CompositeExclusionFilterFactory allows for configuring multiple exclusion filters.

ExclusionFilter originates from ResourceIndex/CaptureSearchResult framework. AccessCheckFilter uses it so that existing exclusion filters (most notably OracleExclusionFilter and StaticMapExclusionFilter) can be reused with CDXServer. As its filterObject method needs CaptureSearchResult as an argument, AccessCheckFilter creates temporary wrapper CaptureSearchResult object for every CDXLine (CDXLine cannot implement CaptureSearchResult as it is not an interface). This is very inefficient.

Older code did not have this issue because filtering was run in CDXToCaptureSearchResultsWriter who already had CaptureSearchResult objects. This old method is still supported, but strongly discouraged as it screws up CDX query result if exclusion and collapsing are combined. CDXWriter should focus on converting CDXLine to final output. That fact other CDXWriter implementations, like PlainTextWriter, do not implement exclusion, supports this argument.

I'd suggest re-implementing exclusion filters natively with CDX Server interfaces.

Possible Refactoring & API Changes

Authentication/Authorization

Adopt design pattern from JAAS:

  • AuthToken as Subject
  • AuthChecker as LoginModule (or LoginContext)

This implies:

  • moving permission attributes and test methods from AuthChecker to AuthToken
  • add a new method to AuthChecker interface, that corresponds to LoginModule#initialize and LoginModule#login
  • remove code related to cookie-based authorization from CDXServer to PrivTokenAuthChecker (ex. constant CDX_AUTH_TOKEN, cookieAuthToken fields and its getter/setter, extractAuthToken method etc.)
  • remove createAccessFilter method from AuthChecker, embed the essential portion of its functionality in CDXServer

Adopting full JAAS would make Wayback configuration way too complex without clear benefits. Unless other requirements arise, simplified framework would suffice. Alignment with JAAS design makes it easier to understand and extend.

Update: recent change at IA is a move in this direction: df182ec

Exclusion

  • remove CDXToCaptureSearchResultWriter#setExclusionFilter and #getExclusionFilter
  • add more arguments to ContextExclusionFilterFactory#getExclusionFilter
    • AuthToken

See iipc/openwayback#290 on requirements for exclusion based on client's IP address. While I (Kenji) suggested WaybackRequest as an additional argument, it shall not be part of CDX server API. Use of extended AuthToken carrying client's IP address would meet the requirements.