- Source Files - Domains | DAP | Pulse | Other | OMB
- Snapshots of Source Files Used in Most Recent Build - Domains | DAP | Pulse | Other
- Do the entries look right when skimmed?
- Are there inconsistences, e.g. agency/organization in the domain registry?
-
Datasets are combined
- Does the snapshot look right when skimmed?
- Does the number of entries in the snapshot equal to the sum of each source file (analysis report)?
-
List is dedupped
- Does the snapshot look right when skimmed?
- Does the snapshot and the removed list add up in size to the previous snapshot (analysis report)?
-
Ignore list is applied
- Does the snapshot look right when skimmed?
- Are there entries in the snapshot that we would like to have the ignore list filter out? If so, how could the ignore list be modified to do that.
- Are there entires in the removed list that we wish the ignore list hadn't filtered out? If so, how could the ignore list be modified to do that?
- Does the snapshot and the removed list add up in size to the previous snapshot (analysis report)?
-
Nonfederal are removed, resulting in the Target URL list
- Does the resulting file look right when skimmed?
- Does the resulting file and the removed list add up in size to the previous snapshot (analysis report)?
-
The completed Target URL list
- Are there any empty cells in the file outside of the agency code, bureau, and bureau code columns (analysis report)?
- Are any of the values in the analysis report unusually different from recent history?
- Does the number of results in the primary snapshot (analysis report) equal the number of urls that returned a 2xx server code in the all snapshot (analysis report)?
- In the all snapshot, do any records with a non-2xx final_url_status_code appear to be live and thus should have returned a 2xx code?
- In the all snapshot, do any records with a failing scan status appear to be live and thus should have completed?
- Note - in particular, analyze and think about each error type.
- [Proposal: filter out certain mimetypes (e.g. JSON, XML) from the primary snapshot as well. If done, we should add a similar step to ^^^]
- Are there certain fields in either snapshot which should not have any empty cells (if so, note them here)?
- Looking for incorrect redirect info:
- False negative examples: http://spsweb.jpl.nasa.gov
- Consider doing a one off pass at capturing the redirect chains
- What else...?
API
- Do the number of records shown in the endpoint (meta: totalItems, at the bottom) equal the number of records in the target URL list analysis report?
- What is the oldest entry?
(more here)