Target URL List creation

Source Files - Domains | DAP | Pulse | Other | OMB
Snapshots of Source Files Used in Most Recent Build - Domains | DAP | Pulse | Other
- Do the entries look right when skimmed?
- Are there inconsistences, e.g. agency/organization in the domain registry?

Datasets are combined
- Does the snapshot look right when skimmed?
- Does the number of entries in the snapshot equal to the sum of each source file (analysis report)?
List is dedupped
- Does the snapshot look right when skimmed?
- Does the snapshot and the removed list add up in size to the previous snapshot (analysis report)?
Ignore list is applied
- Does the snapshot look right when skimmed?
- Are there entries in the snapshot that we would like to have the ignore list filter out? If so, how could the ignore list be modified to do that.
- Are there entires in the removed list that we wish the ignore list hadn't filtered out? If so, how could the ignore list be modified to do that?
- Does the snapshot and the removed list add up in size to the previous snapshot (analysis report)?
Nonfederal are removed, resulting in the Target URL list
- Does the resulting file look right when skimmed?
- Does the resulting file and the removed list add up in size to the previous snapshot (analysis report)?
The completed Target URL list
- Are there any empty cells in the file outside of the agency code, bureau, and bureau code columns (analysis report)?
- Are any of the values in the analysis report unusually different from recent history?

After the scans complete

Does the number of results in the primary snapshot (analysis report) equal the number of urls that returned a 2xx server code in the all snapshot (analysis report)?
In the all snapshot, do any records with a non-2xx final_url_status_code appear to be live and thus should have returned a 2xx code?
In the all snapshot, do any records with a failing scan status appear to be live and thus should have completed?
- Note - in particular, analyze and think about each error type.
- [Proposal: filter out certain mimetypes (e.g. JSON, XML) from the primary snapshot as well. If done, we should add a similar step to ^^^]
Are there certain fields in either snapshot which should not have any empty cells (if so, note them here)?

Looking for incorrect redirect info:
- False negative examples: http://spsweb.jpl.nasa.gov
- Consider doing a one off pass at capturing the redirect chains
What else...?

API

Do the number of records shown in the endpoint (meta: totalItems, at the bottom) equal the number of records in the target URL list analysis report?
What is the oldest entry?

(more here)