Skip to content

Search.gov sitemap setup for Global and Policy‐guidance site‐search

John Carroll edited this page Apr 29, 2024 · 42 revisions

The site-search of fec.gov uses the General Service Administration's search.gov search engine in addition to the FEC API for Candidates and Committees (See fec/search/views.py).

To get access to Search.gov dashboard, slack or email John Carroll ([email protected]) or Pat Phongverijati.([email protected])

Search.gov affiliates:

Robots.txt

The search.gov search engine indexes sitemaps for both Global site-search and Policy and other guidance search by referencing the production robots.txt. We serve a different robots.txt in other environments (dev, stage, feature) that disallows crawling so that these subdomain URLs never get indexed.

Production robots.txt (https://www.fec.gov/robots.txt):


User-agent: usasearch
Crawl-delay: 2
Allow: /
Disallow: /search/?*
Disallow: /data/legal/search/?*
Disallow: /data/search/?*
Sitemap: https://www.fec.gov/sitemap-wagtail.xml
Sitemap: https://www.fec.gov/resources/cms-content/documents/sitemap_pdf.xml
Sitemap: https://www.fec.gov/resources/cms-content/documents/sitemap_html.xml

  

User-agent: *
Crawl-delay: 10
Disallow: /search/?*
Disallow: /data/legal/search/?*
Disallow: /data/search/?*
etc...

  

Dev, stage, feature robots.txt (https://dev.fec.gov/robots.txt):

User-agent: *
Disallow: /

Global site-search:

Accessed via the search box on every page of fec.gov:


Screen Shot 2023-11-08 at 10 20 56 PM

The global sitemap, https://www.fec.gov/sitemap-wagtail.xml/ , is auto-generated by enabling the built-in Wagtail sitemap feature. This only indexes live, public Wagtail pages. Certain page-types can be excluded by overriding get_sitemap_urls() for a page model(See banner models in models.py). Other important, non-Wagtail pages like data-tables, calendar , etc. are put in the Best Bets section of the search.gov dashboard and show up as suggested results at the top of search results. For even more options for future expansion of our searchable content, see the section titled: " Expanding Global site-search beyond just Wagtail pages."

Policy and other guidance search:

Accessed via the search box on this page: https://www.fec.gov/legal-resources/policy-and-other-guidance/guidance-documents/ :


Screen Shot 2023-11-08 at 10 24 09 PM

Search results are limited to only items in the two Policy and other guidance sitemaps. These are uploaded as documents in Wagtail and we reference these urls in robots.txt:

A copy of the latest-uploaded version of each sitemap is also included the fec-cms Github repo just for version control, these files are not exposed to the web through the CMS.

The separation of Policy and other guidance search results from the Global search results is achieved by putting the documents and pages in a dedicated directory(in S3 or Wagtail) and then limiting the search to those directories using the domains section of search.gov dashboard.

Note: Currently, Global search sitemap entries are not available to Policy and other guidance search, but the Policy and other guidance sitemap entries are available to the global search. We will likely isolate both in the future so that they are mutually exclusive but we are still discussing this because it has the benefit of making most FEC-form PDFs searchable in the global site-search.

Process for adding/updating documents or pages to Policy and other guidance search:

Documents:

  • Ask a developer to upload the document to the production S3 bucket at resources/cms-content/documents/policy-guidance
  • Update the sitemap:
    • If it is a new PDF, add the path to the PDF to sitemap_pdf.xml and update the lastmod date.
    • If it is existing, just update the lastmod date.
  • Replace sitemap_pdf.xml in Wagtail documents area with the updated version.
  • Create a Github PR to also update the sitemap in the repo with the latest version.

Webpages:

  • The webpages, most of which are in /updates in Wagtail, are kept in their original location and an alias is created under the /updates/guidance-search/ parent in Wagtail.
  • To add a new page, simply create the page wherever is logical in Wagtail. Create an alias of it by using the Copy option in Wagtail and click the Alias checkbox and choose /updates/guidance-search/ as its parent.(Wagtail WIKI on aliasing pages)
  • To edit an existing page, simply edit the page and publish. The changes will be reflected in the existing alias.
  • Update the sitemap:
    • If it is a new page, add the path to the page to sitemap_html.xml and update the lastmod date.
    • If it is existing, just update the lastmod date.
  • Replace sitemap_html.xml in Wagtail documents area with the updated version.
  • Create a Github PR to also update the sitemap in the repo with the latest version.

Removing items from sitemaps and search indexes:

When a Wagtail page is unpublished or changed to draft or private, it is removed from the Global sitemap. For Policy and other guidance, an item must be manually removed from either of the sitemaps (PDF or HTML). However, once an item has been indexed by search.gov, removing it from a sitemap does not automatically remove it from search results. This applies to both Global and Policy and other Guidance. You can send an email to search.gov support at [email protected] to request items be removed from the index immediately. Otherwise, items no longer on the sitemaps will be removed from the index after 30 days. See search.gov's more detailed explanation below:

The difference between updating indexing vs. indexing new content

  • Ingesting new content is done every two hours off of sitemaps. this picks up URLs we didn’t know about before.
  • At the same time, we scan the sitemaps for any updated timestamps on URLs we already knew about and re-fetch those to get the updates.
  • For URLs that do not or cannot show up on a sitemap with updated timestamps, we have a job that will recheck each URL if it’s been 30 days since we last fetched it. So, today, for example, we’re checking any URLs that were last fetched on July 17, 2023. This is to pick up updates to the pages that didn’t get a new date and to find updated response codes, like 301s or 404s, and remove those URLs from the index. Usually, URLs that are 301ing and 404ing are not included on the sitemaps, so we have to detect them separately.

Best bets

These are search suggestions that you can manually add (or add in bulk by uploading a spreadsheet) which map a URL to a specific set of search keywords. For Gloabal search, Best Bets will be returned at the top of the search results in a section titled "Suggested results". For Policy and other guidance, the Best Bets are always pushed to the top of the result list, but do not have a separate section heading.

Search.gov testing dashboard:

You can test search results on the search.gov dashboard for each affiliate by going to Preview in the dashboard. Keep in mind that search terms are cached. One tip suggested by search.gov to test a term that has already been cached, is to slightly change your search term by its capitalization or punctuation (i.e. "Form 3p", "form-3p") . You can see cached queries by going to Analytics > Queries in the dashboard.

Test Global site-search and Policy and other guidance search locally

Grab the following environment variables from cf target -s prod and export them in your terminal window or your shell configuration (.bash_profile or your shell’s equivalent).


export SEARCHGOV_API_ACCESS_KEY=<>


export SEARCHGOV_POLICY_GUIDANCE_KEY=<>

Now your local site's search boxes will return search results.

Expanding Global site-search beyond just Wagtail pages

We can dynamically generate sitemaps to index more that just Wagtail pages, using the Django sitemap framework. Two examples would be to index all of the documents in Reports about the FEC and to index all of our form PDFs. This WIP PR has an example of both. The sitemap view-code is in urls.py for demo purposes, but it would ultimately be created as its own file like sitemap-views.py and imported into urls.py.

We always have the option to manually write additional sitemaps if necessary, although dynamically created and updated sitemaps are obviously more ideal.