Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove lastmod tag in Sitemap for reindexing #7801

Merged
merged 10 commits into from
Jul 31, 2024
Merged

Conversation

klin2020
Copy link
Contributor

@klin2020 klin2020 commented Jul 22, 2024

Summary

Current search on DG results in outdated articles that bury more recent articles.

Search.gov search results are based on a ranking algorithm that looks at the tag in a website's sitemap or a page's date, whichever is most recent. Our sitemap currently updates the tag to be the current date, leading to the ranking algorithm to weigh every page on DG equally, rather than by its proper publish date

Solution

Remove the tag in the DG sitemap build, so when we re-index DG, the re-index will use the page metadata for its proper date.

Once re-index occurs, we can edit the tag to reflect the page's publish date, rather than the current date.

Screenshots

Current sitemap (including ). Every date reflects the same date, causing issues with the ranking algorithm
Screenshot 2024-07-22 at 1 14 46 PM

Proposed change to sitemap (temporarily remove for Search.gov re-indexing)
Screenshot 2024-07-22 at 1 13 38 PM

-Removed lastmod tag for search.gov reindexing
@klin2020 klin2020 self-assigned this Jul 22, 2024
@klin2020 klin2020 added the Dev: Search Issues related to our implementation of serach.gov label Jul 22, 2024
Copy link

🔍 Preview in Federalist

Copy link
Contributor

@nick-mon1 nick-mon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klin2020 Thanks for the detailed explanation. Some questions:

  1. Which meta fields will search.gov use when we re-index after we remove the lastmod field?
  2. We don't use a lastmod field in the markdown to set for each page, is this something we should consider adding in the future if we want to improve our sitemap?
  3. Could we use .Params.date field to as the next best option for setting the lastmod field?
<lastmod>{{ safeHTML ( .Params.date "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}

@RileySeaburg
Copy link
Member

@klin2020 Thanks for the detailed explanation. Some questions:

  1. Which meta fields will search.gov use when we re-index after we remove the lastmod field?

  2. We don't use a lastmod field in the markdown to set for each page, is this something we should consider adding in the future if we want to improve our sitemap?

  3. Could we use .Params.date field to as the next best option for setting the lastmod field?

<lastmod>{{ safeHTML ( .Params.date "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}

I was wondering this as well. I was expecting the last modified date to default to the date published.

@klin2020
Copy link
Contributor Author

Hi @nick-mon1 @RileySeaburg

  1. The search.gov reindexing will look at that is in every page on DG. Refer to "Freshness" section of this article.
  2. Since we very rarely update our websites, we don't necessarily need to add it in the future, but it is something we can consider. The lastmod field that the search indexing looks at is the lastmod field in sitemap, so if we add it back into our sitemap, we should just update the lastmod field to store .Params.date
  3. Yes, I agree. It may be best to add this back into our sitemap after the re-indexing, just to ensure that the re-indexing will only look at the meta tag property. Let me send you some further information on this

@RileySeaburg
Copy link
Member

RileySeaburg commented Jul 23, 2024

@klin2020

To be clear, I'm not sure we need to remove <lastmod> to have the site reindexed.

We prefer documents that are fresh. Anything published or updated in the past 30 days is considered fresh. After that, we use a Gaussian decay function to demote documents, so that the older a document is, the more it is demoted. When documents are 5 years old or older, we consider them to be equally old and do not demote further. We use either the article:modified_time on an individual page, or that page’s <lastmod> date from the sitemap, whichever is more recent. If there is only an article:published_time for a given page, we use that date

Unless I'm misunderstanding something, updating the <lastmod> tags to reflect the content publish date, and then requesting a reindex should fix this issue.

Please explain the proper procedure if I am incorrect.

If I am not, please update the <lastmod> tag.

@klin2020
Copy link
Contributor Author

@nick-mon1 @RileySeaburg Re-introduced lastmod tag with page date.

Copy link
Member

@RileySeaburg RileySeaburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klin2020 requested changes.

@@ -2,8 +2,8 @@
{{ range .Data.Pages }}
<url>
<loc>https://digital.gov{{ .Permalink | relURL }}</loc>{{ if not .Lastmod.IsZero }}
<lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
<changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
<lastmod>{{ safeHTML ( .Date | time.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you filter the date to YYYY-MM-DD?
.Lastmod.Format "2006-01-02"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RileySeaburg RileySeaburg added the Bug Fix This fixes an actual bug label Jul 24, 2024
@RileySeaburg RileySeaburg self-requested a review July 24, 2024 15:41
RileySeaburg
RileySeaburg previously approved these changes Jul 24, 2024
Copy link
Member

@RileySeaburg RileySeaburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you and great job!

@nick-mon1 nick-mon1 self-requested a review July 24, 2024 18:42
nick-mon1
nick-mon1 previously approved these changes Jul 24, 2024
Copy link
Contributor

@nick-mon1 nick-mon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klin2020 This looks great, thanks for following up with the search team.

Note

Just a note for pages with no date will display:

<url>
<loc>https://digital.gov/authors/alicia-rouault/</loc>
<lastmod>0001-01-01</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>

@klin2020 klin2020 dismissed stale reviews from nick-mon1 and RileySeaburg via 3d1c582 July 24, 2024 20:59
@klin2020
Copy link
Contributor Author

Removed lastmod tag for review again @nick-mon1 @RileySeaburg

@nick-mon1 nick-mon1 self-requested a review July 25, 2024 20:32
@nick-mon1 nick-mon1 requested a review from RileySeaburg July 25, 2024 20:32
nick-mon1
nick-mon1 previously approved these changes Jul 25, 2024
RileySeaburg
RileySeaburg previously approved these changes Jul 26, 2024
@RileySeaburg
Copy link
Member

I'm going to merge this so we can test the re-index today.

@mejiaj there will be another PR where the tag is added back in.

mejiaj
mejiaj previously approved these changes Jul 29, 2024
- Please note that pages with no date, such as authors, still display 0001-01-01 as date
@klin2020 klin2020 dismissed stale reviews from mejiaj, nick-mon1, and RileySeaburg via 187ce99 July 29, 2024 16:13
Copy link
Contributor

@nick-mon1 nick-mon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I Tested

  • ran hugo build
  • checked the public/sitemap.xml and lastmod uses the date field as the value

@RileySeaburg RileySeaburg merged commit 475a9ec into main Jul 31, 2024
8 checks passed
@RileySeaburg RileySeaburg deleted the kl-sitemap-lastmod branch July 31, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Fix This fixes an actual bug Dev: Search Issues related to our implementation of serach.gov
Projects
Status: Done / Merged
Development

Successfully merging this pull request may close these issues.

4 participants