Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More thoughts on reindexing #359

Open
philbudne opened this issue Dec 15, 2024 · 1 comment
Open

More thoughts on reindexing #359

philbudne opened this issue Dec 15, 2024 · 1 comment

Comments

@philbudne
Copy link
Contributor

philbudne commented Dec 15, 2024

This is a follow-on to #344 beyond changes in how we map fields for search.

Something I've wondered about using the Elasticsearch reindex API to populate a new cluster is whether document retrieval from the existing cluster will be the bottleneck (given that URL list download has been found to be such a drag on the state of the current cluster).

If it does incur the same load on the old cluster, I got to thinking about possible advantages of NOT using the "unattended" reindex API we could:

  1. eliminate fields (full_language, original_url)
  2. address URL normalization issues
    • handling of final "/"
    • URL encoding (we currently have some URLs in raw UTF-8, and other with %-encoded UTF-8)
  3. address canonical domain issues (if any)
    • ie; add new domains where canonical domain is something other than thing.tld
  4. remove pages that look like home pages

An alternative to reading from the existing cluster would be to read from WARC files on S3 and B2;

In this case an additional field that might be useful would be something to indicate the provenance of the story, both how it was collected, and how it was processed, so that we could tell which stories came from what import process (it might have been helpful to have this to eliminate the 2022 stories in the "dip" imported via csv files and rss files).

@philbudne
Copy link
Contributor Author

AND we could consider going back to partitioning indices based on publication date, since one of the issues was that stories with no publication date would end up in an index that would grow without bound, and we now understand ILM.

Having indices partitioned by date would allow us to query just the indices that contain stories of interest, rather than having to query each shard of each index (even if individual shards can answer the question quickly, we still would have an ever-growing number of shards to query, tho as the number of indices and shards grow, presumably the number of nodes to handle them will as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant