More thoughts on reindexing #359

philbudne · 2024-12-15T20:00:17Z

This is a follow-on to #344 beyond changes in how we map fields for search.

Something I've wondered about using the Elasticsearch reindex API to populate a new cluster is whether document retrieval from the existing cluster will be the bottleneck (given that URL list download has been found to be such a drag on the state of the current cluster).

If it does incur the same load on the old cluster, I got to thinking about possible advantages of NOT using the "unattended" reindex API we could:

eliminate fields (full_language, original_url)
address URL normalization issues
- handling of final "/"
- URL encoding (we currently have some URLs in raw UTF-8, and other with %-encoded UTF-8)
address canonical domain issues (if any)
- ie; add new domains where canonical domain is something other than thing.tld
remove pages that look like home pages

An alternative to reading from the existing cluster would be to read from WARC files on S3 and B2;

In this case an additional field that might be useful would be something to indicate the provenance of the story, both how it was collected, and how it was processed, so that we could tell which stories came from what import process (it might have been helpful to have this to eliminate the 2022 stories in the "dip" imported via csv files and rss files).

philbudne · 2024-12-15T22:42:03Z

AND we could consider going back to partitioning indices based on publication date, since one of the issues was that stories with no publication date would end up in an index that would grow without bound, and we now understand ILM.

Having indices partitioned by date would allow us to query just the indices that contain stories of interest, rather than having to query each shard of each index (even if individual shards can answer the question quickly, we still would have an ever-growing number of shards to query, tho as the number of indices and shards grow, presumably the number of nodes to handle them will as well)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More thoughts on reindexing #359

More thoughts on reindexing #359

philbudne commented Dec 15, 2024 •

edited

Loading

philbudne commented Dec 15, 2024

More thoughts on reindexing #359

More thoughts on reindexing #359

Comments

philbudne commented Dec 15, 2024 • edited Loading

philbudne commented Dec 15, 2024

philbudne commented Dec 15, 2024 •

edited

Loading