You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a follow-on to #344 beyond changes in how we map fields for search.
Something I've wondered about using the Elasticsearch reindex API to populate a new cluster is whether document retrieval from the existing cluster will be the bottleneck (given that URL list download has been found to be such a drag on the state of the current cluster).
If it does incur the same load on the old cluster, I got to thinking about possible advantages of NOT using the "unattended" reindex API we could:
eliminate fields (full_language, original_url)
address URL normalization issues
handling of final "/"
URL encoding (we currently have some URLs in raw UTF-8, and other with %-encoded UTF-8)
address canonical domain issues (if any)
ie; add new domains where canonical domain is something other than thing.tld
remove pages that look like home pages
An alternative to reading from the existing cluster would be to read from WARC files on S3 and B2;
In this case an additional field that might be useful would be something to indicate the provenance of the story, both how it was collected, and how it was processed, so that we could tell which stories came from what import process (it might have been helpful to have this to eliminate the 2022 stories in the "dip" imported via csv files and rss files).
The text was updated successfully, but these errors were encountered:
AND we could consider going back to partitioning indices based on publication date, since one of the issues was that stories with no publication date would end up in an index that would grow without bound, and we now understand ILM.
Having indices partitioned by date would allow us to query just the indices that contain stories of interest, rather than having to query each shard of each index (even if individual shards can answer the question quickly, we still would have an ever-growing number of shards to query, tho as the number of indices and shards grow, presumably the number of nodes to handle them will as well)
This is a follow-on to #344 beyond changes in how we map fields for search.
Something I've wondered about using the Elasticsearch reindex API to populate a new cluster is whether document retrieval from the existing cluster will be the bottleneck (given that URL list download has been found to be such a drag on the state of the current cluster).
If it does incur the same load on the old cluster, I got to thinking about possible advantages of NOT using the "unattended" reindex API we could:
An alternative to reading from the existing cluster would be to read from WARC files on S3 and B2;
In this case an additional field that might be useful would be something to indicate the provenance of the story, both how it was collected, and how it was processed, so that we could tell which stories came from what import process (it might have been helpful to have this to eliminate the 2022 stories in the "dip" imported via csv files and rss files).
The text was updated successfully, but these errors were encountered: