-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize taxonomy sync process #520
Comments
|
The raw taxonomy JSON from the Content Hub is around 100kb so that should be parsable in one go. With 235 terms and multiple database updates per-term that might be an issue — but looking at the error, it's failing before it even gets to that stage. The error is in I'll investigate further tomorrow and see if I can identify what is causing the error and if I can fix it. Ideally the plugin needs someone more proficient in PHP to rewrite it. I wrote this plugin as a proof-of-concept two years ago. It was only meant to be a working placeholder to illustrate how the WP taxonomy structure should be set up. |
I've identifed some issues with the API data containing invalid parent term IDs like: string(4) "none"
string(19) "string/EBI Training" I'm filtering them out with a regexp pattern: const UUID_PATTERN = '#^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$#'; But that doesn't actually solve this memory bug so I'll keep going... |
Bug is in this part with the nested loop: foreach ($term['parents'] as $parent_id) {
foreach ($api_terms as $new_parent) { embl-taxonomy/includes/register.php That is hugely inefficient 🔥 I will optimise that and see where it gets me. |
@khawkins98 something I've noticed. The API data has similar terms with {
"33f3a158-1ca6-4d20-a75c-23f39dce6c34": {
"name": "EMBL Events term",
"uuid": "33f3a158-1ca6-4d20-a75c-23f39dce6c34",
"nid": "26735",
"name_display": "Seminars",
"primary": "what",
"parents": ["302cfdf7-365b-462a-be65-82c7b783ebf7"],
"type": "term"
},
"4057de86-552d-438e-9782-1a18cec5eeb3": {
"name": "EMBL Seminars term",
"uuid": "4057de86-552d-438e-9782-1a18cec5eeb3",
"nid": "26737",
"name_display": "Seminars",
"primary": "what",
"parents": ["33f3a158-1ca6-4d20-a75c-23f39dce6c34"],
"type": "term"
}
} "EMBL Seminars term" is a child of "EMBL Events term". All seminars are grandchildren resulting in these WordPress terms: Where the display name hierarchy becomes "Seminars > Seminars" 🤔 |
Further updates: With local testing it gets to 18869 generated terms before the memory error. So it's going round in circles somewhere... |
Found where is gets stuck:
This crashed my machine trying to generate the log. Anyway, the related EMBL terms:
†1 Cath Brooksbank has two invalid parent IDs: "parents": {
"who": "3dfcb91f-a022-4dd6-8cba-6391e247f8fb",
"what": "string/EBI Training",
"where": "string/EMBL-EBI"
} †2 EMBL-EBI Training Team is it's own parent! So if one of the term's parents is actually itself — which I assume is a mistake in the data? — it results in a never ending loop of generating WP terms. I just need to add a condition to skip this situation. |
@dbushell Thanks for detail investigation, yeh I also think somehow data is not correct we need lot of cleanup. Sometimes I also found confusing setting up correct who/where/what for group sites. |
I've opened a PR #614 This should fix any WP errors. |
I wonder though if we actually need to generate unique terms in the WP taxonomy for each parent > child relationship. I forget why it was implemented that way. The Content Hub data provides around 250 terms right now, which generated 776 terms in WP. I think we're only using them for the page meta tags: <meta name="embl:who" content="Sharpe Group" uuid="6c31c788-04a1-48b8-a532-fdc251506b57"> Which does not include any parents IDs so having unique terms for each hierarchy is unnecessary. Maybe there is somewhere else I'm forgetting... anyway, I'll leave it as is for now. Possible to look at this again in future. |
I think it is coupled with showing breadcrumb as well fetching site profile like site description, members, GTL etc. If we need to change this approach then we should look at this now, rather then going site live. It will be difficult to manage later and chances of data mismatch. |
@sandykadam true, good point. Maybe better to keep it as is. I should have noted this is my last day working on the project this year! So any further work would have to be picked up by someone else. |
We deployed v28 on all group sites, I was fixing some of the EBI site and found that there limited "who" records. Only following people are showing as who in taxonomy list. Can we please check asap, as all sites will be breaking. |
Looking at https://www.embl.org/api/v1/pattern.json?pattern=embl-ontology&source=contenthub there are only 6 For |
I'm confused now, we have 110 groups site setup which means we already have created 110 "Embl profiles" correct, so I should able to see atleast those 110 peoples to whom they are linked. |
IIRC: many (all?) of the EMBL.org Profile For example, on the Beltrao group: |
Related: I've fixed the onotlogy so EBI Training isn't its own parent. https://content.embl.org/node/7278 When we improve the term picker we could conceivably prevent this. |
I'm still getting timeout error when I try to sync taxonomies on DEV sites. |
Updating on the impact of this issue: With a taxonomy sync, if the value used in a site's Who, What, Where is not present in the synced data before the timeout stops it, the value is erased from the site's Content Hub settings. Breadcrumbs continue to use the previous values and What is usually all that is needed to have the site functioning correctly. "What" has so far never been lost in a timeout sync. |
I think I've found a solution to the sync timeout. Before it would run the sync script from the WordPress doesn't like waiting for it to finish. I've move the sync script to the REST API: I've set it up to process the sync in batches. It will redirect itself from I'll have a working PR for this soon. |
We have option to sync EMBL taxonomy from ContentHub to Wordpress, whenever we click the link to sync we are getting timeout in sync process.
@khawkins98 Spillios We need to remove any unwanted taxonomies as the list is growing so I think it takes time to process and sync them in Wordpress.
The text was updated successfully, but these errors were encountered: