Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize taxonomy sync process #520

Open
sandykadam opened this issue Oct 5, 2020 · 20 comments · Fixed by #614
Open

Optimize taxonomy sync process #520

sandykadam opened this issue Oct 5, 2020 · 20 comments · Fixed by #614
Assignees
Labels
bug Something isn't working Critical Priority: Critical

Comments

@sandykadam
Copy link
Collaborator

We have option to sync EMBL taxonomy from ContentHub to Wordpress, whenever we click the link to sync we are getting timeout in sync process.
Screenshot 2020-10-05 at 10 48 35

@khawkins98 Spillios We need to remove any unwanted taxonomies as the list is growing so I think it takes time to process and sync them in Wordpress.

@sandykadam sandykadam added the High Priority: High label Oct 5, 2020
@sandykadam
Copy link
Collaborator Author

sandykadam commented Nov 25, 2020

Hi!

Since WordPress 5.2 there is a built-in feature that detects when a plugin or theme causes a fatal error on your site, and notifies you with this automated email.

In this case, WordPress caught an error with one of your plugins, EMBL Taxonomy.

First, visit your website (https://wwwdev.embl.org/about/info/imaging-centre/) and check for any visible issues. Next, visit the page where the error was caught (https://wwwdev.embl.org/about/info/imaging-centre/about/info/imaging-centre/wp-admin/edit-tags.php?taxonomy=embl_taxonomy&sync=true) and check for any visible issues.

Please contact your host for assistance with investigating this issue further.

If your site appears broken and you can't access your dashboard normally, WordPress now has a special "recovery mode". This lets you safely login to your dashboard and investigate further.

To keep your site safe, this link will expire in 1 day. Don't worry about that, though: a new link will be emailed to you if the error occurs again after it expires.

When seeking help with this issue, you may be asked for some of the following information:
WordPress version 5.5.3
Current theme: VF-WP Groups (version 1.0.0-beta.5)
Current plugin: EMBL Taxonomy (version 1.0.0-beta.1)
PHP version 7.3.14



Error Details
=============
An error of type E_ERROR was caused in line 404 of the file /var/www/drupal/embl.org.about.info/_imaging-centre/dist/wp-content/plugins/embl-taxonomy/includes/register.php. Error message: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 135168 bytes)
Since WordPress 5.2 there is a built-in feature that detects when a plugin or theme causes a fatal error on your site, and notifies you with this automated email.

In this case, WordPress caught an error with one of your plugins, EMBL Taxonomy.

First, visit your website (https://wwwdev.embl.org/groups/marcia/) and check for any visible issues. Next, visit the page where the error was caught (https://wwwdev.embl.org/groups/marcia/groups/marcia/wp-admin/edit-tags.php?taxonomy=embl_taxonomy&sync=true) and check for any visible issues.

Please contact your host for assistance with investigating this issue further.

If your site appears broken and you can't access your dashboard normally, WordPress now has a special "recovery mode". This lets you safely login to your dashboard and investigate further.

https://wwwdev.embl.org/groups/marcia/wp-login.php?

To keep your site safe, this link will expire in 1 day. Don't worry about that, though: a new link will be emailed to you if the error occurs again after it expires.

When seeking help with this issue, you may be asked for some of the following information:
WordPress version 5.5.3
Current theme: VF-WP Groups (version 1.0.0-beta.5)
Current plugin: EMBL Taxonomy (version 1.0.0-beta.1)
PHP version 7.3.14



Error Details
=============
An error of type E_ERROR was caused in line 404 of the file /var/www/drupal/embl.org.groups/_marcia/dist/wp-content/plugins/embl-taxonomy/includes/register.php. Error message: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 135168 bytes)

@dbushell
Copy link
Collaborator

The raw taxonomy JSON from the Content Hub is around 100kb so that should be parsable in one go.

With 235 terms and multiple database updates per-term that might be an issue — but looking at the error, it's failing before it even gets to that stage.

The error is in generate_terms which maps EMBL terms to the WordPress taxonomy structure. It's a recursive function which might point to the memory error. Or there might be a circular references in the EMBL terms.

I'll investigate further tomorrow and see if I can identify what is causing the error and if I can fix it.

Ideally the plugin needs someone more proficient in PHP to rewrite it. I wrote this plugin as a proof-of-concept two years ago. It was only meant to be a working placeholder to illustrate how the WP taxonomy structure should be set up.

@dbushell
Copy link
Collaborator

I've identifed some issues with the API data containing invalid parent term IDs like:

string(4) "none"
string(19) "string/EBI Training"

I'm filtering them out with a regexp pattern:

const UUID_PATTERN = '#^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$#';

But that doesn't actually solve this memory bug so I'll keep going...

@dbushell
Copy link
Collaborator

Bug is in this part with the nested loop:

      foreach ($term['parents'] as $parent_id) {
        foreach ($api_terms as $new_parent) {

embl-taxonomy/includes/register.php

That is hugely inefficient 🔥 I will optimise that and see where it gets me.

@dbushell
Copy link
Collaborator

@khawkins98 something I've noticed.

The API data has similar terms with display_name of "Seminars":

{
  "33f3a158-1ca6-4d20-a75c-23f39dce6c34": {
    "name": "EMBL Events term",
    "uuid": "33f3a158-1ca6-4d20-a75c-23f39dce6c34",
    "nid": "26735",
    "name_display": "Seminars",
    "primary": "what",
    "parents": ["302cfdf7-365b-462a-be65-82c7b783ebf7"],
    "type": "term"
  },
  "4057de86-552d-438e-9782-1a18cec5eeb3": {
    "name": "EMBL Seminars term",
    "uuid": "4057de86-552d-438e-9782-1a18cec5eeb3",
    "nid": "26737",
    "name_display": "Seminars",
    "primary": "what",
    "parents": ["33f3a158-1ca6-4d20-a75c-23f39dce6c34"],
    "type": "term"
  }
}

"EMBL Seminars term" is a child of "EMBL Events term".

All seminars are grandchildren resulting in these WordPress terms:

Screenshot 2020-11-26 at 09 58 57

Where the display name hierarchy becomes "Seminars > Seminars" 🤔

@dbushell
Copy link
Collaborator

Further updates:

With local testing it gets to 18869 generated terms before the memory error. So it's going round in circles somewhere...

@dbushell
Copy link
Collaborator

Found where is gets stuck:

Who > People directory > Cath Brooksbank > EMBL-EBI Training Team
What > People directory > Cath Brooksbank > EMBL-EBI Training Team
What > All EMBL sites > People directory > Cath Brooksbank > EMBL-EBI Training Team
Where > All EMBL sites > People directory > Cath Brooksbank > EMBL-EBI Training Team
Who > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team
What > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team
What > All EMBL sites > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team
Where > All EMBL sites > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team
Who > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team
What > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team
What > All EMBL sites > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team
Where > All EMBL sites > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team
Who > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team
What > People directory > Cath Brooksbank > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team > EMBL-EBI Training Team

[...]

This crashed my machine trying to generate the log.

Anyway, the related EMBL terms:

# name uuid parents
A Who 4428d1fd-441a-4d6d-a1c5-5dcf5665f213 -
B What 302cfdf7-365b-462a-be65-82c7b783ebf7 -
C Where b14d3f13-5670-44fb-8970-e54dfd9c921a -
1 People directory 3dfcb91f-a022-4dd6-8cba-6391e247f8fb A B 3
2 Cath Brooksbank †1 d8695150-98c0-4fda-a378-fc46851e08f1 1
3 All EMBL Sites 89e00fee-87f4-482e-a801-4c3548bb6a58 B C
4 EMBL-EBI Training Team †2 4b214ef2-8a97-4a55-9b7a-a389574996de 2 4 5
5 EMBL-EBI Hinxton a99d1a7c-ca83-4c00-ab61-d082d3e41ce3 3

†1 Cath Brooksbank has two invalid parent IDs:

"parents": {
  "who": "3dfcb91f-a022-4dd6-8cba-6391e247f8fb",
  "what": "string/EBI Training",
  "where": "string/EMBL-EBI"
}

†2 EMBL-EBI Training Team is it's own parent!

So if one of the term's parents is actually itself — which I assume is a mistake in the data? — it results in a never ending loop of generating WP terms. I just need to add a condition to skip this situation.

@dbushell dbushell mentioned this issue Nov 26, 2020
@sandykadam
Copy link
Collaborator Author

@dbushell Thanks for detail investigation, yeh I also think somehow data is not correct we need lot of cleanup. Sometimes I also found confusing setting up correct who/where/what for group sites.

@dbushell
Copy link
Collaborator

I've opened a PR #614

This should fix any WP errors.

@dbushell
Copy link
Collaborator

I wonder though if we actually need to generate unique terms in the WP taxonomy for each parent > child relationship.

I forget why it was implemented that way.

The Content Hub data provides around 250 terms right now, which generated 776 terms in WP.

I think we're only using them for the page meta tags:

<meta name="embl:who" content="Sharpe Group" uuid="6c31c788-04a1-48b8-a532-fdc251506b57">

Which does not include any parents IDs so having unique terms for each hierarchy is unnecessary. Maybe there is somewhere else I'm forgetting... anyway, I'll leave it as is for now. Possible to look at this again in future.

@sandykadam
Copy link
Collaborator Author

I think it is coupled with showing breadcrumb as well fetching site profile like site description, members, GTL etc. If we need to change this approach then we should look at this now, rather then going site live. It will be difficult to manage later and chances of data mismatch.

cc @khawkins98 @kasprzyk-sz @meladawy

@dbushell
Copy link
Collaborator

@sandykadam true, good point.

Maybe better to keep it as is. I should have noted this is my last day working on the project this year! So any further work would have to be picked up by someone else.

@sandykadam
Copy link
Collaborator Author

sandykadam commented Dec 1, 2020

We deployed v28 on all group sites, I was fixing some of the EBI site and found that there limited "who" records. Only following people are showing as who in taxonomy list.

Screenshot 2020-12-01 at 17 13 21

Can we please check asap, as all sites will be breaking.

Testing on https://wwwdev.ebi.ac.uk/research-beta/beltrao/

@khawkins98
Copy link
Contributor

Looking at https://www.embl.org/api/v1/pattern.json?pattern=embl-ontology&source=contenthub there are only 6 "type": "person" records. So this is probably correct on the VF-WP end.

For person records to show in the taxonomy, they need to be associated with an EMBL.org Profile.

@sandykadam
Copy link
Collaborator Author

Looking at https://www.embl.org/api/v1/pattern.json?pattern=embl-ontology&source=contenthub there are only 6 "type": "person" records. So this is probably correct on the VF-WP end.

For person records to show in the taxonomy, they need to be associated with an EMBL.org Profile.

I'm confused now, we have 110 groups site setup which means we already have created 110 "Embl profiles" correct, so I should able to see atleast those 110 peoples to whom they are linked.

@khawkins98
Copy link
Contributor

khawkins98 commented Dec 1, 2020

IIRC: many (all?) of the EMBL.org Profile Who associations were reset when the people were deleted and reimported in the contentHub about a month ago.

For example, on the Beltrao group:

image

https://content.embl.org/node/7133

@khawkins98
Copy link
Contributor

Related: I've fixed the onotlogy so EBI Training isn't its own parent.

image

https://content.embl.org/node/7278

When we improve the term picker we could conceivably prevent this.

@sandykadam
Copy link
Collaborator Author

I'm still getting timeout error when I try to sync taxonomies on DEV sites.

@khawkins98 khawkins98 added bug Something isn't working Critical Priority: Critical and removed High Priority: High labels Jun 16, 2021
@filipe-dias
Copy link

filipe-dias commented Jun 16, 2021

Updating on the impact of this issue:

With a taxonomy sync, if the value used in a site's Who, What, Where is not present in the synced data before the timeout stops it, the value is erased from the site's Content Hub settings. Breadcrumbs continue to use the previous values and What is usually all that is needed to have the site functioning correctly. "What" has so far never been lost in a timeout sync.

@dbushell
Copy link
Collaborator

I think I've found a solution to the sync timeout.

Before it would run the sync script from the admin_init hook when loading this admin page:
/wp-admin/edit-tags.php?taxonomy=embl_taxonomy&sync=true

WordPress doesn't like waiting for it to finish.

I've move the sync script to the REST API:
/wp-json/embl-taxonomy/v1/sync

I've set it up to process the sync in batches. It will redirect itself from /sync?offset=0 to /sync?offset=100 etc until complete.

I'll have a working PR for this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Critical Priority: Critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants