Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden background job handling #195

Open
2 of 3 tasks
bekarice opened this issue Mar 10, 2017 · 13 comments
Open
2 of 3 tasks

Harden background job handling #195

bekarice opened this issue Mar 10, 2017 · 13 comments
Assignees
Labels
Background Jobs Code concerning background jobs Feature Refactor Issues and pull requests that involve refactoring of FW components

Comments

@bekarice
Copy link
Contributor

bekarice commented Mar 10, 2017

In our utilities classes, we commonly run into a couple issues:

  1. Object caching overrides that improperly handle options (don't purge cache when an option is updated, so job progress is incorrect) -- this can cause duplicated rows in the export file, or extend the time for processing an import.
  2. Hosts that block or rate-limit POST requests to the admin-ajax endpoint. This will cause the job to remain queued, as the maybe_handle() method in the async request class doesn't fire right away or at all.

To resolve these, we should:

  • Use direct SQL queries for setting / getting job status because some hosts don't properly clear the option cache when requested. This will bypass caching.
  • Investigate alternatives for dispatching a job to move it to processing that don't involve a POST request to the admin-ajax endpoint.
  • With ZD 544840, it turned out the server couldn't find its own hostname, so wp remote post using cURL couldn't look up its own DNS and therefore execute that request. We could do some lower level checks to try to ensure the hostname is known, as there have been a couple instances that were similar that I think the same misconfiguration could have been happening.

Related issues

Related threads

@bekarice bekarice added Background Jobs Code concerning background jobs Feature Refactor Issues and pull requests that involve refactoring of FW components labels Mar 10, 2017
@bekarice bekarice added this to the 4.7.0 milestone Mar 10, 2017
@bekarice
Copy link
Contributor Author

updated OP with notes to check that server hostname is properly configured

@ragulka
Copy link
Contributor

ragulka commented Apr 11, 2017

Hosts that block or rate-limit POST requests to the admin-ajax endpoint. This will cause the job to remain queued, as the maybe_handle() method in the async request class doesn't fire right away or at all.

@bekarice wait... do they only limit POST requests? Could we get away by simply using wp_safe_remote_get() instead of wp_safe_remote_post()? Or am I missing something? They both support non-blocking cURL requests. I just tested this locally and the async part worked flawlessly.

Also, just found this article covering a few alternatives we could look into: https://segment.com/blog/how-to-make-async-requests-in-php/

With ZD 544840, it turned out the server couldn't find its own hostname, so wp remote post using cURL couldn't look up its own DNS and therefore execute that request. We could do some lower level checks to try to ensure the hostname is known, as there have been a couple instances that were similar that I think the same misconfiguration could have been happening

I'm curious how this happens? When we wp_remote_post(), we'll simply use admin_url() to create the endpoint to request. This, in turn uses get_option( 'siteurl' ) to get the base url from the database. The siteurl option is set during install, using wp_guess_url(). So if my assumptions are correct, then the site should not really function at all for that customer, since WP so heavily relies on the siteurl option being correctly set. However, it might be that I misunderstood the whole issue.

@ragulka
Copy link
Contributor

ragulka commented Apr 11, 2017

Use direct SQL queries for setting / getting job status because some hosts don't properly clear the option cache when requested. This will bypass caching.

@bekarice do you think we simply want to have a simple SQL query or should we create our own versions of add/update/get_option functions that also trigger all the hooks and filters that are normally run?

EDIT: thinking more about this - I'm not sure how the Object Cache in GoDaddy/WPE works (do they supply their own version of WP_Object_Cache?), but would it be feasible to invalidate/flush the cache before add/update/get_option? For example:

wp_cache_delete( "{$this->identifier}_job_{$id}" );
$results = get_option( "{$this->identifier}_job_{$id}" );

Or am I completely off track here?

@maxrice
Copy link
Contributor

maxrice commented Apr 11, 2017

AFAIK GoDaddy/WPE (and probably other hosts) provide their own version of the object cache so they can use other persistence layers like Redis, memcached, etc. When we were troubleshooting the tickets with these issues, I think we tried clearing the cache but it didn't have any effect.

For the purposes of a job processing system, you really don't want any caching involved at all. I think we'd want simple SQL queries without any hooks or filters.

re: the rate-limited POST requests, I don't know that there is a good solution. A GET requested might work better, but I suspect it's more likely to be cached, which is bad. I think what we would benefit from here is a health check that confirms the server can make POST requests to it's own admin-ajax.php endpoint, so that we have an easy way when doing support to know what's going wrong.

@ragulka
Copy link
Contributor

ragulka commented Apr 12, 2017

re: the rate-limited POST requests, I don't know that there is a good solution. A GET requested might work better, but I suspect it's more likely to be cached, which is bad. I think what we would benefit from here is a health check that confirms the server can make POST requests to it's own admin-ajax.php endpoint, so that we have an easy way when doing support to know what's going wrong.

Hm, you're probably right about GET requests possibly being cached. But could we use cache-busting? If the URL is different each time, then the it should not be cached, right? For example, we can simply append a md5 of the job id and current timestamp to the url, something like this:

http://example.com/wp-admin/admin-ajax.php?action=run_background_process&stamp=jdshh8272318jsadas847

Given that query-strings are not cached on some CDNs/proxies, it should work in our favour. What do you think?

@maxrice
Copy link
Contributor

maxrice commented Apr 12, 2017

Great point, we could definitely add a timestamp to the URLs which should work. I'm still not sure if it's the HTTP method that's the real issue with these sites or other weirdness (like the one where the server's DNS was busted). It stands to reason that if a hosting provider is blocking or rate limiting POST requests to the admin-ajax endpoint, they'd probably do that for any type? Before we make this kind of change, I'd like to be able to test on a server that has this specific issue, so we can see if changing the method makes any difference. @bekarice any recent support tickets that we could use here?

@bekarice
Copy link
Contributor Author

@maxrice @ragulka ZD 544490 has picked back up again actually, that's with WPE so that's probably the best candidate since that seems to be the most problematic with rate limiting.

@ragulka ragulka self-assigned this Apr 18, 2017
ragulka added a commit that referenced this issue Apr 18, 2017
Workaround for issues with some hosts that are rate-limiting the admin
AJAX endpoint for POST requests. The endpoint url will always be unique
due to the nonce param, so caching should not affect the requests.
@ragulka
Copy link
Contributor

ragulka commented Apr 18, 2017

@bekarice I've patched CSV Export with direct SQL queries and GET requests in https://github.com/skyverge/wc-plugins/pull/2110 - can this be tested on the customer's site to see if it fixes the issue for them?

@bekarice
Copy link
Contributor Author

@ragulka getting FTP creds for that site -- WPE says object caching is off, but there's some duplication happening. will loop you into that thread once I have it.

@maxrice
Copy link
Contributor

maxrice commented May 18, 2017

Following up on this with ZD 568347 -- this host blocks so-called "loopback connections" where the server is attempting to connect to itself, responding to both GET and POST requests with HTTP 403. Obviously our async process never starts, and exports are left in the queued state with only the column headers present.

Amusingly (and I guess expected), if you copy the generated URL that the dispatch method tries to connect to, and visit that in the browser, that kicks off all the background jobs.

I'd say that we need

  1. a big red admin notice that indicates to the admin that their hosting provider is blocking loopback connections. Ideally we'll check this on plugin install/upgrade, and provide a system status tool for good measure.
  2. A potential workaround where we make client-side Ajax requests to the dispatch URL in order to get the exports to process. Since they won't appear to be coming from the server, there shouldn't be any issues with these lame hosting providers in that case. The obviously caveat here is that the user won't be able to leave the page when the background job is in-progress, or else processing will stop.

@bekarice
Copy link
Contributor Author

I'm not a huge fan of option 2, but I guess it's at least a workaround to use as a fallback if we detect that the host won't accept this connection. We'd just want to keep this in the job notice + admin notice so people know it will function differently for them, and you'd have to have a js alert likely if they try to leave the page.

Would we need to fail gracefully on the job somehow if they do (ie delete it or just stop progress on the current row)?

I think with Max's points and this

With ZD 544840, it turned out the server couldn't find its own hostname, so wp remote post using cURL couldn't look up its own DNS and therefore execute that request. We could do some lower level checks to try to ensure the hostname is known, as there have been a couple instances that were similar that I think the same misconfiguration could have been happening

we could likely close this out.

cc @mikejolley who may be interested in a couple of the above changes (I think $wpdb vs update_option() / get_option() in #209 has helped at least, but still a bit early to tell if it's solved the majority of caching issues).

@ragulka
Copy link
Contributor

ragulka commented May 18, 2017

I think a fallback js version with an alert when trying to leave the page is a fine workaround. Although even doing so we should really encourage merchants to contact their hosts to enable loopback connections.

@bekarice
Copy link
Contributor Author

one limitation to consider: auto-exports probably couldn't use a js / client side request system since it would require the browser to fire requests, so this really would only be a workaround for manual exports. with that in mind, do we think it would still be worthwhile to pursue, or just require people to use a reasonable host?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Background Jobs Code concerning background jobs Feature Refactor Issues and pull requests that involve refactoring of FW components
Projects
None yet
Development

No branches or pull requests

4 participants