Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature review for various features. #290

Conversation

mfendeksilverstripe
Copy link
Contributor

@mfendeksilverstripe mfendeksilverstripe commented Apr 23, 2020

Feature review for multiple features

Disclaimer

This pull request serves as a preview for features which can be imported into this module. Feature code is included as is from the bespoke project (it's serving as reference only). It's already well organised for module import as it contains no bespoke code dependencies unless stated in the documentation.

The goal here is to review each feature and determine if effort is worth putting into moving this feature to the module. Features identified as "worth the effort" will be separated into their own pull request which follows standard review flow.

Please DO NOT code review this PR. The code is not meant to be merged, only reviewed on do we want this feature in the module or not basis.

Test setup

All features below are used on at least three live projects of various sizes (small, medium, large).

Dependencies

List of features / fixes included in this bundle

This is the full list of features / fixes with details explaining why this feature is needed or what problem is it trying to solve. Features will be numbered for quick reference but the list is in arbitrary order.

Feature 1 - Improved Jobs admin UI

Problem

Finding specific jobs in the Jobs admin is a chore, developers often find themselves scrolling through multiple pages. Finding the right jobs is sometimes impossible as the filter criteria input is quite limited. Using Text fields for input is often tedious.

Solution

Improved UI. More filters, better input fields, more useful information in the columns.

Jobs admin

Code reference

App\Queue\Admin\Extension

Resolution

Feature 2 - Improved Job edit form

Problem

Job data is inaccessible in the jobs admin which makes debugging jobs quite difficult. It often requires the developer to download DB snapshot in order to do anything. In case job messages get corrupted (for example in case of getting truncated), they become inaccessible forcing the developer to rely on DB dump instead.

Solution

Expose job data and raw messages as read only fields in the job edit form.

Jobs data

Jobs messages

Code reference

App\Queue\Admin\DataExtension

Resolution

Feature 3 - Jobs cleanup task

Problem

Default jobs cleanup task is unusable for large number of jobs as it deletes the jobs as models. Jobs cleanup is an ongoing task as completed jobs pile up and deleting becomes a necessity (some operations become slow with large jobs table).

Solution

Implement a dev task to cleanup jobs which is really fast and runs on schedule (every hour).

Code reference

App\Queue\Cleanup\Task

Feature 4 - Queue test task / job

Problem

When setting up a queue it's tricky to test if the setup is working as expected. Queue can be tested with any job but this may require extra effort and may not be reliable:

  • job may require some DB state to run
  • there may be side-effects to the job which are unwanted
  • the job may cause issues which makes testing more difficult because it's not clear that if the issue is queue related or job related
  • most notable case for test jobs is making sure job locking works with multiple runner threads

Solution

Implement a simple test job and task to queue the test jobs, which are reliably not doing anything else apart from testing the queue.

Code reference

App\Queue\Dev

Feature 5 - Factory for jobs

Problem

Some operations (like migrations) require creation of thousands of jobs. This can actually be a challenge mostly because of the following issues:

  • direct CLI access is needed to PROD instance to run the dev task which creates the jobs so it would not time out
  • running the dev task via browser is risky as browser may determine that the page is unresponsive and create a second request. This will result in multiple copies of the dev task running potentially creating redundant jobs as the job signature system is not thread-safe
  • running the dev task via queue job for too long has the same issue as the browser - eventually the queue runner will determine that the process is no longer running and will pause / resume the job. This will result in multiple copies of the dev task running potentially creating redundant jobs as the job signature system is not thread-safe
  • redundant jobs can be detrimental to the operation as the likelihood of copies of the same jobs being processed at the same time is really high, this often results in DB write issues such as deadlocks and timeouts
  • I'm happy to provide more details on this issue if needed as this is quite complex

Solution

Instead of creating thousands of jobs via a dev task, we will use a factory job as an intermediary to ensure we can avoid redundant jobs creation. The jobs creation flow will be as follows:

  • Dev task will be executed but it has only small execution time as it creates Factory jobs (represent batches of target jobs)
  • Factory jobs will generate the target jobs
  • the main difference is that a Factory job indicates progress to the queue runner and is subject to job locking which prevents any redundant processing
  • Factory job essentially represents a part of the original dev task but is thread safe
  • it's now possible to queue up a lot of work for the queue which is convenient for overnight migrations (doesn't require that much attention from the developer while running)

Code reference

App\Queue\Factory

Feature 6 - Job management

Problem

Getting a good overview on how is queue processing jobs is quite difficult from the jobs admin. Also, batches of jobs may need to be manually restarted which is a huge chore if jobs admin UI is used as it has to be done one job at a time. Manually restarting hundreds of jobs is likely too tedious.

Solution

  • Queue report dev task to provide queue overview
  • Update job status dev task which provides job state manipulation in large batches
Queue report / Current state
  • we should include queue waiting lines, job duration per job type and parallel rate per job type (this compares linear job duration vs parallel job duration)
  • this report shouldn't be a dev task but rather a Controller

Queue report

Queue report / UI draft

queue-overview

Job status update / Current state

Job status

Job status update / UI draft
  • this feature should be moved from a dev task into the CMS using standard forms
  • the UI should use standard form layout so no UI customisation is needed

Code reference

App\Queue\Management

Feature 7 - Execution time

Problem

We may want to set limit to execution time on jobs or dev tasks. This should be done in a safe way via callback as the limit should only apply to the job / task execution.

Solution

Execution time trait which provides a small helper method to later execution time.

Code reference

App\Queue\ExecutionTime

Feature 8 - Job error logging

Problem

Default error logging for jobs is not great. Debug information is often missing important details or the information is not available at all. In some cases the information is logged but matching the Raygun error message with a specific job may be quite difficult.

Solution

Use a job error logger which logs all of the issues inside job messages. If job execution fails and triggers errors, all relevant information will be present in the job messages which makes the debugging process for developer that much easier.

Code reference

App\Queue\Extension
App\Queue\Logger

Feature 9 - Job manager process

Problem

Using asynchronous queue runner has a flaw - when the manager process terminates, it also terminates all of its child processes. This is a mistake because there is a valid use case where manager process doesn't have any more work to do (no more jobs to start) but the child processes are still executing jobs.

Solution

Fix the shell command and prevent the manager process from killing of the child processes.

Code reference

App\Queue\Manager

Resolution

Feature 10 - generic process items job

Problem

Majority of non-trivial jobs can be written as a set of sequence of items that need to be processed. This means that there is a lot of code repetition in the job implementation for these kind of jobs.

Solution

Create an abstract class to cover a job which processes sequence of items. The job will act as a parent for any job implementations that are of this kind.

Code reference

App\Queue\Job

Feature 11 - Recommended queue settings for medium stack size

Problem

There are no recommended queue settings based on stack size. There are quite a few constants to get right.

Solution

Start a documentation on recommended queue settings based on stack size.

Code reference

---
Name: app-queue
After: '#queuedjobsettings'
---
Symbiote\QueuedJobs\Services\QueuedJobService:
  lock_file_enabled: true
  lock_file_path: '/public/queue-lock'
  max_init_jobs: 100
  memory_limit: 128m
  use_shutdown_function: false
  extensions:
    - App\Queue\Extension

Symbiote\QueuedJobs\Tasks\Engines\DoormanRunner:
  max_ticks: 12
  tick_interval: 4

SilverStripe\Core\Injector\Injector:
  AsyncPHP\Doorman\Manager\ProcessManager:
    class: App\Queue\Manager
  Symbiote\QueuedJobs\Services\QueuedJobService:
    properties:
      queueRunner: %$Symbiote\QueuedJobs\Tasks\Engines\DoormanRunner
  LowUsageRule:
    class: 'AsyncPHP\Doorman\Rule\InMemoryRule'
    properties:
      Processes: 5
      MinimumProcessorUsage: 0
      MaximumProcessorUsage: 50
      MinimumMemoryUsage: 0
      MaximumMemoryUsage: 50
  MediumUsageRule:
    class: 'AsyncPHP\Doorman\Rule\InMemoryRule'
    properties:
      Processes: 3
      MinimumProcessorUsage: 50
      MaximumProcessorUsage: 75
      MinimumMemoryUsage: 50
      MaximumMemoryUsage: 75
  HighUsageRule:
    class: 'AsyncPHP\Doorman\Rule\InMemoryRule'
    properties:
      Processes: 1
      MinimumProcessorUsage: 75
      MaximumProcessorUsage: 100
      MinimumMemoryUsage: 75
      MaximumMemoryUsage: 100
  Symbiote\QueuedJobs\Tasks\Engines\DoormanRunner:
    properties:
      DefaultRules:
        DefaultRule: null
        LowUsageRule: '%$LowUsageRule'
        MediumUsageRule: '%$MediumUsageRule'
        HighUsageRule: '%$HighUsageRule'

Feature 12 - Automatic Jobs retries

Problem

Some jobs fail and we can't change that. For example jobs that contain third party API requests or automated publishing of multiple localisations of the same page at the same time via Embargo / Expiry module. We have to accept that these jobs may fail but we should be able to minimise the failure occurrence.

Solution

Automated broken job retries. Number of retries are configurable and also the time delay between each retry. This is especially useful if some third party API has a small downtime or there is DB deadlock in place. Waiting with the job retry for a few minutes may make the difference.

Code reference

App\Extensions\QueuedJobs QueuedJobServiceExtension

@madmatt
Copy link
Collaborator

madmatt commented Apr 27, 2020

This is amazing work @mfendeksilverstripe, thanks to you and the whole team! These features look fantastic and I'd love to see every last one of them integrated.

@brynwhyman
Copy link

Wow, these are some fantastic suggested contributions back to the module - big shout out to you and the team! I'm keen to see thoughts from others on the specific features, but they all look like great additions to me.

Copy link
Contributor

@chillu chillu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work Mojmir! Left some minor review comments inline. Got some security concerns with "Feature 6 - Job management" (low impact). Otherwise I'd love to see all eleven features in the module.

I'm assuming you'd convert anything that's currently an Extension into modifications on the class it was extending? Extensions slow everything down, so we should only use them when necessary.

Feature 11 - Recommended queue settings for medium stack size

Hmm my first instinct was to recommend docs on https://docs.silverstripe.cloud. But I think it's more of a general "fine tuning" section in the module readme. Maybe this could be accompanied with some rules of thumb? Some of the details might already be explained, e.g. in the doorman module. But queue fine tuning is so time intensive that it's worth a bit of investment into describing it better IMHO

* @property int $randomID
* @package App\Queue\Dev
*/
class Job extends Queue\Job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a bit more explicit about this and call it TestJob (rather than relying on the task being in the Dev namespace). Over time, there might be other "dev jobs" with different purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code follows the feature based namespace approach which made the code nice and portable within the bespoke code. Of course, I'm happy to change the namespace / class naming convention to whatever makes more sense within the module. This change would be done as a part of standard pull request flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you had a second "job" that would rely on the "queue" (which all jobs do), then what would the FQCN be?

return;
}

$job = Injector::inst()->create($this->jobClass);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably check that $jobClass actually contains a valid job class definition (for both dev experience and security reasons)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

echo '<br />';

echo 'Submitting will apply change immediately:<br><br>';
echo '<button type="submit">Update status</button>';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the TaskRunner controller enforces authorisation on this task (effectively CLI or ADMIN), it's not automatically guarding against CSRF. While CSRF via POST is a bit harder to achieve, this would still make logged-in admins vulnerable. Check out URLSpecialsMiddleware in core in how we've handled that for dev/build, you might be able to hook into the same system? Alternatively, use the Silverstripe Form APIs which should enable CSRF by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I was hoping to use forms and templates to improve this dev task :)

*
* @package App\Queue
*/
class Logger implements LoggerInterface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would benefit from a config example. How are you injecting the logger only during queued job executions? Are you ensuring that a plaintext formatter is attached to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this example. The logger is attached to a helper which is executed within the job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've argued in #309 that I think this isn't the right approach :)

*/
protected function getCommand($binary, $worker, $stdout, $stderr) // phpcs:ignore SlevomatCodingStandard.TypeHints
{
return sprintf('nohup %s %s %%s %s %s & echo $!', $binary, $worker, $stdout, $stderr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this shouldn't be the default behaviour? Assuming nohup is a pretty standard unix command, it shouldn't increase the requirements for the module.

How does this deal with runaway child processes that don't have an explicit max execution time set on them (via your ExecutionTime trait)? In most cases. PHP CLI executions won't be time limited. Should you only use this manager if you've done your homework in every job type that's run in your project? Maybe we need to define a relatively generous max exec time by default. It seems better than a) terminating child processes while still running (status quo), or b) risking runaway processes (your change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comparison:

Current command:

  • child processes terminate when parent process terminates
  • lower chance of having zombie processes
  • potentially less efficient job execution as child processes may terminate prematurely

New command

  • child processes terminate independently from parent process
  • higher chance of having zombie processes if not handled properly (max execution time needs to be set for jobs)
  • potentially more efficient job execution as child processes may continue execution after parent process terminates

We tested the new approach extensively on a large project with high queue usage. Every non-trivial job was assigned a reasonable max execution time and it works really well.

Overall, I agree that setting a global default for time execution and using this new command by default looks like a great option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK can you a max execution default part of the same feature PR please? And check if there's enough docs around considerations for execution time when implementing jobs.

* @property array $remaining
* @package App\Queue
*/
abstract class Job extends AbstractQueuedJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a better name :)

Copy link
Contributor Author

@mfendeksilverstripe mfendeksilverstripe Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to rename this during standard pull request flow.

public function setup(): void
{
$this->remaining = $this->items;
$this->totalSteps = count($this->items);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$items is pretty opaque here, maybe add a getItems() to the interface and describe what it should look like? I guess that's implementation specific, could be a class name, an array, etc.

Do you have an example to illustrate what you could do with this?

Copy link
Contributor Author

@mfendeksilverstripe mfendeksilverstripe Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, some generic examples:

  • job which generates image variants (one item represents one image variant - array)
  • job which migrates a list of objects (one item represents object ID - integer)
  • job which updates static cache (one item represents one URL - string)

Specific example seen here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm wondering how much value there is in defining an abstract class for this, with so much implicit variations. Maybe this could just be a "howto: implement multi step jobs" in the docs? Something that simply points out that all jobs have $remaining and $totalSteps.

@chillu
Copy link
Contributor

chillu commented May 27, 2020

Here's my take on what I'd consider the most valueable features for the module, in case we need to sequence the work (in descending order):

Feature 8 - Job error logging
Feature 9 - Job manager process
Feature 1 - Improved Jobs admin UI
Feature 2 - Improved Job edit form
Feature 6 - Job management
Feature 11 - Recommended queue settings for medium stack size
Feature 7 - Execution time
Feature 3 - Jobs cleanup task
Feature 5 - Factory for jobs
Feature 10 - generic process items job
Feature 4 - Queue test task / job

@mfendeksilverstripe
Copy link
Contributor Author

Thanks for feedback @chillu I'll start breaking out the features into separate pull requests.

@mfendeksilverstripe
Copy link
Contributor Author

Feature 9 - Job manager process covered in #305

@mfendeksilverstripe
Copy link
Contributor Author

Added Feature 12 - Automatic Jobs retries

@emteknetnz
Copy link
Collaborator

emteknetnz commented Aug 23, 2020

@mfendeksilverstripe yes agreed that the features you've listed here looks great. I think the priority order that Ingo listed above looks like a pretty sensible way to order things

Feel free to create PR's that go straight into the queuedjob modules branched off and into the 4 branch (new features that are backwards compatible are released as the next minor version) when you have the opportunity and we'll look to get them merged for you. It probably makes sense to do a single PR for each feature.

Cheers!

@mfendeksilverstripe
Copy link
Contributor Author

Thank you @emteknetnz , will start splitting the features into separate PRs 👍

@GuySartorelli
Copy link
Collaborator

This has been sitting here with no action for a really long time - I'm going to close it on the assumption that it has served its purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants