Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Lazily retrieve pipeline stages from PSR-11 container #55

Open
DvdGiessen opened this issue Aug 6, 2021 · 3 comments

Comments

@DvdGiessen
Copy link

DvdGiessen commented Aug 6, 2021

Description

Pipelines are great for data processing. However, there may be cases where the data fed into the pipeline is invalid, causing any stage to fail. That means there can be quite a few pipeline stages that we loaded, configured, et cetera, that are not going to be called. We can optimize performance in these cases by lazily initializing pipeline stages.

Instead of coming up with some bespoke interface to do so, we can instead delegate this to an existing PSR-11 container implementation. PSR-11 can be considered quite mature at this point, and seems like a good match.

So, instead of doing:

$pipeline = (new PipelineBuilder())
    ->add($container->get(MyFirstStage::class))
    ->add($container->get(MySecondStage::class))
    ->add(function ($payload) {
        return $payload * 10;
    })
    ->add($container->get(MyFourthStage::class))
    ->build();
// Every stage has now gone through initialization

We might have something like:

$pipeline = (new ContainerAwarePipelineBuilder($container))
    ->add(MyFirstStage::class)
    ->add(MySecondStage::class)
    ->add(function ($payload) {
        // Adding callable stages directly still works
        return $payload * 10;
    })
    ->add(MyFourthStage::class)
    ->build();
// Stages from the container will not be initialized at this point,
// they will be initialized when the stage is invoked

What problem does this solve

As mentioned; lazy loading can do a lot for performance in larger applications. This idea came up because in my application I have a data processing pipeline with various stages that can fail. There are also (class based) stages that interact with a remote database, use configuration files, etc, which are expensive to initialize.

The cleanest way to write these stages would usually be a simple class where dependencies are passed to the constructor and initialization like preparing SQL statements, parsing a configuration file, etc are done in the constructor as well. Then the __invoke() method is ready to just do its work.

However, that setup is expensive: not only the initialization that happens within the stage itself, but also the dependencies the stage depends upon need already be resolved. For example, if a stage depends on a PDO object to do it's database work, we need to already set up a connection to the database.

That means that if the pipeline is processing some payload that fails during the very first stage (i.e. a validation step fails), we already have done the expensive initialization for all the stages that follow it but that are never going to be invoked.

(A currently possible workaround is passing a container instance into the stages and have them lazily load their dependencies and do setup lazily whenever the stage is first invoked. This adds a lot of code complexity to the stages, and passing a container around like that is a bit of an anti-pattern. Solving this within the Pipeline abstraction would generally make for much nicer code.)

Brainstorm: If we implement this, how?

  • We can probably support both already ready to go stages and stages that need to be fetched from the container with a single interface: We would widen the types allowable for a stage to callable|string: if it's a callable, it is used directly as a stage. If it's a non-callable string, then it's used as a key to retrieve the stage from the container.
  • Alternatively, we add this functionality to a separate class (for example ContainerAwarePipeline(Builder) as in my example above). We would still need to widen the callable type used in the interfaces.
  • For those interacting with the classes/interfaces widening the types wouldn't be a problem, but for those implementing custom classes based on them that would be a BC break. So yes, this would need a new major version.
  • In documentation and examples, we can use the excellent League Container, which itself already provides documentation on how to do lazy-loading.
  • Another design choice: Do we retrieve the stages once and cache them? Or do we delegate that to the container as well, and just retrieve the stage every time we need it? Probably the latter, but might be worth discussing.

I'd be happy to do the initial work and make a pull request, if the Pipeline maintainers are interested to have this kind of functionality added.

@icanhazstring
Copy link

icanhazstring commented Aug 6, 2021

This is already possible with the current implementation.

class ContainerPipelineBuilder implements PipelineBuilderInterface
{
    private $container;
    private $stages = [];

    public function __construct(Container $container)
    {
        $this->container = $container;
    }

    /**
     * @return self
     */
    public function add($stageClass)
    {
        $this->stages[] = function() use ($stageClass) {
            $stage = $this->container->get($stageClass);
            return call_user_func_array([$stage, '__invoke'], func_get_args());
        };

        return $this;
    }

    public function build(ProcessorInterface $processor = null)
    {
        return new Pipeline($processor, ...$this->stages);
    }
}

class MyStage {
    public function __invoke()
    {
        echo 'first stage';
    }
}

$pb = new ContainerPipelineBuilder(new Container());
$p = $pb->add(MyStage::class)->build();
$p->process('a'); // will print 'first stage';

The "only" thing that needs to be changed is to remove the typehint from add in the builder on your implementation.
You could also add a generic annotation for psalm/phpstan that the parameter given must be a class-string.

See working example: https://3v4l.org/3qGGe

@DvdGiessen
Copy link
Author

Yes, that's roughly what the implementation could look like! To clarify, main reason I raised the feature request is to explore whether having this functionality (and docs, tests, etc) contributed to be part of Pipeline is something maintainers are interested in. If not, users can always build the abstraction themselves.

@icanhazstring
Copy link

I don't think that it should be implemented into this package as it's already possible to provide this. What I could imagine is a pipeline-psr11 package of some sort. This way you can provide an extension to certain other implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants