Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An idea - a Runner that owns the input and output arrays and freezes shapes and names #40

Open
aldanor opened this issue Nov 28, 2020 · 5 comments

Comments

@aldanor
Copy link

aldanor commented Nov 28, 2020

Bottom line - there's tons of overhead in run() currently:

  • It allocates something like 15 Vec instances and a bunch of strings, there's tons of allocations all over the place (so for small inputs and graphs this is noticeable)
  • For big inputs, you are currently required to copy the data in
  • There's a lot of overhead like building names vecs (should be done upon model load?) and shapes (if there's no dynamic axes, no need to do that repeatedly)
  • There's allocations for outputs as well

Here's one idea, what if you could do something like this (I think this way you could bring the overhead down to almost zero).

// maybe I've missed something, would like to hear your thoughts, @nbigaouette :)

// note that this is all simplified, as it may require e.g. Pin<> in a few places
struct Runner {
    session: Session,
    inputs: Vec<Array<...>>>,
    // owned preallocated outputs as well?
    input_names: Vec<CString>,
    output_names: Vec<CString>,
}

impl Runner {
    fn from_session(session: Session) -> Self { ... }

    pub fn execute(&mut self) { ... }

    pub fn outputs(&self) -> &[Array<...>] { ... }

    pub fn inputs(&mut self) -> &mut [Array<...>] { ... }
}

let mut Session: session = ...;
let input_arrays: Vec<...> = ...;

// this executes most of what `run()` currently does, all the way up to the actual .Run() call
let runner = session.into_runner(input_arrays);

runner.execute()?; // this just calls Run() and converts the status

// if outputs are preallocated, no extra allocations here either
for out in runner.outputs() {
    dbg!(out);
}

// no allocations, no boilerplate, we're just updating the inputs
runner.inputs()[0].fill(42.0);

// no allocations, no boilerplate, just a .Run() call
runner.execute()?;
@aldanor
Copy link
Author

aldanor commented Nov 28, 2020

In fact, I think that the current run() can be probably even expressed in terms of the above. This may require it to hold a mut reference to the session though, i.e. Runner<'a'> { session: &'a mut Session, ... .

So to retain current API you could have

impl Session {
    pub fn run(&mut self, inputs: Inputs) -> Result<Outputs> {
        let runner = Runner::new(self, inputs)?;
        runner.execute()?;
        Ok(runner.into_outputs())
    }
}

(As noted in #39 though, things like caching names should be probably done outside of all of this anyway upon model loading; also precaching shapes when there's no dynamic axes etc).

Note: the above will probably not compile because of potential multiple mutable borrows etc, but that's technical details - can be made to work with a bit of munging and shuffling, I just tried to make the general idea clear.

@aldanor
Copy link
Author

aldanor commented Nov 28, 2020

In fact, to think about it further, Runner is almost like a Session where the input shape is known. Then basically we can preallocate everything including inputs and outputs and all pointers can be frozen. Also you don't even need to pass input arrays to create a runner, it can just zero-initialise some inputs; it just needs the shape(s).

I think, in most practical cases where the speed of execution would be critical (realtime apps), input shape would almost always be frozen and known in advance, so all dimensions would be fully known and the only thing that would change would be the inputs themselves (e.g. receiving frames from a camera, etc).

@aldanor
Copy link
Author

aldanor commented Nov 28, 2020

I have a prototype that I can try to push tonight - so, in brief, it reduces execution time of a tiny graph with a few nodes from 15us to 8us, so almost 2x speedup, plus there's no more extractors, no allocations, no copies or clones (as suggested above).

@marshallpierce
Copy link
Contributor

This seems like a good fit for my use case: a service that loads precisely one .onnx file, and then feeds data from each request through the resulting session.

@aldanor
Copy link
Author

aldanor commented Dec 2, 2020

@marshallpierce See #41 for a preliminary working implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants