Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vraspar/phi 3 ios update #467

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Vraspar/phi 3 ios update #467

wants to merge 2 commits into from

Conversation

vraspar
Copy link
Contributor

@vraspar vraspar commented Oct 3, 2024

Update Phi-3 iOS application:

  • Modify ReadME and add sample screenshot
  • Add logging and display token generation stats
  • Use User input instead of fixed prompt
  • UI Improvements

Copy link
Contributor

@edgchen1 edgchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice improvements!

@@ -1,27 +1,124 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep the copyright notice?

let totalTime = userInfo["totalTime"] as? Int,
let firstTokenTime = userInfo["firstTokenTime"] as? Int,
let tokenCount = userInfo["tokenCount"] as? Int {
stats = "Generated \(tokenCount) tokens in \(totalTime) ms. First token in \(firstTokenTime) ms."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also include something like this:

token generation rate (tokens/second) = (tokenCount - 1) * 1000 / (totalTimeInMs - firstTokenTimeInMs)

the token generation and prompt processing rates may be different, and it might be useful to get a sense of both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have prompt tokens/s as well given that and generation tokens/s are the usual metrics that gets compared. That would require something to return the number of tokens in the prompt (possibly the length of the input sequence post-tokenization) as IIRC it's not 1:1 with the number of words.


auto sequences = OgaSequences::Create();
tokenizer->Encode(prompt, *sequences);
typedef std::chrono::high_resolution_clock Clock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider steady_clock

https://en.cppreference.com/w/cpp/chrono/steady_clock
"This clock is not related to wall clock time (for example, it can be time since last reboot), and is most suitable for measuring intervals."

// Log model creation
NSLog(@"Creating model ...");
auto model = OgaModel::Create(modelPath);
if (!model) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the C++ API throws exceptions.

https://github.com/microsoft/onnxruntime-genai/blob/bcf55a6dc563bc8b356128b47504d59a21c5ef2f/src/ort_genai.h#L54-L60

https://github.com/microsoft/onnxruntime-genai/blob/bcf55a6dc563bc8b356128b47504d59a21c5ef2f/src/ort_genai.h#L65

might be simplest to just put most of this method into a try {}. propagating the error back to the UI would be a nice touch.


// Log model creation
NSLog(@"Creating model ...");
auto model = OgaModel::Create(modelPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance-wise, it's probably nicer to not re-create the model every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should definitely be nicer as the first request is usually the slowest (and typically when doing perf testing we do a warmup query first and exclude that from timing data).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

let totalTime = userInfo["totalTime"] as? Int,
let firstTokenTime = userInfo["firstTokenTime"] as? Int,
let tokenCount = userInfo["tokenCount"] as? Int {
stats = "Generated \(tokenCount) tokens in \(totalTime) ms. First token in \(firstTokenTime) ms."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have prompt tokens/s as well given that and generation tokens/s are the usual metrics that gets compared. That would require something to return the number of tokens in the prompt (possibly the length of the input sequence post-tokenization) as IIRC it's not 1:1 with the number of words.


// Log model creation
NSLog(@"Creating model ...");
auto model = OgaModel::Create(modelPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should definitely be nicer as the first request is usually the slowest (and typically when doing perf testing we do a warmup query first and exclude that from timing data).

Comment on lines +29 to +34
NSLog(@"Creating tokenizer...");
auto tokenizer = OgaTokenizer::Create(*model);
if (!tokenizer) {
NSLog(@"Failed to create tokenizer.");
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer could also be created once and re-used as I believe it's tied to the model not the prompt so can be re-used.

Comment on lines +66 to +67
NSLog(@"First token generated.");
firstTokenTime = Clock::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logging takes time so we want to record firstTokenTime prior to logging.

might also be interesting to include per-token timing in the log as that would help get a picture of performance throughout the generation phase. e.g. is the time per token consistent? if not is the variability random or does it gradually increase/decrease? that can provide hints as to potential causes of performance issues (if there are any).

if we do that we might want to accumulate per-token times in a list and log at the end, as inserting log calls for every token inside the loop could significantly affect overall time taken (esp. is the log call is synchronous which apparently NSLog is).

break;
}

NSLog(@"Decoded token: %s", decode_tokens);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might want a setting to control whether we log inside the loop (if we're after the best perf numbers possible we probably don't want to do that), or a way to exclude time in calls to NSLog from the total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants