Reliability of the Portal is one of the top pain points from a customers perspective. As an extension author you have a duty to uphold your experience to the reliability bar at a minimum.
Area | Reliability Bar | Telemetry Action/s | How is it measured? |
---|---|---|---|
Extension | See Power BI | ExtensionLoad | (# of ExtensionLoad completes / (# of ExtensionLoad completes + cancels)) * 100 |
Blade | See Power BI | BladeLoaded vs BladeLoadErrored | (( # of BladeLoaded started - # of BladeLoadErrored's) / # of BladeLoaded started) * 100 |
Part | See Power BI | PartLoaded | (( # of PartLoaded started - # of PartLoaded canceled) / # of PartLoaded started) * 100 |
This is core to your customers experience, if the FX is unable to load your extension it will be unable to surface any of your experience. Consequently your customers will be unable to manage/monitor their resources through the Portal.
Second to Extension reliability, Blade reliability is next level of critical reliability. Your Blade reliability can be equated to a page loading in a website, it failing to load is a critical issue.
Parts are used throughout the portal, from a blade and dashboard perspective, if a part fails to load this results in the user potentially:
- not being able to navigate to the a blade or the next blade
- not seeing the critical data they expected on the dashboard
- etc...
There is two methods to assess your reliability:
-
Visit the IbizaFx provided PowerBi report*
-
Run Kusto queries locally to determine your numbers
(*) To get access to the PowerBi dashboard reference the Telemetry onboarding guide, then access the following Extension performance/reliability report
The first method is definitely the easiest way to determine your current assessment as this is maintained on a regular basis by the Fx team. You can, if preferred, run queries locally but ensure you are using the Fx provided Kusto functions to calculate your assessment.
There are a few items that the FX team advises all extensions to follow.
- Configure CDN
- Extension HomePage Caching
- Persistent Caching of scripts across extension updates
- Geo-distribution, ensure you are serving your extension as close as possible to users. The FX provides an Extension Hosting Service which handles Geo-distribution. To assess your extensions performance by data center see the Extension performance/reliability report
- Turning on IIS compression
The setDataContext API on view model factories was designed pre-AMD support in TypeScript and slows down extension load by increasing the amount of code downloaded on extension initialization. This also increases the risk of extension load failures due to increase in network activity. By switching to the setDataContextFactory method, we reduce the amount of code downloaded to the bare minimum. And the individual data contexts are loaded if and when required (e.g. if a blade that's opened requires it).
Old code:
this.viewModelFactories.Blades().setDataContext(new Blades.DataContext());
New code:
this.viewModelFactories.Blades().setDataContextFactory<typeof Blades>(
"./Blades/BladesArea",
(contextModule) => var x = new contextModule.DataContext()
);
Run the following query
GetExtensionFailuresSummary(ago(1d), now())
| where extension contains "Microsoft_Azure_Compute"
Updating the extensionName to be your extension, and increase the time range if the last 24 hours isn't sufficient. Address the highest impacting issues, per occurence/affected users.
The query will return a summary of all the events which your extension failed to load.
Field name | Definition |
---|---|
extensionName | The extension the error correlates to |
errorState | The type of error that occurred |
error | The specific error that occurred |
Occurences | Number of occurrences |
AffectedUsers | Number of affected users |
AffectedSessions | Number of affected sessions |
any_sessionId | A sample of an affected session |
any_message | A sample message of what would normally be returned given errorState/error |
Once you have ran the query you will be shown a list of errorStates and errors, for more greater details you can use the any_sessionId to investigate further.
Error State | Definition | Action items |
---|---|---|
FirstResponseNotReceived | This error state means that the shell loaded the extension URL obtained from the config into an IFrame, however there wasn't any response from the extension |
|
HomePageTimedOut | The index page failed to load within the max time period | // Need steps to action on |
ManifestNotReceived | This error state means that the bootstrap logic was completed, however the extension did not return a manifest to the shell. The shell waits for a period of time and then timed out. |
|
InvalidExtensionName | This error state means that the name of the extension specified in the extensions JSON in config doesn't match the name of the extension in the extension manifest. |
|
InvalidManifest | This error state means that the manifest that was received from an extension was invalid, i.e. it had validation errors | Scan the error logs for all the validation errors in the extension manifest. |
InvalidDefinition | This error state means that the definition that was received from an extension was invalid, i.e. it had validation errors | Scan the error logs for all the validation errors in the extension definition. |
FailedToInitialize | This error state means that the extension failed to initialize one or more calls to methods on the extension's entry point class failing |
|
TooManyRefreshes | This error state means that the extension try to reload itself within the IFrame multiple times. The error should specify the number of times it refreshed before the extension was disabled | Scan the events table to see if there are any other relevant error messages during the time frame of the alert |
TooManyBootGets | This error state means that the extension try to send the bootGet message to request for Fx scripts multiple times. The error should specify the number of times it refreshed before the extension was disabled | Scan the events table to see if there are any other relevant error messages during the time frame of the alert |
TimedOut | This error signifies that the extension failed to load after the predefined timeout. |
|
MaxRetryAttemptsExceeded | This a collation of the above events | Inspect the sample message and follow appropriate step above |
Firstly, run the following query, ensure you update the extension/time range.
GetBladeFailuresSummary(ago(1d), now())
| where extension contains "Microsoft_Azure_Compute"
Field name | Definition |
---|---|
extension | The extension the error correlates to |
blade | The blade the error correlates to |
errorReason | The error reason associated with the failure |
Occurences | Number of occurrences |
AffectedUsers | Number of affected users |
AffectedSessions | Number of affected sessions |
any_sessionId | A sample of an affected session |
any_details | A sample message of what would normally be returned given extension/blade/errorReason |
Once you have that, correlate the error reasons with the below list to see the guided next steps.
Error reason | Defintion | Action items |
---|---|---|
ErrorInitializing | The FX failed to initialize the blade due to an invalid definition. |
|
ErrorLoadingExtension | The extension failed to load and therefore the blade was unable to load. | Refer to the guidance provided for extension reliability |
ErrorLoadingDefinition | The FX was unable to retrieve the blade defintion from the Extension. | Reference a sample session in the ClientEvents kusto table there should be correlating events before the blade failure |
ErrorLoadingExtensionAndDefinition | The FX was unable to retrieve the blade defintion from the Extension. | Reference a sample session in the ClientEvents kusto table there should be correlating events before the blade failure |
ErrorUnrecoverable | The FX failed to restore the blade during journey restoration because of an unexpected error. | This should not occur but if it does file a [shell bug](https://aka.ms/portalfx/shellbug). |
Firstly, run the following query, ensure you update the extension/time range.
GetPartFailuresSummary(ago(1d), now())
| where extension contains "Microsoft_Azure_Compute"
Field name | Definition |
---|---|
extension | The extension the error correlates to |
blade | The blade the part is on, if blade === "Dashboard' then the part was loaded from a dashboard |
part | The part the error correlates to |
errorReason | The error reason associated with the failure |
Occurences | Number of occurrences |
AffectedUsers | Number of affected users |
AffectedSessions | Number of affected sessions |
any_sessionId | A sample of an affected session |
any_details | A sample message of what would normally be returned given extension/blade/part/errorReason |
Once you have that, correlate the error reasons with the below list to see the guided next steps.
Error reason | Defintion | Action items |
---|---|---|
TransitionedToErrorState | The part was unable to load and failed through its initialization or OnInputsSet | Consult the any_details column, there should be sample message explaining explicitly what the issue was. Commonly this is a nullRef. |
ErrorLocatingPartDefinition | The FX was unable to determine the part definition. | The likely cause of this is the extension has removed the part entirely from the PDL, this is not the guided pattern. See deprecating parts for the explicit guidance. __NEED LINK__ |
ErrorAcquiringViewModel | The FX was unable to retrieve the part view model from the Extension. |
You can correlate the start of thesample message with one of the below for common explanations.
|
ErrorLoadingControl | The FX was unable to retrieve the control module. | Reach out to the FX team if you see a large amount of these issues. |
ErrorCreatingWidget | The FX failed to create the widget. | Check the sample message this is indicate the explicit reason why it failed, this was probably a ScriptError or failure to load the module. |
OldInputsNotHandled | In this case a user has a pinned representation of a old version of the tile. The extension author has changed the inputs in a breaking fashion. | If this happens you need follow the guided pattern. __NEED LINK__ |