[Proposal] Add token acquisition metrics that cover both successful and failed t…#5928
Conversation
…oken acquisition outcomes
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in “extended token acquisition metrics” mode to improve OpenTelemetry coverage across both successful and failed token acquisitions, and closes a gap by capturing IDP network latency for background refresh successes.
Changes:
- Introduces V2 OTel histograms (total duration + HTTP duration) tagged with
Succeeded/HttpStatusCode, gated by a new config flag. - Adds a public builder opt-in
WithExtendedTokenAcquisitionMetrics()and threads the flag through OTel instrumentation APIs. - Logs HTTP duration for proactive background refresh successes.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/client/Microsoft.Identity.Client/TelemetryCore/TelemetryConstants.cs | Adds a new telemetry tag constant for HTTP status code. |
| src/client/Microsoft.Identity.Client/TelemetryCore/OpenTelemetry/NullOtelInstrumentation.cs | Updates no-op instrumentation to match new extended-metrics API surface. |
| src/client/Microsoft.Identity.Client/TelemetryCore/OpenTelemetry/IOtelInstrumentation.cs | Extends the instrumentation interface for V2 metrics + HTTP duration logging and failure context. |
| src/client/Microsoft.Identity.Client/Platforms/Features/OpenTelemetry/OtelInstrumentation.cs | Implements V2 histograms, extended/fallback emission logic, and updated failure metric recording. |
| src/client/Microsoft.Identity.Client/Internal/Requests/SilentRequestHelper.cs | Emits success HTTP duration for proactive background refresh; updates failure logging signature. |
| src/client/Microsoft.Identity.Client/Internal/Requests/RequestBase.cs | Captures total duration for failures and passes HTTP status code into failure telemetry. |
| src/client/Microsoft.Identity.Client/AppConfig/BaseAbstractApplicationBuilder.cs | Adds the public opt-in builder method for extended token acquisition metrics. |
| src/client/Microsoft.Identity.Client/AppConfig/ApplicationConfiguration.cs | Adds configuration toggle IsExtendedTokenAcquisitionMetricsEnabled. |
| src/client/Microsoft.Identity.Client/PublicApi/*/PublicAPI.Unshipped.txt | Declares the new public builder API across TFMs. |
| int httpStatusCode = ex is MsalServiceException serviceEx ? serviceEx.StatusCode : 0; | ||
| LogFailureTelemetryToOtel(apiEvent, apiEvent.ApiErrorCode, httpStatusCode, requestStopwatch.ElapsedMilliseconds + measureTelemetryDurationResult.Milliseconds); |
There was a problem hiding this comment.
measureTelemetryDurationResult.Milliseconds is the millisecond component (0–999) of the TimeSpan, not the full duration. This will under-report total duration for telemetry whenever measureTelemetryDurationResult is >= 1 second. Use measureTelemetryDurationResult.TotalMilliseconds (cast/rounded to long as needed) instead of .Milliseconds when computing totalDurationInMs.
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ErrorCode, errorCode), | ||
| new(TelemetryConstants.ApiId, apiEvent.ApiId), | ||
| new(TelemetryConstants.CallerSdkId, apiEvent.CallerSdkApiId ?? string.Empty + "," + apiEvent.CallerSdkVersion ?? string.Empty), |
There was a problem hiding this comment.
The CallerSdkId value composition is incorrect due to ?? / + precedence and will also drop the version when CallerSdkApiId is non-null. Build the combined value explicitly (e.g., via interpolation or parentheses) so it always includes both parts with empty-string defaults.
| new(TelemetryConstants.CallerSdkId, apiEvent.CallerSdkApiId ?? string.Empty + "," + apiEvent.CallerSdkVersion ?? string.Empty), | |
| new(TelemetryConstants.CallerSdkId, $"{apiEvent.CallerSdkApiId ?? string.Empty},{apiEvent.CallerSdkVersion ?? string.Empty}"), |
| s_durationInExtensionInMs.Value.Record(authResultMetadata.DurationCreatingExtendedTokenInUs, | ||
| new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()), | ||
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ApiId, apiId), | ||
| new(TelemetryConstants.TokenSource, authResultMetadata.TokenSource), | ||
| new(TelemetryConstants.CacheLevel, authResultMetadata.CacheLevel), | ||
| new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType)); |
There was a problem hiding this comment.
s_durationInExtensionInMs is defined as an “InMs” histogram (unit: "ms"), but the recorded value is DurationCreatingExtendedTokenInUs (microseconds). Convert to milliseconds before recording (or record into a microseconds-based histogram) to avoid emitting incorrect latency values.
| serviceBundle.PlatformProxy.OtelInstrumentation.LogFailureMetrics( | ||
| serviceBundle.PlatformProxy.GetProductName(), | ||
| apiEvent, | ||
| ex.ErrorCode, | ||
| apiEvent.ApiId, | ||
| callerSdkId, | ||
| callerSdkVersion, | ||
| CacheRefreshReason.ProactivelyRefreshed, | ||
| apiEvent.TokenType); | ||
| ex.StatusCode, | ||
| totalDurationInMs: 0, | ||
| serviceBundle.Config.IsExtendedTokenAcquisitionMetricsEnabled); |
There was a problem hiding this comment.
In extended-metrics mode, OtelInstrumentation.LogFailureMetrics only records the V2 total-duration histogram when totalDurationInMs > 0. Passing totalDurationInMs: 0 here means background refresh failures won’t contribute to the new “cover failures” total-duration metric. If the goal is to cover failed outcomes for proactive refresh too, capture and pass a meaningful elapsed duration in this flow (or adjust the recording condition to allow 0 only if that’s an intentional signal).
| if (!isExtendedMetricsEnabled) | ||
| { | ||
| if (s_durationInHttp.Value.Enabled) | ||
| { | ||
| s_durationInHttp.Value.Record(authResultMetadata.DurationInHttpInMs, | ||
| new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()), | ||
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ApiId, apiId), | ||
| new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType)); | ||
| } | ||
| } | ||
| else | ||
| { | ||
| if (s_durationInHttpV2.Value.Enabled) | ||
| { | ||
| s_durationInHttpV2.Value.Record(authResultMetadata.DurationInHttpInMs, | ||
| new(TelemetryConstants.MsalVersion, MsalIdHelper.GetMsalVersion()), | ||
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ApiId, apiId), | ||
| new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType), | ||
| new(TelemetryConstants.HttpStatusCode, 200)); | ||
| } | ||
| } |
There was a problem hiding this comment.
The extended/non-extended branching and tag sets for HTTP duration are now duplicated between LogSuccessMetrics and LogSuccessHttpDuration. Consider centralizing the “record HTTP duration” logic into a single private helper to reduce the risk of the two paths drifting (e.g., tag changes, enablement checks, or future histogram revisions).
| if (isBrokerConfigured && !isAccountSourceDeviceCodeFlow) | ||
| { | ||
| _logger.Info("Broker is configured and enabled, attempting to use broker instead."); | ||
| AuthenticationRequestParameters.RequestContext.ApiEvent.TokenSource = TokenSource.Broker; | ||
| var brokerResult = await _brokerStrategyLazy.Value.ExecuteAsync(cancellationToken).ConfigureAwait(false); | ||
|
|
||
| // fallback to local cache if broker fails |
There was a problem hiding this comment.
ApiEvent.TokenSource is set to Broker before attempting the broker flow, but it is never reset when the broker returns null and the code falls back to the local cache. This can mis-tag failure telemetry (and any other ApiEvent-based telemetry) as coming from the broker even when the cache path was used. Set TokenSource back to Cache before _clientStrategy.ExecuteAsync(...) or right before the "Attempting to acquire token using local cache" path.
| case "MsalTotalDuration.2": | ||
| Trace.WriteLine("Verify the metrics captured for MsalTotalDuration.2 histogram."); | ||
| Assert.AreEqual(MetricType.Histogram, exportedItem.MetricType); | ||
|
|
||
| expectedTags.Add(TelemetryConstants.MsalVersion); | ||
| expectedTags.Add(TelemetryConstants.Platform); | ||
| expectedTags.Add(TelemetryConstants.ApiId); | ||
| expectedTags.Add(TelemetryConstants.CacheRefreshReason); | ||
| expectedTags.Add(TelemetryConstants.TokenType); | ||
| expectedTags.Add(TelemetryConstants.Succeeded); | ||
|
|
||
| foreach (var metricPoint in exportedItem.GetMetricPoints()) | ||
| { | ||
| AssertTags(metricPoint.Tags, expectedTags); | ||
| } |
There was a problem hiding this comment.
VerifyMetrics expects MsalTotalDuration.2 to have only a small set of tags, but the implementation currently records additional tags (e.g., TokenSource/CacheLevel/ErrorCode). With the current AssertTags (exact tag count), this case will fail as soon as a test starts validating MsalTotalDuration.2. Align the expected tag list here with the actual emitted tags (or adjust the production tag set to match the intended contract).
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ErrorCode, errorCode), | ||
| new(TelemetryConstants.ApiId, apiEvent.ApiId), | ||
| new(TelemetryConstants.CallerSdkId, apiEvent.CallerSdkApiId ?? string.Empty + "," + apiEvent.CallerSdkVersion ?? string.Empty), |
There was a problem hiding this comment.
The CallerSdkId tag value construction uses ?? and string concatenation without parentheses, so it won’t reliably include both caller SDK id + version and can yield unexpected values (e.g., "," when id is null). Use an explicit format (e.g., interpolate both null-coalesced pieces) to avoid operator-precedence bugs and keep the tag value consistent.
| new(TelemetryConstants.CallerSdkId, apiEvent.CallerSdkApiId ?? string.Empty + "," + apiEvent.CallerSdkVersion ?? string.Empty), | |
| new(TelemetryConstants.CallerSdkId, $"{apiEvent.CallerSdkApiId ?? string.Empty},{apiEvent.CallerSdkVersion ?? string.Empty}"), |
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ApiId, apiEvent.ApiId), | ||
| new(TelemetryConstants.TokenSource, apiEvent.TokenSource), | ||
| new(TelemetryConstants.CacheLevel, string.Empty), |
There was a problem hiding this comment.
MsalTotalDuration.2 records the CacheLevel tag as string.Empty on failures. This makes the attribute type inconsistent with the success path (which uses the CacheLevel enum) and can split/invalidly aggregate metric streams in OTel exporters. Record CacheLevel.None (or another CacheLevel enum value) instead of a string.
| new(TelemetryConstants.CacheLevel, string.Empty), | |
| new(TelemetryConstants.CacheLevel, CacheLevel.None), |
| new(TelemetryConstants.TokenSource, authResultMetadata.TokenSource), | ||
| new(TelemetryConstants.CacheLevel, cacheLevel), | ||
| new(TelemetryConstants.CacheRefreshReason, authResultMetadata.CacheRefreshReason), | ||
| new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType), | ||
| new(TelemetryConstants.ErrorCode, string.Empty), |
There was a problem hiding this comment.
The extended MsalTotalDuration.2 histogram is documented/tested as being tagged mainly with Succeeded/CacheRefreshReason/TokenType, but the implementation also adds TokenSource, CacheLevel, and ErrorCode. If this is intentional, update the public XML docs/tests; if not, consider removing extra dimensions (especially ErrorCode) to avoid high-cardinality histograms and reduce telemetry cost.
| new(TelemetryConstants.TokenSource, authResultMetadata.TokenSource), | |
| new(TelemetryConstants.CacheLevel, cacheLevel), | |
| new(TelemetryConstants.CacheRefreshReason, authResultMetadata.CacheRefreshReason), | |
| new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType), | |
| new(TelemetryConstants.ErrorCode, string.Empty), | |
| new(TelemetryConstants.CacheRefreshReason, authResultMetadata.CacheRefreshReason), | |
| new(TelemetryConstants.TokenType, authResultMetadata.TelemetryTokenType), |
| } | ||
| catch | ||
| { | ||
| LastRequestDurationInMs = sw.ElapsedMilliseconds; |
There was a problem hiding this comment.
finally block might be a good place to place the duration here that will cover all the cases. Here and in the RequestBase.cs as well
There was a problem hiding this comment.
I'd keep it scoped to the network‑bound portion — putting it in finally would also include response body read.
| serviceBundle.PlatformProxy.OtelInstrumentation.LogSuccessHttpDuration( | ||
| serviceBundle.PlatformProxy.GetProductName(), | ||
| apiEvent.ApiId, | ||
| authResult.AuthenticationResultMetadata); |
| if (requestContext.ApiEvent != null) | ||
| { | ||
| requestContext.ApiEvent.DurationInHttpInMs += _httpManager.LastRequestDurationInMs; |
|
|
||
| if (_isExtendedMetricsEnabled) | ||
| { | ||
| if (totalDurationInMs > 0 && s_durationTotalV2.Value.Enabled) |
| new(TelemetryConstants.Platform, platform), | ||
| new(TelemetryConstants.ApiId, apiEvent.ApiId), | ||
| new(TelemetryConstants.TokenSource, apiEvent.TokenSource), | ||
| new(TelemetryConstants.CacheLevel, string.Empty), |
| CollectionAssert.Contains( | ||
| _exportedMetrics.Select(m => m.Name).ToList(), | ||
| "MsalDurationInHttp.1A", | ||
| "Background refresh HTTP duration must be recorded in MsalDurationInHttp.1A"); |
| expectedTags.Add(TelemetryConstants.ApiId); | ||
| expectedTags.Add(TelemetryConstants.CacheRefreshReason); | ||
| expectedTags.Add(TelemetryConstants.TokenType); | ||
| expectedTags.Add(TelemetryConstants.Succeeded); |
| } | ||
|
|
||
| bool isTimeout = string.Equals(errorCode, MsalError.RequestTimeout, StringComparison.OrdinalIgnoreCase); | ||
| if ((httpStatusCode > 0 || (isTimeout && apiEvent.DurationInHttpInMs > 0)) && s_durationInHttpV2.Value.Enabled) |
| // 2. Acquire token — failure | ||
| _harness.HttpManager.AddTokenResponse(TokenResponseType.InvalidClient); | ||
| await AssertException.TaskThrowsAsync<MsalServiceException>( | ||
| () => cca.AcquireTokenForClient(TestConstants.s_scopeForAnotherResource) | ||
| .WithExtraQueryParameters(extraQueryParams) | ||
| .WithTenantId(TestConstants.Utid) | ||
| .ExecuteAsync()).ConfigureAwait(false); |
Add token acquisition metrics that cover both successful and failed token acquisition outcomes and close an observability gap by recording IDP network latency for background refresh.