Skip to content

Commit 7ada35a

Browse files
leoromanovskydd-oleksii
authored andcommitted
fix(openfeature): block initialize() until RC config arrives (#16650)
## Motivation `DataDogProvider.initialize()` returns immediately without waiting for Remote Config data. The OpenFeature SDK then emits `PROVIDER_READY` (per spec: "READY when initialize() terminates normally"), so consumers believe the provider is ready. Flag evaluations in this window silently return default values with `reason: DEFAULT` — there is no error, no indication that config hasn't loaded yet. This was reported by a customer running a Python script (not a long-running server). On servers the bug is masked because RC config typically arrives during startup before any evaluations happen. In scripts and short-lived processes, `set_provider()` returns in 0.00s and the very next evaluation gets defaults. Every other Datadog OpenFeature provider blocks inside `initialize()` until config arrives: - **Java**: `CountDownLatch.await(timeout, unit)` — default 30s - **Go**: `sync.Cond.Wait()` inside a loop — default 30s - **Node.js**: `await initController.wait()` (deferred Promise) — default 30s Fixes: FFL-1843 ## Changes - `DataDogProvider.__init__()` now creates a `threading.Event` (`_config_event`) used to block `initialize()` until config arrives. - `initialize()` checks if config already exists (fast path), otherwise calls `_config_event.wait(timeout)`. If the timeout expires without config, it raises `ProviderNotReadyError` (the SDK then dispatches `PROVIDER_ERROR`). - `on_configuration_received()` calls `_config_event.set()` to unblock `initialize()` when the first RC payload arrives. If init already timed out, it emits `PROVIDER_READY` for late recovery. - `shutdown()` clears the event for clean re-initialization. - New env var `DD_EXPERIMENTAL_FLAGGING_PROVIDER_INITIALIZATION_TIMEOUT_MS` (default 30000) controls the timeout. Also configurable via constructor: `DataDogProvider(init_timeout=10.0)`. ## Decisions - **Blocking by default** (30s timeout) matches Java/Go/Node.js. The OpenFeature Python SDK only has `set_provider()` (no `set_provider_and_wait()` yet), and it calls `initialize()` synchronously on the caller's thread. So blocking here means `set_provider()` itself blocks — which is the correct default behavior for most users. - **Timeout raises `ProviderNotReadyError`** rather than returning silently. This puts the provider in `ERROR` state (not premature `READY`), which is the same pattern Java and Node.js use on timeout. - **Late recovery supported**: if config arrives after the timeout, `on_configuration_received()` emits `PROVIDER_READY` and the provider transitions from `ERROR` to `READY`. - **No `init_timeout=0` async mode**: rather than adding a special non-blocking mode to the provider, async customers can wrap `set_provider()` in a background thread and listen for `PROVIDER_READY` events. A proper `set_provider_and_wait()` is being contributed upstream to the OpenFeature Python SDK ([open-feature/python-sdk#567](open-feature/python-sdk#567)). ## Testing Verified locally using system-tests parametric tests against a patched build: | Test | Before fix | After fix | |---|---|---| | `test_ffe_evaluation_immediately_after_start_without_config` | **FAILED** — `ffe_start()` returned in 0.00s, eval returned defaults | **PASSED** — blocked 30s, timed out correctly | | `test_ffe_init_blocks_until_config_received` | PASSED | PASSED | | `test_ffe_init_returns_real_values_not_defaults` | PASSED | PASSED | Server log confirms blocking: `Waiting up to 30.0s for initial FFE configuration from Remote Config` Existing FFE parametric tests: 13 passed, 0 failed (remaining errors were container resource exhaustion, not code-related). Co-authored-by: dd-oleksii <oleksii.shmalko@datadoghq.com>
1 parent cb5e682 commit 7ada35a

6 files changed

Lines changed: 207 additions & 28 deletions

File tree

ddtrace/internal/openfeature/_provider.py

Lines changed: 54 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from collections import OrderedDict
99
from collections.abc import MutableMapping
1010
from importlib.metadata import version
11+
import threading
1112
import typing
1213

1314
from openfeature.evaluation_context import EvaluationContext
@@ -90,11 +91,20 @@ class DataDogProvider(AbstractProvider):
9091
Feature Flags and Experimentation (FFE) product.
9192
"""
9293

93-
def __init__(self, *args: typing.Any, **kwargs: typing.Any):
94+
def __init__(self, *args: typing.Any, initialization_timeout: typing.Optional[float] = None, **kwargs: typing.Any):
9495
super().__init__(*args, **kwargs)
9596
self._metadata = Metadata(name="Datadog")
9697
self._status = ProviderStatus.NOT_READY
97-
self._config_received = False
98+
99+
# Initialization timeout: constructor arg takes priority, then env var (default 30s)
100+
if initialization_timeout is not None:
101+
self._initialization_timeout = initialization_timeout
102+
else:
103+
self._initialization_timeout = ffe_config.initialization_timeout_ms / 1000.0
104+
105+
# Event used to block initialize() until config arrives.
106+
# Also serves as the "config received" flag via is_set().
107+
self._config_received = threading.Event()
98108

99109
# Cache for reported exposures to prevent duplicates
100110
# Stores mapping of (flag_key, subject_id) -> (allocation_key, variant_key)
@@ -119,9 +129,6 @@ def __init__(self, *args: typing.Any, **kwargs: typing.Any):
119129
self._flag_eval_metrics = FlagEvalMetrics()
120130
self._flag_eval_hook = FlagEvalHook(self._flag_eval_metrics)
121131

122-
# Register this provider instance for status updates
123-
_register_provider(self)
124-
125132
def get_metadata(self) -> Metadata:
126133
"""Returns provider metadata."""
127134
return self._metadata
@@ -142,32 +149,52 @@ def initialize(self, evaluation_context: EvaluationContext) -> None:
142149
"""
143150
Initialize the provider.
144151
145-
Called by the OpenFeature SDK when the provider is set.
146-
Provider Creation → NOT_READY
147-
148-
First Remote Config Payload
149-
150-
READY (emits PROVIDER_READY event)
151-
152-
Shutdown
153-
154-
NOT_READY
152+
Blocks until Remote Config delivers the first FFE configuration or
153+
the initialization timeout expires.
154+
155+
The timeout is configurable via:
156+
- Constructor: DataDogProvider(initialization_timeout=10.0) # seconds
157+
- Env var: DD_EXPERIMENTAL_FLAGGING_PROVIDER_INITIALIZATION_TIMEOUT_MS=10000
158+
159+
Provider lifecycle:
160+
NOT_READY -> initialize() blocks -> config arrives -> READY
161+
NOT_READY -> initialize() blocks -> timeout -> raises ProviderNotReadyError
155162
"""
156163
if not self._enabled:
157164
return
158165

166+
# Register for RC config callbacks (in initialize, not __init__, so
167+
# re-initialization after shutdown re-registers the provider)
168+
_register_provider(self)
169+
159170
try:
160171
# Start the exposure writer for reporting
161172
start_exposure_writer()
162173
except ServiceStatusError:
163174
logger.debug("Exposure writer is already running", exc_info=True)
164175

165-
# If configuration was already received before initialization, emit ready now
176+
# Fast path: config already available (RC delivered before set_provider)
166177
config = _get_ffe_config()
167-
if config is not None and not self._config_received:
168-
self._config_received = True
178+
if config is not None:
179+
logger.debug("FFE configuration already available, provider is READY")
180+
self._config_received.set()
169181
self._status = ProviderStatus.READY
170-
self._emit_ready_event()
182+
return # SDK will dispatch PROVIDER_READY
183+
184+
# Block until config arrives or timeout expires
185+
logger.debug(
186+
"Waiting up to %.1fs for initial FFE configuration from Remote Config", self._initialization_timeout
187+
)
188+
if not self._config_received.wait(timeout=self._initialization_timeout):
189+
# Timeout expired without receiving config
190+
from openfeature.exception import ProviderNotReadyError
191+
192+
raise ProviderNotReadyError(
193+
f"Provider timed out after {self._initialization_timeout:.1f}s waiting for "
194+
"initial configuration from Remote Config"
195+
)
196+
197+
# Config received during wait -- on_configuration_received() already set status
171198

172199
def shutdown(self) -> None:
173200
"""
@@ -196,7 +223,7 @@ def shutdown(self) -> None:
196223
# Unregister provider
197224
_unregister_provider(self)
198225
self._status = ProviderStatus.NOT_READY
199-
self._config_received = False
226+
self._config_received.clear()
200227

201228
def resolve_boolean_details(
202229
self,
@@ -463,14 +490,18 @@ def on_configuration_received(self) -> None:
463490
"""
464491
Called when a Remote Configuration payload is received and processed.
465492
466-
Emits PROVIDER_READY event on first configuration.
493+
Updates status first, then signals the event to unblock initialize().
494+
Emits PROVIDER_READY for late arrivals (config received after initialize() timed out).
467495
"""
468-
if not self._config_received:
469-
self._config_received = True
496+
if not self._config_received.is_set():
470497
self._status = ProviderStatus.READY
471498
logger.debug("First FFE configuration received, provider is now READY")
499+
# Emit READY for late recovery: config arrived after init timed out
472500
self._emit_ready_event()
473501

502+
# Signal the event last to unblock initialize() after status is updated
503+
self._config_received.set()
504+
474505
def _emit_ready_event(self) -> None:
475506
"""
476507
Safely emit PROVIDER_READY event.

ddtrace/internal/settings/openfeature.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,20 @@ class OpenFeatureConfig(DDConfig):
3030
default=1.0,
3131
)
3232

33+
# Provider initialization timeout in milliseconds.
34+
# Controls how long initialize() blocks waiting for the first Remote Config payload.
35+
# Default is 30000ms (30 seconds), matching Java, Go, and Node.js SDKs.
36+
initialization_timeout_ms = DDConfig.var(
37+
int,
38+
"DD_EXPERIMENTAL_FLAGGING_PROVIDER_INITIALIZATION_TIMEOUT_MS",
39+
default=30000,
40+
)
41+
3342
_openfeature_config_keys = [
3443
"experimental_flagging_provider_enabled",
3544
"ffe_intake_enabled",
3645
"ffe_intake_heartbeat_interval",
46+
"initialization_timeout_ms",
3747
]
3848

3949

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
fixes:
3+
- |
4+
openfeature: This fix resolves an issue where ``DataDogProvider.initialize()`` returned before
5+
configuration was received, causing the OpenFeature SDK to mark the provider as ready to serve
6+
evaluations too early and flag evaluations to silently return default values. The provider now
7+
waits for configuration before returning.
8+
features:
9+
- |
10+
openfeature: This introduces a configurable initialization timeout for ``DataDogProvider``.
11+
The timeout controls how long ``initialize()`` waits for configuration before returning,
12+
and defaults to 30 seconds. Set it via the
13+
``DD_EXPERIMENTAL_FLAGGING_PROVIDER_INITIALIZATION_TIMEOUT_MS`` environment variable or the
14+
``init_timeout`` constructor parameter.

tests/openfeature/conftest.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
"""
2+
Shared fixtures for openfeature tests.
3+
"""
4+
5+
import pytest
6+
7+
from tests.utils import override_global_config
8+
9+
10+
@pytest.fixture(autouse=True)
11+
def fast_initialization_timeout():
12+
"""
13+
Override the provider initialization timeout to 100ms for all tests.
14+
15+
The production default is 30s (matching other SDKs), but that makes any
16+
test that calls api.set_provider() without pre-loaded config hang for 30s.
17+
Tests that need to verify timeout/blocking behaviour set their own explicit
18+
initialization_timeout= on DataDogProvider() directly, which takes priority
19+
over the config value.
20+
"""
21+
with override_global_config({"initialization_timeout_ms": 100}):
22+
yield

tests/openfeature/test_provider.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ class TestProviderInitializationShutdown:
5555

5656
def test_provider_initialization(self, provider, evaluation_context):
5757
"""Provider should initialize without errors."""
58+
# Pre-load config so initialize() takes the fast path (no blocking wait)
59+
config = create_config(create_boolean_flag("test-flag", enabled=True, default_value=True))
60+
process_ffe_configuration(config)
5861
# Should not raise
5962
provider.initialize(evaluation_context)
6063

tests/openfeature/test_provider_status.py

Lines changed: 104 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,12 @@
55
- NOT_READY by default
66
- READY when first Remote Config payload is received
77
- Event emission on status change
8+
- Blocking initialization until config arrives or timeout
89
"""
910

11+
import threading
12+
import time
13+
1014
from openfeature import api
1115
from openfeature.provider import ProviderStatus
1216
import pytest
@@ -43,7 +47,7 @@ def test_provider_starts_not_ready(self):
4347
provider = DataDogProvider()
4448

4549
assert provider._status == ProviderStatus.NOT_READY
46-
assert provider._config_received is False
50+
assert not provider._config_received.is_set()
4751

4852
def test_provider_becomes_ready_after_first_config(self):
4953
"""Test that provider becomes READY after receiving first configuration."""
@@ -61,7 +65,7 @@ def test_provider_becomes_ready_after_first_config(self):
6165

6266
# Verify becomes READY
6367
assert provider._status == ProviderStatus.READY
64-
assert provider._config_received is True
68+
assert provider._config_received.is_set()
6569
finally:
6670
api.clear_providers()
6771

@@ -73,14 +77,14 @@ def test_provider_ready_event_emitted(self):
7377

7478
try:
7579
# Provider should not have received config yet
76-
assert not provider._config_received
80+
assert not provider._config_received.is_set()
7781

7882
# Process a configuration
7983
config = create_config(create_boolean_flag("test-flag", enabled=True))
8084
process_ffe_configuration(config)
8185

8286
# Provider should now have received config and be READY
83-
assert provider._config_received
87+
assert provider._config_received.is_set()
8488
assert provider._status == ProviderStatus.READY
8589
finally:
8690
api.clear_providers()
@@ -140,7 +144,7 @@ def test_provider_status_after_shutdown(self):
140144

141145
# Verify back to NOT_READY
142146
assert provider._status == ProviderStatus.NOT_READY
143-
assert provider._config_received is False
147+
assert not provider._config_received.is_set()
144148
finally:
145149
api.clear_providers()
146150

@@ -194,3 +198,98 @@ def on_provider_ready(event_details):
194198
finally:
195199
api.remove_handler(ProviderEvent.PROVIDER_READY, on_provider_ready)
196200
api.clear_providers()
201+
202+
203+
class TestProviderInitializationBlocking:
204+
"""Test that initialize() blocks until config arrives or timeout expires."""
205+
206+
def test_initialize_blocks_until_config_arrives(self):
207+
"""initialize() should block and return once config is delivered mid-wait."""
208+
with override_global_config({"experimental_flagging_provider_enabled": True}):
209+
provider = DataDogProvider(initialization_timeout=5.0)
210+
211+
# Deliver config from a background thread after 0.5s
212+
def deliver_config():
213+
time.sleep(0.5)
214+
config = create_config(create_boolean_flag("test-flag", enabled=True))
215+
process_ffe_configuration(config)
216+
217+
timer = threading.Thread(target=deliver_config, daemon=True)
218+
timer.start()
219+
220+
try:
221+
start = time.monotonic()
222+
api.set_provider(provider)
223+
elapsed = time.monotonic() - start
224+
225+
# Should have blocked for ~0.5s (not instant, not full timeout)
226+
assert elapsed >= 0.3, f"initialize() returned too fast ({elapsed:.2f}s)"
227+
assert elapsed < 4.0, f"initialize() took too long ({elapsed:.2f}s), should have unblocked at ~0.5s"
228+
assert provider._status == ProviderStatus.READY
229+
assert provider._config_received.is_set()
230+
finally:
231+
api.clear_providers()
232+
233+
def test_initialize_fast_path_when_config_exists(self):
234+
"""initialize() should return immediately if config already exists."""
235+
with override_global_config({"experimental_flagging_provider_enabled": True}):
236+
# Deliver config BEFORE creating provider
237+
config = create_config(create_boolean_flag("test-flag", enabled=True))
238+
process_ffe_configuration(config)
239+
240+
provider = DataDogProvider(initialization_timeout=5.0)
241+
242+
try:
243+
start = time.monotonic()
244+
api.set_provider(provider)
245+
elapsed = time.monotonic() - start
246+
247+
# Should be near-instant (config already available)
248+
assert elapsed < 1.0, f"initialize() took {elapsed:.2f}s, should be instant with pre-loaded config"
249+
assert provider._status == ProviderStatus.READY
250+
finally:
251+
api.clear_providers()
252+
253+
def test_initialize_timeout_raises(self):
254+
"""initialize() should raise ProviderNotReadyError after timeout expires."""
255+
with override_global_config({"experimental_flagging_provider_enabled": True}):
256+
provider = DataDogProvider(initialization_timeout=0.5)
257+
258+
try:
259+
start = time.monotonic()
260+
# set_provider catches the exception and dispatches PROVIDER_ERROR
261+
api.set_provider(provider)
262+
elapsed = time.monotonic() - start
263+
264+
# Should have blocked for ~0.5s (the timeout)
265+
assert elapsed >= 0.3, f"initialize() returned too fast ({elapsed:.2f}s)"
266+
assert elapsed < 2.0, f"initialize() took too long ({elapsed:.2f}s)"
267+
268+
# Provider should be in ERROR state (SDK caught ProviderNotReadyError)
269+
client = api.get_client()
270+
assert client.get_provider_status() == ProviderStatus.ERROR
271+
finally:
272+
api.clear_providers()
273+
274+
def test_late_recovery_after_timeout(self):
275+
"""Config arriving after timeout should transition provider to READY."""
276+
with override_global_config({"experimental_flagging_provider_enabled": True}):
277+
provider = DataDogProvider(initialization_timeout=0.5)
278+
279+
try:
280+
# Let it timeout
281+
api.set_provider(provider)
282+
283+
# Provider should be in ERROR state
284+
client = api.get_client()
285+
assert client.get_provider_status() == ProviderStatus.ERROR
286+
287+
# Now deliver config (late recovery)
288+
config = create_config(create_boolean_flag("test-flag", enabled=True))
289+
process_ffe_configuration(config)
290+
291+
# Provider should recover to READY
292+
assert provider._status == ProviderStatus.READY
293+
assert provider._config_received.is_set()
294+
finally:
295+
api.clear_providers()

0 commit comments

Comments
 (0)