You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds /runtime/rule/{addOrUpdate,inactivate,delete,list,bundled,dump,get} on a
new admin port (default 17128, disabled by default) for cluster-wide MAL/LAL
rule management — push, soft-pause, drop, list, dump, single-rule fetch. The
endpoint converges through a single elected main per (catalog, name) with
Suspend/Resume RPCs broadcast to peers across the OAP cluster bus. Rules are
stored in the management storage layer (BanyanDB / Elasticsearch / JDBC) so
hot-updates survive OAP restart — at boot the merged view of bundled YAML and
persisted runtime rows takes effect without a regression to defaults.
The endpoint has no built-in authentication: operators must gateway-protect
it with IP allow-lists and never expose it to the public internet.
Engine model
------------
Three layers, with one boundary between each:
scheduler (DSLManager + REST handler) — DSL-agnostic. Lock acquisition,
cluster Suspend/Resume RPC fan-out, persistence, classloader graveyard,
cross-file ownership enforcement, tick scheduling, self-heal.
orchestrators (DSLRuntimeApply, DSLRuntimeUnregister, DSLRuntimeDelete) —
drive the per-DSL phase pipeline through the engine SPI.
engines (MalRuleEngine, LalRuleEngine) — DSL-specific. Implement compile /
verify / commit / rollback / unregister / dropBackend / reloadStatic
against a per-engine ApplyContext; classify, claimedKeys, storageImpactKeys
drive cross-DSL routing and the storage-change guardrail.
DSLClassLoaderManager (server-core)
-----------------------------------
Process-wide singleton that owns every per-file RuleClassLoader. Boot-time
bundled rules continue to load into the OAP main classloader (shared); per-
file `static:` loaders only mint after a runtime override is removed and the
bundled YAML must serve again. Loader name format
`<kind>:<catalog>/<rule>@<MMdd-HHmmss>` is observable in stack traces and the
graveyard's INFO/WARN log lines. The manager runs an internal daemon sweeper
that observes phantom-reference collection and warns on retired loaders that
stay alive past the configured threshold (the leak signal).
API surface:
newBuilder(catalog, rule, kind, hash) — mints a loader for compile; not yet
registered as active.
commit(loader) — promotes to active, returns the
displaced prior so the caller decides
whether to retire.
retire(loader) — graveyard a specific loader for GC
observability.
dropRuntime(catalog, rule) — drops + retires the active loader.
active(catalog, rule), activeCount(), pendingCount() — diagnostics.
The split between newBuilder and commit means a failed compile leaves the
live loader untouched: the failed loader is just garbage-collected, the
manager's active map still points at the previously-serving one.
/inactivate semantics (Design A)
--------------------------------
/inactivate stamps localState=NOT_LOADED. Bundled rules do NOT auto-
resurrect on /inactivate even when a bundled twin exists on disk — the
operator's "off" intent is preserved across reboots. To bring bundled back,
the operator runs /delete (drops the row, gone-keys reconcile reloads
bundled) or /addOrUpdate (with bundled YAML or their own).
/delete semantics
-----------------
Default mode: removes the runtime row.
No bundled twin — destructive cascade fires (BanyanDB measure / ES
index / JDBC table dropped). Rule fully gone.
Bundled twin exists — non-destructive: backend resources runtime claimed
that bundled doesn't (or claims at different shape)
are dropped; bundled-shared at matching shape is
preserved. The runtime row is removed; bundled is
reinstalled into a `static:` loader synchronously on
the local node. Peers converge via gone-keys reconcile
on their next tick.
?mode=revertToBundled is an explicit operator hint that requires a bundled
twin (returns 400 no_bundled_twin when none exists) — useful for scripts
that want to fail loudly on assumption mismatch.
REST surface
------------
* /list returns a single JSON envelope {generatedAt, loaderStats, rules}
(was NDJSON). Each row carries status, localState, suspendOrigin, loaderGc,
loaderKind (RUNTIME/STATIC/NONE), loaderName, contentHash, bundled,
bundledContentHash, updateTime, lastApplyError. status=BUNDLED replaces
the prior STATIC for bundled-only rules.
* /get accepts ?source=bundled to read bundled YAML even when a runtime row
exists — closes the "compare runtime to bundled" gap for editor flows.
* All JSON build sites use Gson.
Catalog enum at REST boundary
-----------------------------
The catalog query parameter parses to org.apache.skywalking.oap.server.core
.classloader.Catalog at the REST boundary. RuntimeRuleService's public
methods (addOrUpdate / inactivate / delete / get / listBundled / dumpCatalog)
take Catalog; unknown wire values return 400 invalid_catalog uniformly.
Internal helpers convert via getWireName() at DAO / cluster-RPC edges.
DeleteMode enum
---------------
?mode= parses to DeleteMode at the REST boundary. The string never leaves
the handler.
ForwardTarget interface removed
-------------------------------
Single production implementation (RuntimeRuleRestHandler) and zero test
fakes. The cluster gRPC service now references RuntimeRuleService directly;
the REST handler is left as a pure transport adapter (route bindings + parameter
parsing). The Result POJO becomes RuntimeRuleService.ForwardResult.
Cluster
-------
Suspend / Resume / Forward RPCs over the cluster gRPC bus. Single-main
routing (deterministic mainFor(catalog, name)) with REST forward-to-main.
Self-heal sweeps SUSPENDED bundles whose main crashed mid-apply (60 s
default).
Storage
-------
RuntimeRuleManagementDAO — per-backend upsert / read / delete on the rule
rows. Implementations for BanyanDB / ES / JDBC. /inactivate runs under
StorageManipulationOpt.localCacheOnly so the backend measure and history
stay; /delete fires the destructive cascade unless a bundled twin makes
delta-drop the right path.
Per-rule lifecycle docs
-----------------------
docs/en/setup/backend/backend-runtime-rule-api.md walks the full operator
surface: routes, applyStatus matrix, per-row status decoding (status x
loaderKind x bundled), reading bundled-vs-runtime YAML, consistency model.
E2E
---
test/e2e-v2/cases/runtime-rule/ covers the lifecycle on BanyanDB / ES /
JDBC, plus a 2-node cluster scenario (Suspend/Resume + main routing) and
the LAL pipeline. Verified end-to-end on BanyanDB locally:
CREATE → UPDATE-FILTER → UPDATE-STRUCTURAL → DUMP → 4× illegal →
SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- No one-line delegate methods. A wrapper whose only body is a single forwarding call
101
+
to another class (`return Other.foo(a, b);`) adds a hop without value. Inline the
102
+
call at the use site, or call the underlying object directly (including via method
103
+
reference: `obj::foo` instead of `this::wrapperOfFoo`).
93
104
94
105
**Required patterns:**
95
106
-`@Override` annotation required for overridden methods
@@ -105,6 +116,13 @@ public class XxxModuleProvider extends ModuleProvider {
105
116
- Package names: `org.apache.skywalking.*` or `test.apache.skywalking.*`
106
117
- Type names: `PascalCase` or `UPPER_CASE_WITH_UNDERSCORES`
107
118
- Local variables/parameters/members: `camelCase`
119
+
-**Function-oriented naming, not abstract metaphor**: classes and methods are named for
120
+
what they do, not for an abstract concept. Prefer concrete verbs (`load`, `apply`,
121
+
`unregister`, `compile`, `verify`, `commit`, `rollback`) over metaphorical ones
122
+
(`seed`, `hydrate`, `bootstrap`, `prime`). Class names follow the same rule —
123
+
`StaticRuleLoader` (loads static rules), not `StaticBundleSeeder`; `DSLSyncTimer` (syncs
124
+
DB → state on a timer), not `TickRunner`. If you can't name a method without reaching
125
+
for a metaphor, the method is probably doing too much; split it.
108
126
109
127
**File limits:**
110
128
- Max file length: 3000 lines
@@ -257,6 +275,10 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
257
275
10.**Relative paths in docs are valid**: Relative file paths (e.g., `../../../oap-server/...`) in documentation work both in the repo and on the documentation website, supported by website build tooling
258
276
11.**Module service registration**: When adding a service to `CoreModule.services()`, update ALL `CoreModuleProvider` implementations — not just the main one. Search with `grep -rn "extends CoreModuleProvider" oap-server/ --include="*.java"`. The `MockCoreModuleProvider` in `server-tools/profile-exporter/` also needs it, or the profile exporter e2e test will fail at startup.
259
277
12.**Multiple OAP packagings**: The OAP server is not only the main `server-starter`. The `server-tools/` directory contains standalone tools (e.g., profile exporter) that boot with mock module providers and a subset of modules. Changes to core module contracts (services, required modules) must be reflected in these tools too.
278
+
13.**`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
279
+
14.**Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
280
+
15.**No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
281
+
16.**BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
0 commit comments