You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a runtime-rule receiver that exposes a REST surface on port 17128 (off
by default) so operators can hot-update OTEL MAL, log MAL, telegraf MAL,
and LAL rule files without restarting OAP. State persists in management
storage; every node in an OAP cluster converges on a new content within
~30 s.
REST endpoints (full reference in
docs/en/setup/backend/backend-runtime-rule-api.md):
- POST /runtime/rule/addOrUpdate create or replace
- POST /runtime/rule/inactivate soft-pause (preserves measure + history)
- POST /runtime/rule/delete destructive removal of an INACTIVE rule
- GET /runtime/rule fetch one rule (raw or JSON, ETag)
- GET /runtime/rule/bundled static-vs-runtime overlay per catalog
- GET /runtime/rule/list NDJSON rule state across the cluster
- GET /runtime/rule/dump tar.gz of all stored rules
Engine, cluster + storage:
- New management entity RuntimeRule (catalog,name,content,status,updateTime)
with per-backend RuntimeRuleManagementDAO save/find/list/delete.
- DSLManager + RuleEngine SPI orchestrate compile / verify / commit /
rollback per file; MAL and LAL each ship an engine, with a shared
apply pipeline (StructuralCommitCoordinator, SuspendResumeCoordinator,
ApplierResolver, PostApplyVerifier, alarm-window reset).
- Cluster Suspend/Resume RPC and Forward RPC over the cluster bus so
any peer can drive a structural cutover that pauses dispatch on every
node, persists the rule, then resumes — with a 60 s self-heal backstop
for missed Resumes.
- LOCAL_CACHE_VERIFY mode: non-main nodes verify backend shape on boot
and refuse to start when their declared model diverges from what the
main installed, instead of silently registering against an
incompatible schema.
- Storage-model remove lifecycle: StorageModels.remove + per-backend
dropTable so a runtime-rule delete can drop the backing measure on
BanyanDB and the model on every backend without restarting.
- BanyanDB schema-watch fence: schema mutations wait (best-effort,
bounded 2 s) for every data node to apply the new revision before
unparking dispatch, so the typical case gets a clean cutover where
samples after 200 OK use the new shape.
- BanyanDB also gains shape-mismatch detection: at boot, resources whose
on-disk shape diverges from the declared model are skipped with an
ERROR diff, instead of silently dropping samples.
Bundled vs runtime overlay:
- StaticRuleRegistry holds bundled rules; the merge resolver lets a
runtime rule override a bundled (catalog,name) without rewriting
static files. RuntimeRuleOverrideResolver SPI threads operator-side
overrides into the bundle resolution.
E2E coverage under test/e2e-v2/cases/runtime-rule/:
- mal-storage/{banyandb,elasticsearch,postgresql} — full 10-phase
lifecycle (CREATE → FILTER_ONLY → STRUCTURAL → DUMP → 4× ILLEGAL
→ SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP) with a
per-phase `step` label so verification queries attribute data
back to the phase that wrote it.
- lal — log-mal aggregation rule + LAL hot-swap, swctl asserts that
the extracted metric carries the swap-flipped step label.
- cluster — 2-OAP convergence over ZooKeeper.
Dependency bumps (driven by BanyanDB schema-consistency RPCs whose
generated validation code requires protobuf-java 4.x):
gRPC 1.70 → 1.80, protobuf-java 3.25.5 → 4.33.1, pgv 1.2.1 → 1.3.0,
Netty 4.2.10 → 4.2.12, Netty-tcnative 2.0.75 → 2.0.77.
Security: the admin port has no built-in authentication; the module is
disabled by default. docs/en/security/README.md spells out the operator
duty to gateway-protect with IP allow-lists, audit every request, and
keep the port off the cluster-external interface.
- No one-line delegate methods. A wrapper whose only body is a single forwarding call
101
+
to another class (`return Other.foo(a, b);`) adds a hop without value. Inline the
102
+
call at the use site, or call the underlying object directly (including via method
103
+
reference: `obj::foo` instead of `this::wrapperOfFoo`).
93
104
94
105
**Required patterns:**
95
106
-`@Override` annotation required for overridden methods
@@ -105,6 +116,13 @@ public class XxxModuleProvider extends ModuleProvider {
105
116
- Package names: `org.apache.skywalking.*` or `test.apache.skywalking.*`
106
117
- Type names: `PascalCase` or `UPPER_CASE_WITH_UNDERSCORES`
107
118
- Local variables/parameters/members: `camelCase`
119
+
-**Function-oriented naming, not abstract metaphor**: classes and methods are named for
120
+
what they do, not for an abstract concept. Prefer concrete verbs (`load`, `apply`,
121
+
`unregister`, `compile`, `verify`, `commit`, `rollback`) over metaphorical ones
122
+
(`seed`, `hydrate`, `bootstrap`, `prime`). Class names follow the same rule —
123
+
`StaticRuleLoader` (loads static rules), not `StaticBundleSeeder`; `DSLSyncTimer` (syncs
124
+
DB → state on a timer), not `TickRunner`. If you can't name a method without reaching
125
+
for a metaphor, the method is probably doing too much; split it.
108
126
109
127
**File limits:**
110
128
- Max file length: 3000 lines
@@ -257,6 +275,10 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
257
275
10.**Relative paths in docs are valid**: Relative file paths (e.g., `../../../oap-server/...`) in documentation work both in the repo and on the documentation website, supported by website build tooling
258
276
11.**Module service registration**: When adding a service to `CoreModule.services()`, update ALL `CoreModuleProvider` implementations — not just the main one. Search with `grep -rn "extends CoreModuleProvider" oap-server/ --include="*.java"`. The `MockCoreModuleProvider` in `server-tools/profile-exporter/` also needs it, or the profile exporter e2e test will fail at startup.
259
277
12.**Multiple OAP packagings**: The OAP server is not only the main `server-starter`. The `server-tools/` directory contains standalone tools (e.g., profile exporter) that boot with mock module providers and a subset of modules. Changes to core module contracts (services, required modules) must be reflected in these tools too.
278
+
13.**`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
279
+
14.**Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
280
+
15.**No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
281
+
16.**BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
`1.2.1` → `1.3.0`. Driven by the new BanyanDB schema-consistency RPCs whose generated
43
+
validation code requires the `protobuf-java 4.x` runtime.
6
44
7
45
#### OAP Server
8
46
* Add Zipkin Virtual GenAI e2e test. Use `zipkin_json` exporter to avoid protobuf dependency conflict
@@ -54,4 +92,3 @@
54
92
* Add WeChat / Alipay Mini Program monitoring setup documentation, plus a client-side-monitoring section in the security guide covering public-internet ingress (OTLP + `/v3/segments`) for mobile / browser / mini-program SDKs.
55
93
56
94
All issues and pull requests are [here](https://github.com/apache/skywalking/issues?q=milestone:10.5.0)
0 commit comments