perf(throw_error): cheaper error construction and hook installation#4777
Open
sabify wants to merge 3 commits into
Open
perf(throw_error): cheaper error construction and hook installation#4777sabify wants to merge 3 commits into
sabify wants to merge 3 commits into
Conversation
…crete sites
Error construction on the throw path performed two heap allocations and a
memcpy where one allocation suffices.
Problem
-------
The only public constructor is the blanket conversion:
impl<T> From<T> for Error
where
T: Into<Box<dyn error::Error + Send + Sync + 'static>>,
{
fn from(value: T) -> Self {
Error(Arc::from(value.into()))
}
}
For a concrete, sized error, `value.into()` first materialises a
`Box<dyn Error + Send + Sync>` (allocation #1). `Arc::from(Box<T>)` for an
unsized payload cannot reuse the box's allocation, because the `Arc` must
prepend its strong/weak refcount header in front of the value: it allocates a
fresh, larger block (allocation #2), memcpys the error out of the box, and
frees the box. Net cost: 2 allocations + 1 free + 1 memcpy.
`From` itself cannot be narrowed to `T: error::Error` without breaking
`Error::from("str")` / `Error::from(String)` (neither implements
`std::error::Error`); the two bounds overlap, so a second `From` impl would
violate coherence.
Profiling
---------
Isolated bench, mean of 2 runs:
fn bench<F: FnMut()>(name: &str, iters: u64, mut f: F) {
for _ in 0..(iters / 10).max(1) { f(); } // warmup
let t = Instant::now();
for _ in 0..iters { f(); }
println!("{name:<44} {:>9.3} ns/op",
t.elapsed().as_nanos() as f64 / iters as f64);
}
bench("Error::from(DummyErr) [Box->Arc realloc]", 5_000_000,
|| { black_box(&Error::from(black_box(DummyErr { _pad: [0; 32] }))); });
bench("Error::new(DummyErr) [single alloc]", 5_000_000,
|| { black_box(&Error::new(black_box(DummyErr { _pad: [0; 32] }))); });
Error::from(DummyErr) [Box->Arc realloc] ~35.3 ns/op
Error::new(DummyErr) [single alloc] ~20.5 ns/op (~1.7x faster)
The 2x-ish ratio is the fingerprint of the eliminated second allocation: if
`Arc::from(Box)` reused the box's memory the two would land together.
Solution
--------
Add an additive, non-breaking inherent constructor that places a concrete
error directly into the `Arc` (one allocation, no copy):
pub fn new<E>(err: E) -> Self
where
E: error::Error + Send + Sync + 'static,
{
Error(Arc::new(err)) // Arc<E> coerces to Arc<dyn Error> in place
}
Route the hot construction sites that already hold a concrete error type
through it:
- server_fn `From<ServerFnError> for Error`: `ServerFnErrorWrapper` derives
`std::error::Error`, so it can skip the box. Runs on every server-fn
error conversion.
- hydration_context error deserialization: `SerializedError` is a concrete
`std::error::Error`. Runs per serialized error during hydration.
The `&str` / `String` / `anyhow` paths keep using `From` (they are unsized on
arrival and need the box anyway).
…:poll
`ErrorHookFuture::poll` cloned the captured hook `Arc` on every single poll.
Problem
-------
The wrapper installs its captured hook into the thread-local slot for the
duration of each poll:
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
let this = self.project();
let _hook = ResetErrorHookOnDrop(ERROR_HOOK.with_borrow_mut(|cur| {
std::mem::replace(cur, this.hook.clone()) // atomic inc every poll
}));
this.inner.poll(cx) // ...atomic dec on drop
}
Every poll clones `this.hook` (`Option<Arc<dyn ErrorHook>>`): one atomic
increment to install it, then one atomic decrement when the swapped-in clone is
dropped through the guard. That is two atomic RMW operations per poll even
though the hook being installed is the value the future already owns in its own
field — it only needs to lend it, not own a second copy.
`leptos` wraps each `Suspense` body future in `ErrorHookFuture`, which is polled
once per await-point resumption for the life of the suspense, so this is a
genuinely repeated path. Under cross-thread sharing the atomics also risk
cache-line contention beyond the single-thread cost.
Profiling
---------
opt-level=3 + lto, mean of 2 runs. `CloneWrapper`/`SwapWrapper` mirror the old
and new poll bodies over a thread-local slot to isolate the atomic delta:
// clone (old):
let prev = SLOT.with_borrow_mut(|c| mem::replace(c, this.hook.clone()));
let r = Pin::new(&mut this.inner).poll(cx);
SLOT.with_borrow_mut(|c| *c = prev);
// swap (new):
SLOT.with_borrow_mut(|c| mem::swap(c, &mut this.hook));
let r = Pin::new(&mut this.inner).poll(cx);
SLOT.with_borrow_mut(|c| mem::swap(c, &mut this.hook));
poll bare future [baseline] ~0.40 ns/op
poll wrapper [clone hook] ~4.11 ns/op
poll wrapper [swap hook] ~0.60 ns/op (~3.5 ns saved)
poll ErrorHookFuture [real, swap] ~1.65 ns/op (clone was ~4.13)
Solution
--------
Lend the future's own hook to the slot with a swap (no clone) and swap it back
with a panic-safe guard, so zero atomic refcount ops run per poll:
ERROR_HOOK.with_borrow_mut(|cur| std::mem::swap(cur, this.hook));
struct Restore<'a>(&'a mut Option<Arc<dyn ErrorHook>>);
impl Drop for Restore<'_> {
fn drop(&mut self) {
let _ = ERROR_HOOK.try_with(|cell| {
if let Ok(mut cur) = cell.try_borrow_mut() {
std::mem::swap(&mut *cur, self.0);
}
});
}
}
let _restore = Restore(this.hook);
this.inner.poll(cx)
`this.hook` and `this.inner` are disjoint projected fields, so holding
`&mut this.hook` in the guard while polling `inner` is sound. Observable
semantics are unchanged: during the poll the slot holds the captured hook
(including the `None` case, which correctly clears the slot); after the poll the
slot holds the previous ambient hook and `this.hook` is restored — exactly as
the clone version, minus the atomics. The guard mirrors `ResetErrorHookOnDrop`'s
best-effort, panic-safe restore. Adds a regression test asserting the captured
hook overrides the ambient hook during the poll and is restored afterwards.
gbj
reviewed
Jun 12, 2026
Collaborator
|
9cddc6c seems like micro-optimization churn. It's talking about saving nanoseconds in the context of doing async web request/response work... |
Contributor
Author
|
Yes, a 2.41x faster optimization in hot path + panic-safe drop guard. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two small optimizations to
throw_errorhot paths, with no API breakage or behavior change.1. Single-allocation
Error::newfor concrete error typesThe only way to construct an
Errorwas the blanketFromimpl, which boxes the value into aBox<dyn Error>and then re-allocates it behind anArc(the refcount header prevents reusing the box's allocation) — two allocations plus a memcpy per error.This adds an additive inherent constructor that places a concrete error directly into the
Arcin one allocation (~1.7x faster in isolated benches), and routes the two call sites that already hold a concrete type through it:server_fn:From<ServerFnError> for Error(runs on every server-fn error conversion)hydration_context: serialized error deserialization during hydrationThe
&str/String/anyhowpaths keep usingFrom(they need the box anyway).2. Swap instead of clone in
ErrorHookFuture::pollErrorHookFuture::pollcloned its captured hookArcon every poll — two atomic refcount ops per poll on a future that wraps everySuspensebody and is polled once per await-point resumption.Since the future only needs to lend the hook to the thread-local slot for the duration of the poll, it now
mem::swaps it in and back out via a panic-safe drop guard: zero atomic ops per poll (~4.1ns → ~1.7ns per poll in isolated benches). Observable semantics are unchanged; a regression test asserts the captured hook overrides the ambient hook during the poll and the ambient hook is restored afterwards.