[ExecuTorch][WebGPU] Dynamic resize hooks for rms_norm, embedding, rope by JulianCloudNTH · Pull Request #20575 · pytorch/executorch

JulianCloudNTH · 2026-06-28T16:22:14Z

Stack from ghstack (oldest at bottom):

These ops baked their dispatch count, param UBO, and output dims at build() for the max seq-len. On a dynamic-shape graph at a smaller live S they would over-dispatch and leave the output sized at the max, so the resize engine could not actually shrink them.

This adds tensor resize hooks to rms_norm, embedding_q4gsw, and apply_rotary_emb. When an input is resized, each hook recomputes the live row/token count, rewrites the param UBO, updates the dispatch workgroup_count_x, and sets the output's cur_dims. The hook is inert until a resize happens, so static graphs are byte-identical.

Implementation:

rms_norm: recompute num_rows from live cur_dims; out dims follow the input.
embedding_q4gsw: recompute num_indices/total_blocks; out dims = indices dims + [embed_dim].
apply_rotary_emb: add_rope_dispatch now returns its uniform handle; one hook rewrites both the xq and xk dispatches/UBOs for the live S and sets both outputs.
Each keeps its uniform buffer alive via own_uniform_buffer (the hook rewrites it) instead of releasing it at build.

Mirrors Vulkan per-op resize_*_node (recompute sizes + dispatch each execute). No kernel/WGSL/numerics change. Behavior-neutral on static graphs (hook only fires when live dims differ from max). quantized_linear and SDPA resize hooks land in following diffs; prepack needs none (constants are fixed-size).
@exported-using-ghexport

Differential Revision: D109906096

[ghstack-poisoned]

pytorch-bot · 2026-06-28T16:22:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20575

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending, 1 Unrelated Failure

As of commit 6c20da7 with merge base db5f1ba ():

NEW FAILURE - The following job has failed:

pull / test-qnn-delegate-linux / linux-job (gh)
RuntimeError: Command docker exec -t eb759c704bc815fb560bbc5391b7eaa2fb76519783ece0375b3f70c659935267 /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-28T16:23:16Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-29T20:22:47Z

@claude review and check for any areas or opportunities for modularization

claude · 2026-06-29T20:23:32Z

Claude finished @JulianCloudNTH's task in 1m 35s —— View job

Review: Dynamic resize hooks for rms_norm, embedding, rope

Read changed files + surrounding context
Inspect WebGPUGraph resize-hook API
Compare against Vulkan resize_*_node patterns
Assess correctness + modularization opportunities
Post review

The change is well-scoped and the design is sound: hooks are inert on static graphs (they only re-derive from cur_dims, which equals the max dims until a resize), the UBO-ownership swap from wgpuBufferRelease → own_uniform_buffer is refcount-balanced (create=1, bind group=2, dtor release=1, bind-group release=0 — no leak), and pulling the duplicated numel_of into utils::numel_of is a clean dedup. Mirrors the Vulkan resize_*_node contract faithfully.

A few things worth addressing before merge:

Correctness

1. rms_norm error message is wrong for the rows == 0 case. RmsNorm.cpp:204-206 throws the "exceeds 1D dispatch limit (65535)" message even when rows == 0 (the build-time path at :70-77 splits these into two distinct messages). Worth splitting so a zero-row resize doesn't report a misleading limit error.

2. rope hook keys only on xq_id but rewrites both xq and xk. RotaryEmbedding.cpp:277 registers the hook on xq_id only, then reads g.cur_dims(xk_id) at fire time. This is correct only if the resize engine always dirties xq whenever xk's live S changes. The build enforces seq_k == seq so they're coupled at build, but the hook silently assumes they stay coupled at runtime — if a future graph resizes xk without xq, the xk dispatch/UBO would go stale with no error. Either register the same hook on xk_id too (the body is idempotent — it reads both and rewrites both), or add a comment documenting the coupling invariant the hook relies on. Vulkan's per-node resize re-runs for each node, so this single-trigger shortcut is a WebGPU-specific assumption that should be made explicit.

3. freqs coverage at the live S is unchecked in the hook. The build validates freqs_numel == seq * half_dim against the max S. At a smaller live S that's fine (positions 0..S-1 are a prefix), but the assumption that freqs is allocated-at-max and indexed by position is load-bearing and undocumented in the hook. A one-line comment would help the next reader.

Modularization

4. The "rewrite UBO + bump workgroup count" pair is repeated 4× (rms ×1, embedding ×1, rope ×2) with the identical shape:

wgpuQueueWriteBuffer(g.queue(), buf, 0, &p, sizeof(p));
g.dispatch_at(idx).workgroup_count_x =
    utils::compute_1d_workgroup_count(g.device(), threads, wg_size, name);

A small templated helper in WebGPUUtils.h would collapse all four call sites and keep the device/queue plumbing in one place:

template <typename Params>
inline void rewrite_dispatch(
    WebGPUGraph& g, WGPUBuffer buf, const Params& p,
    size_t dispatch_idx, uint32_t threads, uint32_t wg_size,
    const char* name) {
  wgpuQueueWriteBuffer(g.queue(), buf, 0, &p, sizeof(p));
  g.dispatch_at(dispatch_idx).workgroup_count_x =
      utils::compute_1d_workgroup_count(g.device(), threads, wg_size, name);
}

rms_norm is the odd one out — it sets workgroup_count_x = rows directly (one workgroup per row, no div_up), so it either keeps its inline form or the helper grows a "raw count" overload. I'd keep rms_norm inline and apply the helper to embedding + the two rope dispatches; that's where the win is real. Fix this →

5. Pre-existing, optional: utils::make_uniform (WebGPUUtils.h:69) already wraps the mapped-at-creation uniform pattern, but all three ops still hand-roll wgpuDeviceCreateBuffer + GetMappedRange + memcpy + Unmap (RmsNorm.cpp:85-93, EmbeddingQ4gsw.cpp:147-155, RotaryEmbedding.cpp:64-72). Out of scope for this PR, but converting them would remove ~8 lines each and centralize the null-check error handling.

6. Minor: the WGPUBuffer params_buf = uniform_buffer; alias (RmsNorm.cpp:195, EmbeddingQ4gsw.cpp:231) exists only to give the lambda a distinct capture name; capturing uniform_buffer directly reads just as clearly. Not worth changing on its own.

Nothing here is blocking — items 1–3 are the ones I'd want resolved (1 is a trivial message fix; 2 is the one with real runtime-correctness implications if the graph ever decouples xq/xk).
• branch gh/JulianCloudNTH/67/head

[ghstack-poisoned]

Update

89bce88

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 28, 2026 16:22 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

Update

2b1fab6

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 29, 2026

Update

6c20da7

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Dynamic resize hooks for rms_norm, embedding, rope#20575

[ExecuTorch][WebGPU] Dynamic resize hooks for rms_norm, embedding, rope#20575
JulianCloudNTH wants to merge 3 commits into
gh/JulianCloudNTH/67/basefrom
gh/JulianCloudNTH/67/head

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20575

❌ 1 New Failure, 1 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 28, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Dynamic resize hooks for rms_norm, embedding, rope

Correctness

Modularization

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 29, 2026 •

edited

Loading