Skip to content

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU#2981

Open
vchamarthi wants to merge 3 commits into
IntelPython:masterfrom
vchamarthi:fix/inplace-cmatrix-broadcast-cpu
Open

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU#2981
vchamarthi wants to merge 3 commits into
IntelPython:masterfrom
vchamarthi:fix/inplace-cmatrix-broadcast-cpu

Conversation

@vchamarthi

Copy link
Copy Markdown
Contributor

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU

In-place binary elementwise ops that broadcast a vector against a C-contiguous matrix
(m += row, m += col[:, None]) fell through to the general strided kernel (scalar, one
element per work-item) on the CPU device, even though a vectorized broadcast kernel already
exists and is used by the out-of-place path.

Changes

  • Row broadcast (m += row): add the missing C-contiguous dispatch branch to
    py_binary_inplace_ufunc (the out-of-place path already had it; the in-place template only
    had the F-style {1,0} branch). Reuses the existing BinaryInplaceRowMatrixBroadcastingFunctor,
    so it benefits all in-place binary ufuncs — no new kernel.
  • Column broadcast (m += col[:, None]): add BinaryInplaceColMatrixBroadcastingFunctor
    (mat[gid] += vec[gid / n1]) and wire it for add via a defaulted extra template
    parameter, leaving all other in-place ufuncs unchanged.

Both paths are guarded by exact simplified-stride checks and fall back to the strided kernel
otherwise; results are bitwise-identical.

Results

13th Gen Intel Core i5-13400 (CPU/OpenCL), float32, dpnp built from source. Same op, before/after
this change:

op (D = 21846x21846) before after
D += row 247 ms 58 ms
D += col[:, None] 252 ms 150 ms

This was the only op category where dpnp-on-CPU lost to NumPy in dpbench pairwise_distance
(M16Gb, single). End-to-end via the dpbench CLI (--repeat 30, --validate passing):
0.92x → 1.52x vs stock NumPy (731 ms → 445 ms; NumPy 678 ms).

Checklist

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
    (TestAdd::test_inplace_row_broadcast, TestAdd::test_inplace_column_broadcast)
  • Have you tested your changes locally for CPU and GPU devices? (row + column broadcast verified
    bitwise-equal to NumPy on opencl:cpu and level_zero:gpu, across float32/64 + int32/64 and
    shapes incl. non-sub-group-multiple row lengths)
  • Have you made sure that new changes do not introduce compiler warnings? (clean recompile with
    icpx 2026.0, -Wall -Wextra: 0 warnings)
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary? (no public API change)
  • Have you added your changes to the changelog?

In-place binary elementwise ops broadcasting a vector against a
C-contiguous matrix (m += row, m += col[:, None]) fell through to the
general strided kernel on CPU, although a vectorized row-broadcast
kernel already exists and is used by the out-of-place path.

- Add the missing C-contiguous row-broadcast dispatch branch to
  py_binary_inplace_ufunc (reuses the existing
  BinaryInplaceRowMatrixBroadcastingFunctor); the in-place template
  previously only had the F-style {1,0} branch while the out-of-place
  path already handled the {0,1} C-contiguous case.
- Add BinaryInplaceColMatrixBroadcastingFunctor for the column case
  (mat[gid] += vec[gid / n1]) and wire it for add via a defaulted extra
  template parameter, keeping all other in-place ufuncs unchanged.

Both paths are guarded by exact simplified-stride checks and fall back
to the strided kernel otherwise. Results are bitwise-identical.

Adds TestAdd::test_inplace_row_broadcast and
TestAdd::test_inplace_column_broadcast covering several shapes (incl.
row lengths not a multiple of the sub-group size) across dtypes.
@coveralls

Copy link
Copy Markdown
Collaborator

Coverage Status

coverage: 78.073%. remained the same — vchamarthi:fix/inplace-cmatrix-broadcast-cpu into IntelPython:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants