Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU by vchamarthi · Pull Request #2981 · IntelPython/dpnp

vchamarthi · 2026-06-30T01:00:39Z

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU

In-place binary elementwise ops that broadcast a vector against a C-contiguous matrix
(m += row, m += col[:, None]) fell through to the general strided kernel (scalar, one
element per work-item) on the CPU device, even though a vectorized broadcast kernel already
exists and is used by the out-of-place path.

Changes

Row broadcast (m += row): add the missing C-contiguous dispatch branch to
py_binary_inplace_ufunc (the out-of-place path already had it; the in-place template only
had the F-style {1,0} branch). Reuses the existing BinaryInplaceRowMatrixBroadcastingFunctor,
so it benefits all in-place binary ufuncs — no new kernel.
Column broadcast (m += col[:, None]): add BinaryInplaceColMatrixBroadcastingFunctor
(mat[gid] += vec[gid / n1]) and wire it for add via a defaulted extra template
parameter, leaving all other in-place ufuncs unchanged.

Both paths are guarded by exact simplified-stride checks and fall back to the strided kernel
otherwise; results are bitwise-identical.

Results

13th Gen Intel Core i5-13400 (CPU/OpenCL), float32, dpnp built from source. Same op, before/after
this change:

op (`D` = 21846x21846)	before	after
`D += row`	247 ms	58 ms
`D += col[:, None]`	252 ms	150 ms

This was the only op category where dpnp-on-CPU lost to NumPy in dpbench pairwise_distance
(M16Gb, single). End-to-end via the dpbench CLI (--repeat 30, --validate passing):
0.92x → 1.52x vs stock NumPy (731 ms → 445 ms; NumPy 678 ms).

Checklist

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
(TestAdd::test_inplace_row_broadcast, TestAdd::test_inplace_column_broadcast)
Have you tested your changes locally for CPU and GPU devices? (row + column broadcast verified
bitwise-equal to NumPy on opencl:cpu and level_zero:gpu, across float32/64 + int32/64 and
shapes incl. non-sub-group-multiple row lengths)
Have you made sure that new changes do not introduce compiler warnings? (clean recompile with
icpx 2026.0, -Wall -Wextra: 0 warnings)
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary? (no public API change)
Have you added your changes to the changelog?

In-place binary elementwise ops broadcasting a vector against a C-contiguous matrix (m += row, m += col[:, None]) fell through to the general strided kernel on CPU, although a vectorized row-broadcast kernel already exists and is used by the out-of-place path. - Add the missing C-contiguous row-broadcast dispatch branch to py_binary_inplace_ufunc (reuses the existing BinaryInplaceRowMatrixBroadcastingFunctor); the in-place template previously only had the F-style {1,0} branch while the out-of-place path already handled the {0,1} C-contiguous case. - Add BinaryInplaceColMatrixBroadcastingFunctor for the column case (mat[gid] += vec[gid / n1]) and wire it for add via a defaulted extra template parameter, keeping all other in-place ufuncs unchanged. Both paths are guarded by exact simplified-stride checks and fall back to the strided kernel otherwise. Results are bitwise-identical. Adds TestAdd::test_inplace_row_broadcast and TestAdd::test_inplace_column_broadcast covering several shapes (incl. row lengths not a multiple of the sub-group size) across dtypes.

coveralls · 2026-06-30T04:02:40Z

coverage: 78.073%. remained the same — vchamarthi:fix/inplace-cmatrix-broadcast-cpu into IntelPython:master

vchamarthi requested review from antonwolfy, ndgrigorian and vlad-perevezentsev as code owners June 30, 2026 01:00

vchamarthi added 2 commits June 29, 2026 20:01

Update changelog with PR number IntelPython#2981

43beb19

Apply pre-commit formatting (black, clang-format)

5e1ea14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU#2981

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU#2981
vchamarthi wants to merge 3 commits into
IntelPython:masterfrom
vchamarthi:fix/inplace-cmatrix-broadcast-cpu

vchamarthi commented Jun 30, 2026

Uh oh!

coveralls commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vchamarthi commented Jun 30, 2026

Speed up in-place vector-to-C-contiguous-matrix broadcast on CPU

Changes

Results

Checklist

Uh oh!

coveralls commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants