Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件 by Oseltamivir · Pull Request #2004 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-07-03T17:38:00Z

Summary

Finalizes the isolated CollectiveX v1 expert-parallel communication benchmark under
experimental/CollectiveX/. The branch is ready for three complete no-canary qualification runs;
it does not claim or include promoted v1 results yet.

Benchmark contract

Covers H100, H200, B200, B300, GB200, GB300, MI325X, and MI355X across DeepEP V1,
DeepEP V2 PR update mi300 to 0.5.8 #605 plus PR Update README with new image and remove picture tag #630's scale-up fix, DeepEP Hybrid, UCCL, MoRI, and the
NCCL/RCCL reference.
Resolves 37 runnable shards and 360 requested cases / 840 points: 222 runnable cases / 518
points plus 138 planned-unsupported cases / 322 points.
Uses 8 timed iterations x 64 trials and 32 synchronized full-roundtrip warmups before every
measured component, yielding exactly 512 observations per point with nearest-rank percentiles.
Standardizes combine as activation-only rank-sum. Dispatch weights remain oracle-checked.
Keeps uniform routing as the headline; Zipf and Zipf+EPLB are experimental sensitivity evidence.
Marks only the current H100 runner pool's six DeepEP V2 cases unsupported. Other H100 backends
remain runnable; V2 returns after that pool proves all-rank CUDA P2P/VMM and one full-world LSA team.

Qualification fixes

Adds deterministic native correctness for payload, routing, multiplicity, counts, weights,
combine values, and input immutability on every rank.
Binds image/squash bytes, pinned source trees, imported binaries, loaded NCCL/RCCL runtimes,
runtime topology, and generated-kernel evidence.
Hardens B300 cache identity with a private random mount sentinel, root-relative owner/mode checks,
and immutable completed markers; V2 JIT output is isolated per shard.
Pins DeepEP V2 PR update mi300 to 0.5.8 #605 with PR Update README with new image and remove picture tag #630's minimal pure-scale-up fix, disables GIN only for declared
scale-up cases, and requires NCCL's realized LSA team to cover the full EP world.
Preserves raw logs privately while exporting only a closed failure category, including detailed
NCCL/topology/JIT classification for GB two-tray jobs.

Artifact architecture

GitHub artifacts are transient delivery inputs to an owner-only, content-addressed local filesystem
publisher. Promotion requires exactly three complete independent runs from one source SHA, exact
coverage, homogeneous build/runtime identity, stable p50/p99 evidence and ordering, and every
controlled cohort. No managed database, object store, or third-party result hosting is introduced.

The tracked tree and all reachable refs contain none of the six private runner endpoint literals.
platforms.yaml, local goals/notes, raw logs, and result stores are ignored and untracked.

Validation

132 Python contract/unit tests.
Matrix SHA-256: 17ebafaa4f704e6d309d05f1fa7c44c66d60166b15a1cda8c29905ee39b536c5.
Case-catalog SHA-256: 3b223fef491c79cfd4eef32ac8cef288d2fa35f3051f3c089b6c9cc09e2fe36f.
Independent regeneration confirmed all counts, promotion cohorts (48 library / 12 system / 74
routing), and uniform 8:64:32 / 512-sample / warmup semantics across all 360 cases.
Actionlint, bash -n, ShellCheck, git diff --check, bilingual documentation parity, and exact
endpoint scans across the tracked tree and all reachable Git refs.

中文说明

本 PR 完成位于 experimental/CollectiveX/ 的隔离式 CollectiveX v1 专家并行（EP）通信基准测试。
当前分支已准备执行三轮完整、无 canary 的资格验证；目前尚未宣称或提交任何已晋级的 v1 结果。

基准测试约定

覆盖 H100、H200、B200、B300、GB200、GB300、MI325X 和 MI355X；后端包括 DeepEP V1、
DeepEP V2 PR update mi300 to 0.5.8 #605 及 PR Update README with new image and remove picture tag #630 的 scale-up 修复、DeepEP Hybrid、UCCL、MoRI 和 NCCL/RCCL
参考实现。
生成 37 个可运行分片，共请求 360 个 case / 840 个点位：其中 222 个可运行 case / 518 个
点位，另有 138 个 planned-unsupported case / 322 个点位。
每个测量组件统一执行 32 次同步完整往返预热，再进行 8 次计时迭代 x 64 次 trial；每个点位
严格得到 512 个观测值，并采用 nearest-rank 百分位数。
所有后端的 combine 统一为 activation-only rank-sum；dispatch weights 仍由 oracle 校验。
Uniform routing 作为主结果；Zipf 和 Zipf+EPLB 仅作为实验性敏感度证据。
仅将当前 H100 runner pool 的 6 个 DeepEP V2 case 标记为 unsupported；其他 H100 后端仍可运行。
该 pool 证明全 rank CUDA P2P/VMM 及覆盖整个 world 的 LSA team 后即可恢复 V2。

资格验证修复

在每个 rank 对 payload、routing、multiplicity、counts、weights、combine 数值及输入不可变性
执行确定性的原生正确性校验。
溯源信息绑定镜像/squash 内容、固定源码 tree、导入的二进制、实际加载的 NCCL/RCCL runtime、
运行时拓扑及生成 kernel 证据。
通过私有随机挂载哨兵、相对缓存根目录的属主/权限校验及不可变完成标记强化 B300 缓存身份；
V2 JIT 产物按分片隔离。
DeepEP V2 固定使用 PR update mi300 to 0.5.8 #605 和 PR Update README with new image and remove picture tag #630 的最小纯 scale-up 修复；仅对声明的 scale-up case
禁用 GIN，并要求 NCCL 实际建立的 LSA team 覆盖整个 EP world。
原始日志只保存在私有目录，对外仅输出封闭的失败类别；GB 双 tray job 也能细分 NCCL、拓扑和
JIT 失败，且不会公开原始日志。

产物架构

GitHub 产物仅作为临时传输输入，最终进入仅限属主访问的本地内容寻址文件系统发布器。只有来自
同一 source SHA 的三轮完整独立运行同时满足精确覆盖、统一构建/运行时身份、p50/p99 稳定性、
排序稳定性及全部受控 cohort，才允许晋级。不引入托管数据库、对象存储或第三方结果托管服务。

受跟踪文件及全部可达 Git refs 均不包含 6 个私有 runner endpoint 字面值。platforms.yaml、
本地目标/笔记、原始日志和结果存储均已忽略且不受 Git 跟踪。

验证

132 个 Python 约定/单元测试。
矩阵 SHA-256：17ebafaa4f704e6d309d05f1fa7c44c66d60166b15a1cda8c29905ee39b536c5。
Case catalog SHA-256：3b223fef491c79cfd4eef32ac8cef288d2fa35f3051f3c089b6c9cc09e2fe36f。
独立重生成确认全部计数、晋级 cohort（48 个 library / 12 个 system / 74 个 routing），以及
360 个 case 统一采用 8:64:32、512 个样本和同一预热语义。
Actionlint、bash -n、ShellCheck、git diff --check、中英文文档一致性，以及对受跟踪文件和
全部可达 Git refs 的私有 endpoint 精确扫描。

claude · 2026-07-03T18:12:16Z

+  rsync -a --delete --delete-excluded \
+    --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
+    --exclude='configs/platforms.yaml' --exclude='private-infra.md' \
+    --exclude='goal.md' --exclude='notes.md' \
+    "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \
+    || cx_die "staging CollectiveX failed"


🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).

Extended reasoning...

The bug

The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:

# .github/workflows/collectivex-sweep.yml env: CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json # RELATIVE path ... - name: Extract shard from matrix artifact working-directory: experimental/CollectiveX run: | ... json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.

For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:

# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24 MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"

which rsyncs the tree with an exclude that drops results/:

# experimental/CollectiveX/runtime/common.sh:145-150 rsync -a --delete --delete-excluded \ --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \ --exclude='configs/platforms.yaml' --exclude='private-infra.md' \ --exclude='goal.md' --exclude='notes.md' \ "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"

Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.

The consequence at runtime

The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:

# runtime/run_in_container.sh:458 if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then # SHARD mode — sweep every scheduled case ... else # Single-bench (workflow_dispatch) path # uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform}, # ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...

The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.

Why the rack (EP8) paths escape

The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:

# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases) local sf="${CX_SHARD_FILE:-}" [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"

The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.

Affected sweeps

Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:

b300 (all shards; launch_b300.sh is single-node)

gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)

gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)

The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.

Concrete walk-through (b300 shard)

Setup job resolves matrix; writes experimental/CollectiveX/results/.shard_b300-deepep.json on the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1).

Sweep job on the b300 runner exports CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and calls launch_b300.sh.

launch_b300.sh:34 -> cx_stage_repo rsyncs to $CX_STAGE_DIR/job_<id>/experimental/CollectiveX/ with --exclude='results/' --delete-excluded. The shard file is not copied.

srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container = /ix/experimental/CollectiveX.

run_in_container.sh:458 tests [ -f "results/.shard_b300-deepep.json" ] -> that resolves to /ix/experimental/CollectiveX/results/.shard_b300-deepep.json -> missing.

Execution falls into the else branch at line 556+. It dispatches ${CX_BENCH} once with CX_MODE=normal, CX_PHASE=decode, CX_ROUTING=uniform, CX_DISPATCH_DTYPE=bf16, empty CX_CASE_ID, empty CX_SUITE, empty CX_WORKLOAD_NAME, empty CX_REQUIRED_PUBLICATION.

One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.

Aggregate job's make_bundle.py validate_expected_coverage computes missing_identity + missing + identity_mismatch against matrix_full.json and raise SystemExit(...) — the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.

Impact

For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.

Fix

Any one of:

Allow the shard file through the rsync in cx_stage_repo (runtime/common.sh:146):

rsync -a --delete --delete-excluded \ --include='experimental/CollectiveX/results/' \ --include='experimental/CollectiveX/results/.shard_*.json' \ --exclude='__pycache__/' --exclude='results/' ...

Copy the shard file into the stage dir after the rsync completes:

[ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \ && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \ "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"

Mirror the rack (EP8) launcher workaround in run_in_container.sh:458:

sf="${CX_SHARD_FILE:-}" # $CX_DIR is not set inside the container; use the fixed workdir instead. [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \ && sf="/ix/experimental/CollectiveX/$sf" if [ -n "$sf" ] && [ -f "$sf" ]; then ...

Approach (1) or (2) is the smallest change with the least surface area.

claude · 2026-07-03T18:12:17Z

+    elif _run(["ibstat", "-l"]):
+        devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]
+    return {


🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.

Extended reasoning...

The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:

listing = _run(["ibv_devinfo", "-l"]) # assigned once, iterated once if listing: for line in listing.splitlines()[1:]: ... elif _run(["ibstat", "-l"]): # called once (as a truthiness check) devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()] # called AGAIN

The ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.

Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.

Step-by-step proof of the theoretical crash path:

Node has ibstat in $PATH but no ibv_devinfo (a real config: MI355X-style stacks with ibstat only).

First call: _run(["ibstat", "-l"]) succeeds → returns "mlx5_0\nmlx5_1\n" → elif condition is truthy.

Second call: a transient nonzero exit (e.g. ibstat racing an IB-driver reload, timer wraparound, PID-namespace hiccup) → out.returncode != 0 → _run returns None.

None.splitlines() → AttributeError: 'NoneType' object has no attribute 'splitlines' → Python exits nonzero → set -e aborts run_in_container.sh → the shard step fails before GPU work.

Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.

The fix. One-line refactor to mirror the ibv_devinfo branch:

else: listing = _run(["ibstat", "-l"]) if listing: devices = [d.strip() for d in listing.splitlines() if d.strip()]

Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.

claude · 2026-07-03T18:12:19Z

+    "required_publication": env("CX_REQUIRED_PUBLICATION") or None,
+    "backend": backend,
+    "phase": phase,
+    "ep": integer("CX_EP", integer("CX_NGPUS", 1)),
+    "gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)),
+    "scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)),
+    "dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"),
+    "mode": env("CX_MODE", "normal"),
+    "contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"),
+    "routing": env("CX_ROUTING", "uniform"),
+    "eplb": enabled("CX_EPLB"),
+    "combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"),
+    "resource_mode": env("CX_RESOURCE_MODE", "tuned"),
+    "activation_profile": env("CX_ACTIVATION_PROFILE", "normal"),
+    "placement": env("CX_PLACEMENT", "packed"),
+    "routing_step": env("CX_ROUTING_STEP", "0"),
+    "uneven_tokens": env("CX_UNEVEN_TOKENS", "none"),
+    "tokens_ladder": env("CX_TOKENS_LADDER"),
+    "canonical": enabled("CX_CANONICAL"),
+    "sampling_contract": "fixed-512-v1",
+    "samples_per_point": integer("CX_SAMPLES_PER_POINT", 512),
+    "iters": integer("CX_ITERS", 8),
+    "trials": integer("CX_TRIALS", 64),
+    "warmup": integer("CX_WARMUP", 32),
+    "warmup_semantics": env(
+        "CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1"
+    ),


🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.

Extended reasoning...

The observed behavior

With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:

bundle: expected-matrix coverage failed ( missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[] identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])

The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.

Step-by-step proof

Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:

{ ..., "hidden": "", # h==7168 -> "" sentinel "topk": "", # t==8 -> "" "experts": "", # e==256 -> "" "nodes": "1", # always str ... }

When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.

aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:

_expected_case_identity(matrix_case) — "hidden" in case is true (value ""), so identity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256). "nodes" in case is true, identity["nodes"] = int("1") = 1. Expected identity contains {hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}.

_actual_case_identity(failed_doc) (the failed-case branch, line 184-195) copies failure.case verbatim, calls _expected_case_identity. None of hidden/topk/experts/nodes are in that dict, so the if field in case: guard skips all four. Actual identity contains everything except the four scheduled shape fields.

_identity_differences iterates the expected identity's items; actual_identity.get("hidden") is None, None != 7168 -> hidden=None!=7168. Same for the other three.

validate_expected_coverage (line 294-298) hits the differences branch, appends the case_id to identity_mismatch, and does not add it to actual{}. Then missing = set(expected) - set(actual) (line 301) also contains that case_id. Line 319 raises the dual-report SystemExit.

validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.

Why this fires in practice

The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.

Impact

Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."

Fix

Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.

Freeze the 37-shard cross-vendor EP matrix at 360 requested cases and 840 points on one 32-warmup, 512-observation protocol. Add native correctness, closed provenance, three-allocation promotion gates, and an isolated content-addressed filesystem publisher. Close defects exposed by rejected allocations: isolate AMD Enroot state; correct MoRI output shape and unweighted combine semantics; standardize activation-only combine across every adapter; stage pinned DeepEP sources before compute allocation; authenticate reusable build outputs; normalize Hybrid enum identity; query loaded NCCL/RCCL runtimes; and harden cleanup and failure classification. Harden B300 cache identity with a private mount sentinel and root-relative ownership checks, isolate DeepEP V2 JIT output per shard, and keep PR #605 with the official PR #630 scale-up fix. Mark only the current H100 runner pool's V2 cases unsupported until NCCL Device API symmetric memory is available, retain other H100 coverage, restore the production-only workflow, and classify detailed GB failures without publishing raw logs. 中文：完成隔离式 CollectiveX v1 专家并行基准测试套件。固定包含 37 个可运行分片、360 个请求 case 和 840 个点位的跨厂商矩阵，统一采用 32 次预热和 512 个观测值，并加入原生正确性校验、严格溯源、三次独立分配晋级门槛及本地内容寻址文件系统发布器。修复已拒绝分配暴露的问题：隔离 AMD Enroot 状态；修正 MoRI 输出形状及无权重 combine 语义；统一所有 adapter 的 activation-only combine 边界；在计算节点分配前暂存固定版本的 DeepEP 源码；校验可复用构建产物；规范化 Hybrid 枚举身份；从实际加载的 NCCL/RCCL 运行库读取版本；同时强化清理和失败分类。通过私有挂载哨兵和相对缓存根目录的属主校验强化 B300 缓存身份，并将 DeepEP V2 JIT 产物隔离到单个分片。DeepEP V2 保持 PR #605 实现并固定使用官方 PR #630 的纯 scale-up 修复；仅在当前 H100 runner pool 尚不具备 NCCL Device API 对称内存能力时将其 V2 case 标记为 unsupported，保留其他 H100 覆盖；同时恢复仅含生产路径的 workflow，并在不公开原始日志的前提下细化 GB 失败分类。

Oseltamivir requested a review from a team July 3, 2026 17:38

github-project-automation Bot added this to InferenceMAX Board Jul 3, 2026

claude Bot reviewed Jul 3, 2026

View reviewed changes

Oseltamivir force-pushed the collectivex branch 4 times, most recently from 758fa52 to 1c5b901 Compare July 4, 2026 01:11

github-advanced-security AI found potential problems Jul 4, 2026

View reviewed changes

Comment thread experimental/CollectiveX/tests/test_sampling_contract.py Fixed

Oseltamivir force-pushed the collectivex branch 3 times, most recently from 7e5f80a to 28cbac4 Compare July 4, 2026 03:21

functionstackx changed the title ~~CollectiveX v1: cross-vendor EP benchmark suite~~ CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1：跨厂商 EP 基准测试套件 Jul 4, 2026

Oseltamivir force-pushed the collectivex branch from 28cbac4 to aa318f7 Compare July 4, 2026 06:58

Oseltamivir force-pushed the collectivex branch 14 times, most recently from 57efb35 to 4ff5841 Compare July 4, 2026 13:42

Oseltamivir force-pushed the collectivex branch from 4ff5841 to bffd8d6 Compare July 4, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex

Oseltamivir commented Jul 3, 2026 •

edited

Loading

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Oseltamivir commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark contract

Qualification fixes

Artifact architecture

Validation

中文说明

基准测试约定

资格验证修复

产物架构

验证

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

The bug

The consequence at runtime

Why the rack (EP8) paths escape

Affected sweeps

Concrete walk-through (b300 shard)

Impact

Fix

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

The observed behavior

Step-by-step proof

Why this fires in practice

Impact

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oseltamivir commented Jul 3, 2026 •

edited

Loading