Skip to content

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004

Open
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex
Open

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Finalizes the isolated CollectiveX v1 expert-parallel communication benchmark under
experimental/CollectiveX/. The branch is ready for three complete no-canary qualification runs;
it does not claim or include promoted v1 results yet.

Benchmark contract

  • Covers H100, H200, B200, B300, GB200, GB300, MI325X, and MI355X across DeepEP V1,
    DeepEP V2 PR update mi300 to 0.5.8 #605 plus PR Update README with new image and remove picture tag #630's scale-up fix, DeepEP Hybrid, UCCL, MoRI, and the
    NCCL/RCCL reference.
  • Resolves 37 runnable shards and 360 requested cases / 840 points: 222 runnable cases / 518
    points plus 138 planned-unsupported cases / 322 points.
  • Uses 8 timed iterations x 64 trials and 32 synchronized full-roundtrip warmups before every
    measured component, yielding exactly 512 observations per point with nearest-rank percentiles.
  • Standardizes combine as activation-only rank-sum. Dispatch weights remain oracle-checked.
  • Keeps uniform routing as the headline; Zipf and Zipf+EPLB are experimental sensitivity evidence.
  • Marks only the current H100 runner pool's six DeepEP V2 cases unsupported. Other H100 backends
    remain runnable; V2 returns after that pool proves all-rank CUDA P2P/VMM and one full-world LSA team.

Qualification fixes

  • Adds deterministic native correctness for payload, routing, multiplicity, counts, weights,
    combine values, and input immutability on every rank.
  • Binds image/squash bytes, pinned source trees, imported binaries, loaded NCCL/RCCL runtimes,
    runtime topology, and generated-kernel evidence.
  • Hardens B300 cache identity with a private random mount sentinel, root-relative owner/mode checks,
    and immutable completed markers; V2 JIT output is isolated per shard.
  • Pins DeepEP V2 PR update mi300 to 0.5.8 #605 with PR Update README with new image and remove picture tag #630's minimal pure-scale-up fix, disables GIN only for declared
    scale-up cases, and requires NCCL's realized LSA team to cover the full EP world.
  • Preserves raw logs privately while exporting only a closed failure category, including detailed
    NCCL/topology/JIT classification for GB two-tray jobs.

Artifact architecture

GitHub artifacts are transient delivery inputs to an owner-only, content-addressed local filesystem
publisher. Promotion requires exactly three complete independent runs from one source SHA, exact
coverage, homogeneous build/runtime identity, stable p50/p99 evidence and ordering, and every
controlled cohort. No managed database, object store, or third-party result hosting is introduced.

The tracked tree and all reachable refs contain none of the six private runner endpoint literals.
platforms.yaml, local goals/notes, raw logs, and result stores are ignored and untracked.

Validation

  • 132 Python contract/unit tests.
  • Matrix SHA-256: 17ebafaa4f704e6d309d05f1fa7c44c66d60166b15a1cda8c29905ee39b536c5.
  • Case-catalog SHA-256: 3b223fef491c79cfd4eef32ac8cef288d2fa35f3051f3c089b6c9cc09e2fe36f.
  • Independent regeneration confirmed all counts, promotion cohorts (48 library / 12 system / 74
    routing), and uniform 8:64:32 / 512-sample / warmup semantics across all 360 cases.
  • Actionlint, bash -n, ShellCheck, git diff --check, bilingual documentation parity, and exact
    endpoint scans across the tracked tree and all reachable Git refs.

中文说明

本 PR 完成位于 experimental/CollectiveX/ 的隔离式 CollectiveX v1 专家并行(EP)通信基准测试。
当前分支已准备执行三轮完整、无 canary 的资格验证;目前尚未宣称或提交任何已晋级的 v1 结果。

基准测试约定

  • 覆盖 H100、H200、B200、B300、GB200、GB300、MI325X 和 MI355X;后端包括 DeepEP V1、
    DeepEP V2 PR update mi300 to 0.5.8 #605 及 PR Update README with new image and remove picture tag #630 的 scale-up 修复、DeepEP Hybrid、UCCL、MoRI 和 NCCL/RCCL
    参考实现。
  • 生成 37 个可运行分片,共请求 360 个 case / 840 个点位:其中 222 个可运行 case / 518 个
    点位,另有 138 个 planned-unsupported case / 322 个点位。
  • 每个测量组件统一执行 32 次同步完整往返预热,再进行 8 次计时迭代 x 64 次 trial;每个点位
    严格得到 512 个观测值,并采用 nearest-rank 百分位数。
  • 所有后端的 combine 统一为 activation-only rank-sum;dispatch weights 仍由 oracle 校验。
  • Uniform routing 作为主结果;Zipf 和 Zipf+EPLB 仅作为实验性敏感度证据。
  • 仅将当前 H100 runner pool 的 6 个 DeepEP V2 case 标记为 unsupported;其他 H100 后端仍可运行。
    该 pool 证明全 rank CUDA P2P/VMM 及覆盖整个 world 的 LSA team 后即可恢复 V2。

资格验证修复

  • 在每个 rank 对 payload、routing、multiplicity、counts、weights、combine 数值及输入不可变性
    执行确定性的原生正确性校验。
  • 溯源信息绑定镜像/squash 内容、固定源码 tree、导入的二进制、实际加载的 NCCL/RCCL runtime、
    运行时拓扑及生成 kernel 证据。
  • 通过私有随机挂载哨兵、相对缓存根目录的属主/权限校验及不可变完成标记强化 B300 缓存身份;
    V2 JIT 产物按分片隔离。
  • DeepEP V2 固定使用 PR update mi300 to 0.5.8 #605 和 PR Update README with new image and remove picture tag #630 的最小纯 scale-up 修复;仅对声明的 scale-up case
    禁用 GIN,并要求 NCCL 实际建立的 LSA team 覆盖整个 EP world。
  • 原始日志只保存在私有目录,对外仅输出封闭的失败类别;GB 双 tray job 也能细分 NCCL、拓扑和
    JIT 失败,且不会公开原始日志。

产物架构

GitHub 产物仅作为临时传输输入,最终进入仅限属主访问的本地内容寻址文件系统发布器。只有来自
同一 source SHA 的三轮完整独立运行同时满足精确覆盖、统一构建/运行时身份、p50/p99 稳定性、
排序稳定性及全部受控 cohort,才允许晋级。不引入托管数据库、对象存储或第三方结果托管服务。

受跟踪文件及全部可达 Git refs 均不包含 6 个私有 runner endpoint 字面值。platforms.yaml
本地目标/笔记、原始日志和结果存储均已忽略且不受 Git 跟踪。

验证

  • 132 个 Python 约定/单元测试。
  • 矩阵 SHA-256:17ebafaa4f704e6d309d05f1fa7c44c66d60166b15a1cda8c29905ee39b536c5
  • Case catalog SHA-256:3b223fef491c79cfd4eef32ac8cef288d2fa35f3051f3c089b6c9cc09e2fe36f
  • 独立重生成确认全部计数、晋级 cohort(48 个 library / 12 个 system / 74 个 routing),以及
    360 个 case 统一采用 8:64:32、512 个样本和同一预热语义。
  • Actionlint、bash -n、ShellCheck、git diff --check、中英文文档一致性,以及对受跟踪文件和
    全部可达 Git refs 的私有 endpoint 精确扫描。

Comment on lines +145 to +150
rsync -a --delete --delete-excluded \
--exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
--exclude='configs/platforms.yaml' --exclude='private-infra.md' \
--exclude='goal.md' --exclude='notes.md' \
"$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \
|| cx_die "staging CollectiveX failed"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).

Extended reasoning...

The bug

The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:

# .github/workflows/collectivex-sweep.yml
env:
  CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json   # RELATIVE path
...
- name: Extract shard from matrix artifact
  working-directory: experimental/CollectiveX
  run: |
    ...
    json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.

For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:

# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24
MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"

which rsyncs the tree with an exclude that drops results/:

# experimental/CollectiveX/runtime/common.sh:145-150
rsync -a --delete --delete-excluded \
  --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
  --exclude='configs/platforms.yaml' --exclude='private-infra.md' \
  --exclude='goal.md' --exclude='notes.md' \
  "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"

Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.

The consequence at runtime

The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:

# runtime/run_in_container.sh:458
if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then
  # SHARD mode — sweep every scheduled case
  ...
else
  # Single-bench (workflow_dispatch) path
  # uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform},
  # ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...

The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.

Why the rack (EP8) paths escape

The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:

# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases)
local sf="${CX_SHARD_FILE:-}"
[ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"

The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.

Affected sweeps

Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:

  • b300 (all shards; launch_b300.sh is single-node)
  • gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)
  • gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)

The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.

Concrete walk-through (b300 shard)

  1. Setup job resolves matrix; writes experimental/CollectiveX/results/.shard_b300-deepep.json on the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1).
  2. Sweep job on the b300 runner exports CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and calls launch_b300.sh.
  3. launch_b300.sh:34 -> cx_stage_repo rsyncs to $CX_STAGE_DIR/job_<id>/experimental/CollectiveX/ with --exclude='results/' --delete-excluded. The shard file is not copied.
  4. srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container = /ix/experimental/CollectiveX.
  5. run_in_container.sh:458 tests [ -f "results/.shard_b300-deepep.json" ] -> that resolves to /ix/experimental/CollectiveX/results/.shard_b300-deepep.json -> missing.
  6. Execution falls into the else branch at line 556+. It dispatches ${CX_BENCH} once with CX_MODE=normal, CX_PHASE=decode, CX_ROUTING=uniform, CX_DISPATCH_DTYPE=bf16, empty CX_CASE_ID, empty CX_SUITE, empty CX_WORKLOAD_NAME, empty CX_REQUIRED_PUBLICATION.
  7. One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.
  8. Aggregate job's make_bundle.py validate_expected_coverage computes missing_identity + missing + identity_mismatch against matrix_full.json and raise SystemExit(...) — the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.

Impact

For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.

Fix

Any one of:

  1. Allow the shard file through the rsync in cx_stage_repo (runtime/common.sh:146):

    rsync -a --delete --delete-excluded \
      --include='experimental/CollectiveX/results/' \
      --include='experimental/CollectiveX/results/.shard_*.json' \
      --exclude='__pycache__/' --exclude='results/' ...
  2. Copy the shard file into the stage dir after the rsync completes:

    [ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \
      && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \
              "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"
  3. Mirror the rack (EP8) launcher workaround in run_in_container.sh:458:

    sf="${CX_SHARD_FILE:-}"
    # $CX_DIR is not set inside the container; use the fixed workdir instead.
    [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \
      && sf="/ix/experimental/CollectiveX/$sf"
    if [ -n "$sf" ] && [ -f "$sf" ]; then ...

Approach (1) or (2) is the smallest change with the least surface area.

Comment thread experimental/CollectiveX/env_capture.py Outdated
Comment on lines +178 to +180
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]
return {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.

Extended reasoning...

The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:

listing = _run(["ibv_devinfo", "-l"])   # assigned once, iterated once
if listing:
    for line in listing.splitlines()[1:]:
        ...
elif _run(["ibstat", "-l"]):             # called once (as a truthiness check)
    devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]  # called AGAIN

The ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.

Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.

Step-by-step proof of the theoretical crash path:

  1. Node has ibstat in $PATH but no ibv_devinfo (a real config: MI355X-style stacks with ibstat only).
  2. First call: _run(["ibstat", "-l"]) succeeds → returns "mlx5_0\nmlx5_1\n" → elif condition is truthy.
  3. Second call: a transient nonzero exit (e.g. ibstat racing an IB-driver reload, timer wraparound, PID-namespace hiccup) → out.returncode != 0_run returns None.
  4. None.splitlines()AttributeError: 'NoneType' object has no attribute 'splitlines' → Python exits nonzero → set -e aborts run_in_container.sh → the shard step fails before GPU work.

Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.

The fix. One-line refactor to mirror the ibv_devinfo branch:

else:
    listing = _run(["ibstat", "-l"])
    if listing:
        devices = [d.strip() for d in listing.splitlines() if d.strip()]

Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.

Comment on lines +260 to +286
"required_publication": env("CX_REQUIRED_PUBLICATION") or None,
"backend": backend,
"phase": phase,
"ep": integer("CX_EP", integer("CX_NGPUS", 1)),
"gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)),
"scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)),
"dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"),
"mode": env("CX_MODE", "normal"),
"contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"),
"routing": env("CX_ROUTING", "uniform"),
"eplb": enabled("CX_EPLB"),
"combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"),
"resource_mode": env("CX_RESOURCE_MODE", "tuned"),
"activation_profile": env("CX_ACTIVATION_PROFILE", "normal"),
"placement": env("CX_PLACEMENT", "packed"),
"routing_step": env("CX_ROUTING_STEP", "0"),
"uneven_tokens": env("CX_UNEVEN_TOKENS", "none"),
"tokens_ladder": env("CX_TOKENS_LADDER"),
"canonical": enabled("CX_CANONICAL"),
"sampling_contract": "fixed-512-v1",
"samples_per_point": integer("CX_SAMPLES_PER_POINT", 512),
"iters": integer("CX_ITERS", 8),
"trials": integer("CX_TRIALS", 64),
"warmup": integer("CX_WARMUP", 32),
"warmup_semantics": env(
"CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1"
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.

Extended reasoning...

The observed behavior

With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:

bundle: expected-matrix coverage failed (
  missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[]
  identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])

The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.

Step-by-step proof

Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:

{
  ...,
  "hidden": "",     # h==7168 -> "" sentinel
  "topk": "",       # t==8    -> ""
  "experts": "",    # e==256  -> ""
  "nodes": "1",     # always str
  ...
}

When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.

aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:

  1. _expected_case_identity(matrix_case)"hidden" in case is true (value ""), so identity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256). "nodes" in case is true, identity["nodes"] = int("1") = 1. Expected identity contains {hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}.
  2. _actual_case_identity(failed_doc) (the failed-case branch, line 184-195) copies failure.case verbatim, calls _expected_case_identity. None of hidden/topk/experts/nodes are in that dict, so the if field in case: guard skips all four. Actual identity contains everything except the four scheduled shape fields.
  3. _identity_differences iterates the expected identity's items; actual_identity.get("hidden") is None, None != 7168 -> hidden=None!=7168. Same for the other three.
  4. validate_expected_coverage (line 294-298) hits the differences branch, appends the case_id to identity_mismatch, and does not add it to actual{}. Then missing = set(expected) - set(actual) (line 301) also contains that case_id. Line 319 raises the dual-report SystemExit.

validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.

Why this fires in practice

The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.

Impact

Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."

Fix

Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.

@Oseltamivir Oseltamivir force-pushed the collectivex branch 4 times, most recently from 758fa52 to 1c5b901 Compare July 4, 2026 01:11
Comment thread experimental/CollectiveX/tests/test_sampling_contract.py Fixed
@Oseltamivir Oseltamivir force-pushed the collectivex branch 3 times, most recently from 7e5f80a to 28cbac4 Compare July 4, 2026 03:21
@functionstackx functionstackx changed the title CollectiveX v1: cross-vendor EP benchmark suite CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 Jul 4, 2026
@functionstackx functionstackx changed the title CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 / CollectiveX v1: 크로스 벤더 EP 벤치마크 스위트 Jul 4, 2026
@Oseltamivir Oseltamivir changed the title CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 / CollectiveX v1: 크로스 벤더 EP 벤치마크 스위트 Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件 Jul 4, 2026
@Oseltamivir Oseltamivir force-pushed the collectivex branch 14 times, most recently from 57efb35 to 4ff5841 Compare July 4, 2026 13:42
Freeze the 37-shard cross-vendor EP matrix at 360 requested cases and 840 points on one 32-warmup, 512-observation protocol. Add native correctness, closed provenance, three-allocation promotion gates, and an isolated content-addressed filesystem publisher.

Close defects exposed by rejected allocations: isolate AMD Enroot state; correct MoRI output shape and unweighted combine semantics; standardize activation-only combine across every adapter; stage pinned DeepEP sources before compute allocation; authenticate reusable build outputs; normalize Hybrid enum identity; query loaded NCCL/RCCL runtimes; and harden cleanup and failure classification.

Harden B300 cache identity with a private mount sentinel and root-relative ownership checks, isolate DeepEP V2 JIT output per shard, and keep PR #605 with the official PR #630 scale-up fix. Mark only the current H100 runner pool's V2 cases unsupported until NCCL Device API symmetric memory is available, retain other H100 coverage, restore the production-only workflow, and classify detailed GB failures without publishing raw logs.

中文:完成隔离式 CollectiveX v1 专家并行基准测试套件。固定包含 37 个可运行分片、360 个请求 case 和 840 个点位的跨厂商矩阵,统一采用 32 次预热和 512 个观测值,并加入原生正确性校验、严格溯源、三次独立分配晋级门槛及本地内容寻址文件系统发布器。

修复已拒绝分配暴露的问题:隔离 AMD Enroot 状态;修正 MoRI 输出形状及无权重 combine 语义;统一所有 adapter 的 activation-only combine 边界;在计算节点分配前暂存固定版本的 DeepEP 源码;校验可复用构建产物;规范化 Hybrid 枚举身份;从实际加载的 NCCL/RCCL 运行库读取版本;同时强化清理和失败分类。

通过私有挂载哨兵和相对缓存根目录的属主校验强化 B300 缓存身份,并将 DeepEP V2 JIT 产物隔离到单个分片。DeepEP V2 保持 PR #605 实现并固定使用官方 PR #630 的纯 scale-up 修复;仅在当前 H100 runner pool 尚不具备 NCCL Device API 对称内存能力时将其 V2 case 标记为 unsupported,保留其他 H100 覆盖;同时恢复仅含生产路径的 workflow,并在不公开原始日志的前提下细化 GB 失败分类。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants